Announcements
- January 23-24, 2012
CSP 2012 PI Workshop, Walnut Creek, CA - January 19-20, 2012
Microbial Genome Reannotation Workshop, Walnut Creek, CA
Releases
- January 31, 2012
Clostridium sp. BNL1100 - December 1, 2011
Acidovorax avenae avenae ATCC 19860 - December 1, 2011
Alicycliphilus denitrificans K601 - December 1, 2011
Cellulosilyticum lentocellum RHM5, DSM 5427 - December 1, 2011
Delftia sp. Cs1-4
Benchmarks
1. Microbial genome gene model quality control.
Gene models that are defined by the IMG-ER gene-calling pipeline undergo a combination of automatic and manual quality control. Gene models are analyzed for correct start and stop positions, broken fragments, uniqueness, and dubiousness, while intergenic regions without gene definitions but with uninterrupted or sparsely interrupted ORFs are analyzed for missing genes.
- Results: Automatic identification and correction of gene model anomalies is done using GenePRIMP. Details of the types of anomalies that are identified are available at http://geneprimp.jgi-psf.org/gperrors.html. A subset of identified anomalies is automatically corrected based on homology to related proteins in a filtered version of the non-redundant protein database in IMG. More complex anomalies are appropriately tagged for manual curation performed by JGI scientists.
- Reference:
2. Microbial genome gene prediction benchmark
GenePRIMP was used to compare the accuracy of five popular gene finders: Prodigal, GeneMark, Glimmer3, RAST, and AMIGene by evaluating their gene calls for two genomes: the bacterium Mycobacterium sp. Spyr1 (Myco, GC% = 67.9, Size=6 Mb) and the archaeon Methanosphaerula palustris E1-9c (Meth, GC% = 55.35, Size=2.9 Mb), selected because of the high number of modifications made to their gene models during manual curation.
- Results: The benchmark was based on the number of anomalies of each type detected by GenePRIMP as shown in the Table below:

Results of automated gene finding for the two genomes mentioned above vary wildly among the different tools and pipelines. Notably, Glimmer3 predicts the most unique genes (522); 226 of these were not called by any other gene caller and only 38 genes are predicted by all others. Glimmer3 identifies 515 more genes (18%) in than does Prodigal, which identifies the lowest number of genes; many of these additional genes are among the 522 unique genes predicted by Glimmer3. We observe considerable variation in the gene-finders' identification of translation initiation sites. Glimmer3, GeneMark, and RAST show a tendency to predict genes shorter than their homologs, whereas AMIGene calls more long genes than any of the others. The occurrence of missed genes and predicted genes that are longer or shorter than their homologs reflects the current limitations of automated gene finding in microbial genomes. The number of broken and interrupted genes identified in the gene calls indicates the sensitivity of the respective gene caller. Higher numbers attest to the greater ability of that gene caller to identify shorter regions of CDSs, including small fragments in highly degraded pseudogenes. This facilitates the correction of sequencing artifacts (artificial frameshifts) and the correct annotation of pseudogenes and genes with unusual translational features (e.g., the recoding of stop codons). - Reference: