Comparative Analysis Methods and Tools

Motivation

Genome annotation and analysis requires development and validation of new algorithms and tools. Several directions of this development include methods to analyze eukaryotic genome organization (tandem and segmental duplication, gene-based synteny, including for multiple related genomes), gene structure (intron conservation or loss across genomes), gene gain/loss (detection of possible errors in automated clustering results for analysis of gene families, creating whole genome based phylogenetic trees based on clustering results, pfam domain analysis to detect expanded and lost families), genome evolution, gene expression, genome variation, metabolic pathways and regulatory elements. Test new gene predictors, including those using Rna-Seq data and synteny-based approaches on validated gene sets in terms of accuracy and speed, pipelines (eg, MAKER), repeat finding software, and non-coding RNA finding software. This project aims at (1) developing algorithms and prototypes for new genome analysis methods for publications; (2) testing new gene prediction and genome analysis tools for possible integration into production annotation process.

Comparative Gene Modeling.

Comparative gene modeling aimed to improve the initial gene predictions for a set of closely related organisms and correct for missing or incorrectly predicted genes (incorrect splice sites, chimeras, gene fragments, etc).The idea of comparative modeling is that for closely related genomes, most orthologs have the same conserved gene structure. The algorithm maps all gene models predicted in all genomes to all individual genomes, and for each locus selects among the potentially many competing models, the one which is most closely resemble the homologous genes from other genomes. This procedure maybe iterated several times until no change in gene models will be observed

Results

For Basidiomycete Dichomitus squalens reannotation using comparative modeling is compared with initial JGI production annotation:

JGI Annotation pipeline Comparative modeling
Number of predicted gene models 12,290 12,802
with Swissprot hits 7,356 7,900
with non-repeat PFAM domains 6,010 6,353
with EST support 10,796 11,105
with >90% EST support 9,178 9,444
Number of unique PFAM domains 2,245 2,322
Average EST coverage per gene 93.3% 93.3%
Splice sites supported by ESTs 102,200 104,246