Validation of Short Read ESTs and Genomic Assemblies

Motivation

The current JGI Annotation Pipeline is optimized for use with hybrid 454- Illumina genomic assemblies and with 454 ESTs and contigs derived thereof. The complete replacement of 454 sequencing by Illumina sequencing poses new challenges to the current practice, as shorter (relative to 454) Illumina reads are more difficult to assemble into accurate genomic scaffolds or EST contigs, which may have significant downstream effects on annotation quality. Simultaneously, the enormous deepening of EST coverage provided by Illumina sequencing (relative to 454) poses computational challenges to some of the software currently standard in the JGI Annotation Pipeline. To perform these assessments, we are choosing and developing simple metrics of relevance to gene prediction quality, performing systematic and controlled annotation experiments to measure the effects of substituting Illumina for 454 inputs, and developing, as needed, new protocols and programs to compensate for any deleterious effects. This project aims at assessing the utility of Illumina- only ESTs, EST contigs, and genomic assemblies for whole-genome annotation.

Results

Assessment of ESTs and EST contigs is nearly complete. Changes to annotation process in response to assessments are in process of consideration and implementation. Genomic assemblies not available yet for assessment, but planning for such assessments has begun. Remaining problems include overclustering of EST contigs, leading to chimaeric transcript models.