CombEST

CombEST : annotation of fungal genome using Illumina EST data

Motivation

Gene modeling using RNA-Seq ESTs struggles with handling very large amounts of short ESTs. An initio EST assemblies using genome assemblers like Velvet when mapped to genomic sequences to build gene models suffers from fragmentation, chimerism, and misinterpretation of alternative splicing. CombEST is a new approach that offers genome based EST assembly that offers better performance, higher quality of gene models, and simpler parallelizable computations than ab initio methods.

Results

The CombEST algorithm consists of three parts. The first step includes mapping ESTs to genome sequences using one of public alignment tools Gmap, TopHat, or Blat. At the second step, these alignments are sorted based on genomic location and grouped into congregations (overlapping alignments), which are then assembled into gene models. In the final stage, chimeric gene models are detected and split using base coverage profiles. In addition, fragmented models predicted by Combest can be improved by combining with gene models predicted using other methods.

The algorithm is implemented in C++ using objects with focus on performance and modularity. Tested on a single CPU for 10+ genomes with variable EST coverage, CombEST demonstrated 1.e3-1.e4 speed-up compared to PASA and good quality of predicted gene models.