Microbial Program at JGI-Home

Assembly

Microbial genome production assemblies are straightforward when libraries and sequences of high quality are available. Considerable effort was spent initially on improving the library construction process, in particular for long mate pair libraries. The key challenge for microbial genome assembly is posed by scaling to accommodate the very large number of genomes that will be sequenced over the next several years. In order to address this challenge, we have pushed microbial genome assembly "upstream" into the sequencing production area, so that these assemblies are carried by the QA/QC team. This change takes advantage of the extensive expertise of the QA/QC team in efficiently analyzing very large numbers of data sets, as the group is responsible for QC and analysis of every library produced for all science programs. In order to support analysis of thousands of libraries per year, the QA/QC team has developed a data processing framework to support automatic analysis and assembly. The Microbial Program produces a variety of products (single cell, isolate quick draft, isolate high quality draft, etc.) and pipelines have been developed to automatically assemble projects of each type. For single cells, software was developed to address sequence coverage bias and contamination expected in this type of project. After pre-filtering, single cell sequence data is assembled using the Velvet and Allpaths-lg assemblers with the results merged to form a final consensus assembly. Standard microbial genomes are assembled using Allpaths-lg which produces assemblies with very good contiguity and high consensus accuracy.

Microbial genome R&D assembly efforts involve continuous testing of new assemblers and scaffolding tools appearing in the literature, and assembly of data sets produced by the R&D group resulting from new protocols such as long mate pairs, new versions of Illumina chemistry, etc. Results and recommendations of these analyses guide changes in production protocols and assembly strategies.

Microbial genome finishing efforts focus on speeding up the finishing process, reducing cost, and integrating PacBio data. The finishing pipeline includes five major modules covering QC, assembly integration, repeat resolution, gap closing, and reporting. In order to reduce costs, repeats of known function, rRNAs and transposons, which are on average more than half of the repeats in bacterial genomes, will not be finished to base pair perfect. Instead a consensus is generated and placed into proper locations. Other gaps, unique or repetitive, will be finished with additional coverage from sequencing amplicons with PacBio during gap closure. Only microbial genome projects of high quality draft are considered for finishing.