You are here

Week 6 - Assembling genomes

This week I worked on the arduous task of whole-genome shotgun fragment assembly, or putting back together the randomly cut DNA to form contiguous sequences (genomes) representative of the most predominant members in the community. Whole-genome assembly is an important step in studying this microbial community because it allows us to answer two important questions: (1) who is present and (2) and how are they contributing to key metabolic processes like carbon and nitrogen fixation (through gene prediction and annotation). However, whole-genome assembly is only possible if there is enough coverage of the genomic DNA in the metagenome. Assembly of contigs the size of whole genomes is also resource intensive. To address these problems, a number of parameters were set to ensure that genome assembly would be both accurate and efficient.

First, digital normalization was used to rid the metagenome of excessive reads that might skew assembly (http://ivory.idyll.org/blog/what-is-diginorm.html). This step greatly reduced the computing costs. Additionally, rather than Geneious, the program Velvet was used for assembly. Velvet is different from most assemblers in that it does not use the consensus coverage approach, but uses k-mers to construct a de Bruijn graph. With Velvet, coverage cutoff and minimum contig length can also be constrained to increase the accuracy of assembly. This method of assembly is excellent for environmental genomic sequences and has been shown to assemble contigs with high accuracy. Velvet assembly generated contigs of 90-kb N50 length.

Having assembled contigs of considerable length, the next step was to figure who in the community is responsible for the key metabolic processes. In order to do this, I spent the latter part of the week building the program Glimmer-MG that would first predict which components of the sequences are genes based on ORFs, then annotate these genes for their respective functions. Glimmer-MG uses phylogenetic binning to compare the sampled genomes with annotated microbial reference genomes obtained from RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/). Glimmer trains itself on the reference genomes and then classifies the sampled genomes based on phylogenetic similarity. While this script ran from Wednesday-Friday, I caught up on some much needed reading.

Friday we visited Bonneville Dam and Multnomah Falls. It was a beautiful day to get out of the lab and enjoy the outdoors. I had not been to Bonneville since early on in grade school, so it was interesting to see some of the fisheries improvements.