Fall 2020: Investigating Microbial Communities
8
Whole metagenome sequencing and analysis.
BIT 477/577 Fall 2020 Students
Module Learning Objectives (MOs)
- MO 2.1. Perform quality control of read libraries. (CO 3, CO 4, CO 5)
- MO 2.2. Predict taxonomic population structure of environmental shotgun reads. (CO 3, CO 4, CO 5)
- MO 2.3. Identify high-quality bins to extract and annotate. (CO 3, CO 4, CO 5)
- MO 2.4. Annotate genes of extracted MAGs. (CO 3, CO 4, CO 5)
- MO 2.5. Place MAGs into a species tree. (CO 3, CO 4, CO 5)
Metagenome assembled genomes (MAGs)
MO 2.1. Two conditions which indicate trimming reads before assembly is likely needed:
- Per Sequence Quality Scores: Ideally, you will only see a single peak on the far right. If you see a second peak to the center or left, you may have a subset of low-quality reads that need to be trimmed out.
- Sequence Duplication Levels: This plot shows the relative number of sequences with each degree of duplication. It’s based only on a subset of the data, but you should get a good idea of how many sequences have duplicates. If there is a number in the 2 or more, you may need to trim the library to remove them.
I had difficulty figuring out how to find metagenome files for the assignment, so here is a flow of what I did that worked. Help to obtain metagenomic sequencing for Gold Narrative: go to ncbi.nlm.nih.gov → next to search bar select “SRA” and type in keywords → when results appear, on the left select “DNA” and “Library layout=paired,” on the right under “search in related databases” select “BioProject” → select a result → at bottom of page click “SRA,” this will bring up sample information → at bottom of lists under “Runs” click the hyperlink → will bring up page with a table, hyperlink in table beginning with “https://sra-download…” is link to copy and paste into KBase data import.
Addendum: if you load ncbi.nlm.nih.gov through your NCSU proxy, the link you copy here will not work in KBase. “https://sra-download-ncbi-nlm-nih-gov.prox.lib.ncsu.edu”
Here is a great explanation of what CheckM does: https://www.youtube.com/watch?v=sLtSDs3sh6k&feature=youtu.be
Interleaved files: files that contain both forward and reverse FASTA/FASTQ files
“CheckM improves on this by checking for single-copy genes that a genome of the bin’s taxonomy is expected to have [24]. The percentage of expected single-copy genes that is found in a bin is interpreted as its completion, while the contamination is estimated from the percentage of single-copy genes that are found in duplicate.” https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0541-1
The KBase database also provides some decently detailed information on the various apps. For instance, to cut down on compute cost, the Kaiju app can be scaled to different taxonomic levels depending on what you are looking for.
https://kbase.us/applist/apps/kb_kaiju/run_kaiju/release
I like a quick reference to help make decisions. This paper helps draw out what a “high-quality” bin looks like:
Li, Qi & Lin, Phoebe & Yang, Chen & Wang, Juanping & Lin, Yan & Mengyuan, Shen & Park, Min & Li, Tao & Zhao, Jindong. (2018). A Large-Scale Comparative Metagenomic Study Reveals the Functional Interactions in Six Bloom-Forming Microcystis-Epibiont Communities. Frontiers in Microbiology. 9. 10.3389/fmicb.2018.00746.
Here is a good link that provides a tutorial on how to do quality control analysis on read libraries.
Metagenome assembled genomes
