Fall 2020: Investigating Microbial Communities
12
Data visualization and statistics.
BIT 477/577 Fall 2020 Students
Module Learning Objectives (MOs)
- MO 6.1. Given a formatted dataset and an appropriate visualization tool, the participant will be able to accurately summarize the output for different measures of diversity.
- MO 6.2. Interpret and evaluate a 2D representation of multidimensional data (ordination plot)
- MO 6.3. Describe what published metagenomic survey figures represent.
Here is an organized paper that outlines metagenomic tools titled “Multiple Data Analyses and Statistical Approaches for Analyzing Data from Metagenomic Studies and Clinical Trials”: https://link.springer.com/protocol/10.1007/978-1-4939-9074-0_20.
You may find here an overview of ordination methods by M. W. Palmer. I found the “Why ordination?” section really helpful. http://ordination.okstate.edu/overview.htm#:~:text=The%20majority%20of%20techniques%20fall,or%20sample%20units%20along%20gradients.
A helpful resource for using Phyloseq for microbiome data: https://www.bioconductor.org/packages/devel/bioc/vignettes/phyloseq/inst/doc/phyloseq-analysis.html
I found this to be helpful for explaining PCAs, especially the figures: https://builtin.com/data-science/step-step-explanation-principal-component-analysis
Here is a short document I found on ordination. It goes further into the mathematics behind PCA, PCoA, and NMDS.
https://img.jgi.doe.gov/docs/Ordination.pdf
Here’s a tutorial on ordination: https://ourcodingclub.github.io/tutorials/ordination/
There is a lot of flexibility in R for data visualization. Here is a link that walks through ordination and how to generate some of the plot types in R in a digestible and straightforward way: https://ourcodingclub.github.io/tutorials/ordination/
This article was helpful to find different statistical tools for metagenomic analysis. It describes most tools that can be used in R, discusses their functions, and covers which tools are best suited for different types of datasets.
In a talk about dimensional reduction, I’m surprised no one has mentioned t-SNE. It is a very useful tool for embedding samples into a low dimensional space and is very good for grouping samples. Unfortunately, the distances in a t-SNE plot are often not useful, but its grouping capabilities are extremely good for highly non-linear data. I’ve linked a tutorial with a Python example below. While I’m unsure of R, it is trivial to use in Matlab (command: tsne(data)).
https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1
Here is a paper that I read during my lit review that describes a microorganism from the genus Nitrospira that is capable of complete ammonia oxidation (commamox). This is a pretty big development in my research area (soil microbiology/nitrogen cycling). Figure 3 in this paper is a good example of a Bayesian influence tree, which shows the relationship between the commamox AmoA gene and the AmoA genes found in other nitrifying microorganisms. doi: 10.1038/nature 16461