Fall 2019: Investigating the Microbial Communities in Mortality Composts
2
Different metrics.
BIT 477/577 Fall 2019 Students and Carlos Goller
Learning Objectives
- Define and explain the concepts of metadata, OTU, rarefaction curve.
- Explain three different diversity metrics.
- Identify and describe the limitations and assumptions of certain diversity metrics.
What is diversity?
Definitions
- Metadata
- Data about the data. For example, date, location of sample collection, the concentration of DNA samples, etc.
- Standards for metadata can be found on the Genomics Standards Consortium (gensc.org) → creates standard descriptors for metadata and sequencing approaches.
-
-
- Genomics Standards Consortium used to help classify the data
- Metadata can be distinct to specific fields (clinical microbiology has different metadata than environmental microbiology)
- Indicates the “where, when and what” conditions of samples.
-
- OTU (Operational Taxonomic Units): defines a species (classify sequences together)- used to classify closely related groups based on sequence similarity
-
- Not reproducible
-
- Traditionally used as a means of species identification or classifying sequence clusters
-
- Generally 16S or 18S (ribotyping)
-
- Reference- Compares against known reference standards
-
-
- It can miss species or misidentify. Only as good as the reference data set.
-
-
- de novo- Compares against data in the set
-
- Captures information based on what is in the sample. Not generalizable
-
- Amplicon Sequence Variants (ASV)- Dr. Ben Calahann
-
-
-
- Empirically based on the sequence
-
-
-
-
-
- Used as the basis for DADA2 clustering
-
-
- Rarefaction Curve: a curve describing the growth of the number of species (y-axis) discovered as a function of individuals (x-axis) sampled.
- Based on rarefaction curve
- Allows researchers to assess species richness from sampling results
- Added parameter: Read number (x) and sequence variability (specification)(y)
- Not to be confused with “rarifying”
- Normalizing based on the number of sequences present in various samples so that all samples have the same number of sequences
- Go back to sample to sub-sample take into account subsample for each sample
- Controversy due to exclusion of some collected data.
- High abundance organisms affect the likelihood of finding low abundance organisms
- Our experiment: Superimposing our data
- Comparing species present
- Used to determine whether or not we need additional sequencing
- Assumptions
- Assumes differences are genuine and not errors
- Shotgun will be very difficult to capture the rare members
- How likely is it that more sequencing will help identify low abundance organisms (i.e. not factored into the graph)
- Equal probability of identifying species in samples
- Rare organisms may have minimal effect
- A higher plateau has more variability. A plateau usually suggests that you sequenced “enough” to identify the majority of the organisms.
- Bioinformatics pipeline has minimized sequencing errors
- Every “new” read is a new organism
- Alpha diversity (focuses on one sample)
- General: Within a sample
-
- Species richness: How many different species are present?
-
- Species diversity: How different is the distribution?
- Beta diversity (across many distinct samples)
-
- General: comparing the microbial composition
-
-
- microbial composition of one environment compared to another
- Who is present, left-out,
- “Is your sample different and how?”
-
USEFUL Videos
https://youtu.be/9ZvoR89HYP8 [Alpha Diversity, 16 min]
https://youtu.be/lcbp6EecDg4 [Beta Diversity, 20 min]
https://youtu.be/M8ylvsS0MHg [UniFrac, 18 min]
diversity metrics
Shannon
measures how evenly microbes distributed in a sample
Advantages
-
- Accounts for the number of species and abundance of species
Limitations
-
- Single measure of richness rather than independent measurements
- Does not account for uniqueness of the biodiverse community (high biodiversity with undesirable organisms or low biodiversity with rare organisms)
Chao1
Calculates the estimated true species diversity of a sample and is based on abundance (Alpha Diversity measure),requires the abundance of individuals belonging to a certain class in a sample.
- Takes into account rare species
Bray- Curtis
- Beta diversity
- 0 to 1 scale- easy to interpret: 0 is samples are identical, 1 completely different.
- Dissimilarity index
Limitations
-
- Argued that this coefficient may provide misleading results for species abundance data containing zeros (e.g. Orlóci, 1972, Orlóci, 1978, Legendre and Gallagher, 2001)
- Often misinterpreted as a distance, it is only a dissimilarity index because it counts frequency
Advantages
-
- Intuitive scale for readers. Easy to calculate and scale is from 0 to 1 so meaning is clear
- calculate it you simply subtract the Bray Curtis dissimilarity (remember, a number between 0 and 1) from 1, then multiply by 100.
Jaccard
“Similarity between sites”
- Compares two samples based on the presence and absence of microbes.
- Size of the intersection divided by the size of the union of the samples.
Assumption
The more species the samples have in common, the more similar they are to each other.
Limitations/ Disadvantages
bad for small data sets. based on ranking, collaborative data sets; does not account for phylogeny.
Helpful Visualizations
- https://www.oreilly.com/library/view/hands-on-convolutional-neural/9781789130331/a0267a8a-bd4a-452a-9e5a-8b276d7787a0.xhtml
- https://thatware.co/jaccard-similarity/
Euclidean Distance
Definition
The sqrt of the sum of the squared differences between two data sets (similar to pythagorean theorem).
Builds on the most basic idea of differences between samples but can be applied across multiple variables (in a matrix).
Applications
Beta diversity, Correlation is inversely related to Euclidean Distance
Limitation
Only appropriate for same scales, not good for clustering, very bad at multidimensional analysis
Advantages
Simple Analysis, no elaborate analysis, good for absolute magnitudes
Data Required
Empirical values on two populations where the data is on the same scale.
UniFrac
- Phylogenetic- based beta diversity
- Percent of observed branch length unique to either sample
- Identical communities D=0, related communities D=0.5, unrelated communities D=1
Unweighted
Only uses presence/absence emphasizes the minor species
Limitations
Does not take into account the abundance of populations like weighted unifrac
Weighted
Takes into account relative abundance, emphasizing the more dominant species, quantitative
Limitations:
- not a reliable approach to measuring similarity
- small samples can inflate weighted UniFrac values
Operational Taxonomic Unit
Curve depicting the discovery of species or OTUs as a function of sequencing effort.