Chapter 2: What is diversity?  Different metrics.

BIT 477/577 Fall 2019 Students; Carlos Goller

Fall 2019: Investigating the Microbial Communities in Mortality Composts

2

Different metrics.

BIT 477/577 Fall 2019 Students and Carlos Goller

Learning Objectives

Define and explain the concepts of metadata, OTU, rarefaction curve.
Explain three different diversity metrics.
Identify and describe the limitations and assumptions of certain diversity metrics.

What is diversity?

Definitions

Metadata
- Data about the data. For example, date, location of sample collection, the concentration of DNA samples, etc.
- Standards for metadata can be found on the Genomics Standards Consortium (gensc.org) → creates standard descriptors for metadata and sequencing approaches.

- - Genomics Standards Consortium used to help classify the data
- Metadata can be distinct to specific fields (clinical microbiology has different metadata than environmental microbiology)
- Indicates the “where, when and what” conditions of samples.

OTU (Operational Taxonomic Units): defines a species (classify sequences together)- used to classify closely related groups based on sequence similarity

- Not reproducible

- Traditionally used as a means of species identification or classifying sequence clusters

- Generally 16S or 18S (ribotyping)

- Reference- Compares against known reference standards

- - It can miss species or misidentify. Only as good as the reference data set.

- de novo- Compares against data in the set

- Captures information based on what is in the sample. Not generalizable

- Amplicon Sequence Variants (ASV)- Dr. Ben Calahann

- - - Empirically based on the sequence

- - - Used as the basis for DADA2 clustering

Rarefaction Curve: a curve describing the growth of the number of species (y-axis) discovered as a function of individuals (x-axis) sampled.
- Based on rarefaction curve
- Allows researchers to assess species richness from sampling results
- Added parameter: Read number (x) and sequence variability (specification)(y)
- Not to be confused with “rarifying”
  - Normalizing based on the number of sequences present in various samples so that all samples have the same number of sequences
  - Go back to sample to sub-sample take into account subsample for each sample
    - Controversy due to exclusion of some collected data.
- High abundance organisms affect the likelihood of finding low abundance organisms
- Our experiment: Superimposing our data
  - Comparing species present
  - Used to determine whether or not we need additional sequencing
  - Assumptions
    - Assumes differences are genuine and not errors
    - Shotgun will be very difficult to capture the rare members
    - How likely is it that more sequencing will help identify low abundance organisms (i.e. not factored into the graph)
    - Equal probability of identifying species in samples
    - Rare organisms may have minimal effect
    - A higher plateau has more variability. A plateau usually suggests that you sequenced “enough” to identify the majority of the organisms.
    - Bioinformatics pipeline has minimized sequencing errors
    - Every “new” read is a new organism

Alpha diversity (focuses on one sample)
- General: Within a sample

- Species richness: How many different species are present?

- Species diversity: How different is the distribution?

Beta diversity (across many distinct samples)

- General: comparing the microbial composition

- - microbial composition of one environment compared to another
  - Who is present, left-out,
  - “Is your sample different and how?”

USEFUL Videos

https://youtu.be/9ZvoR89HYP8 [Alpha Diversity, 16 min]

https://youtu.be/lcbp6EecDg4 [Beta Diversity, 20 min]

https://youtu.be/M8ylvsS0MHg [UniFrac, 18 min]

diversity metrics

Shannon

measures how evenly microbes distributed in a sample

Advantages

- Accounts for the number of species and abundance of species

Limitations

- Single measure of richness rather than independent measurements
- Does not account for uniqueness of the biodiverse community (high biodiversity with undesirable organisms or low biodiversity with rare organisms)

Chao1

Calculates the estimated true species diversity of a sample and is based on abundance (Alpha Diversity measure)，requires the abundance of individuals belonging to a certain class in a sample.

Takes into account rare species

Bray- Curtis

Beta diversity
0 to 1 scale- easy to interpret: 0 is samples are identical, 1 completely different.
Dissimilarity index

Limitations

- Argued that this coefficient may provide misleading results for species abundance data containing zeros (e.g. Orlóci, 1972, Orlóci, 1978, Legendre and Gallagher, 2001)
- Often misinterpreted as a distance, it is only a dissimilarity index because it counts frequency

Advantages

- Intuitive scale for readers. Easy to calculate and scale is from 0 to 1 so meaning is clear
- calculate it you simply subtract the Bray Curtis dissimilarity (remember, a number between 0 and 1) from 1, then multiply by 100.

Jaccard

“Similarity between sites”

Compares two samples based on the presence and absence of microbes.
Size of the intersection divided by the size of the union of the samples.

Assumption

The more species the samples have in common, the more similar they are to each other.

Limitations/ Disadvantages

bad for small data sets. based on ranking, collaborative data sets; does not account for phylogeny.

Helpful Visualizations

Euclidean Distance

Definition

The sqrt of the sum of the squared differences between two data sets (similar to pythagorean theorem).

Builds on the most basic idea of differences between samples but can be applied across multiple variables (in a matrix).

Applications

Beta diversity, Correlation is inversely related to Euclidean Distance

Limitation

Only appropriate for same scales, not good for clustering, very bad at multidimensional analysis

Advantages

Simple Analysis, no elaborate analysis, good for absolute magnitudes

Data Required

Empirical values on two populations where the data is on the same scale.

UniFrac

Phylogenetic- based beta diversity
Percent of observed branch length unique to either sample
Identical communities D=0, related communities D=0.5, unrelated communities D=1

Unweighted

Only uses presence/absence emphasizes the minor species

Limitations

Does not take into account the abundance of populations like weighted unifrac

Weighted

Takes into account relative abundance, emphasizing the more dominant species, quantitative

2

What is diversity?

USEFUL Videos

diversity metrics

Shannon

Advantages

Limitations

Chao1

Bray- Curtis

Limitations

Advantages

Jaccard

Assumption

Limitations/ Disadvantages

Helpful Visualizations

Euclidean Distance

Definition

Applications

Limitation

Advantages

Data Required

UniFrac

Unweighted

Limitations

Weighted

Limitations:

License

Share This Book