Chapter 4. What are OTUs and ASVs? Introduction to QIIME environment and DADA2.

BIT 477/577 Fall 2019 Students; Carlos Goller

Fall 2019: Investigating the Microbial Communities in Mortality Composts

4

Introduction to QIIME environment and DADA2.

BIT 477/577 Fall 2019 Students and Carlos Goller

Learning Objectives

Discuss and critically evaluate each step of a 16S QIIME standard operating procedure (SOP)
Explain the utility and fundamental tools of QIIME2.
Compare and contrast OTUs and ASVs.

Historically, two methods used to observe the microbes were:

Microscope
Culture

These methods are very limited– microscopy can’t give any information about genes, and the majority of microbes can’t be cultured. Sequencing can overcome both of these challenges and continues to get cheaper

Marker-gene Sequencing:

Targeted approach
Analyzes the composition of a community
Has conserved flanking regions
Uses barcodes

Challenges:

Errors

OTUs: operational taxonomic units

- Clustering sequences that fall into fixed similarity thresholds
- Closed-reference methods for defining OTUs: reads that are sufficiently similar to a sequence in a reference database are recruited into a corresponding OTU
- De novo methods of defining OTUs: reads are grouped into OTUs as a function of their pairwise similarities

Closed reference OTUs: mapping against a reference database

Limitations:
- Defined by sequences that were not directly observed in data/sample, missing some references for amplicons so amplicon data can be lost
- Dependent on the reference data used. i.e. Did you use human gut reference set when looking at the marine community?
Reference Sequences:
- Confirmed/supported previously characterized sequence(s)
- Maps data to closest reference
- Represented data: OTUs (closed reference)

de novo OTUs:

Making from data directly, as opposed to by proxy via reference database.

Data is clustered to most common representatives, grouped to within 3%, for example.

Most common algorithm: find the most abundant sequence in data, enclose region (~70%), collapse, analyze, re-collapse

Strengths

1. 1. Working with actual data
  2. Groups errors where they are closest
  3. More resolution/precision

Weaknesses

1. 1. Lost resolution from collapsing – No reference database is perfect – so data can be lost
  2. Very sensitive to detail (okay for Phylum level, but not as much the granular levels/lower taxa)
    1. Increased occurrence of random error
  3. Not consistent labels
  4. Not replicable- every run with new data may generate new OTUs

Amplicon Sequence Variants

Removes errors; can see true populations even close to together
Consistent labels that can be reproduced between samples versus just being clustered
Continuous data integration because data can be analyzed discretely and compared to whole

Distinguishing Signal from Noise

ASVs (DADA2)

Effective Hamming Distance
Allows continuous data integration
Eliminates the need for reprocessing of raw data.
Unlimited dataset size

OTUs

A limitation of OTUs is that this approach groups the data in such a way that may result in it (the dataset) being unusable in the future if, say the OTUs identified in the past are not identified in the future dataset (i.e., the OTUs are a secondary dataset to the actual data generated and it may not be relevant or applicable in the future; potentially inconsistent labeling).

ASVs (Amplicon sequence variants):

Inferred unique sequences present in the original sample, after correcting for sequencing/sample preparation (e.g., PCR) errors.
Does not use the arbitrary dissimilarity thresholds that define molecular OTUs
These methods infer the biological sequences in the sample prior to the introduction of amplification and sequencing errors, and distinguish sequence variants that differ by as little as one nucleotide

ASVs are precise, tractable, reproducible, comprehensive, continuous labels.

Also known as exact sequence variants (ESVs), sub-OTUs (zOTUs), haplotypes, oligotypes …

a 16S QIIME standard operating procedure (SOP).

Figure source: https://h3abionet.github.io/H3ABionet-SOPs/16s-rRNA

What is QIIME 2?– microbiome analysis pipeline. “QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.”

QIIME2 SOP- https://chmi-sops.github.io/mydoc_qiime2.html

QIIME2 workflow overview: https://docs.qiime2.org/2019.10/tutorials/overview/

Connect to server (linux)
Obtain and import fastq files and metadata
Demultiplexing (to determine which sample each read came from)
Sequence quality/ denoising and clustering / feature tables (that has counts (frequencies) of each unique sequence in each sample in the dataset) → uses DADA2 plugin or you can use Deblur
Feature table summary
Phylogenetic diversity analyses & weighted and unweighted UniFrac
Alpha and Beta diversity analysis
Alpha rarefaction plotting
Taxonomic analysis
Differential abundance across samples → uses ANCOM

A YouTube video that discusses QIIME: https://youtu.be/nWeRN2lKIto

License

Icon for the Creative Commons Attribution 4.0 International License