DADA2 workflow for processing the 16S rRNA and the ITS high temperature data sets. Workflow uses paired end reads, beginning with raw fastq files, ending with sequence and taxonomy tables.
In order to run this workflow, you either need the raw data, available on the figshare project site (see below), or the trimmed data, available from the European Nucleotide Archive (ENA) under project accession number PRJEB45074. See the Data Availability page for more details.
All files needed to run this workflow can be downloaded from figshare.
NOTE: The reverse reads for 16S rRNA data set are unusable (see below), therefore this workflow deals with forward reads only.
We will screen the forward reads for reverse primers and show the quality plots of the reverse reads at the beginning of the workflow. After that, reverse reads are not used.
Overview
Sequence Files & Samples
We sequenced a total of 15 samples collected from 10 different plots at a single depth in each plot.
save an image of seqtab & taxtab for next part of workflow
1. Set Working Environment
Next, we need to setup the working environment by renaming the fastq files and defining a path for the working directory.
Rename samples
To make the parsing easier, we will eliminate the lane and well number from each sample. If you do not wish to do that you will need to adjust the code accordingly. I am positive there are more elegant ways of doing this.
CAUTION. If you use this for your own data, please check that this code works on a few backup files before proceeding.
Before we start the DADA2 workflow we need to run catadapt(Martin 2011) on all fastq.gz files to trim the primers. For bacteria and archaea, we amplified the V4 hypervariable region of the 16S rRNA gene using the primer pair 515F (GTGCCAGCMGCCGCGGTAA) and 806R (GGACTACHVGGGTWTCTAAT) (Caporaso et al. 2011), which should yield an amplicon length of about 253 bp.
Next, we check the presence and orientation of these primers in the data. I started doing this for ITS data because of primer read-through but I really like the general idea of doing it just to make sure nothing funny is going of with the data. To do this, we will create all orientations of the input primer sequences. In other words the Forward, Complement, Reverse, and Reverse Complement variations.
Now we do a little pre-filter step to eliminate ambiguous bases (Ns) because Ns make mapping of short primer sequences difficult. This step removes any reads with Ns. Again, set some files paths, this time for the filtered reads.
Nice. Time to assess the number of times a primer (and all primer orientations) appear in the forward and reverse reads. According to the workflow, counting the primers on one set of paired end fastq files is sufficient to see if there is a problem. This assumes that all the files were created using the same library prep. Basically for both primers, we will search for all four orientations in both forward and reverse reads. Since this is 16S rRNA we do not anticipate any issues but it is worth checking anyway.
sampnum<-2
primerHits<-function(primer, fn){# Counts number of reads in which the primer is foundnhits<-vcountPattern(primer, sread(readFastq(fn)), fixed =FALSE)return(sum(nhits>0))}
As expected, forward primers predominantly in the forward reads and very little evidence of reverse primers.
3. Remove Primers
Now we can run catadapt(Martin 2011) to remove the primers from the fastq sequences. A little setup first. If this command executes successfully it means R has found cutadapt.
cutadapt<-"/Users/rad/miniconda3/envs/cutadapt/bin/cutadapt"system2(cutadapt, args ="--version")# Run shell commands from R
2.8
We set paths and trim the forward primer and the reverse-complement of the reverse primer off of R1 (forward reads) and trim the reverse primer and the reverse-complement of the forward primer off of R2 (reverse reads).
This is cutadapt 2.8 with Python 3.7.6
Command line parameters: -g GTGCCAGCMGCCGCGGTAA -a ATTAGAWACCCBDGTAGTCC \
-m 20 -n 2 -e 0.12 \
-o RAW/cutadapt/P1_R1.fastq.gz RAW/filtN/P1_R1.fastq.gz
Note. If the code above removes all of the base pairs in a sequence, you will get downstream errors unless you set the -m flag. This flag sets the minimum length and reads shorter than this will be discarded. Without this flag, reads of length 0 will be kept and cause issues. Also, a lot of output will be written to the screen by cutadapt!.
We can now count the number of primers in the sequences from the output of cutadapt.
Basically, primers are no longer detected in the cutadapted reads. Now, for each sample, we can take a look at how the pre-filtering step and primer removal affected the total number of raw reads.
(16S rRNA) Table 2 | Total reads per sample after prefiltering and primer removal (using cutadapt).
4. Quality Assessment & Filtering
We need the forward and reverse fastq file names plus the sample names.
First let’s look at the quality of our reads. The numbers in brackets specify which samples to view. Here we are looking at an aggregate plot of all data (except the negative control).
These parameters should be set based on the anticipated length of the amplicon and the read quality.
And here is a table of how the filtering step affected the number of reads in each sample. As you can see, there are a few samples that started with a low read count to begin with—we will likely remove those samples at some point.
(16S rRNA) Table 3 | Total reads per sample after filtering.
5. Learn Error Rates
Now it is time to assess the error rate of the data. The DADA2 algorithm uses a parametric error model. Every amplicon data set has a different set of error rates and the learnErrors method learns this error model from the data. It does this by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. The algorithm begins with an initial guess, for which the maximum possible error rates in the data are used.
(16S rRNA) Figure 3 | Forward reads: Observed frequency of each transition (e.g., T -> G) as a function of the associated quality score.
The error rates for each possible transition (A to C, A to G, etc.) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected.
6. Dereplicate Reads
Now we can use derepFastq to identify the unique sequences in the forward and reverse fastq files.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R1.fastq.gz
Encountered 39950 unique sequences from 79804 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R1.fastq.gz
Encountered 31163 unique sequences from 72760 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R1.fastq.gz
Encountered 35274 unique sequences from 62598 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R1.fastq.gz
Encountered 42119 unique sequences from 81240 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R1.fastq.gz
Encountered 40336 unique sequences from 86732 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R1.fastq.gz
Encountered 31104 unique sequences from 60860 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R1.fastq.gz
Encountered 35169 unique sequences from 63037 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R1.fastq.gz
Encountered 27752 unique sequences from 54127 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R1.fastq.gz
Encountered 49045 unique sequences from 109421 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R1.fastq.gz
Encountered 46853 unique sequences from 94740 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R1.fastq.gz
Encountered 35363 unique sequences from 71678 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R1.fastq.gz
Encountered 35567 unique sequences from 73639 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R1.fastq.gz
Encountered 38518 unique sequences from 72263 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R1.fastq.gz
Encountered 43426 unique sequences from 89315 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R1.fastq.gz
Encountered 13673 unique sequences from 33000 total sequences read.
7. DADA2 & ASV Inference
At this point we are ready to apply the core sample inference algorithm (dada) to the filtered and trimmed sequence data. DADA2 offers three options for whether and how to pool samples for ASV inference.
If pool = TRUE, the algorithm will pool together all samples prior to sample inference.
If pool = FALSE, sample inference is performed on each sample individually.
If pool = "pseudo", the algorithm will perform pseudo-pooling between individually processed samples.
We tested all three methods through the full pipeline. Click the + to see the results of the test. For our final analysis, we chose pool = FALSE for this data set.
Show/hide Results of dada pooling options
Here are summary tables of results from the tests. Values are from the final sequence table produced by each option.
(16S rRNA) Table 4 | Total number of reads, total number of ASVs, minimum and maximum ASVs, followed by the number of singletons, doubletons, etc. for pooling options.
(16S rRNA) Table 5 | Total number of reads and ASVs by sample for pooling options.
dadaFs<-dada(derepFs, err =errF, multithread =TRUE, pool =FALSE)
Detailed results of dada on forward reads
Sample 1 - 79804 reads in 39950 unique sequences.
Sample 2 - 72760 reads in 31163 unique sequences.
Sample 3 - 62598 reads in 35274 unique sequences.
Sample 4 - 81240 reads in 42119 unique sequences.
Sample 5 - 86732 reads in 40336 unique sequences.
Sample 6 - 60860 reads in 31104 unique sequences.
Sample 7 - 63037 reads in 35169 unique sequences.
Sample 8 - 54127 reads in 27752 unique sequences.
Sample 9 - 109421 reads in 49045 unique sequences.
Sample 10 - 94740 reads in 46853 unique sequences.
Sample 11 - 71678 reads in 35363 unique sequences.
Sample 12 - 73639 reads in 35567 unique sequences.
Sample 13 - 72263 reads in 38518 unique sequences.
Sample 14 - 89315 reads in 43426 unique sequences.
Sample 15 - 33000 reads in 13673 unique sequences.
As an example, we can inspect the returned dada-class object for the forward and reverse reads from the sample #2:
The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants. We have 21685 sequence variants but also a range of sequence lengths. Since many of these sequence variants are singletons or doubletons, we can just select a range that corresponds to the expected amplicon length and eliminate the spurious reads.
(16S rRNA) Figure 4 | Distribution of read length by total ASVs before removing extreme length variants.
After removing the extreme length variants, we have 21573, a reduction of 112 sequence variants.
9. Remove Chimeras
Even though the dada method corrects substitution and indel errors, chimeric sequences remain. According to the DADA2 documentation, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant parent sequences.
Chimera checking removed an additional 1241 sequence variants however, when we account for the abundances of each variant, we see chimeras accounts for about 4.03405% of the merged sequence reads. Not bad.
The last thing we want to do is write the sequence table to an RDS file.
The assignTaxonomy command implements the naive Bayesian classifier, so for reproducible results you need to set a random number seed (see issue #538). We did this at the beginning of the workflow. For taxonomic assignment, we are using the Silva version 138 (Quast et al. 2012). The developers of DADA2 maintain a formatted version of the database.
We will read in the RDS file containing the sequence table saved above. We also need to run remove(list = ls()) command, otherwise the final image we save will be huge. This way, the image only contains the sample data, seqtab, and taxtab after running removeBimeraDenovo.
And finally, we save an image for use in the analytic part of the workflow. This R data file will be needed as the input for the phyloseq portion of the workflow. See the Data Availability page for complete details on where to get this file.
All files needed to run this workflow can be downloaded from figshare.
Overview
Sequence Files & Samples
We sequenced a total of 15 samples collected from 10 different plots at a single depth in each plot.
(ITS) Table 1 | Sample data & associated sequencing information.
Workflow
Our workflow is pretty much ripped from the DADA2 ITS Workflow (1.8) on the DADA2 website. That webpage contains thorough explanations for each step so we will not repeat most of that here. For more details, check out the post. The workflow consists of the following steps:
save an image of seqtab & taxtab for next part of workflow
1. Set Working Environment
Next, we need to setup the working environment by renaming the fastq files and defining a path for the working directory.
Rename samples
To make the parsing easier, we will eliminate the lane and well number from each sample. If you do not wish to do that you will need to adjust the code accordingly. I am positive there are more elegant ways of doing this.
CAUTION. You should check that this code works on a few backup files before proceeding.
Before we start the DADA2 workflow we need to run catadapt(Martin 2011) on all fastq.gz files to trim the primers. For this part of the study we used the primer pair ITS1f (CTTGGTCATTTAGAGGAAGTAA) (Gardes and Bruns 1993) and ITS2 (GCTGCGTTCTTCATCGATGC) (White et al. 1990) which should yield variable amplicon lengths between 100 to 400 bp. What we are looking for is potential read-through scenarios that are possible when sequencing the length variable ITS region as described in the DADA2 ITS Workflow (1.8). Please refer to this document for a complete explanation.
Next, we check the presence and orientation of these primers in the data. To do this we will create all orientations of the input primer sequences. In other words the Forward, Complement, Reverse, and Reverse Complement variations.
Now we do a little pre-filter step to eliminate ambiguous bases (Ns) because Ns make mapping of short primer sequences difficult. This step removes any reads with Ns. Again, set some files paths, this time for the filtered reads.
Sweet. Time to assess the number of times a primer (and all possible primer orientation) appear in the forward and reverse reads. According to the workflow, counting the primers on one set of paired end fastq files is sufficient to see if there is a problem. This assumes that all the files were created using the same library prep. Basically for both primers, we will search for all four orientations in both forward and reverse reads
sampnum<-1primerHits<-function(primer, fn){# Counts number of reads in which the primer is foundnhits<-vcountPattern(primer, sread(readFastq(fn)), fixed =FALSE)return(sum(nhits>0))}
What does this table mean? I wondered the same thing. Let’s break it down. Sample P1 had 50,894 sequences after the original filtering described earlier. The code searched the forward and reverse fastq files for all 8 primers. If we look at the two outputs, we see the forward primer is found in the forward reads in its forward orientation but also in some reverse reads in its reverse-complement orientation. The reverse primer is found in the reverse reads in its forward orientation but also in some forward reads in its reverse-complement orientation. This is due to read-through when the ITS region is short.
3. Remove Primers
Now we can run catadapt(Martin 2011) to remove the primers from the fastq sequences. A little setup first. If this command executes successfully it means R has found cutadapt.
cutadapt<-"/PATH/to/cutadapt"system2(cutadapt, args ="--version")# Run shell commands from R
2.8
We set paths and trim the forward primer and the reverse-complement of the reverse primer off of R1 (forward reads) and trim the reverse primer and the reverse-complement of the forward primer off of R2 (reverse reads).
This is cutadapt 2.8 with Python 3.7.6
Command line parameters: -g CTTGGTCATTTAGAGGAAGTAA -a GCATCGATGAAGAACGCAGC \
-G GCTGCGTTCTTCATCGATGC -A TTACTTCCTCTAAATGACCAAG \
-m 20 -n 2 \
-o RAW/cutadapt/P1_R1.fastq.gz \
-p RAW/cutadapt/P1_R2.fastq.gz RAW/filtN/P1_R1.fastq.gz RAW/filtN/P1_R2.fastq.gz
Note. If the code above removes all of the base pairs in a sequence, you will get downstream errors unless you set the -m flag. This flag sets the minimum length and reads shorter than this will be discarded. Without this flag, reads of length 0 will be kept and cause issues. Also, a lot of output will be written to the screen by cutadapt!
We can now count the number of primers in the sequences from the output of cutadapt.
Basically, primers are no longer detected in the cutadapted reads. Now, for each sample, we can take a look at how the pre-filtering step and primer removal affected the total number of raw reads.
(ITS) Table 2 | Total reads per sample after prefiltering and primer removal (using cutadapt).
4. Quality Assessment & Filtering
We need the forward and reverse fastq file names and the sample names.
These parameters should be set based on the anticipated length of the amplicon and the read quality.
And here is a table of how the filtering step affected the number of reads in each sample.
(ITS) Table 3 | Total reads per sample after filtering.
5. Learn Error Rates
Time to assess the error rate of the data. The rest of the workflow is very similar to the 16S workflows presented previously. So I will basically stop talking.
108788735 total bases in 495784 reads from 11 samples will be used for learning the error rates.
109358094 total bases in 495784 reads from 11 samples will be used for learning the error rates.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R2.fastq.gz
Encountered 13081 unique sequences from 35999 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R2.fastq.gz
Encountered 17965 unique sequences from 48230 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R2.fastq.gz
Encountered 18610 unique sequences from 80877 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R2.fastq.gz
Encountered 19600 unique sequences from 58956 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R2.fastq.gz
Encountered 16145 unique sequences from 42563 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R2.fastq.gz
Encountered 7496 unique sequences from 17548 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R2.fastq.gz
Encountered 19933 unique sequences from 68259 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R2.fastq.gz
Encountered 199 unique sequences from 308 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R2.fastq.gz
Encountered 18495 unique sequences from 48518 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R2.fastq.gz
Encountered 17216 unique sequences from 51055 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R2.fastq.gz
Encountered 17111 unique sequences from 43471 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R2.fastq.gz
Encountered 3809 unique sequences from 9488 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R2.fastq.gz
Encountered 14140 unique sequences from 41563 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R2.fastq.gz
Encountered 14856 unique sequences from 35722 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R2.fastq.gz
Encountered 45 unique sequences from 85 total sequences read.
7. DADA2 & ASV Inference
At this point we are ready to apply the core sample inference algorithm (dada) to the filtered and trimmed sequence data. DADA2 offers three options for whether and how to pool samples for ASV inference.
If pool = TRUE, the algorithm will pool together all samples prior to sample inference.
If pool = FALSE, sample inference is performed on each sample individually.
If pool = "pseudo", the algorithm will perform pseudo-pooling between individually processed samples.
We tested all three methods through the full pipeline. Click the + to see the results of the test. For our final analysis, we chose pool = TRUE for this data set.
Show/hide Results of dada pooling options
Here are a few summary tables of results from the tests. Values are from the final sequence table prodcuced by each option.
(ITS) Table 4 | Total number of reads, total number of ASVs, minimum and maximum ASVs, followed by the number of singletons, doubletons, etc. for pooling options.
(ITS) Table 5 | Total number of reads and ASVs by sample for pooling options.
Forward Reads
dadaFs<-dada(derepFs, err =errF, pool =TRUE, multithread =TRUE)
15 samples were pooled: 582642 reads in 102887 unique sequences.
29795 paired-reads (in 946 unique pairings) successfully merged out of 32193 (in 1177 pairings) input.
41880 paired-reads (in 719 unique pairings) successfully merged out of 46715 (in 941 pairings) input.
64805 paired-reads (in 815 unique pairings) successfully merged out of 66771 (in 1015 pairings) input.
Duplicate sequences in merged output.
51144 paired-reads (in 1011 unique pairings) successfully merged out of 57613 (in 1325 pairings) input.
38442 paired-reads (in 767 unique pairings) successfully merged out of 41985 (in 994 pairings) input.
14188 paired-reads (in 349 unique pairings) successfully merged out of 16981 (in 452 pairings) input.
56594 paired-reads (in 1018 unique pairings) successfully merged out of 64411 (in 1306 pairings) input.
296 paired-reads (in 95 unique pairings) successfully merged out of 301 (in 100 pairings) input.
35713 paired-reads (in 619 unique pairings) successfully merged out of 41006 (in 786 pairings) input.
45851 paired-reads (in 955 unique pairings) successfully merged out of 50174 (in 1240 pairings) input.
36121 paired-reads (in 870 unique pairings) successfully merged out of 41541 (in 1135 pairings) input.
Duplicate sequences in merged output.
9172 paired-reads (in 335 unique pairings) successfully merged out of 9403 (in 402 pairings) input.
38392 paired-reads (in 704 unique pairings) successfully merged out of 40897 (in 886 pairings) input.
29155 paired-reads (in 748 unique pairings) successfully merged out of 34979 (in 990 pairings) input.
Duplicate sequences in merged output.
80 paired-reads (in 14 unique pairings) successfully merged out of 82 (in 16 pairings) input.
Chimera checking removed an additional 7 sequence variants however, when we account for the abundances of each variant, we see chimeras accounts for about 0.09865% of the merged sequence reads. Curious.
Abarenkov, Kessy, Allan Zirk, Timo Piirmann, Raivo Pöhönen, Filipp Ivanov, R. Henrik Nilsson, and Urmas Kõljalg. 2020. “UNITE General FASTA Release for Fungi.”https://dx.doi.org/10.15156/BIO/786368.
Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.”Nature Methods 13 (7): 581. https://doi.org/10.1038/nmeth.3869.
Caporaso, J Gregory, Christian L Lauber, William A Walters, Donna Berg-Lyons, Catherine A Lozupone, Peter J Turnbaugh, Noah Fierer, and Rob Knight. 2011. “Global Patterns of 16S rRNA Diversity at a Depth of Millions of Sequences Per Sample.”Proceedings of the National Academy of Sciences 108: 4516–22. https://doi.org/https://doi.org/10.1073/pnas.1000080107.
Gardes, Monique, and Thomas D Bruns. 1993. “ITS Primers with Enhanced Specificity for Basidiomycetes-Application to the Identification of Mycorrhizae and Rusts.”Molecular Ecology 2 (2): 113–18. https://doi.org/10.1111/j.1365-294X.1993.tb00005.x.
Nilsson, Rolf Henrik, Karl-Henrik Larsson, Andy F S Taylor, Johan Bengtsson-Palme, Thomas S Jeppesen, Dmitry Schigel, Peter Kennedy, et al. 2019. “The UNITE Database for Molecular Identification of Fungi: Handling Dark Taxa and Parallel Taxonomic Classifications.”Nucleic Acids Research 47 (D1): D259–64. https://doi.org/10.1093/nar/gky1022.
Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2012. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools.”Nucleic Acids Research 41 (D1): D590–96. https://doi.org/10.1093/nar/gks1219.
Wang, Qiong, George M Garrity, James M Tiedje, and James R Cole. 2007. “Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.”Applied and Environmental Microbiology 73 (16): 5261–67. https://doi.org/10.1128/AEM.00062-07.
White, Thomas J, Thomas Bruns, SJWT Lee, John Taylor, et al. 1990. “Amplification and Direct Sequencing of Fungal Ribosomal RNA Genes for Phylogenetics.”PCR Protocols: A Guide to Methods and Applications 18 (1): 315–22.
Source Code
---title: "1. DADA2 Workflow"description: | DADA2 workflow for processing the 16S rRNA and the ITS high temperature data sets. Workflow uses paired end reads, beginning with raw fastq files, ending with sequence and taxonomy tables.---```{r}#| message: false#| results: hide#| code-fold: true#| code-summary: "Click here for setup information"knitr::opts_chunk$set(echo =TRUE, eval =FALSE)set.seed(919191)#pacman::p_depends(ape, local = TRUE) #pacman::p_depends_reverse(ape, local = TRUE) library(dada2); packageVersion("dada2")library(ShortRead); packageVersion("ShortRead")library(ggplot2); packageVersion("ggplot2")pacman::p_load(tidyverse, gridExtra, grid, formatR, reactable, downloadthis, captioner, reactablefmtr, install =FALSE, update =FALSE)options(scipen=999)knitr::opts_current$get(c("cache","cache.path","cache.rebuild","dependson","autodep"))remove(list =ls())source(file.path("assets", "functions.R"))``````{r}#| echo: false#| eval: true#| results: hide# Create the caption(s) with captionercaption_tab <-captioner(prefix ="(16S rRNA) Table", suffix =" |", style ="b")caption_fig <-captioner(prefix ="(16S rRNA) Figure", suffix =" |", style ="b")# Create a function for referring to the tables in textref <-function(x) str_extract(x, "[^|]*") %>%trimws(which ="right", whitespace ="[ ]")``````{r}#| echo: false#| eval: truesource("assets/captions/captions_dada2.R")```# PrerequisitesIn order to run this workflow, you either need the *raw* data, available on the figshare project site (see below), or the *trimmed* data, available from the European Nucleotide Archive (ENA) under project accession number [PRJEB45074](https://www.ebi.ac.uk/ena/browser/view/PRJEB45074). See the [Data Availability](data-availability.html) page for more details.This workflow contains the code we used to process the 16S rRNA and ITS data sets using [DADA2](https://benjjneb.github.io/dada2/)[@callahan2016dada2]. Workflow construction is based on the [DADA2 Pipeline Tutorial (1.8)](https://benjjneb.github.io/dada2/tutorial_1_8.html) and the primer identification section of the [DADA2 ITS Pipeline Workflow (1.8)](https://benjjneb.github.io/dada2/ITS_workflow.html).# 16s rRNA Data## Workflow InputAll files needed to run this workflow can be downloaded from figshare. <iframe src="https://widgets.figshare.com/articles/14686665/embed?show_title=1" width="100%" height="351" allowfullscreen frameborder="0"></iframe>> NOTE: The reverse reads for 16S rRNA data set are unusable (see below), therefore this workflow deals with forward reads *only*.We will screen the forward reads for reverse primers and show the quality plots of the reverse reads at the beginning of the workflow. After that, reverse reads are not used.## Overview### Sequence Files & SamplesWe sequenced a total of 15 samples collected from 10 different plots at a single depth in each plot.```{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/ssu18_sample_seq_info.txt",header =TRUE, sep ="\t")```<small>`r caption_tab("seq_dets_ssu")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =TRUE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads"), raw_reads =colDef(footer =function(values) sprintf("%.0f", sum(values))) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/ssu18_sample_seq_info.txt dname="ssu18_sample_seq_info" label="Download table as text file" icon="table" type="primary" >}}</br>### WorkflowThe workflow consists of the following steps:| Step | Command | What we’re doing ||------|------------------------|------------------------------------------------------------|| 1 | multiple | prepare input file names & paths || 2 | multiple | Define primers (all orientations) || 3 |`cutadapt`| Remove primers || 4 |`filterAndTrim()`| assess quality & filter reads || 5 |`learnErrors()`| generate an error model for the data || 6 |`derepFastq`| dereplicate sequences || 7 |`dada()`| infer ASVs on both forward & reverse reads independently || 8 |`makeSequenceTable()`| generate a count table || 9 |`removeBimeraDenovo()`| screen for & remove chimeras || 10 || track reads through workflow || 11 |`assignTaxonomy()`| assign taxonomy & finish workflow || 12 |`save.image()`| save an image of seqtab & taxtab for next part of workflow |## 1. Set Working EnvironmentNext, we need to setup the working environment by renaming the fastq files and defining a path for the working directory.### Rename samplesTo make the parsing easier, we will eliminate the lane and well number from each sample. If you do not wish to do that you will need to adjust the code accordingly. I am *positive* there are more elegant ways of doing this.> CAUTION. If you use this for your own data, please check that this code works on a few backup files before proceeding.```{r}#| warning: falsesetwd("RAW")orig_fastq <-list.files(pattern ="*.fastq.gz")newname_fastq <-gsub("_S.*_L001", "", orig_fastq)newname_fastq <-gsub("_001", "", newname_fastq)file.rename(orig_fastq, newname_fastq)setwd("../")``````{r}path <-"RAW"head(list.files(path))``````{r}#| echo: false#| eval: truec("P1_R1.fastq.gz", "P1_R2.fastq.gz", "P1-8C_R1.fastq.gz","P1-8C_R2.fastq.gz", "P10_R1.fastq.gz", "P10_R2.fastq.gz")```Then, we generate matched lists of the forward and reverse read files. We also parse out the sample name.```{r}fnFs <-sort(list.files(path, pattern ="_R1.fastq.gz", full.names =TRUE))fnRs <-sort(list.files(path, pattern ="_R2.fastq.gz", full.names =TRUE))``````{r}#| warning: false#| fig-height: 3p1a <-plotQualityProfile(fnFs[1:15], aggregate =TRUE)p2a <-plotQualityProfile(fnRs[1:15], aggregate =TRUE)p3a <-grid.arrange(p1a, p2a, nrow =1)ggsave("plot_qscores_2018a.png", p3a, width =7, height =3)``````{r}#| echo: falsesystem("cp files/dada2/figures/ssu18_plot_qscores_a.png include/dada2/ssu18_plot_qscores_a.png")```Unfortunately, the reverse reads are unusable.```{r}#| echo: false#| eval: true#| fig-height: 4#| fig-width: 4#| warning: falseknitr::include_graphics("include/dada2/ssu18_plot_qscores_a.png")```<small>`r caption_fig("qual_scores_ssu")`</small>## 2. Define PrimersBefore we start the DADA2 workflow we need to run [catadapt](https://github.com/marcelm/cutadapt)[@martin2011cutadapt] on all `fastq.gz` files to trim the primers. For bacteria and archaea, we amplified the V4 hypervariable region of the 16S rRNA gene using the primer pair 515F (GTGCCAGCMGCCGCGGTAA) and 806R (GGACTACHVGGGTWTCTAAT) [@caporaso2011global], which should yield an amplicon length of about 253 bp.First we define the primers.```{r}FWD <-"GTGCCAGCMGCCGCGGTAA"REV <-"GGACTACHVGGGTWTCTAAT"```Next, we check the presence and orientation of these primers in the data. I started doing this for ITS data because of primer read-through but I really like the general idea of doing it just to make sure nothing funny is going of with the data. To do this, we will create all orientations of the input primer sequences. In other words the *Forward*, *Complement*, *Reverse*, and *Reverse Complement* variations.```{r}allOrients <-function(primer) {require(Biostrings) dna <-DNAString(primer) orients <-c(Forward = dna,Complement =complement(dna),Reverse =reverse(dna),RevComp =reverseComplement(dna))return(sapply(orients, toString))}FWD.orients <-allOrients(FWD)REV.orients <-allOrients(REV)FWD.orientsREV.orients`````` Forward Complement Reverse"GTGCCAGCMGCCGCGGTAA" "CACGGTCGKCGGCGCCATT" "AATGGCGCCGMCGACCGTG" RevComp"TTACCGCGGCKGCTGGCAC" Forward Complement Reverse"GGACTACHVGGGTWTCTAAT" "CCTGATGDBCCCAWAGATTA" "TAATCTWTGGGVHCATCAGG" RevComp"ATTAGAWACCCBDGTAGTCC"```Now we do a little pre-filter step to eliminate ambiguous bases (*Ns*) because *Ns* make mapping of short primer sequences difficult. This step removes any reads with *Ns*. Again, set some files paths, this time for the filtered reads.```{r}fnFs.filtN <-file.path(path, "filtN", basename(fnFs))filterAndTrim(fnFs, fnFs.filtN,maxN =0, multithread =TRUE)```Nice. Time to assess the number of times a primer (and all primer orientations) appear in the forward and reverse reads. According to the workflow, counting the primers on one set of paired end fastq files is sufficient to see if there is a problem. This assumes that all the files were created using the same library prep. Basically for both primers, we will search for all four orientations in both forward and reverse reads. Since this is 16S rRNA we do not anticipate any issues but it is worth checking anyway.```{r}#| eval: truesampnum <-2``````{r}primerHits <-function(primer, fn) {# Counts number of reads in which the primer is found nhits <-vcountPattern(primer, sread(readFastq(fn)), fixed =FALSE)return(sum(nhits >0))}```#### Forward primers```{r}rbind(FWD.ForwardReads =sapply(FWD.orients, primerHits,fn = fnFs.filtN[[sampnum]]))`````` Forward Complement Reverse RevCompFWD.ForwardReads 74707 0 0 0```#### Reverse primers```{r}rbind(REV.ForwardReads =sapply(REV.orients, primerHits,fn = fnFs.filtN[[sampnum]]))`````` Forward Complement Reverse RevCompREV.ForwardReads 0 0 0 165```As expected, forward primers predominantly in the forward reads and very little evidence of reverse primers.## 3. Remove PrimersNow we can run [catadapt](https://github.com/marcelm/cutadapt)[@martin2011cutadapt] to remove the primers from the fastq sequences. A little setup first. If this command executes successfully it means R has found cutadapt.```{r}cutadapt <-"/Users/rad/miniconda3/envs/cutadapt/bin/cutadapt"system2(cutadapt, args ="--version") # Run shell commands from R``````2.8```We set paths and trim the forward primer and the reverse-complement of the reverse primer off of R1 (forward reads) and trim the reverse primer and the reverse-complement of the forward primer off of R2 (reverse reads).```{r}path.cut <-file.path(path, "cutadapt")if(!dir.exists(path.cut)) dir.create(path.cut)fnFs.cut <-file.path(path.cut, basename(fnFs))FWD.RC <- dada2:::rc(FWD)REV.RC <- dada2:::rc(REV)R1.flags <-paste("-g", FWD, "-a", REV.RC)for(i inseq_along(fnFs)) {system2(cutadapt,args =c(R1.flags,"-m", 20, "-n", 2, "-e", 0.12,"-o", fnFs.cut[i], fnFs.filtN[i]))}``````This is cutadapt 2.8 with Python 3.7.6Command line parameters: -g GTGCCAGCMGCCGCGGTAA -a ATTAGAWACCCBDGTAGTCC \ -m 20 -n 2 -e 0.12 \ -o RAW/cutadapt/P1_R1.fastq.gz RAW/filtN/P1_R1.fastq.gz```*Note*. If the code above removes all of the base pairs in a sequence, you will get downstream errors unless you set the `-m` flag. This flag sets the minimum length and reads shorter than this will be discarded. Without this flag, reads of length 0 will be kept and cause issues. *Also, a lot of output will be written to the screen by cutadapt!*.We can now count the number of primers in the sequences from the output of cutadapt.```{r}rbind(FWD.ForwardReads =sapply(FWD.orients, primerHits, fn = fnFs.cut[[sampnum]]),REV.ForwardReads =sapply(REV.orients, primerHits, fn = fnFs.cut[[sampnum]]))`````` Forward Complement Reverse RevCompFWD.ForwardReads 0 0 0 0REV.ForwardReads 0 0 0 0```Basically, primers are no longer detected in the cutadapted reads. Now, for each sample, we can take a look at how the pre-filtering step and primer removal affected the total number of raw reads.```{r}#| echo: falsetemp1 <-read.table("files/dada2/tables/ssu18_read_changes.txt", header =TRUE)temp1[, 4:6] <-list(NULL)write.table(temp1, "files/dada2/tables/ssu18_cut.txt", sep ="\t", quote =FALSE)``````{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/ssu18_cut.txt",header =TRUE, sep ="\t")seq_table$per_reads_kept <-round(seq_table$cutadapt_reads/seq_table$raw_reads, digits =3)write.table(seq_table, "files/dada2/tables/ssu18_cutadapt.txt", row.names =FALSE, sep ="\t", quote =FALSE)```<small>`r caption_tab("seq_cutadapt_ssu")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =FALSE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads"), raw_reads =colDef(footer =function(values) sprintf("%.0f", sum(values))),pre_filt_reads =colDef(footer =function(values) sprintf("%.0f", sum(values))),cutadapt_reads =colDef(footer =function(values) sprintf("%.0f", sum(values))) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/ssu18_cutadapt.txt dname="ssu18_cutadapt" label="Download table as text file" icon="table" type="primary" >}} ## 4. Quality Assessment & FilteringWe need the forward and reverse fastq file names plus the sample names.```{r}cutFs <-sort(list.files(path.cut, pattern ="_R1.fastq.gz",full.names =TRUE))get.sample.name <-function(fname) strsplit(basename(fname), "_")[[1]][1]sample.names <-unname(sapply(cutFs, get.sample.name))head(sample.names)``````{r}#| eval: truec("P1", "P1-8C", "P10", "P2", "P3", "P3-8C")```First let's look at the quality of our reads. The numbers in brackets specify which samples to view. Here we are looking at an aggregate plot of all data (except the negative control).### Quality plots```{r}#| warning: falsep1 <-plotQualityProfile(cutFs[1:15], aggregate =TRUE)ggsave("ssu18_plot_qscores.png", p1, width =7, height =3)ggsave("ssu18_plot_qscores.png", p1)``````{r}#| echo: falsesystem("cp files/dada2/figures/ssu18_plot_qscores_b.png include/dada2/ssu18_plot_qscores_b.png")``````{r}#| echo: false#| eval: true#| warning: false#| out-width: 55%#| fig-align: centerknitr::include_graphics("include/dada2/ssu18_plot_qscores_b.png")```<small>`r caption_fig("qual_scores_after_ssu")`</small>The forward reads look good.### FilteringFirst, we again make some path variables and setup a new directory of filtered reads.```{r}filtFs <-file.path(path.cut, "filtered", basename(cutFs))```And then we trim the reads.```{r}out <-filterAndTrim(cutFs, filtFs,maxN =0, maxEE =2,truncQ =2, rm.phix =TRUE,compress =TRUE, multithread =20)out```:::{.callout-note}These parameters should be set based on the anticipated length of the amplicon and the read quality.:::And here is a table of how the filtering step affected the number of reads in each sample. As you can see, there are a few samples that started with a low read count to begin with---we will likely remove those samples at some point.```{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/ssu18_filter.txt",header =TRUE, sep ="\t")seq_table$per_reads_kept <-round(seq_table$reads_out/seq_table$reads_in, digits =3)write.table(seq_table, "files/dada2/tables/ssu18_filter.txt", row.names =FALSE, sep ="\t", quote =FALSE)```<small>`r caption_tab("filter_ssu")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =FALSE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads"), reads_in =colDef(footer =function(values) sprintf("%.0f", sum(values))),reads_out =colDef(footer =function(values) sprintf("%.0f", sum(values))) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/ssu18_filter.txt dname="ssu18_filter" label="Download table as text file" icon="table" type="primary" >}} ## 5. Learn Error RatesNow it is time to assess the error rate of the data. The DADA2 algorithm uses a parametric error model. Every amplicon data set has a different set of error rates and the `learnErrors` method learns this error model *from the data*. It does this by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. The algorithm begins with an initial guess, for which the maximum possible error rates in the data are used.#### Forward Reads```{r}errF <-learnErrors(filtFs, multithread =FALSE)``````103002523 total bases in 443994 reads from 6 samples will be used for learning the error rates.``````{r}#| warning: falseplotErrors(errF, nominalQ=TRUE)``````{r}#| echo: falsesystem("cp files/dada2/figures/ssu18_plot_errorF.png include/dada2/ssu18_plot_errorF.png")``````{r}#| echo: false#| eval: trueknitr::include_graphics("include/dada2/ssu18_plot_errorF.png")```<small>`r caption_fig("error_F_ssu")`</small>The error rates for each possible transition (A to C, A to G, etc.) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected.## 6. Dereplicate ReadsNow we can use `derepFastq` to identify the unique sequences in the forward and reverse fastq files.#### Forward Reads```{r}#| echo: false#| eval: true# TO run and cache the derepFastq you must set# `cache.lazy=FALSE`# For both code blocks``````{r}derepFs <-derepFastq(filtFs)names(derepFs) <- sample.names```<details><summary>Detailed results of `derep` forward reads</summary>```Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R1.fastq.gzEncountered 39950 unique sequences from 79804 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R1.fastq.gzEncountered 31163 unique sequences from 72760 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R1.fastq.gzEncountered 35274 unique sequences from 62598 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R1.fastq.gzEncountered 42119 unique sequences from 81240 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R1.fastq.gzEncountered 40336 unique sequences from 86732 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R1.fastq.gzEncountered 31104 unique sequences from 60860 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R1.fastq.gzEncountered 35169 unique sequences from 63037 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R1.fastq.gzEncountered 27752 unique sequences from 54127 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R1.fastq.gzEncountered 49045 unique sequences from 109421 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R1.fastq.gzEncountered 46853 unique sequences from 94740 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R1.fastq.gzEncountered 35363 unique sequences from 71678 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R1.fastq.gzEncountered 35567 unique sequences from 73639 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R1.fastq.gzEncountered 38518 unique sequences from 72263 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R1.fastq.gzEncountered 43426 unique sequences from 89315 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R1.fastq.gzEncountered 13673 unique sequences from 33000 total sequences read.```</details><br/>## 7. DADA2 & ASV InferenceAt this point we are ready to apply the core sample inference algorithm (`dada`) to the filtered and trimmed sequence data. DADA2 offers three options for whether and how to pool samples for ASV inference. If `pool = TRUE`, the algorithm will pool together all samples prior to sample inference. If `pool = FALSE`, sample inference is performed on each sample individually. If `pool = "pseudo"`, the algorithm will perform pseudo-pooling between individually processed samples.We tested all three methods through the full pipeline. Click the `+` to see the results of the test. For our final analysis, we chose `pool = FALSE` for this data set. <details><summary>Show/hide Results of `dada` pooling options</summary><br/>Here are summary tables of results from the tests. Values are from the final sequence table produced by each option. ```{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/ssu18_dada_compare.txt",header =TRUE, sep ="\t", row.names =1)write.table(seq_table, "files/dada2/tables/ssu18_pooling_summary.txt", row.names =FALSE, sep ="\t", quote =FALSE)```<small>`r caption_tab("pooling_summary_ssu")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =FALSE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/ssu18_pooling_summary.txt dname="ssu18_pooling_summary" label="Download table as text file" icon="table" type="primary" >}} ```{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/ssu18_dada_compare_by_samp.txt",header =TRUE, sep ="\t")write.table(seq_table, "files/dada2/tables/ssu18_pooling_results.txt", row.names =FALSE, sep ="\t", quote =FALSE)```<small>`r caption_tab("pooling_ssu")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", minWidth =150, filterable =FALSE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/ssu18_pooling_results.txt dname="ssu18_pooling_results" label="Download table as text file" icon="table" type="primary" >}} </details>```{r}dadaFs <-dada(derepFs, err = errF, multithread =TRUE, pool =FALSE)```<details><summary>Detailed results of `dada` on forward reads</summary>```Sample 1 - 79804 reads in 39950 unique sequences.Sample 2 - 72760 reads in 31163 unique sequences.Sample 3 - 62598 reads in 35274 unique sequences.Sample 4 - 81240 reads in 42119 unique sequences.Sample 5 - 86732 reads in 40336 unique sequences.Sample 6 - 60860 reads in 31104 unique sequences.Sample 7 - 63037 reads in 35169 unique sequences.Sample 8 - 54127 reads in 27752 unique sequences.Sample 9 - 109421 reads in 49045 unique sequences.Sample 10 - 94740 reads in 46853 unique sequences.Sample 11 - 71678 reads in 35363 unique sequences.Sample 12 - 73639 reads in 35567 unique sequences.Sample 13 - 72263 reads in 38518 unique sequences.Sample 14 - 89315 reads in 43426 unique sequences.Sample 15 - 33000 reads in 13673 unique sequences.```</details>As an example, we can inspect the returned `dada-class` object for the forward and reverse reads from the sample #`r sampnum`:```{r}dadaFs[[sampnum]]``````dada-class: object describing DADA2 denoising results1759 sequence variants were inferred from 31163 input unique sequences.Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16```This output tells us how many true sequence variants the DADA2 algorithm inferred from the unique sequences, in this case the sample 2.## 8. Construct Sequence TableNow we construct an amplicon sequence variant (ASV) table.```{r}seqtab <-makeSequenceTable(dadaFs)dim(seqtab)``````{r}#| echo: false#| eval: trueseqtab <-c(15, 21685)seqtab``````{r}table(nchar(getSequences(seqtab)))`````` 22 23 24 25 27 41 66 73 120 149 185 190 193 2 1 1 2 1 1 1 1 1 1 1 9 1 198 200 201 202 208 209 211 215 217 220 221 222 223 1 1 1 1 1 1 1 1 2 4 14 21 8 224 225 227 228 229 230 231 232 233 234 238 251 3 4 6 4 5 9 20 21342 190 12 2 8 ```The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants. We have `r (seqtab)[2]` sequence variants but also a range of sequence lengths. Since many of these sequence variants are singletons or doubletons, we can just select a range that corresponds to the expected amplicon length and eliminate the spurious reads.```{r}#| echo: falseread_length <-data.frame(nchar(getSequences(seqtab)))colnames(read_length) <-"length"plot_a <-qplot(length, data = read_length,geom ="histogram", binwidth =1,xlab ="read length",ylab ="total variants",xlim =c(200, 275))ggsave("read_length_before.png", plot_a, width =7, height =3)``````{r}#| echo: falsesystem("cp files/dada2/figures/ssu18_read_length_before.png include/dada2/ssu18_read_length_before.png")``````{r}#| echo: false#| eval: trueknitr::include_graphics("include/dada2/ssu18_read_length_before.png")```<small>`r caption_fig("read_length_ssu")`</small>```{r}seqtab.2<- seqtab[,nchar(colnames(seqtab)) %in%seq(230,235)]dim(seqtab.2)```:::{.callout-note}The values you select should be based on the sequence table generated above.:::```{r}#| echo: false#| eval: trueseqtab2 <-c(15, 21573)seqtab2``````{r}table(nchar(getSequences(seqtab.2)))`````` 230 231 232 233 234 9 20 21342 190 12 ```After removing the extreme length variants, we have `r (seqtab2)[2]`, a reduction of `r (seqtab)[2] - (seqtab2)[2]` sequence variants.```{r}#| echo: falsesaveRDS(seqtab.2, "ssu18_seqtab.rds")```## 9. Remove ChimerasEven though the `dada` method corrects substitution and indel errors, chimeric sequences remain. According to the DADA2 documentation, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be *exactly reconstructed* by combining a left-segment and a right-segment from two more abundant *parent* sequences.```{r}seqtab.nochim <-removeBimeraDenovo(seqtab.2,method ="consensus",multithread =20,verbose =TRUE)dim(seqtab.nochim)``````Identified 1241 bimeras out of 21573 input sequences.``````{r}#| echo: false#| eval: trueseqtab3 <-c(15, 20332)seqtab3``````{r}sum(seqtab.nochim)/sum(seqtab.2)``````{r}#| echo: false#| eval: truechim_rem <-0.95965950.9596595```Chimera checking removed an additional `r (seqtab2)[2] - (seqtab3)[2]` sequence variants however, when we account for the abundances of each variant, we see chimeras accounts for about `r 100*(1-chim_rem)`% of the merged sequence reads. Not bad.The last thing we want to do is write the sequence table to an `RDS` file.```{r}saveRDS(seqtab.nochim, "ssu18_seqtab.nochim.rds")```## 10. Track Reads through WorkflowAt this point we can look at the number of reads that made it through each step of the workflow for every sample.```{r}getN <-function(x) sum(getUniques(x))track <-cbind(out,sapply(dadaFs, getN),rowSums(seqtab.nochim))colnames(track) <-c("input", "filtered", "denoisedF", "nonchim")rownames(track) <- sample.names``````{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/ssu18_read_changes.txt",header =TRUE, sep ="\t")```<small>`r caption_tab("track_ssu")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =TRUE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads"), raw =colDef(footer =function(values) sprintf("%.0f", sum(values))),pre_filt =colDef(footer =function(values) sprintf("%.0f", sum(values))),pre_filt =colDef(footer =function(values) sprintf("%.0f", sum(values))),cut =colDef(footer =function(values) sprintf("%.0f", sum(values))),filtered =colDef(footer =function(values) sprintf("%.0f", sum(values))),denoisedF =colDef(footer =function(values) sprintf("%.0f", sum(values))),nonchim =colDef(footer =function(values) sprintf("%.0f", sum(values))) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/ssu18_read_changes.txt dname="ssu18_track_read_changes" label="Download table as text file" icon="table" type="primary" >}} And we can save a copy of the table for later use.```{r}write.table(track, "ssu18_read_changes.txt",sep ="\t", quote =FALSE, col.names=NA)``````{r}#| echo: falsesave.image("pre_tax_16s_2018.rdata")```## 11. Assign TaxonomyThe `assignTaxonomy` command implements the naive Bayesian classifier, so for reproducible results you need to set a random number seed (see issue [#538](https://github.com/benjjneb/dada2/issues/538)). We did this at the beginning of the workflow. For taxonomic assignment, we are using the Silva version 138 [@quast2012silva]. The developers of DADA2 maintain a [formatted version of the database](https://benjjneb.github.io/dada2/training.html).We will read in the `RDS` file containing the sequence table saved above. We also need to run `remove(list = ls())` command, otherwise the final image we save will be huge. This way, the image only contains the sample data, seqtab, and taxtab *after* running `removeBimeraDenovo`.```{r}remove(list =ls())seqtab <-readRDS("ssu18_seqtab.nochim.rds")```And then native Bayesian [@wang2007naive] classifier against the Silva database.```{r}tax_silva <-assignTaxonomy( seqtab, "silva_nr_v138_train_set.fa.gz", multithread =TRUE)```## 12. Save ImageAnd finally, we save an image for use in the analytic part of the workflow. This R data file will be needed as the input for the phyloseq portion of the workflow. See the [Data Availability](data-availability.html) page for complete details on where to get this file.```{r}save.image("ssu18_dada2_wf.rdata")``````{r}#| echo: false# Classify with IdTAX dbremove(list =ls())seqtab <-readRDS("ssu18_seqtab.nochim.rds")tax_silva <-assignTaxonomy(seqtab, "silva_nr_v138_train_set.fa.gz", multithread =TRUE)dna <-DNAStringSet(getSequences(seqtab)) # Create a DNAStringSet from the ASVsload("SILVA_SSU_r138_2018.RData")ids <-IdTaxa(dna, trainingSet, strand ="top", processors =NULL, verbose =FALSE) # use all processorsranks <-c("domain", "phylum", "class", "order", "family", "genus") # ranks of interest# Convert the output object of class "Taxa" to a matrix analogous to the output from assignTaxonomytax_id <-t(sapply(ids, function(x) { m <-match(ranks, x$rank) taxa <- x$taxon[m] taxa[startsWith(taxa, "unclassified_")] <-NA taxa}))colnames(tax_id) <- ranks; rownames(tax_id) <-getSequences(seqtab)save.image("ssu18_data_prep_wf_ALT.rdata")saveRDS(seqtab, "ssu18_seqtab.nochim.2.rds")saveRDS(tax_silva, "ssu18_tax_silva.rds")saveRDS(tax_id, "ssu18_tax_id.rds")```## R Session Information & CodeThis workflow was run on the [Smithsonian High Performance Cluster (SI/HPC)](https://doi.org/10.25572/SIHPC), Smithsonian Institution. Below are the specific packages and versions used in this workflow using both `sessionInfo()` and `devtools::session_info()`. Click the arrow to see the details.<details><summary>Click here for Session Information</summary>```sessionInfo()R version 4.0.0 (2020-04-24)Platform: x86_64-conda_cos6-linux-gnu (64-bit)Running under: CentOS Linux 7 (Core)Matrix products: defaultBLAS/LAPACK: /home/scottjj/miniconda3/envs/R-4/lib/libopenblasp-r0.3.9.solocale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=Cattached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods baseother attached packages: [1] gridExtra_2.3 ggplot2_3.3.0 [3] ShortRead_1.46.0 GenomicAlignments_1.24.0 [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0 [7] matrixStats_0.56.0 Biobase_2.48.0 [9] Rsamtools_2.4.0 GenomicRanges_1.40.0[11] GenomeInfoDb_1.24.0 Biostrings_2.56.0[13] XVector_0.28.0 IRanges_2.22.1[15] S4Vectors_0.26.0 BiocParallel_1.22.0[17] BiocGenerics_0.34.0 dada2_1.16.0[19] Rcpp_1.0.4.6loaded via a namespace (and not attached): [1] plyr_1.8.6 RColorBrewer_1.1-2 pillar_1.4.4 [4] compiler_4.0.0 bitops_1.0-6 tools_4.0.0 [7] zlibbioc_1.34.0 lifecycle_0.2.0 tibble_3.0.1[10] gtable_0.3.0 lattice_0.20-41 png_0.1-7[13] pkgconfig_2.0.3 rlang_0.4.6 Matrix_1.2-18[16] GenomeInfoDbData_1.2.3 withr_2.2.0 stringr_1.4.0[19] hwriter_1.3.2 vctrs_0.3.0 glue_1.4.1[22] R6_2.4.1 jpeg_0.1-8.1 latticeExtra_0.6-29[25] reshape2_1.4.4 magrittr_1.5 scales_1.1.1[28] ellipsis_0.3.1 colorspace_1.4-1 stringi_1.4.6[31] RCurl_1.98-1.2 RcppParallel_5.0.1 munsell_0.5.0[34] crayon_1.3.4devtools::session_info()─ Session info ─────────────────────────────────────────────────────────────── setting value version R version 4.0.0 (2020-04-24) os CentOS Linux 7 (Core) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2020-06-17─ Packages ─────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0) backports 1.1.7 2020-05-13 [1] CRAN (R 4.0.0) Biobase * 2.48.0 2020-04-27 [1] Bioconductor BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor BiocParallel * 1.22.0 2020-04-27 [1] Bioconductor Biostrings * 2.56.0 2020-04-27 [1] Bioconductor bitops 1.0-6 2013-08-17 [1] CRAN (R 4.0.0) callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0) cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0) colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0) crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0) dada2 * 1.16.0 2020-04-27 [1] Bioconductor DelayedArray * 0.14.0 2020-04-27 [1] Bioconductor desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0) devtools 2.3.0 2020-04-10 [1] CRAN (R 4.0.0) digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0) ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0) fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0) fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0) GenomeInfoDb * 1.24.0 2020-04-27 [1] Bioconductor GenomeInfoDbData 1.2.3 2020-05-22 [1] Bioconductor GenomicAlignments * 1.24.0 2020-04-27 [1] Bioconductor GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 4.0.0) glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.0) gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.0.0) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0) hwriter 1.3.2 2014-09-10 [1] CRAN (R 4.0.0) IRanges * 2.22.1 2020-04-28 [1] Bioconductor jpeg 0.1-8.1 2019-10-24 [1] CRAN (R 4.0.0) lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.0) latticeExtra 0.6-29 2019-12-19 [1] CRAN (R 4.0.0) lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0) Matrix 1.2-18 2019-11-27 [1] CRAN (R 4.0.0) matrixStats * 0.56.0 2020-03-13 [1] CRAN (R 4.0.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0) pillar 1.4.4 2020-05-05 [1] CRAN (R 4.0.0) pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 4.0.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 4.0.0) plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.0) png 0.1-7 2013-12-03 [1] CRAN (R 4.0.0) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0) processx 3.4.2 2020-02-09 [1] CRAN (R 4.0.0) ps 1.3.3 2020-05-08 [1] CRAN (R 4.0.0) R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0) RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.0.0) Rcpp * 1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0) RcppParallel 5.0.1 2020-05-06 [1] CRAN (R 4.0.0) RCurl 1.98-1.2 2020-04-18 [1] CRAN (R 4.0.0) remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0) reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.0.0) rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0) Rsamtools * 2.4.0 2020-04-27 [1] Bioconductor S4Vectors * 0.26.0 2020-04-27 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0) ShortRead * 1.46.0 2020-04-27 [1] Bioconductor stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0) stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) SummarizedExperiment * 1.18.1 2020-04-30 [1] Bioconductor testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0) tibble 3.0.1 2020-04-20 [1] CRAN (R 4.0.0) usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0) vctrs 0.3.0 2020-05-11 [1] CRAN (R 4.0.0) withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0) XVector * 0.28.0 2020-04-27 [1] Bioconductor zlibbioc 1.34.0 2020-04-27 [1] Bioconductor```</details><br/># ITS Data```{r}#| include: falseremove(list =ls())``````{r}#| echo: false#| eval: true#| results: hide# Create the caption(s) with captionercaption_tab <-captioner(prefix ="(ITS) Table", suffix =" |", style ="b")caption_fig <-captioner(prefix ="(ITS) Figure", suffix =" |", style ="b")# Create a function for referring to the tables in textref <-function(x) str_extract(x, "[^|]*") %>%trimws(which ="right", whitespace ="[ ]")``````{r}#| echo: false#| eval: true#| results: hide## Table captionscaption_tab("seq_dets_its", "Sample data & associated sequencing information.")caption_tab("seq_cutadapt_its", "Total reads per sample after prefiltering and primer removal (using cutadapt).")caption_tab("filter_its", "Total reads per sample after filtering.")caption_tab("pooling_summary_its", "Total number of reads, total number of ASVs, minimum and maximum ASVs, followed by the number of singletons, doubletons, etc. for pooling options.")caption_tab("pooling_its", "Total number of reads and ASVs by sample for pooling options.")caption_tab("track_its", "Tracking reads changes at each step of the DADA2 workflow.")#<small>`r caption_tab("qual_scores_after_its")`</small>``````{r}#| echo: false#| eval: true#| results: hide## Figure captionscaption_fig("qual_scores_its", "Aggregated quality score plots for forward (left) & reverse (right) reads before filtering.")caption_fig("qual_scores_after_its", "Aggregated quality score plot for forward reads after prefilter & primer removal.")caption_fig("error_F_its", "Forward reads: Observed frequency of each transition (e.g., T -> G) as a function of the associated quality score.")caption_fig("error_R_its", "Reverse reads: Observed frequency of each transition (e.g., T -> G) as a function of the associated quality score.")caption_fig("read_length_its", "Distribution of read length by total ASVs before removing extreme length variants.")#<small>`r caption_fig("read_length_its")`</small>```## Workflow InputAll files needed to run this workflow can be downloaded from figshare. <iframe src="https://widgets.figshare.com/articles/14686755/embed?show_title=1" width="568" height="351" allowfullscreen frameborder="0"></iframe>## Overview### Sequence Files & SamplesWe sequenced a total of 15 samples collected from 10 different plots at a single depth in each plot.```{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/its18_sample_seq_info.txt",header =TRUE, sep ="\t")```<small>`r caption_tab("seq_dets_its")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =TRUE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads"), raw_reads =colDef(footer =function(values) sprintf("%.0f", sum(values))) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/its18_sample_seq_info.txt dname="its18_sample_seq_info" label="Download table as text file" icon="table" type="primary" >}} ### WorkflowOur workflow is pretty much ripped from the [DADA2 ITS Workflow (1.8)](https://benjjneb.github.io/dada2/ITS_workflow.html) on the DADA2 website. That webpage contains thorough explanations for each step so we will not repeat most of that here. For more details, check out the post. The workflow consists of the following steps:| Step | Command | What we’re doing ||------|------------------------|------------------------------------------------------------|| 1 | multiple | prepare input file names & paths || 2 | multiple | Define primers (all orientations) || 3 |`cutadapt`| Remove primers || 4 |`filterAndTrim()`| assess quality & filter reads || 5 |`learnErrors()`| generate an error model for the data || 6 |`derepFastq`| dereplicate sequences || 7 |`dada()`| infer ASVs on both forward & reverse reads independently || 8 |`mergePairs()`| merge forward & reverse reads to further refine ASVs || 9 |`makeSequenceTable()`| generate a count table || 10 |`removeBimeraDenovo()`| screen for & remove chimeras || 11 || track reads through workflow || 12 |`assignTaxonomy()`| assign taxonomy & finish workflow || 13 |`save.image()`| save an image of seqtab & taxtab for next part of workflow |## 1. Set Working EnvironmentNext, we need to setup the working environment by renaming the fastq files and defining a path for the working directory.#### Rename samplesTo make the parsing easier, we will eliminate the lane and well number from each sample. If you do not wish to do that you will need to adjust the code accordingly. I am *positive* there are more elegant ways of doing this.> CAUTION. You should check that this code works on a few backup files before proceeding.```{r}#| warning: falsesetwd("RAW")orig_fastq <-list.files(pattern ="*.fastq.gz")newname_fastq <-gsub("_S.*_L001", "", orig_fastq)newname_fastq <-gsub("_001", "", newname_fastq)file.rename(orig_fastq, newname_fastq)setwd("../")``````{r}path <-"RAW"head(list.files(path))``````{r}#| echo: false#| eval: truec("P1_R1.fastq.gz", "P1_R2.fastq.gz", "P1-8C_R1.fastq.gz","P1-8C_R2.fastq.gz", "P10_R1.fastq.gz", "P10_R2.fastq.gz")```Then, we generate matched lists of the forward and reverse read files. We also parse out the sample name.```{r}fnFs <-sort(list.files(path, pattern ="_R1.fastq.gz", full.names =TRUE))fnRs <-sort(list.files(path, pattern ="_R2.fastq.gz", full.names =TRUE))```## 2. Define PrimersBefore we start the DADA2 workflow we need to run [catadapt](https://github.com/marcelm/cutadapt)[@martin2011cutadapt] on all `fastq.gz` files to trim the primers. For this part of the study we used the primer pair ITS1f (CTTGGTCATTTAGAGGAAGTAA) [@gardes1993its] and ITS2 (GCTGCGTTCTTCATCGATGC) [@white1990amplification] which should yield variable amplicon lengths between 100 to 400 bp. What we are looking for is potential read-through scenarios that are possible when sequencing the length variable ITS region as described in the [DADA2 ITS Workflow (1.8)](https://benjjneb.github.io/dada2/ITS_workflow.html). Please refer to this document for a complete explanation.First we define the primers.```{r}FWD <-"CTTGGTCATTTAGAGGAAGTAA"REV <-"GCTGCGTTCTTCATCGATGC"```Next, we check the presence and orientation of these primers in the data. To do this we will create all orientations of the input primer sequences. In other words the *Forward*, *Complement*, *Reverse*, and *Reverse Complement* variations.```{r}allOrients <-function(primer) {require(Biostrings) dna <-DNAString(primer) orients <-c(Forward = dna,Complement =complement(dna),Reverse =reverse(dna),RevComp =reverseComplement(dna))return(sapply(orients, toString))}FWD.orients <-allOrients(FWD)REV.orients <-allOrients(REV)FWD.orientsREV.orients`````` Forward Complement Reverse"CTTGGTCATTTAGAGGAAGTAA" "GAACCAGTAAATCTCCTTCATT" "AATGAAGGAGATTTACTGGTTC" RevComp"TTACTTCCTCTAAATGACCAAG" Forward Complement Reverse"GCTGCGTTCTTCATCGATGC" "CGACGCAAGAAGTAGCTACG" "CGTAGCTACTTCTTGCGTCG" RevComp"GCATCGATGAAGAACGCAGC"```Now we do a little pre-filter step to eliminate ambiguous bases (*Ns*) because *Ns* make mapping of short primer sequences difficult. This step removes any reads with *Ns*. Again, set some files paths, this time for the filtered reads.```{r}fnFs.filtN <-file.path(path, "filtN", basename(fnFs))fnRs.filtN <-file.path(path, "filtN", basename(fnRs))filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN,maxN =0, multithread =TRUE)```Sweet. Time to assess the number of times a primer (and all possible primer orientation) appear in the forward and reverse reads. According to the workflow, counting the primers on one set of paired end fastq files is sufficient to see if there is a problem. This assumes that all the files were created using the same library prep. Basically for both primers, we will search for all four orientations in both forward and reverse reads```{r}sampnum <-1primerHits <-function(primer, fn) {# Counts number of reads in which the primer is found nhits <-vcountPattern(primer, sread(readFastq(fn)), fixed =FALSE)return(sum(nhits >0))}```#### Forward primers```{r}rbind(FWD.ForwardReads =sapply(FWD.orients, primerHits,fn = fnFs.filtN[[sampnum]]),FWD.ReverseReads =sapply(FWD.orients, primerHits,fn = fnRs.filtN[[sampnum]]))`````` Forward Complement Reverse RevCompFWD.ForwardReads 48467 0 0 0FWD.ReverseReads 0 0 0 5078```#### Reverse primers```{r}rbind(REV.ForwardReads =sapply(REV.orients, primerHits,fn = fnFs.filtN[[sampnum]]),REV.ReverseReads =sapply(REV.orients, primerHits,fn = fnRs.filtN[[sampnum]]))`````` Forward Complement Reverse RevCompREV.ForwardReads 0 0 0 6237REV.ReverseReads 43902 0 0 0``````{r}#| echo: falsefnFsXX <-sort(list.files(path, pattern ="_R1.fastq.gz", full.names =FALSE))qsamp <- fnFsXX[sampnum]qsamp <-sapply(strsplit(qsamp, "_"), `[`, 1)### add this for inline calc `r countLines(fnFs.filtN[[sampnum]])/4````What does this table mean? I wondered the same thing. Let's break it down. Sample P1 had 50,894 sequences after the original filtering described earlier. The code searched the forward and reverse fastq files for all 8 primers. If we look at the two outputs, we see the forward primer is found in the forward reads in its forward orientation but also in some reverse reads in its reverse-complement orientation. The reverse primer is found in the reverse reads in its forward orientation but also in some forward reads in its reverse-complement orientation. This is due to read-through when the ITS region is short.## 3. Remove PrimersNow we can run [catadapt](https://github.com/marcelm/cutadapt)[@martin2011cutadapt] to remove the primers from the fastq sequences. A little setup first. If this command executes successfully it means R has found cutadapt.```{r}cutadapt <-"/PATH/to/cutadapt"system2(cutadapt, args ="--version") # Run shell commands from R``````2.8```We set paths and trim the forward primer and the reverse-complement of the reverse primer off of R1 (forward reads) and trim the reverse primer and the reverse-complement of the forward primer off of R2 (reverse reads).```{r}path.cut <-file.path(path, "cutadapt")if(!dir.exists(path.cut)) dir.create(path.cut)fnFs.cut <-file.path(path.cut, basename(fnFs))fnRs.cut <-file.path(path.cut, basename(fnRs))FWD.RC <- dada2:::rc(FWD)REV.RC <- dada2:::rc(REV)R1.flags <-paste("-g", FWD, "-a", REV.RC)R2.flags <-paste("-G", REV, "-A", FWD.RC)for(i inseq_along(fnFs)) {system2(cutadapt,args =c(R1.flags, R2.flags,"-m", 20, "-n", 2,"-o", fnFs.cut[i],"-p", fnRs.cut[i], fnFs.filtN[i], fnRs.filtN[i]))}``````This is cutadapt 2.8 with Python 3.7.6Command line parameters: -g CTTGGTCATTTAGAGGAAGTAA -a GCATCGATGAAGAACGCAGC \ -G GCTGCGTTCTTCATCGATGC -A TTACTTCCTCTAAATGACCAAG \ -m 20 -n 2 \ -o RAW/cutadapt/P1_R1.fastq.gz \ -p RAW/cutadapt/P1_R2.fastq.gz RAW/filtN/P1_R1.fastq.gz RAW/filtN/P1_R2.fastq.gz```*Note*. If the code above removes all of the base pairs in a sequence, you will get downstream errors unless you set the `-m` flag. This flag sets the minimum length and reads shorter than this will be discarded. Without this flag, reads of length 0 will be kept and cause issues. *Also, a lot of output will be written to the screen by cutadapt!*We can now count the number of primers in the sequences from the output of cutadapt.```{r}rbind(FWD.ForwardReads =sapply(FWD.orients, primerHits, fn = fnFs.cut[[sampnum]]),FWD.ReverseReads =sapply(FWD.orients, primerHits, fn = fnRs.cut[[sampnum]]),REV.ForwardReads =sapply(REV.orients, primerHits, fn = fnFs.cut[[sampnum]]),REV.ReverseReads =sapply(REV.orients, primerHits, fn = fnRs.cut[[sampnum]]))`````` Forward Complement Reverse RevCompFWD.ForwardReads 0 0 0 0FWD.ReverseReads 0 0 0 0REV.ForwardReads 0 0 0 0REV.ReverseReads 0 0 0 0```Basically, primers are no longer detected in the cutadapted reads. Now, for each sample, we can take a look at how the pre-filtering step and primer removal affected the total number of raw reads.```{r}#| echo: falsetemp1 <-read.table("files/dada2/tables/its18_read_changes.txt", header =TRUE)temp1[, 4:8] <-list(NULL)write.table(temp1, "files/dada2/tables/its18_cut.txt", sep ="\t", quote =FALSE)``````{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/its18_cut.txt",header =TRUE, sep ="\t")seq_table$per_reads_kept <-round(seq_table$cutadapt_reads/seq_table$raw_reads, digits =3)write.table(seq_table, "files/dada2/tables/its18_cutadapt.txt", row.names =FALSE, sep ="\t", quote =FALSE)```<small>`r caption_tab("seq_cutadapt_its")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =FALSE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads"), raw_reads =colDef(footer =function(values) sprintf("%.0f", sum(values))),pre_filt_reads =colDef(footer =function(values) sprintf("%.0f", sum(values))),cutadapt_reads =colDef(footer =function(values) sprintf("%.0f", sum(values))) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/its18_cutadapt.txt dname="its18_cutadapt" label="Download table as text file" icon="table" type="primary" >}}## 4. Quality Assessment & FilteringWe need the forward and reverse fastq file names and the sample names.```{r}cutFs <-sort(list.files(path.cut, pattern ="_R1.fastq.gz",full.names =TRUE))cutRs <-sort(list.files(path.cut, pattern ="_R2.fastq.gz",full.names =TRUE))get.sample.name <-function(fname) strsplit(basename(fname), "_")[[1]][1]sample.names <-unname(sapply(cutFs, get.sample.name))head(sample.names)``````{r}#| echo: false#| eval: truec("P1", "P1-8C", "P10", "P2", "P3", "P3-8C")```### Quality plotsThen we inspect read quality.```{r}p1 <-plotQualityProfile(cutFs[1:15], aggregate =TRUE)p2 <-plotQualityProfile(cutRs[1:15], aggregate =TRUE)p3 <-grid.arrange(p1, p2, nrow =1)ggsave("plot_qscores_2018.png", p3, width =7, height =3)``````{r}#| echo: falsesystem("cp files/dada2/figures/its18_plot_qscores.png include/dada2/its18_plot_qscores.png")``````{r}#| echo: false#| eval: true#| fig-height: 3#| warning: falseknitr::include_graphics("include/dada2/its18_plot_qscores.png")```<small>`r caption_fig("qual_scores_its")`</small>### Filtering```{r}filtFs <-file.path(path.cut, "filtered", basename(cutFs))filtRs <-file.path(path.cut, "filtered", basename(cutRs))```We use a `minLen` value to get rid of very short length sequences.```{r}out <-filterAndTrim(cutFs, filtFs, cutRs, filtRs,maxN =0, maxEE =c(2, 2), truncQ =2,minLen =50, rm.phix =TRUE,compress =TRUE, multithread =TRUE)out```:::{.callout-note}These parameters should be set based on the anticipated length of the amplicon and the read quality.:::And here is a table of how the filtering step affected the number of reads in each sample.```{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/its18_filter.txt",header =TRUE, sep ="\t")seq_table$per_reads_kept <-round(seq_table$reads_out/seq_table$reads_in, digits =3)write.table(seq_table, "files/dada2/tables/its18_filter.txt", row.names =FALSE, sep ="\t", quote =FALSE)```<small>`r caption_tab("filter_its")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =FALSE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads"), reads_in =colDef(footer =function(values) sprintf("%.0f", sum(values))),reads_out =colDef(footer =function(values) sprintf("%.0f", sum(values))) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/its18_filter.txt dname="its18_filter" label="Download table as text file" icon="table" type="primary" >}}## 5. Learn Error RatesTime to assess the error rate of the data. The rest of the workflow is very similar to the 16S workflows presented previously. So I will basically stop talking.#### Forward Reads```{r}errF <-learnErrors(filtFs, multithread =TRUE)``````108788735 total bases in 495784 reads from 11 samples will be used for learning the error rates.109358094 total bases in 495784 reads from 11 samples will be used for learning the error rates.``````{r}plotErrors(errF, nominalQ =TRUE)``````{r}#| echo: falsep3 <-plotErrors(errF, nominalQ =TRUE)ggsave("plot_errorF_1.png", p3, width =7, height =5)ggsave("plot_errorF_2.png", p3)``````{r}#| echo: falsesystem("cp files/dada2/figures/its18_plot_errorF.png include/dada2/its18_plot_errorF.png")``````{r}#| echo: false#| eval: trueknitr::include_graphics("include/dada2/its18_plot_errorF.png")```<small>`r caption_fig("error_F_its")`</small>#### Reverse Reads```{r}errR <-learnErrors(filtRs, multithread =TRUE)``````109358094 total bases in 495784 reads from 11 samples will be used for learning the error rates.``````{r}plotErrors(errR, nominalQ =TRUE)``````{r}#| echo: falsep4 <-plotErrors(errR, nominalQ =TRUE)ggsave("plot_errorR_1.png", p4, width =7, height =5)ggsave("plot_errorR_2.png", p4)``````{r}#| echo: falsesystem("cp files/dada2/figures/its18_plot_errorR.png include/dada2/its18_plot_errorR.png")``````{r}#| echo: false#| eval: trueknitr::include_graphics("include/dada2/its18_plot_errorR.png")```<small>`r caption_fig("error_R_its")`</small>## 6. Dereplicate Reads#### Forward Reads```{r}derepFs <-derepFastq(filtFs, verbose =TRUE)names(derepFs) <- sample.names```<br/><details><summary>Detailed results of derep forward reads</summary>```Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R1.fastq.gzEncountered 7277 unique sequences from 35999 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R1.fastq.gzEncountered 11366 unique sequences from 48230 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R1.fastq.gzEncountered 11714 unique sequences from 80877 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R1.fastq.gzEncountered 13088 unique sequences from 58956 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R1.fastq.gzEncountered 10180 unique sequences from 42563 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R1.fastq.gzEncountered 4480 unique sequences from 17548 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R1.fastq.gzEncountered 12259 unique sequences from 68259 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R1.fastq.gzEncountered 160 unique sequences from 308 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R1.fastq.gzEncountered 10436 unique sequences from 48518 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R1.fastq.gzEncountered 11230 unique sequences from 51055 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R1.fastq.gzEncountered 10316 unique sequences from 43471 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R1.fastq.gzEncountered 2121 unique sequences from 9488 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R1.fastq.gzEncountered 9236 unique sequences from 41563 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R1.fastq.gzEncountered 9000 unique sequences from 35722 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R1.fastq.gzEncountered 42 unique sequences from 85 total sequences read.```</details><br/>#### Reverse Reads```{r}derepRs <-derepFastq(filtRs, verbose =TRUE)names(derepRs) <- sample.names```<br/><details><summary>Detailed results of derep reverse reads</summary>```Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R2.fastq.gzEncountered 13081 unique sequences from 35999 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R2.fastq.gzEncountered 17965 unique sequences from 48230 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R2.fastq.gzEncountered 18610 unique sequences from 80877 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R2.fastq.gzEncountered 19600 unique sequences from 58956 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R2.fastq.gzEncountered 16145 unique sequences from 42563 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R2.fastq.gzEncountered 7496 unique sequences from 17548 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R2.fastq.gzEncountered 19933 unique sequences from 68259 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R2.fastq.gzEncountered 199 unique sequences from 308 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R2.fastq.gzEncountered 18495 unique sequences from 48518 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R2.fastq.gzEncountered 17216 unique sequences from 51055 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R2.fastq.gzEncountered 17111 unique sequences from 43471 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R2.fastq.gzEncountered 3809 unique sequences from 9488 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R2.fastq.gzEncountered 14140 unique sequences from 41563 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R2.fastq.gzEncountered 14856 unique sequences from 35722 total sequences read.Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R2.fastq.gzEncountered 45 unique sequences from 85 total sequences read.```</details><br/>## 7. DADA2 & ASV InferenceAt this point we are ready to apply the core sample inference algorithm (dada) to the filtered and trimmed sequence data. DADA2 offers three options for whether and how to pool samples for ASV inference. If `pool = TRUE`, the algorithm will pool together all samples prior to sample inference. If `pool = FALSE`, sample inference is performed on each sample individually. If `pool = "pseudo"`, the algorithm will perform pseudo-pooling between individually processed samples.We tested all three methods through the full pipeline. Click the `+` to see the results of the test. For our final analysis, we chose `pool = TRUE` for this data set. <details><summary>Show/hide Results of `dada` pooling options</summary>Here are a few summary tables of results from the tests. Values are from the final sequence table prodcuced by each option. ```{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/its18_dada_compare.txt",header =TRUE, sep ="\t", row.names =1)```<small>`r caption_tab("pooling_summary_its")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =FALSE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/its18_dada_compare.txt dname="its18_dada_compare" label="Download table as text file" icon="table" type="primary" >}}```{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/its18_dada_compare_by_samp.txt",header =TRUE, sep ="\t")```<small>`r caption_tab("pooling_its")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", minWidth =150, filterable =FALSE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/its18_dada_compare_by_samp.txt dname="its18_pooling_results" label="Download table as text file" icon="table" type="primary" >}}</details>#### Forward Reads```{r}dadaFs <-dada(derepFs, err = errF, pool =TRUE, multithread =TRUE)``````15 samples were pooled: 582642 reads in 102887 unique sequences.``````{r}dadaFs[[sampnum]]``````dada-class: object describing DADA2 denoising results1175 sequence variants were inferred from 7277 input unique sequences.Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16```#### Reverse Reads```{r}dadaRs <-dada(derepRs, err = errR, pool =TRUE, multithread =TRUE)``````15 samples were pooled: 582642 reads in 174866 unique sequences.``````{r}dadaRs[[sampnum]]``````dada-class: object describing DADA2 denoising results1116 sequence variants were inferred from 13081 input unique sequences.Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16``````{r}save.image("pre_merge_its_2018.rdata")```## 8. Merge Paired Reads```{r}mergers <-mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)```<details><summary>Detailed results of merging reads</summary>```29795 paired-reads (in 946 unique pairings) successfully merged out of 32193 (in 1177 pairings) input.41880 paired-reads (in 719 unique pairings) successfully merged out of 46715 (in 941 pairings) input.64805 paired-reads (in 815 unique pairings) successfully merged out of 66771 (in 1015 pairings) input.Duplicate sequences in merged output.51144 paired-reads (in 1011 unique pairings) successfully merged out of 57613 (in 1325 pairings) input.38442 paired-reads (in 767 unique pairings) successfully merged out of 41985 (in 994 pairings) input.14188 paired-reads (in 349 unique pairings) successfully merged out of 16981 (in 452 pairings) input.56594 paired-reads (in 1018 unique pairings) successfully merged out of 64411 (in 1306 pairings) input.296 paired-reads (in 95 unique pairings) successfully merged out of 301 (in 100 pairings) input.35713 paired-reads (in 619 unique pairings) successfully merged out of 41006 (in 786 pairings) input.45851 paired-reads (in 955 unique pairings) successfully merged out of 50174 (in 1240 pairings) input.36121 paired-reads (in 870 unique pairings) successfully merged out of 41541 (in 1135 pairings) input.Duplicate sequences in merged output.9172 paired-reads (in 335 unique pairings) successfully merged out of 9403 (in 402 pairings) input.38392 paired-reads (in 704 unique pairings) successfully merged out of 40897 (in 886 pairings) input.29155 paired-reads (in 748 unique pairings) successfully merged out of 34979 (in 990 pairings) input.Duplicate sequences in merged output.80 paired-reads (in 14 unique pairings) successfully merged out of 82 (in 16 pairings) input.```</details><br/>```{r}#| echo: falsehead(mergers[[1]])```## 9. Construct Sequence Table```{r}seqtabF <-makeSequenceTable(mergers)dim(seqtabF)``````{r}#| echo: false#| eval: trueseqtabX <-c(15, 3364)seqtabX``````{r}table(nchar(getSequences(seqtabF)))``````130 131 132 133 136 139 140 141 143 144 146 147 148 149 150 151 152 153 154 155 2 2 1 1 1 1 1 3 2 1 3 5 5 4 5 5 3 2 5 4 156 159 160 161 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 1 2 1 1 2 2 4 9 2 1 2 5 5 5 8 9 2 5 10 13 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 13 14 11 14 12 15 19 12 12 13 9 22 11 11 12 15 17 13 10 11 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 22 16 10 11 13 17 10 8 10 12 13 7 20 18 24 30 39 38 20 19 219 220 221 222 223 224 225 226 228 229 230 231 232 233 234 235 236 237 238 239 12 18 21 23 23 29 32 32 6 48 4 73 20 27 29 19 31 33 30 29 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 35 21 36 16 16 18 17 25 22 29 17 27 28 22 20 18 28 19 28 25 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 23 11 15 13 16 18 23 17 12 12 16 15 32 17 22 21 21 18 16 17 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 13 22 13 18 12 13 13 18 17 15 14 18 8 11 13 17 12 11 13 9 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 16 10 9 3 8 12 8 13 20 15 8 13 16 11 10 8 18 10 8 12 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 8 5 12 17 13 9 8 8 9 12 16 13 14 11 12 13 18 10 11 10 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 9 11 8 8 5 15 8 16 11 15 4 9 7 9 12 14 5 12 11 5 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 2 4 11 5 2 14 6 12 7 5 6 4 3 9 5 10 8 9 4 6 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 5 3 7 7 4 2 3 4 3 5 5 1 3 4 3 6 5 4 3 2 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 421 11 5 3 3 3 3 2 1 2 3 3 1 3 3 1 2 2 1 2 2 422 423 424 427 430 431 432 434 436 437 439 440 441 442 443 446 447 448 2 3 2 1 3 2 4 2 1 2 4 3 1 2 1 1 2 1 ``````{r}#| echo: falseread_length <-data.frame(nchar(getSequences(seqtab)))colnames(read_length) <-"length"plot_a <-qplot(length, data = read_length,geom ="histogram", binwidth =1,xlab ="read length",ylab ="total variants",xlim =c(200, 300))ggsave("read_length_before.png", plot_a, width =7, height =3)``````{r}#| echo: falsesystem("cp files/dada2/figures/its18_read_length_before.png include/dada2/its18_read_length_before.png")``````{r}#| echo: false#| eval: trueknitr::include_graphics("include/dada2/its18_read_length_before.png")```<small>`r caption_fig("read_length_its")`</small>As expected, a wide range of length variability in the the amplified ITS region.```{r}seqtab <- seqtabF[,nchar(colnames(seqtabF)) %in%seq(100,450)]dim(seqtab)``````{r}#| echo: false#| eval: trueseqtabY <-c(15, 3364)seqtabY```After removing the extreme length variants, we have `r (seqtabY)[2]`, a reduction of `r (seqtabX)[2] - (seqtabY)[2]` sequence variants.```{r}table(nchar(getSequences(seqtab)))``````{r}#| echo: falsesaveRDS(seqtab, "its18_seqtab.rds")```## 10. Remove Chimeras```{r}seqtab.nochim <-removeBimeraDenovo(seqtab,method ="consensus",multithread =TRUE, verbose =TRUE)dim(seqtab.nochim)``````Identified 7 bimeras out of 3364 input sequences.``````{r}#| echo: false#| eval: trueseqtabZ <-c(15, 3357)seqtabZ``````{r}sum(seqtab.nochim)/sum(seqtab.2)``````{r}#| echo: false#| eval: truechim_rem <-0.99901350.9990135```Chimera checking removed an additional `r (seqtabY)[2] - (seqtabZ)[2]` sequence variants however, when we account for the abundances of each variant, we see chimeras accounts for about `r 100*(1-chim_rem)`% of the merged sequence reads. Curious.```{r}table(colSums(seqtab.nochim>0))`````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 151247 711 443 289 204 123 102 86 49 32 24 15 23 7 2``````{r}table(colSums(seqtab.nochim>0))`````` 14 95 335 349 616 702 719 745 765 812 867 945 954 1011 1017 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1```The last thing we want to do is write the sequence table to an `RDS` file.```{r}saveRDS(seqtab.nochim, "its18_seqtab.nochim.rds")``````{r}#| echo: falsetable(nchar(getSequences(seqtab.nochim)))```## 11. Track Reads through Workflow```{r}#| echo: false#| eval: trueseq_table <-read.table("files/dada2/tables/its18_read_changes.txt",header =TRUE, sep ="\t")```<small>`r caption_tab("track_its")`</small>```{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =TRUE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads"), raw =colDef(footer =function(values) sprintf("%.0f", sum(values))),pre_filt =colDef(footer =function(values) sprintf("%.0f", sum(values))),pre_filt =colDef(footer =function(values) sprintf("%.0f", sum(values))),cut =colDef(footer =function(values) sprintf("%.0f", sum(values))),filtered =colDef(footer =function(values) sprintf("%.0f", sum(values))),denoisedF =colDef(footer =function(values) sprintf("%.0f", sum(values))),denoisedR =colDef(footer =function(values) sprintf("%.0f", sum(values))),merged =colDef(footer =function(values) sprintf("%.0f", sum(values))),nonchim =colDef(footer =function(values) sprintf("%.0f", sum(values))) ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/dada2/tables/its18_read_changes.txt dname="its18_track_read_changes" label="Download table as text file" icon="table" type="primary" >}} ```{r}#| echo: falsesave.image("pre_tax_its_2018.rdata")```## 12. Assign TaxonomyFor taxonomic classification, we used the UNITE [@nilsson2019unite] database, specifically the [UNITE general FASTA release for Fungi (v. 04.02.2020)](https://plutof.ut.ee/#/doi/10.15156/BIO/786368)[@abarenkov2020unite].```{r}#| echo: falseremove(list =ls())seqtab <-readRDS("its18_seqtab.nochim.rds")``````{r}tax <-assignTaxonomy(seqtab,"sh_general_release_dynamic_s_04.02.2020_dev.fasta",multithread =20, tryRC =TRUE)saveRDS(tax, "its18_tax_its.rds")``````UNITE fungal taxonomic reference detected.```## 13. Save Image```{r}save.image("its18_dada2_wf.rdata")```## R Session Information & CodeThis workflow was run on the [Smithsonian High Performance Cluster (SI/HPC)](https://doi.org/10.25572/SIHPC), Smithsonian Institution. Below are the specific packages and versions used in this workflow using both `sessionInfo()` and `devtools::session_info()`. Click the arrow to see the details.<details><summary>Show/hide R Session Info</summary>```sessionInfo()R version 4.0.0 (2020-04-24)Platform: x86_64-conda_cos6-linux-gnu (64-bit)Running under: CentOS Linux 7 (Core)Matrix products: defaultBLAS/LAPACK: /home/scottjj/miniconda3/envs/R-4/lib/libopenblasp-r0.3.9.solocale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=Cattached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods baseother attached packages: [1] gridExtra_2.3 ggplot2_3.3.0 [3] ShortRead_1.46.0 GenomicAlignments_1.24.0 [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0 [7] matrixStats_0.56.0 Biobase_2.48.0 [9] Rsamtools_2.4.0 GenomicRanges_1.40.0[11] GenomeInfoDb_1.24.0 Biostrings_2.56.0[13] XVector_0.28.0 IRanges_2.22.1[15] S4Vectors_0.26.0 BiocParallel_1.22.0[17] BiocGenerics_0.34.0 dada2_1.16.0[19] Rcpp_1.0.4.6loaded via a namespace (and not attached): [1] plyr_1.8.6 RColorBrewer_1.1-2 pillar_1.4.4 [4] compiler_4.0.0 bitops_1.0-6 tools_4.0.0 [7] zlibbioc_1.34.0 lifecycle_0.2.0 tibble_3.0.1[10] gtable_0.3.0 lattice_0.20-41 png_0.1-7[13] pkgconfig_2.0.3 rlang_0.4.6 Matrix_1.2-18[16] GenomeInfoDbData_1.2.3 withr_2.2.0 stringr_1.4.0[19] hwriter_1.3.2 vctrs_0.3.0 glue_1.4.1[22] R6_2.4.1 jpeg_0.1-8.1 latticeExtra_0.6-29[25] reshape2_1.4.4 magrittr_1.5 scales_1.1.1[28] ellipsis_0.3.1 colorspace_1.4-1 stringi_1.4.6[31] RCurl_1.98-1.2 RcppParallel_5.0.1 munsell_0.5.0[34] crayon_1.3.4devtools::session_info()─ Session info ─────────────────────────────────────────────────────────────── setting value version R version 4.0.0 (2020-04-24) os CentOS Linux 7 (Core) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2020-06-04─ Packages ─────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0) backports 1.1.7 2020-05-13 [1] CRAN (R 4.0.0) Biobase * 2.48.0 2020-04-27 [1] Bioconductor BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor BiocParallel * 1.22.0 2020-04-27 [1] Bioconductor Biostrings * 2.56.0 2020-04-27 [1] Bioconductor bitops 1.0-6 2013-08-17 [1] CRAN (R 4.0.0) callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0) cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0) colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0) crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0) dada2 * 1.16.0 2020-04-27 [1] Bioconductor DelayedArray * 0.14.0 2020-04-27 [1] Bioconductor desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0) devtools 2.3.0 2020-04-10 [1] CRAN (R 4.0.0) digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0) ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0) fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0) fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0) GenomeInfoDb * 1.24.0 2020-04-27 [1] Bioconductor GenomeInfoDbData 1.2.3 2020-05-22 [1] Bioconductor GenomicAlignments * 1.24.0 2020-04-27 [1] Bioconductor GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 4.0.0) glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.0) gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.0.0) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0) hwriter 1.3.2 2014-09-10 [1] CRAN (R 4.0.0) IRanges * 2.22.1 2020-04-28 [1] Bioconductor jpeg 0.1-8.1 2019-10-24 [1] CRAN (R 4.0.0) lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.0) latticeExtra 0.6-29 2019-12-19 [1] CRAN (R 4.0.0) lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0) Matrix 1.2-18 2019-11-27 [1] CRAN (R 4.0.0) matrixStats * 0.56.0 2020-03-13 [1] CRAN (R 4.0.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0) pillar 1.4.4 2020-05-05 [1] CRAN (R 4.0.0) pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 4.0.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 4.0.0) plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.0) png 0.1-7 2013-12-03 [1] CRAN (R 4.0.0) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0) processx 3.4.2 2020-02-09 [1] CRAN (R 4.0.0) ps 1.3.3 2020-05-08 [1] CRAN (R 4.0.0) R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0) RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.0.0) Rcpp * 1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0) RcppParallel 5.0.1 2020-05-06 [1] CRAN (R 4.0.0) RCurl 1.98-1.2 2020-04-18 [1] CRAN (R 4.0.0) remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0) reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.0.0) rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0) Rsamtools * 2.4.0 2020-04-27 [1] Bioconductor S4Vectors * 0.26.0 2020-04-27 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0) ShortRead * 1.46.0 2020-04-27 [1] Bioconductor stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0) stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) SummarizedExperiment * 1.18.1 2020-04-30 [1] Bioconductor testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0) tibble 3.0.1 2020-04-20 [1] CRAN (R 4.0.0) usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0) vctrs 0.3.0 2020-05-11 [1] CRAN (R 4.0.0) withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0) XVector * 0.28.0 2020-04-27 [1] Bioconductor zlibbioc 1.34.0 2020-04-27 [1] Bioconductor```</details><br/># Workflow Output Data products generated in this workflow can be downloaded from figshare.<iframe src="https://widgets.figshare.com/articles/14687184/embed?show_title=1" width="100%" height="351" frameborder="0"></iframe>The end!```{r}#| include: false#| eval: trueremove(list =ls())```<a href="data-prep.html" class="btnnav button_round" style="float: right;">Next workflow:<br> 2. Data Set Preparation</a><br><br>```{r}#| message: false #| results: hide#| eval: true#| echo: falseremove(list =ls())### COmmon formatting scripts### NOTE: captioner.R must be read BEFORE captions_XYZ.Rsource(file.path("assets", "functions.R"))```#### Source Code {.appendix}The source code for this page can be accessed on GitHub `r fa(name = "github")` by [clicking this link](`r source_code()`). #### Data Availability {.appendix}Raw fastq files available on figshare at [10.25573/data.14686665](https://doi.org/10.25573/data.14686665) for the 16S rRNA data and [10.25573/data.14686755](https://doi.org/10.25573/data.14686755) for the ITS data. Trimmed fastq files (primers removed) available through the ENA under project accession number [PRJEB45074 (ERP129199)](https://www.ebi.ac.uk/ena/browser/view/PRJEB45074). Data generated in this workflow can be accessed on figshare at [10.25573/data.14687184](https://doi.org/10.25573/data.14687184).#### Last updated on {.appendix}```{r}#| echo: false#| eval: trueSys.time()```