Click here for setup information.

knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
set.seed(919191)
#pacman::p_depends(ape, local = TRUE)  
#pacman::p_depends_reverse(ape, local = TRUE)  
library(dada2); packageVersion("dada2")
library(ShortRead); packageVersion("ShortRead")
library(ggplot2); packageVersion("ggplot2")
pacman::p_load(tidyverse, gridExtra, grid, 
               formatR, reactable, downloadthis,
               captioner, sparkline, 
               install = FALSE, update = FALSE)

options(scipen=999)
knitr::opts_current$get(c(
  "cache",
  "cache.path",
  "cache.rebuild",
  "dependson",
  "autodep"
))
remove(list = ls())
source(file.path("assets", "functions.R"))

Prerequisites

In order to run this workflow, you either need the raw data, available on the figshare project site (see below), or the trimmed data, available from the European Nucleotide Archive (ENA) under project accession number PRJEB45074. See the Data Availability page for more details.

This workflow contains the code we used to process the 16S rRNA and ITS data sets using DADA2 (Callahan et al. 2016). Workflow construction is based on the DADA2 Pipeline Tutorial (1.8) and the primer identification section of the DADA2 ITS Pipeline Workflow (1.8).

16s rRNA Data

Workflow Input

All files needed to run this workflow can be downloaded from figshare.

NOTE: The reverse reads for 16S rRNA data set are unusable (see below), therefore this workflow deals with forward reads only.

We will screen the forward reads for reverse primers and show the quality plots of the reverse reads at the beginning of the workflow. After that, reverse reads are not used.

Overview

Sequence Files & Samples

We sequenced a total of 15 samples collected from 10 different plots at a single depth in each plot.

(16S rRNA) Table 1 | Sample data & associated sequencing information.

Workflow

The workflow consists of the following steps:

Step	Command	What we’re doing
1	multiple	prepare input file names & paths
2	multiple	Define primers (all orientations)
3	`cutadapt`	Remove primers
4	`filterAndTrim()`	assess quality & filter reads
5	`learnErrors()`	generate an error model for the data
6	`derepFastq`	dereplicate sequences
7	`dada()`	infer ASVs on both forward & reverse reads independently
8	`makeSequenceTable()`	generate a count table
9	`removeBimeraDenovo()`	screen for & remove chimeras
10		track reads through workflow
11	`assignTaxonomy()`	assign taxonomy & finish workflow
12	`save.image()`	save an image of seqtab & taxtab for next part of workflow

1. Set Working Environment

Next, we need to setup the working environment by renaming the fastq files and defining a path for the working directory.

Rename samples

To make the parsing easier, we will eliminate the lane and well number from each sample. If you do not wish to do that you will need to adjust the code accordingly. I am positive there are more elegant ways of doing this.

CAUTION. If you use this for your own data, please check that this code works on a few backup files before proceeding.

setwd("RAW")
orig_fastq <- list.files(pattern = "*.fastq.gz")
newname_fastq <- gsub("_S.*_L001", "", orig_fastq)
newname_fastq <- gsub("_001", "", newname_fastq)
file.rename(orig_fastq, newname_fastq)
setwd("../")

path <- "RAW"
head(list.files(path))

[1] "P1_R1.fastq.gz"    "P1_R2.fastq.gz"    "P1-8C_R1.fastq.gz"
[4] "P1-8C_R2.fastq.gz" "P10_R1.fastq.gz"   "P10_R2.fastq.gz"

Then, we generate matched lists of the forward and reverse read files. We also parse out the sample name.

fnFs <- sort(list.files(path, pattern = "_R1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2.fastq.gz", full.names = TRUE))

p1a <- plotQualityProfile(fnFs[1:15], aggregate = TRUE)
p2a <- plotQualityProfile(fnRs[1:15], aggregate = TRUE)

p3a <- grid.arrange(p1a, p2a, nrow = 1)
ggsave("plot_qscores_2018a.png", p3a, width = 7, height = 3)

Unfortunately, the reverse reads are unusable.

(16S rRNA) Figure 1 | Aggregated quality score plots for forward (left) & reverse (right) reads.

2. Define Primers

Before we start the DADA2 workflow we need to run catadapt (Martin 2011) on all fastq.gz files to trim the primers. For bacteria and archaea, we amplified the V4 hypervariable region of the 16S rRNA gene using the primer pair 515F (GTGCCAGCMGCCGCGGTAA) and 806R (GGACTACHVGGGTWTCTAAT) (Caporaso et al. 2011), which should yield an amplicon length of about 253 bp.

First we define the primers.

FWD <- "GTGCCAGCMGCCGCGGTAA"
REV <- "GGACTACHVGGGTWTCTAAT"

Next, we check the presence and orientation of these primers in the data. I started doing this for ITS data because of primer read-through but I really like the general idea of doing it just to make sure nothing funny is going of with the data. To do this, we will create all orientations of the input primer sequences. In other words the Forward, Complement, Reverse, and Reverse Complement variations.

allOrients <- function(primer) {
    require(Biostrings)
    dna <- DNAString(primer)
    orients <- c(Forward = dna,
                 Complement = complement(dna),
                 Reverse = reverse(dna),
                 RevComp = reverseComplement(dna))
    return(sapply(orients, toString))
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
REV.orients

              Forward            Complement               Reverse
"GTGCCAGCMGCCGCGGTAA" "CACGGTCGKCGGCGCCATT" "AATGGCGCCGMCGACCGTG"
              RevComp
"TTACCGCGGCKGCTGGCAC"
               Forward             Complement                Reverse
"GGACTACHVGGGTWTCTAAT" "CCTGATGDBCCCAWAGATTA" "TAATCTWTGGGVHCATCAGG"
               RevComp
"ATTAGAWACCCBDGTAGTCC"

Now we do a little pre-filter step to eliminate ambiguous bases (Ns) because Ns make mapping of short primer sequences difficult. This step removes any reads with Ns. Again, set some files paths, this time for the filtered reads.

fnFs.filtN <- file.path(path, "filtN", basename(fnFs))
filterAndTrim(fnFs, fnFs.filtN,
              maxN = 0, multithread = TRUE)

Nice. Time to assess the number of times a primer (and all primer orientations) appear in the forward and reverse reads. According to the workflow, counting the primers on one set of paired end fastq files is sufficient to see if there is a problem. This assumes that all the files were created using the same library prep. Basically for both primers, we will search for all four orientations in both forward and reverse reads. Since this is 16S rRNA we do not anticipate any issues but it is worth checking anyway.

sampnum <- 2

primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}

Forward primers

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits,
                                fn = fnFs.filtN[[sampnum]]))

                 Forward Complement Reverse RevComp
FWD.ForwardReads   74707          0       0       0

Reverse primers

rbind(REV.ForwardReads = sapply(REV.orients, primerHits,
                                fn = fnFs.filtN[[sampnum]]))

                 Forward Complement Reverse RevComp
REV.ForwardReads       0          0       0     165

As expected, forward primers predominantly in the forward reads and very little evidence of reverse primers.

3. Remove Primers

Now we can run catadapt (Martin 2011) to remove the primers from the fastq sequences. A little setup first. If this command executes successfully it means R has found cutadapt.

cutadapt <- "/Users/rad/miniconda3/envs/cutadapt/bin/cutadapt"
system2(cutadapt, args = "--version") # Run shell commands from R

2.8

We set paths and trim the forward primer and the reverse-complement of the reverse primer off of R1 (forward reads) and trim the reverse primer and the reverse-complement of the forward primer off of R2 (reverse reads).

path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)

R1.flags <- paste("-g", FWD, "-a", REV.RC)

for(i in seq_along(fnFs)) {system2(cutadapt,
                                   args = c(R1.flags,
                                            "-m", 20, "-n", 2, "-e", 0.12,
                                            "-o", fnFs.cut[i],
                                            fnFs.filtN[i]))}

This is cutadapt 2.8 with Python 3.7.6
Command line parameters: -g GTGCCAGCMGCCGCGGTAA -a ATTAGAWACCCBDGTAGTCC \
                         -m 20 -n 2 -e 0.12 \
                         -o RAW/cutadapt/P1_R1.fastq.gz RAW/filtN/P1_R1.fastq.gz

Note. If the code above removes all of the base pairs in a sequence, you will get downstream errors unless you set the -m flag. This flag sets the minimum length and reads shorter than this will be discarded. Without this flag, reads of length 0 will be kept and cause issues. Also, a lot of output will be written to the screen by cutadapt!.

We can now count the number of primers in the sequences from the output of cutadapt.

rbind(
  FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[sampnum]]),
  REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[sampnum]]))

                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
REV.ForwardReads       0          0       0       0

Basically, primers are no longer detected in the cutadapted reads. Now, for each sample, we can take a look at how the pre-filtering step and primer removal affected the total number of raw reads.

(16S rRNA) Table 2 | Total reads per sample after prefiltering and primer removal (using cutadapt).

4. Quality Assessment & Filtering

We need the forward and reverse fastq file names plus the sample names.

cutFs <- sort(list.files(path.cut, pattern = "_R1.fastq.gz",
                         full.names = TRUE))
get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1]
sample.names <- unname(sapply(cutFs, get.sample.name))
head(sample.names)

c("P1", "P1-8C", "P10", "P2", "P3", "P3-8C")

[1] "P1"    "P1-8C" "P10"   "P2"    "P3"    "P3-8C"

First let’s look at the quality of our reads. The numbers in brackets specify which samples to view. Here we are looking at an aggregate plot of all data (except the negative control).

Quality plots

p1 <- plotQualityProfile(cutFs[1:15], aggregate = TRUE)
ggsave("ssu18_plot_qscores.png", p1, width = 7, height = 3)
ggsave("ssu18_plot_qscores.png", p1)

(16S rRNA) Figure 2 | Aggregated quality score plot for forward reads after prefilter & primer removal.

The forward reads look good.

Filtering

First, we again make some path variables and setup a new directory of filtered reads.

filtFs <- file.path(path.cut, "filtered", basename(cutFs))

And then we trim the reads.

out <- filterAndTrim(cutFs, filtFs,
                     maxN = 0, maxEE = 2,
                     truncQ = 2, rm.phix = TRUE,
                     compress = TRUE, multithread = 20)
out

Note

These parameters should be set based on the anticipated length of the amplicon and the read quality.

And here is a table of how the filtering step affected the number of reads in each sample. As you can see, there are a few samples that started with a low read count to begin with—we will likely remove those samples at some point.

(16S rRNA) Table 3 | Total reads per sample after filtering.

5. Learn Error Rates

Now it is time to assess the error rate of the data. The DADA2 algorithm uses a parametric error model. Every amplicon data set has a different set of error rates and the learnErrors method learns this error model from the data. It does this by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. The algorithm begins with an initial guess, for which the maximum possible error rates in the data are used.

Forward Reads

errF <- learnErrors(filtFs, multithread = FALSE)

103002523 total bases in 443994 reads from 6 samples will be used for learning the error rates.

plotErrors(errF, nominalQ=TRUE)

(16S rRNA) Figure 3 | Forward reads: Observed frequency of each transition (e.g., T -> G) as a function of the associated quality score.

The error rates for each possible transition (A to C, A to G, etc.) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected.

6. Dereplicate Reads

Now we can use derepFastq to identify the unique sequences in the forward and reverse fastq files.

Forward Reads

derepFs <- derepFastq(filtFs)
names(derepFs) <- sample.names

Detailed results of derep forward reads


Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R1.fastq.gz
Encountered 39950 unique sequences from 79804 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R1.fastq.gz
Encountered 31163 unique sequences from 72760 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R1.fastq.gz
Encountered 35274 unique sequences from 62598 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R1.fastq.gz
Encountered 42119 unique sequences from 81240 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R1.fastq.gz
Encountered 40336 unique sequences from 86732 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R1.fastq.gz
Encountered 31104 unique sequences from 60860 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R1.fastq.gz
Encountered 35169 unique sequences from 63037 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R1.fastq.gz
Encountered 27752 unique sequences from 54127 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R1.fastq.gz
Encountered 49045 unique sequences from 109421 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R1.fastq.gz
Encountered 46853 unique sequences from 94740 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R1.fastq.gz
Encountered 35363 unique sequences from 71678 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R1.fastq.gz
Encountered 35567 unique sequences from 73639 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R1.fastq.gz
Encountered 38518 unique sequences from 72263 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R1.fastq.gz
Encountered 43426 unique sequences from 89315 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R1.fastq.gz
Encountered 13673 unique sequences from 33000 total sequences read.

7. DADA2 & ASV Inference

At this point we are ready to apply the core sample inference algorithm (dada) to the filtered and trimmed sequence data. DADA2 offers three options for whether and how to pool samples for ASV inference.

If pool = TRUE, the algorithm will pool together all samples prior to sample inference.
If pool = FALSE, sample inference is performed on each sample individually.
If pool = "pseudo", the algorithm will perform pseudo-pooling between individually processed samples.

We tested all three methods through the full pipeline. Click the + to see the results of the test. For our final analysis, we chose pool = FALSE for this data set.

Show/hide Results of dada pooling options

Here are summary tables of results from the tests. Values are from the final sequence table produced by each option.

(16S rRNA) Table 4 | Total number of reads, total number of ASVs, minimum and maximum ASVs, followed by the number of singletons, doubletons, etc. for pooling options.

(16S rRNA) Table 5 | Total number of reads and ASVs by sample for pooling options.

dadaFs <- dada(derepFs, err = errF, multithread = TRUE, pool = FALSE)

Detailed results of dada on forward reads


Sample 1 - 79804 reads in 39950 unique sequences.
Sample 2 - 72760 reads in 31163 unique sequences.
Sample 3 - 62598 reads in 35274 unique sequences.
Sample 4 - 81240 reads in 42119 unique sequences.
Sample 5 - 86732 reads in 40336 unique sequences.
Sample 6 - 60860 reads in 31104 unique sequences.
Sample 7 - 63037 reads in 35169 unique sequences.
Sample 8 - 54127 reads in 27752 unique sequences.
Sample 9 - 109421 reads in 49045 unique sequences.
Sample 10 - 94740 reads in 46853 unique sequences.
Sample 11 - 71678 reads in 35363 unique sequences.
Sample 12 - 73639 reads in 35567 unique sequences.
Sample 13 - 72263 reads in 38518 unique sequences.
Sample 14 - 89315 reads in 43426 unique sequences.
Sample 15 - 33000 reads in 13673 unique sequences.

As an example, we can inspect the returned dada-class object for the forward and reverse reads from the sample #2:

dadaFs[[sampnum]]

dada-class: object describing DADA2 denoising results
1759 sequence variants were inferred from 31163 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

This output tells us how many true sequence variants the DADA2 algorithm inferred from the unique sequences, in this case the sample 2.

8. Construct Sequence Table

Now we construct an amplicon sequence variant (ASV) table.

seqtab <- makeSequenceTable(dadaFs)
dim(seqtab)

[1]    15 21685

table(nchar(getSequences(seqtab)))

   22    23    24    25    27    41    66    73   120   149   185   190   193 
    2     1     1     2     1     1     1     1     1     1     1     9     1 
  198   200   201   202   208   209   211   215   217   220   221   222   223 
    1     1     1     1     1     1     1     1     2     4    14    21     8 
  224   225   227   228   229   230   231   232   233   234   238   251 
    3     4     6     4     5     9    20 21342   190    12     2     8

The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants. We have 21685 sequence variants but also a range of sequence lengths. Since many of these sequence variants are singletons or doubletons, we can just select a range that corresponds to the expected amplicon length and eliminate the spurious reads.

(16S rRNA) Figure 4 | Distribution of read length by total ASVs before removing extreme length variants.

seqtab.2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(230,235)]
dim(seqtab.2)

Note

The values you select should be based on the sequence table generated above.

[1]    15 21573

table(nchar(getSequences(seqtab.2)))

  230   231   232   233   234 
    9    20 21342   190    12

After removing the extreme length variants, we have 21573, a reduction of 112 sequence variants.

9. Remove Chimeras

Even though the dada method corrects substitution and indel errors, chimeric sequences remain. According to the DADA2 documentation, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant parent sequences.

seqtab.nochim <- removeBimeraDenovo(seqtab.2,
                                    method = "consensus",
                                    multithread = 20,
                                    verbose = TRUE)
dim(seqtab.nochim)

Identified 1241 bimeras out of 21573 input sequences.

[1]    15 20332

sum(seqtab.nochim)/sum(seqtab.2)

[1] 0.9596595

Chimera checking removed an additional 1241 sequence variants however, when we account for the abundances of each variant, we see chimeras accounts for about 4.03405% of the merged sequence reads. Not bad.

The last thing we want to do is write the sequence table to an RDS file.

saveRDS(seqtab.nochim, "ssu18_seqtab.nochim.rds")

10. Track Reads through Workflow

At this point we can look at the number of reads that made it through each step of the workflow for every sample.

getN <- function(x) sum(getUniques(x))
track <- cbind(out,
               sapply(dadaFs, getN),
               rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "nonchim")
rownames(track) <- sample.names

(16S rRNA) Table 6 | Tracking reads changes at each step of the DADA2 workflow.

And we can save a copy of the table for later use.

write.table(track, "ssu18_read_changes.txt",
            sep = "\t", quote = FALSE, col.names=NA)

11. Assign Taxonomy

The assignTaxonomy command implements the naive Bayesian classifier, so for reproducible results you need to set a random number seed (see issue #538). We did this at the beginning of the workflow. For taxonomic assignment, we are using the Silva version 138 (Quast et al. 2012). The developers of DADA2 maintain a formatted version of the database.

We will read in the RDS file containing the sequence table saved above. We also need to run remove(list = ls()) command, otherwise the final image we save will be huge. This way, the image only contains the sample data, seqtab, and taxtab after running removeBimeraDenovo.

remove(list = ls())
seqtab <- readRDS("ssu18_seqtab.nochim.rds")

And then native Bayesian (Wang et al. 2007) classifier against the Silva database.

tax_silva <- assignTaxonomy(
  seqtab, "silva_nr_v138_train_set.fa.gz", multithread = TRUE)

12. Save Image

And finally, we save an image for use in the analytic part of the workflow. This R data file will be needed as the input for the phyloseq portion of the workflow. See the Data Availability page for complete details on where to get this file.

save.image("ssu18_dada2_wf.rdata")

R Session Information & Code

This workflow was run on the Smithsonian High Performance Cluster (SI/HPC), Smithsonian Institution. Below are the specific packages and versions used in this workflow using both sessionInfo() and devtools::session_info(). Click the arrow to see the details.

Show/hide R Session Info

sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/scottjj/miniconda3/envs/R-4/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] gridExtra_2.3               ggplot2_3.3.0
 [3] ShortRead_1.46.0            GenomicAlignments_1.24.0
 [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0
 [7] matrixStats_0.56.0          Biobase_2.48.0
 [9] Rsamtools_2.4.0             GenomicRanges_1.40.0
[11] GenomeInfoDb_1.24.0         Biostrings_2.56.0
[13] XVector_0.28.0              IRanges_2.22.1
[15] S4Vectors_0.26.0            BiocParallel_1.22.0
[17] BiocGenerics_0.34.0         dada2_1.16.0
[19] Rcpp_1.0.4.6

loaded via a namespace (and not attached):
 [1] plyr_1.8.6             RColorBrewer_1.1-2     pillar_1.4.4
 [4] compiler_4.0.0         bitops_1.0-6           tools_4.0.0
 [7] zlibbioc_1.34.0        lifecycle_0.2.0        tibble_3.0.1
[10] gtable_0.3.0           lattice_0.20-41        png_0.1-7
[13] pkgconfig_2.0.3        rlang_0.4.6            Matrix_1.2-18
[16] GenomeInfoDbData_1.2.3 withr_2.2.0            stringr_1.4.0
[19] hwriter_1.3.2          vctrs_0.3.0            glue_1.4.1
[22] R6_2.4.1               jpeg_0.1-8.1           latticeExtra_0.6-29
[25] reshape2_1.4.4         magrittr_1.5           scales_1.1.1
[28] ellipsis_0.3.1         colorspace_1.4-1       stringi_1.4.6
[31] RCurl_1.98-1.2         RcppParallel_5.0.1     munsell_0.5.0
[34] crayon_1.3.4

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.0.0 (2020-04-24)
 os       CentOS Linux 7 (Core)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2020-06-17

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version  date       lib source
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.0.0)
 backports              1.1.7    2020-05-13 [1] CRAN (R 4.0.0)
 Biobase              * 2.48.0   2020-04-27 [1] Bioconductor
 BiocGenerics         * 0.34.0   2020-04-27 [1] Bioconductor
 BiocParallel         * 1.22.0   2020-04-27 [1] Bioconductor
 Biostrings           * 2.56.0   2020-04-27 [1] Bioconductor
 bitops                 1.0-6    2013-08-17 [1] CRAN (R 4.0.0)
 callr                  3.4.3    2020-03-28 [1] CRAN (R 4.0.0)
 cli                    2.0.2    2020-02-28 [1] CRAN (R 4.0.0)
 colorspace             1.4-1    2019-03-18 [1] CRAN (R 4.0.0)
 crayon                 1.3.4    2017-09-16 [1] CRAN (R 4.0.0)
 dada2                * 1.16.0   2020-04-27 [1] Bioconductor
 DelayedArray         * 0.14.0   2020-04-27 [1] Bioconductor
 desc                   1.2.0    2018-05-01 [1] CRAN (R 4.0.0)
 devtools               2.3.0    2020-04-10 [1] CRAN (R 4.0.0)
 digest                 0.6.25   2020-02-23 [1] CRAN (R 4.0.0)
 ellipsis               0.3.1    2020-05-15 [1] CRAN (R 4.0.0)
 fansi                  0.4.1    2020-01-08 [1] CRAN (R 4.0.0)
 fs                     1.4.1    2020-04-04 [1] CRAN (R 4.0.0)
 GenomeInfoDb         * 1.24.0   2020-04-27 [1] Bioconductor
 GenomeInfoDbData       1.2.3    2020-05-22 [1] Bioconductor
 GenomicAlignments    * 1.24.0   2020-04-27 [1] Bioconductor
 GenomicRanges        * 1.40.0   2020-04-27 [1] Bioconductor
 ggplot2              * 3.3.0    2020-03-05 [1] CRAN (R 4.0.0)
 glue                   1.4.1    2020-05-13 [1] CRAN (R 4.0.0)
 gridExtra            * 2.3      2017-09-09 [1] CRAN (R 4.0.0)
 gtable                 0.3.0    2019-03-25 [1] CRAN (R 4.0.0)
 hwriter                1.3.2    2014-09-10 [1] CRAN (R 4.0.0)
 IRanges              * 2.22.1   2020-04-28 [1] Bioconductor
 jpeg                   0.1-8.1  2019-10-24 [1] CRAN (R 4.0.0)
 lattice                0.20-41  2020-04-02 [1] CRAN (R 4.0.0)
 latticeExtra           0.6-29   2019-12-19 [1] CRAN (R 4.0.0)
 lifecycle              0.2.0    2020-03-06 [1] CRAN (R 4.0.0)
 magrittr               1.5      2014-11-22 [1] CRAN (R 4.0.0)
 Matrix                 1.2-18   2019-11-27 [1] CRAN (R 4.0.0)
 matrixStats          * 0.56.0   2020-03-13 [1] CRAN (R 4.0.0)
 memoise                1.1.0    2017-04-21 [1] CRAN (R 4.0.0)
 munsell                0.5.0    2018-06-12 [1] CRAN (R 4.0.0)
 pillar                 1.4.4    2020-05-05 [1] CRAN (R 4.0.0)
 pkgbuild               1.0.8    2020-05-07 [1] CRAN (R 4.0.0)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.0.0)
 pkgload                1.0.2    2018-10-29 [1] CRAN (R 4.0.0)
 plyr                   1.8.6    2020-03-03 [1] CRAN (R 4.0.0)
 png                    0.1-7    2013-12-03 [1] CRAN (R 4.0.0)
 prettyunits            1.1.1    2020-01-24 [1] CRAN (R 4.0.0)
 processx               3.4.2    2020-02-09 [1] CRAN (R 4.0.0)
 ps                     1.3.3    2020-05-08 [1] CRAN (R 4.0.0)
 R6                     2.4.1    2019-11-12 [1] CRAN (R 4.0.0)
 RColorBrewer           1.1-2    2014-12-07 [1] CRAN (R 4.0.0)
 Rcpp                 * 1.0.4.6  2020-04-09 [1] CRAN (R 4.0.0)
 RcppParallel           5.0.1    2020-05-06 [1] CRAN (R 4.0.0)
 RCurl                  1.98-1.2 2020-04-18 [1] CRAN (R 4.0.0)
 remotes                2.1.1    2020-02-15 [1] CRAN (R 4.0.0)
 reshape2               1.4.4    2020-04-09 [1] CRAN (R 4.0.0)
 rlang                  0.4.6    2020-05-02 [1] CRAN (R 4.0.0)
 rprojroot              1.3-2    2018-01-03 [1] CRAN (R 4.0.0)
 Rsamtools            * 2.4.0    2020-04-27 [1] Bioconductor
 S4Vectors            * 0.26.0   2020-04-27 [1] Bioconductor
 scales                 1.1.1    2020-05-11 [1] CRAN (R 4.0.0)
 sessioninfo            1.1.1    2018-11-05 [1] CRAN (R 4.0.0)
 ShortRead            * 1.46.0   2020-04-27 [1] Bioconductor
 stringi                1.4.6    2020-02-17 [1] CRAN (R 4.0.0)
 stringr                1.4.0    2019-02-10 [1] CRAN (R 4.0.0)
 SummarizedExperiment * 1.18.1   2020-04-30 [1] Bioconductor
 testthat               2.3.2    2020-03-02 [1] CRAN (R 4.0.0)
 tibble                 3.0.1    2020-04-20 [1] CRAN (R 4.0.0)
 usethis                1.6.1    2020-04-29 [1] CRAN (R 4.0.0)
 vctrs                  0.3.0    2020-05-11 [1] CRAN (R 4.0.0)
 withr                  2.2.0    2020-04-20 [1] CRAN (R 4.0.0)
 XVector              * 0.28.0   2020-04-27 [1] Bioconductor
 zlibbioc               1.34.0   2020-04-27 [1] Bioconductor

ITS Data

Workflow Input

All files needed to run this workflow can be downloaded from figshare.

Overview

Sequence Files & Samples

We sequenced a total of 15 samples collected from 10 different plots at a single depth in each plot.

(ITS) Table 1 | Sample data & associated sequencing information.

Workflow

Our workflow is pretty much ripped from the DADA2 ITS Workflow (1.8) on the DADA2 website. That webpage contains thorough explanations for each step so we will not repeat most of that here. For more details, check out the post. The workflow consists of the following steps:

Step	Command	What we’re doing
1	multiple	prepare input file names & paths
2	multiple	Define primers (all orientations)
3	`cutadapt`	Remove primers
4	`filterAndTrim()`	assess quality & filter reads
5	`learnErrors()`	generate an error model for the data
6	`derepFastq`	dereplicate sequences
7	`dada()`	infer ASVs on both forward & reverse reads independently
8	`mergePairs()`	merge forward & reverse reads to further refine ASVs
9	`makeSequenceTable()`	generate a count table
10	`removeBimeraDenovo()`	screen for & remove chimeras
11		track reads through workflow
12	`assignTaxonomy()`	assign taxonomy & finish workflow
13	`save.image()`	save an image of seqtab & taxtab for next part of workflow

1. Set Working Environment

Next, we need to setup the working environment by renaming the fastq files and defining a path for the working directory.

Rename samples

CAUTION. You should check that this code works on a few backup files before proceeding.

setwd("RAW")
orig_fastq <- list.files(pattern = "*.fastq.gz")
newname_fastq <- gsub("_S.*_L001", "", orig_fastq)
newname_fastq <- gsub("_001", "", newname_fastq)
file.rename(orig_fastq, newname_fastq)
setwd("../")

path <- "RAW"
head(list.files(path))

[1] "P1_R1.fastq.gz"    "P1_R2.fastq.gz"    "P1-8C_R1.fastq.gz"
[4] "P1-8C_R2.fastq.gz" "P10_R1.fastq.gz"   "P10_R2.fastq.gz"

Then, we generate matched lists of the forward and reverse read files. We also parse out the sample name.

fnFs <- sort(list.files(path, pattern = "_R1.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern = "_R2.fastq.gz", full.names = TRUE))

2. Define Primers

Before we start the DADA2 workflow we need to run catadapt (Martin 2011) on all fastq.gz files to trim the primers. For this part of the study we used the primer pair ITS1f (CTTGGTCATTTAGAGGAAGTAA) (Gardes and Bruns 1993) and ITS2 (GCTGCGTTCTTCATCGATGC) (White et al. 1990) which should yield variable amplicon lengths between 100 to 400 bp. What we are looking for is potential read-through scenarios that are possible when sequencing the length variable ITS region as described in the DADA2 ITS Workflow (1.8). Please refer to this document for a complete explanation.

First we define the primers.

FWD <- "CTTGGTCATTTAGAGGAAGTAA"
REV <- "GCTGCGTTCTTCATCGATGC"

Next, we check the presence and orientation of these primers in the data. To do this we will create all orientations of the input primer sequences. In other words the Forward, Complement, Reverse, and Reverse Complement variations.

allOrients <- function(primer) {
    require(Biostrings)
    dna <- DNAString(primer)
    orients <- c(Forward = dna,
                 Complement = complement(dna),
                 Reverse = reverse(dna),
                 RevComp = reverseComplement(dna))
    return(sapply(orients, toString))
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients
REV.orients

                 Forward               Complement                  Reverse
"CTTGGTCATTTAGAGGAAGTAA" "GAACCAGTAAATCTCCTTCATT" "AATGAAGGAGATTTACTGGTTC"
                 RevComp
"TTACTTCCTCTAAATGACCAAG"
               Forward             Complement                Reverse
"GCTGCGTTCTTCATCGATGC" "CGACGCAAGAAGTAGCTACG" "CGTAGCTACTTCTTGCGTCG"
               RevComp
"GCATCGATGAAGAACGCAGC"

fnFs.filtN <- file.path(path, "filtN", basename(fnFs))
fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN,
              maxN = 0, multithread = TRUE)

Sweet. Time to assess the number of times a primer (and all possible primer orientation) appear in the forward and reverse reads. According to the workflow, counting the primers on one set of paired end fastq files is sufficient to see if there is a problem. This assumes that all the files were created using the same library prep. Basically for both primers, we will search for all four orientations in both forward and reverse reads

sampnum <- 1
primerHits <- function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}

Forward primers

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits,
                                fn = fnFs.filtN[[sampnum]]),
    FWD.ReverseReads = sapply(FWD.orients, primerHits,
                              fn = fnRs.filtN[[sampnum]]))

                 Forward Complement Reverse RevComp
FWD.ForwardReads   48467          0       0       0
FWD.ReverseReads       0          0       0    5078

Reverse primers

rbind(REV.ForwardReads = sapply(REV.orients, primerHits,
                                fn = fnFs.filtN[[sampnum]]),
       REV.ReverseReads = sapply(REV.orients, primerHits,
                                 fn = fnRs.filtN[[sampnum]]))

                 Forward Complement Reverse RevComp
REV.ForwardReads       0          0       0    6237
REV.ReverseReads   43902          0       0       0

What does this table mean? I wondered the same thing. Let’s break it down. Sample P1 had 50,894 sequences after the original filtering described earlier. The code searched the forward and reverse fastq files for all 8 primers. If we look at the two outputs, we see the forward primer is found in the forward reads in its forward orientation but also in some reverse reads in its reverse-complement orientation. The reverse primer is found in the reverse reads in its forward orientation but also in some forward reads in its reverse-complement orientation. This is due to read-through when the ITS region is short.

3. Remove Primers

Now we can run catadapt (Martin 2011) to remove the primers from the fastq sequences. A little setup first. If this command executes successfully it means R has found cutadapt.

cutadapt <- "/PATH/to/cutadapt"
system2(cutadapt, args = "--version") # Run shell commands from R

2.8

path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))

FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)

R1.flags <- paste("-g", FWD, "-a", REV.RC)
R2.flags <- paste("-G", REV, "-A", FWD.RC)

for(i in seq_along(fnFs)) {system2(cutadapt,
                                   args = c(R1.flags, R2.flags,
                                            "-m", 20, "-n", 2,
                                            "-o", fnFs.cut[i],
                                            "-p", fnRs.cut[i],
                                            fnFs.filtN[i], fnRs.filtN[i]))}

This is cutadapt 2.8 with Python 3.7.6
Command line parameters: -g CTTGGTCATTTAGAGGAAGTAA -a GCATCGATGAAGAACGCAGC \
                         -G GCTGCGTTCTTCATCGATGC -A TTACTTCCTCTAAATGACCAAG \
                         -m 20 -n 2 \
                         -o RAW/cutadapt/P1_R1.fastq.gz \
                         -p RAW/cutadapt/P1_R2.fastq.gz RAW/filtN/P1_R1.fastq.gz RAW/filtN/P1_R2.fastq.gz

We can now count the number of primers in the sequences from the output of cutadapt.

rbind(
  FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[sampnum]]),
  FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[sampnum]]),
  REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[sampnum]]),
  REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[sampnum]]))

                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0       0
REV.ReverseReads       0          0       0       0

Basically, primers are no longer detected in the cutadapted reads. Now, for each sample, we can take a look at how the pre-filtering step and primer removal affected the total number of raw reads.

(ITS) Table 2 | Total reads per sample after prefiltering and primer removal (using cutadapt).

4. Quality Assessment & Filtering

We need the forward and reverse fastq file names and the sample names.

cutFs <- sort(list.files(path.cut, pattern = "_R1.fastq.gz",
                         full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "_R2.fastq.gz",
                         full.names = TRUE))

get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1]
sample.names <- unname(sapply(cutFs, get.sample.name))
head(sample.names)

[1] "P1"    "P1-8C" "P10"   "P2"    "P3"    "P3-8C"

Quality plots

Then we inspect read quality.

p1 <- plotQualityProfile(cutFs[1:15], aggregate = TRUE)
p2 <- plotQualityProfile(cutRs[1:15], aggregate = TRUE)

p3 <- grid.arrange(p1, p2, nrow = 1)
ggsave("plot_qscores_2018.png", p3, width = 7, height = 3)

(ITS) Figure 1 | Aggregated quality score plots for forward (left) & reverse (right) reads before filtering.

Filtering

filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))

We use a minLen value to get rid of very short length sequences.

out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs,
                     maxN = 0, maxEE = c(2, 2), truncQ = 2,
                     minLen = 50, rm.phix = TRUE,
                     compress = TRUE, multithread = TRUE)
out

Note

These parameters should be set based on the anticipated length of the amplicon and the read quality.

And here is a table of how the filtering step affected the number of reads in each sample.

(ITS) Table 3 | Total reads per sample after filtering.

5. Learn Error Rates

Time to assess the error rate of the data. The rest of the workflow is very similar to the 16S workflows presented previously. So I will basically stop talking.

Forward Reads

errF <- learnErrors(filtFs, multithread = TRUE)

108788735 total bases in 495784 reads from 11 samples will be used for learning the error rates.
109358094 total bases in 495784 reads from 11 samples will be used for learning the error rates.

plotErrors(errF, nominalQ = TRUE)

(ITS) Figure 3 | Forward reads: Observed frequency of each transition (e.g., T -> G) as a function of the associated quality score.

Reverse Reads

errR <- learnErrors(filtRs, multithread = TRUE)

109358094 total bases in 495784 reads from 11 samples will be used for learning the error rates.

plotErrors(errR, nominalQ = TRUE)

(ITS) Figure 4 | Reverse reads: Observed frequency of each transition (e.g., T -> G) as a function of the associated quality score.

6. Dereplicate Reads

Forward Reads

derepFs <- derepFastq(filtFs, verbose = TRUE)
names(derepFs) <- sample.names

Detailed results of derep forward reads


Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R1.fastq.gz
Encountered 7277 unique sequences from 35999 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R1.fastq.gz
Encountered 11366 unique sequences from 48230 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R1.fastq.gz
Encountered 11714 unique sequences from 80877 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R1.fastq.gz
Encountered 13088 unique sequences from 58956 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R1.fastq.gz
Encountered 10180 unique sequences from 42563 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R1.fastq.gz
Encountered 4480 unique sequences from 17548 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R1.fastq.gz
Encountered 12259 unique sequences from 68259 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R1.fastq.gz
Encountered 160 unique sequences from 308 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R1.fastq.gz
Encountered 10436 unique sequences from 48518 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R1.fastq.gz
Encountered 11230 unique sequences from 51055 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R1.fastq.gz
Encountered 10316 unique sequences from 43471 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R1.fastq.gz
Encountered 2121 unique sequences from 9488 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R1.fastq.gz
Encountered 9236 unique sequences from 41563 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R1.fastq.gz
Encountered 9000 unique sequences from 35722 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R1.fastq.gz
Encountered 42 unique sequences from 85 total sequences read.

Reverse Reads

derepRs <- derepFastq(filtRs, verbose = TRUE)
names(derepRs) <- sample.names

Detailed results of derep reverse reads


Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R2.fastq.gz
Encountered 13081 unique sequences from 35999 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R2.fastq.gz
Encountered 17965 unique sequences from 48230 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R2.fastq.gz
Encountered 18610 unique sequences from 80877 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R2.fastq.gz
Encountered 19600 unique sequences from 58956 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R2.fastq.gz
Encountered 16145 unique sequences from 42563 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R2.fastq.gz
Encountered 7496 unique sequences from 17548 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R2.fastq.gz
Encountered 19933 unique sequences from 68259 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R2.fastq.gz
Encountered 199 unique sequences from 308 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R2.fastq.gz
Encountered 18495 unique sequences from 48518 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R2.fastq.gz
Encountered 17216 unique sequences from 51055 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R2.fastq.gz
Encountered 17111 unique sequences from 43471 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R2.fastq.gz
Encountered 3809 unique sequences from 9488 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R2.fastq.gz
Encountered 14140 unique sequences from 41563 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R2.fastq.gz
Encountered 14856 unique sequences from 35722 total sequences read.
Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R2.fastq.gz
Encountered 45 unique sequences from 85 total sequences read.

7. DADA2 & ASV Inference

At this point we are ready to apply the core sample inference algorithm (dada) to the filtered and trimmed sequence data. DADA2 offers three options for whether and how to pool samples for ASV inference.

We tested all three methods through the full pipeline. Click the + to see the results of the test. For our final analysis, we chose pool = TRUE for this data set.

Show/hide Results of dada pooling options

Here are a few summary tables of results from the tests. Values are from the final sequence table prodcuced by each option.

(ITS) Table 4 | Total number of reads, total number of ASVs, minimum and maximum ASVs, followed by the number of singletons, doubletons, etc. for pooling options.

(ITS) Table 5 | Total number of reads and ASVs by sample for pooling options.

Forward Reads

dadaFs <- dada(derepFs, err = errF, pool = TRUE, multithread = TRUE)

15 samples were pooled: 582642 reads in 102887 unique sequences.

dadaFs[[sampnum]]

dada-class: object describing DADA2 denoising results
1175 sequence variants were inferred from 7277 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Reverse Reads

dadaRs <- dada(derepRs, err = errR, pool = TRUE, multithread = TRUE)

15 samples were pooled: 582642 reads in 174866 unique sequences.

dadaRs[[sampnum]]

dada-class: object describing DADA2 denoising results
1116 sequence variants were inferred from 13081 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

save.image("pre_merge_its_2018.rdata")

8. Merge Paired Reads

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)

Detailed results of merging reads


29795 paired-reads (in 946 unique pairings) successfully merged out of 32193 (in 1177 pairings) input.
41880 paired-reads (in 719 unique pairings) successfully merged out of 46715 (in 941 pairings) input.
64805 paired-reads (in 815 unique pairings) successfully merged out of 66771 (in 1015 pairings) input.
Duplicate sequences in merged output.
51144 paired-reads (in 1011 unique pairings) successfully merged out of 57613 (in 1325 pairings) input.
38442 paired-reads (in 767 unique pairings) successfully merged out of 41985 (in 994 pairings) input.
14188 paired-reads (in 349 unique pairings) successfully merged out of 16981 (in 452 pairings) input.
56594 paired-reads (in 1018 unique pairings) successfully merged out of 64411 (in 1306 pairings) input.
296 paired-reads (in 95 unique pairings) successfully merged out of 301 (in 100 pairings) input.
35713 paired-reads (in 619 unique pairings) successfully merged out of 41006 (in 786 pairings) input.
45851 paired-reads (in 955 unique pairings) successfully merged out of 50174 (in 1240 pairings) input.
36121 paired-reads (in 870 unique pairings) successfully merged out of 41541 (in 1135 pairings) input.
Duplicate sequences in merged output.
9172 paired-reads (in 335 unique pairings) successfully merged out of 9403 (in 402 pairings) input.
38392 paired-reads (in 704 unique pairings) successfully merged out of 40897 (in 886 pairings) input.
29155 paired-reads (in 748 unique pairings) successfully merged out of 34979 (in 990 pairings) input.
Duplicate sequences in merged output.
80 paired-reads (in 14 unique pairings) successfully merged out of 82 (in 16 pairings) input.

9. Construct Sequence Table

seqtabF <- makeSequenceTable(mergers)
dim(seqtabF)

[1]   15 3364

table(nchar(getSequences(seqtabF)))

130 131 132 133 136 139 140 141 143 144 146 147 148 149 150 151 152 153 154 155 
  2   2   1   1   1   1   1   3   2   1   3   5   5   4   5   5   3   2   5   4 
156 159 160 161 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 
  1   2   1   1   2   2   4   9   2   1   2   5   5   5   8   9   2   5  10  13 
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 
 13  14  11  14  12  15  19  12  12  13   9  22  11  11  12  15  17  13  10  11 
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 
 22  16  10  11  13  17  10   8  10  12  13   7  20  18  24  30  39  38  20  19 
219 220 221 222 223 224 225 226 228 229 230 231 232 233 234 235 236 237 238 239 
 12  18  21  23  23  29  32  32   6  48   4  73  20  27  29  19  31  33  30  29 
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 
 35  21  36  16  16  18  17  25  22  29  17  27  28  22  20  18  28  19  28  25 
260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 
 23  11  15  13  16  18  23  17  12  12  16  15  32  17  22  21  21  18  16  17 
280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 
 13  22  13  18  12  13  13  18  17  15  14  18   8  11  13  17  12  11  13   9 
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 
 16  10   9   3   8  12   8  13  20  15   8  13  16  11  10   8  18  10   8  12 
320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 
  8   5  12  17  13   9   8   8   9  12  16  13  14  11  12  13  18  10  11  10 
340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 
  9  11   8   8   5  15   8  16  11  15   4   9   7   9  12  14   5  12  11   5 
360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 
  2   4  11   5   2  14   6  12   7   5   6   4   3   9   5  10   8   9   4   6 
380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 
  5   3   7   7   4   2   3   4   3   5   5   1   3   4   3   6   5   4   3   2 
400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 421 
 11   5   3   3   3   3   2   1   2   3   3   1   3   3   1   2   2   1   2   2 
422 423 424 427 430 431 432 434 436 437 439 440 441 442 443 446 447 448 
  2   3   2   1   3   2   4   2   1   2   4   3   1   2   1   1   2   1

(ITS) Figure 5 | Distribution of read length by total ASVs before removing extreme length variants.

As expected, a wide range of length variability in the the amplified ITS region.

seqtab <- seqtabF[,nchar(colnames(seqtabF)) %in% seq(100,450)]
dim(seqtab)

[1]   15 3364

After removing the extreme length variants, we have 3364, a reduction of 0 sequence variants.

table(nchar(getSequences(seqtab)))

10. Remove Chimeras

seqtab.nochim <- removeBimeraDenovo(seqtab,
                                    method = "consensus",
                                    multithread = TRUE, verbose = TRUE)
dim(seqtab.nochim)

Identified 7 bimeras out of 3364 input sequences.

[1]   15 3357

sum(seqtab.nochim)/sum(seqtab.2)

[1] 0.9990135

Chimera checking removed an additional 7 sequence variants however, when we account for the abundances of each variant, we see chimeras accounts for about 0.09865% of the merged sequence reads. Curious.

table(colSums(seqtab.nochim>0))

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15
1247  711  443  289  204  123  102   86   49   32   24   15   23    7    2

table(colSums(seqtab.nochim>0))

  14   95  335  349  616  702  719  745  765  812  867  945  954 1011 1017
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1

The last thing we want to do is write the sequence table to an RDS file.

saveRDS(seqtab.nochim, "its18_seqtab.nochim.rds")

11. Track Reads through Workflow

(ITS) Table 6 | Tracking reads changes at each step of the DADA2 workflow.

12. Assign Taxonomy

For taxonomic classification, we used the UNITE (Nilsson et al. 2019) database, specifically the UNITE general FASTA release for Fungi (v. 04.02.2020) (Abarenkov et al. 2020).

tax <- assignTaxonomy(seqtab,
                      "sh_general_release_dynamic_s_04.02.2020_dev.fasta",
                      multithread = 20, tryRC = TRUE)
saveRDS(tax, "its18_tax_its.rds")

UNITE fungal taxonomic reference detected.

13. Save Image

save.image("its18_dada2_wf.rdata")

R Session Information & Code

Show/hide R Session Info

sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/scottjj/miniconda3/envs/R-4/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils
 [8] datasets  methods   base

other attached packages:
 [1] gridExtra_2.3               ggplot2_3.3.0
 [3] ShortRead_1.46.0            GenomicAlignments_1.24.0
 [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0
 [7] matrixStats_0.56.0          Biobase_2.48.0
 [9] Rsamtools_2.4.0             GenomicRanges_1.40.0
[11] GenomeInfoDb_1.24.0         Biostrings_2.56.0
[13] XVector_0.28.0              IRanges_2.22.1
[15] S4Vectors_0.26.0            BiocParallel_1.22.0
[17] BiocGenerics_0.34.0         dada2_1.16.0
[19] Rcpp_1.0.4.6

loaded via a namespace (and not attached):
 [1] plyr_1.8.6             RColorBrewer_1.1-2     pillar_1.4.4
 [4] compiler_4.0.0         bitops_1.0-6           tools_4.0.0
 [7] zlibbioc_1.34.0        lifecycle_0.2.0        tibble_3.0.1
[10] gtable_0.3.0           lattice_0.20-41        png_0.1-7
[13] pkgconfig_2.0.3        rlang_0.4.6            Matrix_1.2-18
[16] GenomeInfoDbData_1.2.3 withr_2.2.0            stringr_1.4.0
[19] hwriter_1.3.2          vctrs_0.3.0            glue_1.4.1
[22] R6_2.4.1               jpeg_0.1-8.1           latticeExtra_0.6-29
[25] reshape2_1.4.4         magrittr_1.5           scales_1.1.1
[28] ellipsis_0.3.1         colorspace_1.4-1       stringi_1.4.6
[31] RCurl_1.98-1.2         RcppParallel_5.0.1     munsell_0.5.0
[34] crayon_1.3.4

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.0.0 (2020-04-24)
 os       CentOS Linux 7 (Core)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2020-06-04

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version  date       lib source
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.0.0)
 backports              1.1.7    2020-05-13 [1] CRAN (R 4.0.0)
 Biobase              * 2.48.0   2020-04-27 [1] Bioconductor
 BiocGenerics         * 0.34.0   2020-04-27 [1] Bioconductor
 BiocParallel         * 1.22.0   2020-04-27 [1] Bioconductor
 Biostrings           * 2.56.0   2020-04-27 [1] Bioconductor
 bitops                 1.0-6    2013-08-17 [1] CRAN (R 4.0.0)
 callr                  3.4.3    2020-03-28 [1] CRAN (R 4.0.0)
 cli                    2.0.2    2020-02-28 [1] CRAN (R 4.0.0)
 colorspace             1.4-1    2019-03-18 [1] CRAN (R 4.0.0)
 crayon                 1.3.4    2017-09-16 [1] CRAN (R 4.0.0)
 dada2                * 1.16.0   2020-04-27 [1] Bioconductor
 DelayedArray         * 0.14.0   2020-04-27 [1] Bioconductor
 desc                   1.2.0    2018-05-01 [1] CRAN (R 4.0.0)
 devtools               2.3.0    2020-04-10 [1] CRAN (R 4.0.0)
 digest                 0.6.25   2020-02-23 [1] CRAN (R 4.0.0)
 ellipsis               0.3.1    2020-05-15 [1] CRAN (R 4.0.0)
 fansi                  0.4.1    2020-01-08 [1] CRAN (R 4.0.0)
 fs                     1.4.1    2020-04-04 [1] CRAN (R 4.0.0)
 GenomeInfoDb         * 1.24.0   2020-04-27 [1] Bioconductor
 GenomeInfoDbData       1.2.3    2020-05-22 [1] Bioconductor
 GenomicAlignments    * 1.24.0   2020-04-27 [1] Bioconductor
 GenomicRanges        * 1.40.0   2020-04-27 [1] Bioconductor
 ggplot2              * 3.3.0    2020-03-05 [1] CRAN (R 4.0.0)
 glue                   1.4.1    2020-05-13 [1] CRAN (R 4.0.0)
 gridExtra            * 2.3      2017-09-09 [1] CRAN (R 4.0.0)
 gtable                 0.3.0    2019-03-25 [1] CRAN (R 4.0.0)
 hwriter                1.3.2    2014-09-10 [1] CRAN (R 4.0.0)
 IRanges              * 2.22.1   2020-04-28 [1] Bioconductor
 jpeg                   0.1-8.1  2019-10-24 [1] CRAN (R 4.0.0)
 lattice                0.20-41  2020-04-02 [1] CRAN (R 4.0.0)
 latticeExtra           0.6-29   2019-12-19 [1] CRAN (R 4.0.0)
 lifecycle              0.2.0    2020-03-06 [1] CRAN (R 4.0.0)
 magrittr               1.5      2014-11-22 [1] CRAN (R 4.0.0)
 Matrix                 1.2-18   2019-11-27 [1] CRAN (R 4.0.0)
 matrixStats          * 0.56.0   2020-03-13 [1] CRAN (R 4.0.0)
 memoise                1.1.0    2017-04-21 [1] CRAN (R 4.0.0)
 munsell                0.5.0    2018-06-12 [1] CRAN (R 4.0.0)
 pillar                 1.4.4    2020-05-05 [1] CRAN (R 4.0.0)
 pkgbuild               1.0.8    2020-05-07 [1] CRAN (R 4.0.0)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.0.0)
 pkgload                1.0.2    2018-10-29 [1] CRAN (R 4.0.0)
 plyr                   1.8.6    2020-03-03 [1] CRAN (R 4.0.0)
 png                    0.1-7    2013-12-03 [1] CRAN (R 4.0.0)
 prettyunits            1.1.1    2020-01-24 [1] CRAN (R 4.0.0)
 processx               3.4.2    2020-02-09 [1] CRAN (R 4.0.0)
 ps                     1.3.3    2020-05-08 [1] CRAN (R 4.0.0)
 R6                     2.4.1    2019-11-12 [1] CRAN (R 4.0.0)
 RColorBrewer           1.1-2    2014-12-07 [1] CRAN (R 4.0.0)
 Rcpp                 * 1.0.4.6  2020-04-09 [1] CRAN (R 4.0.0)
 RcppParallel           5.0.1    2020-05-06 [1] CRAN (R 4.0.0)
 RCurl                  1.98-1.2 2020-04-18 [1] CRAN (R 4.0.0)
 remotes                2.1.1    2020-02-15 [1] CRAN (R 4.0.0)
 reshape2               1.4.4    2020-04-09 [1] CRAN (R 4.0.0)
 rlang                  0.4.6    2020-05-02 [1] CRAN (R 4.0.0)
 rprojroot              1.3-2    2018-01-03 [1] CRAN (R 4.0.0)
 Rsamtools            * 2.4.0    2020-04-27 [1] Bioconductor
 S4Vectors            * 0.26.0   2020-04-27 [1] Bioconductor
 scales                 1.1.1    2020-05-11 [1] CRAN (R 4.0.0)
 sessioninfo            1.1.1    2018-11-05 [1] CRAN (R 4.0.0)
 ShortRead            * 1.46.0   2020-04-27 [1] Bioconductor
 stringi                1.4.6    2020-02-17 [1] CRAN (R 4.0.0)
 stringr                1.4.0    2019-02-10 [1] CRAN (R 4.0.0)
 SummarizedExperiment * 1.18.1   2020-04-30 [1] Bioconductor
 testthat               2.3.2    2020-03-02 [1] CRAN (R 4.0.0)
 tibble                 3.0.1    2020-04-20 [1] CRAN (R 4.0.0)
 usethis                1.6.1    2020-04-29 [1] CRAN (R 4.0.0)
 vctrs                  0.3.0    2020-05-11 [1] CRAN (R 4.0.0)
 withr                  2.2.0    2020-04-20 [1] CRAN (R 4.0.0)
 XVector              * 0.28.0   2020-04-27 [1] Bioconductor
 zlibbioc               1.34.0   2020-04-27 [1] Bioconductor

Workflow Output

Data products generated in this workflow can be downloaded from figshare.

The end!

Next workflow:
2. Data Set Preparation

Source Code

The source code for this page can be accessed on GitHub by clicking this link.

Data Availability

Raw fastq files available on figshare at 10.25573/data.14686665 for the 16S rRNA data and 10.25573/data.14686755 for the ITS data. Trimmed fastq files (primers removed) available through the ENA under project accession number PRJEB45074 (ERP129199). Data generated in this workflow can be accessed on figshare at 10.25573/data.14687184.

Last updated on

[1] "2022-06-29 07:30:39 EST"

References

Abarenkov, Kessy, Allan Zirk, Timo Piirmann, Raivo Pöhönen, Filipp Ivanov, R. Henrik Nilsson, and Urmas Kõljalg. 2020. “UNITE General FASTA Release for Fungi.” https://dx.doi.org/10.15156/BIO/786368.

Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “Dada2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (7): 581. https://doi.org/10.1038/nmeth.3869.

Caporaso, J Gregory, Christian L Lauber, William A Walters, Donna Berg-Lyons, Catherine A Lozupone, Peter J Turnbaugh, Noah Fierer, and Rob Knight. 2011. “Global Patterns of 16s rRNA Diversity at a Depth of Millions of Sequences Per Sample.” Proceedings of the National Academy of Sciences 108: 4516–22. https://doi.org/https://doi.org/10.1073/pnas.1000080107.

Gardes, Monique, and Thomas D Bruns. 1993. “ITS Primers with Enhanced Specificity for Basidiomycetes-Application to the Identification of Mycorrhizae and Rusts.” Molecular Ecology 2 (2): 113–18. https://doi.org/10.1111/j.1365-294X.1993.tb00005.x.

Martin, Marcel. 2011. “Cutadapt Removes Adapter Sequences from High-Throughput Sequencing Reads.” EMBnet. Journal 17 (1): 10–12. https://doi.org/10.14806/ej.17.1.200.

Nilsson, Rolf Henrik, Karl-Henrik Larsson, Andy F S Taylor, Johan Bengtsson-Palme, Thomas S Jeppesen, Dmitry Schigel, Peter Kennedy, et al. 2019. “The UNITE Database for Molecular Identification of Fungi: Handling Dark Taxa and Parallel Taxonomic Classifications.” Nucleic Acids Research 47 (D1): D259–64. https://doi.org/10.1093/nar/gky1022.

Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2012. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research 41 (D1): D590–96. https://doi.org/10.1093/nar/gks1219.

Wang, Qiong, George M Garrity, James M Tiedje, and James R Cole. 2007. “Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.” Applied and Environmental Microbiology 73 (16): 5261–67. https://doi.org/10.1128/AEM.00062-07.

White, Thomas J, Thomas Bruns, SJWT Lee, John Taylor, et al. 1990. “Amplification and Direct Sequencing of Fungal Ribosomal RNA Genes for Phylogenetics.” PCR Protocols: A Guide to Methods and Applications 18 (1): 315–22.

--- title: "1. DADA2 Workflow" description: | DADA2 workflow for processing the 16S rRNA and the ITS high temperature data sets. Workflow uses paired end reads, beginning with raw fastq files, ending with sequence and taxonomy tables. --- <details markdown="1"> <summary>Click here for setup information.</summary> ```{r} #| message: false #| results: hide knitr::opts_chunk$set(echo = TRUE, eval = FALSE) set.seed(919191) #pacman::p_depends(ape, local = TRUE) #pacman::p_depends_reverse(ape, local = TRUE) library(dada2); packageVersion("dada2") library(ShortRead); packageVersion("ShortRead") library(ggplot2); packageVersion("ggplot2") pacman::p_load(tidyverse, gridExtra, grid, formatR, reactable, downloadthis, captioner, sparkline, install = FALSE, update = FALSE) options(scipen=999) knitr::opts_current$get(c( "cache", "cache.path", "cache.rebuild", "dependson", "autodep" )) remove(list = ls()) source(file.path("assets", "functions.R")) ``` </details> ```{r} #| echo: false #| eval: true #| results: hide # Create the caption(s) with captioner caption_tab <- captioner(prefix = "(16S rRNA) Table", suffix = " |", style = "b") caption_fig <- captioner(prefix = "(16S rRNA) Figure", suffix = " |", style = "b") # Create a function for referring to the tables in text ref <- function(x) str_extract(x, "[^|]*") %>% trimws(which = "right", whitespace = "[ ]") ``` ```{r} #| echo: false #| eval: true source("assets/captions/captions_dada2.R") ``` # Prerequisites In order to run this workflow, you either need the *raw* data, available on the figshare project site (see below), or the *trimmed* data, available from the European Nucleotide Archive (ENA) under project accession number [PRJEB45074](https://www.ebi.ac.uk/ena/browser/view/PRJEB45074). See the [Data Availability](data-availability.html) page for more details. This workflow contains the code we used to process the 16S rRNA and ITS data sets using [DADA2](https://benjjneb.github.io/dada2/) [@callahan2016dada2]. Workflow construction is based on the [DADA2 Pipeline Tutorial (1.8)](https://benjjneb.github.io/dada2/tutorial_1_8.html) and the primer identification section of the [DADA2 ITS Pipeline Workflow (1.8)](https://benjjneb.github.io/dada2/ITS_workflow.html). # 16s rRNA Data ## Workflow Input All files needed to run this workflow can be downloaded from figshare. <iframe src="https://widgets.figshare.com/articles/14686665/embed?show_title=1" width="100%" height="351" allowfullscreen frameborder="0"></iframe> > NOTE: The reverse reads for 16S rRNA data set are unusable (see below), therefore this workflow deals with forward reads *only*. We will screen the forward reads for reverse primers and show the quality plots of the reverse reads at the beginning of the workflow. After that, reverse reads are not used. ## Overview ### Sequence Files & Samples We sequenced a total of 15 samples collected from 10 different plots at a single depth in each plot. ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/ssu18_sample_seq_info.txt", header = TRUE, sep = "\t") ``` `r caption_tab("seq_dets_ssu")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "ssu18_sample_seq_info", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = TRUE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads"), raw_reads = colDef(footer = function(values) sprintf("%.0f", sum(values))) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ### Workflow The workflow consists of the following steps: | Step | Command | What we’re doing | |------|------------------------|------------------------------------------------------------| | 1 | multiple | prepare input file names & paths | | 2 | multiple | Define primers (all orientations) | | 3 | `cutadapt` | Remove primers | | 4 | `filterAndTrim()` | assess quality & filter reads | | 5 | `learnErrors()` | generate an error model for the data | | 6 | `derepFastq` | dereplicate sequences | | 7 | `dada()` | infer ASVs on both forward & reverse reads independently | | 8 | `makeSequenceTable()` | generate a count table | | 9 | `removeBimeraDenovo()` | screen for & remove chimeras | | 10 | | track reads through workflow | | 11 | `assignTaxonomy()` | assign taxonomy & finish workflow | | 12 | `save.image()` | save an image of seqtab & taxtab for next part of workflow | ## 1. Set Working Environment Next, we need to setup the working environment by renaming the fastq files and defining a path for the working directory. ### Rename samples To make the parsing easier, we will eliminate the lane and well number from each sample. If you do not wish to do that you will need to adjust the code accordingly. I am *positive* there are more elegant ways of doing this. > CAUTION. If you use this for your own data, please check that this code works on a few backup files before proceeding. ```{r} #| warning: false setwd("RAW") orig_fastq <- list.files(pattern = "*.fastq.gz") newname_fastq <- gsub("_S.*_L001", "", orig_fastq) newname_fastq <- gsub("_001", "", newname_fastq) file.rename(orig_fastq, newname_fastq) setwd("../") ``` ```{r} path <- "RAW" head(list.files(path)) ``` ```{r} #| echo: false #| eval: true c("P1_R1.fastq.gz", "P1_R2.fastq.gz", "P1-8C_R1.fastq.gz", "P1-8C_R2.fastq.gz", "P10_R1.fastq.gz", "P10_R2.fastq.gz") ``` Then, we generate matched lists of the forward and reverse read files. We also parse out the sample name. ```{r} fnFs <- sort(list.files(path, pattern = "_R1.fastq.gz", full.names = TRUE)) fnRs <- sort(list.files(path, pattern = "_R2.fastq.gz", full.names = TRUE)) ``` ```{r} #| warning: false #| fig-height: 3 p1a <- plotQualityProfile(fnFs[1:15], aggregate = TRUE) p2a <- plotQualityProfile(fnRs[1:15], aggregate = TRUE) p3a <- grid.arrange(p1a, p2a, nrow = 1) ggsave("plot_qscores_2018a.png", p3a, width = 7, height = 3) ``` ```{r} #| echo: false system("cp files/dada2/figures/ssu18_plot_qscores_a.png include/dada2/ssu18_plot_qscores_a.png") ``` Unfortunately, the reverse reads are unusable. ```{r} #| echo: false #| eval: true #| fig-height: 4 #| fig-width: 4 #| warning: false knitr::include_graphics("include/dada2/ssu18_plot_qscores_a.png") ``` `r caption_fig("qual_scores_ssu")` ## 2. Define Primers Before we start the DADA2 workflow we need to run [catadapt](https://github.com/marcelm/cutadapt) [@martin2011cutadapt] on all `fastq.gz` files to trim the primers. For bacteria and archaea, we amplified the V4 hypervariable region of the 16S rRNA gene using the primer pair 515F (GTGCCAGCMGCCGCGGTAA) and 806R (GGACTACHVGGGTWTCTAAT) [@caporaso2011global], which should yield an amplicon length of about 253 bp. First we define the primers. ```{r} FWD <- "GTGCCAGCMGCCGCGGTAA" REV <- "GGACTACHVGGGTWTCTAAT" ``` Next, we check the presence and orientation of these primers in the data. I started doing this for ITS data because of primer read-through but I really like the general idea of doing it just to make sure nothing funny is going of with the data. To do this, we will create all orientations of the input primer sequences. In other words the *Forward*, *Complement*, *Reverse*, and *Reverse Complement* variations. ```{r} allOrients <- function(primer) { require(Biostrings) dna <- DNAString(primer) orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), RevComp = reverseComplement(dna)) return(sapply(orients, toString)) } FWD.orients <- allOrients(FWD) REV.orients <- allOrients(REV) FWD.orients REV.orients ``` ``` Forward Complement Reverse "GTGCCAGCMGCCGCGGTAA" "CACGGTCGKCGGCGCCATT" "AATGGCGCCGMCGACCGTG" RevComp "TTACCGCGGCKGCTGGCAC" Forward Complement Reverse "GGACTACHVGGGTWTCTAAT" "CCTGATGDBCCCAWAGATTA" "TAATCTWTGGGVHCATCAGG" RevComp "ATTAGAWACCCBDGTAGTCC" ``` Now we do a little pre-filter step to eliminate ambiguous bases (*Ns*) because *Ns* make mapping of short primer sequences difficult. This step removes any reads with *Ns*. Again, set some files paths, this time for the filtered reads. ```{r} fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) filterAndTrim(fnFs, fnFs.filtN, maxN = 0, multithread = TRUE) ``` Nice. Time to assess the number of times a primer (and all primer orientations) appear in the forward and reverse reads. According to the workflow, counting the primers on one set of paired end fastq files is sufficient to see if there is a problem. This assumes that all the files were created using the same library prep. Basically for both primers, we will search for all four orientations in both forward and reverse reads. Since this is 16S rRNA we do not anticipate any issues but it is worth checking anyway. ```{r} #| eval: true sampnum <- 2 ``` ```{r} primerHits <- function(primer, fn) { # Counts number of reads in which the primer is found nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE) return(sum(nhits > 0)) } ``` #### Forward primers ```{r} rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[sampnum]])) ``` ``` Forward Complement Reverse RevComp FWD.ForwardReads 74707 0 0 0 ``` #### Reverse primers ```{r} rbind(REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[sampnum]])) ``` ``` Forward Complement Reverse RevComp REV.ForwardReads 0 0 0 165 ``` As expected, forward primers predominantly in the forward reads and very little evidence of reverse primers. ## 3. Remove Primers Now we can run [catadapt](https://github.com/marcelm/cutadapt) [@martin2011cutadapt] to remove the primers from the fastq sequences. A little setup first. If this command executes successfully it means R has found cutadapt. ```{r} cutadapt <- "/Users/rad/miniconda3/envs/cutadapt/bin/cutadapt" system2(cutadapt, args = "--version") # Run shell commands from R ``` ``` 2.8 ``` We set paths and trim the forward primer and the reverse-complement of the reverse primer off of R1 (forward reads) and trim the reverse primer and the reverse-complement of the forward primer off of R2 (reverse reads). ```{r} path.cut <- file.path(path, "cutadapt") if(!dir.exists(path.cut)) dir.create(path.cut) fnFs.cut <- file.path(path.cut, basename(fnFs)) FWD.RC <- dada2:::rc(FWD) REV.RC <- dada2:::rc(REV) R1.flags <- paste("-g", FWD, "-a", REV.RC) for(i in seq_along(fnFs)) {system2(cutadapt, args = c(R1.flags, "-m", 20, "-n", 2, "-e", 0.12, "-o", fnFs.cut[i], fnFs.filtN[i]))} ``` ``` This is cutadapt 2.8 with Python 3.7.6 Command line parameters: -g GTGCCAGCMGCCGCGGTAA -a ATTAGAWACCCBDGTAGTCC \ -m 20 -n 2 -e 0.12 \ -o RAW/cutadapt/P1_R1.fastq.gz RAW/filtN/P1_R1.fastq.gz ``` *Note*. If the code above removes all of the base pairs in a sequence, you will get downstream errors unless you set the `-m` flag. This flag sets the minimum length and reads shorter than this will be discarded. Without this flag, reads of length 0 will be kept and cause issues. *Also, a lot of output will be written to the screen by cutadapt!*. We can now count the number of primers in the sequences from the output of cutadapt. ```{r} rbind( FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[sampnum]]), REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[sampnum]])) ``` ``` Forward Complement Reverse RevComp FWD.ForwardReads 0 0 0 0 REV.ForwardReads 0 0 0 0 ``` Basically, primers are no longer detected in the cutadapted reads. Now, for each sample, we can take a look at how the pre-filtering step and primer removal affected the total number of raw reads. ```{r} #| echo: false temp1 <- read.table("files/dada2/tables/ssu18_read_changes.txt", header = TRUE) temp1[, 4:6] <- list(NULL) write.table(temp1, "files/dada2/tables/ssu18_cut.txt", sep = "\t", quote = FALSE) ``` ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/ssu18_cut.txt", header = TRUE, sep = "\t") seq_table$per_reads_kept <- round(seq_table$cutadapt_reads/seq_table$raw_reads, digits = 3) ``` `r caption_tab("seq_cutadapt_ssu")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "ssu18_cutadapt", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = FALSE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads"), raw_reads = colDef(footer = function(values) sprintf("%.0f", sum(values))), pre_filt_reads = colDef(footer = function(values) sprintf("%.0f", sum(values))), cutadapt_reads = colDef(footer = function(values) sprintf("%.0f", sum(values))) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ## 4. Quality Assessment & Filtering We need the forward and reverse fastq file names plus the sample names. ```{r} cutFs <- sort(list.files(path.cut, pattern = "_R1.fastq.gz", full.names = TRUE)) get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1] sample.names <- unname(sapply(cutFs, get.sample.name)) head(sample.names) ``` ```{r} #| eval: true c("P1", "P1-8C", "P10", "P2", "P3", "P3-8C") ``` First let's look at the quality of our reads. The numbers in brackets specify which samples to view. Here we are looking at an aggregate plot of all data (except the negative control). ### Quality plots ```{r} #| warning: false p1 <- plotQualityProfile(cutFs[1:15], aggregate = TRUE) ggsave("ssu18_plot_qscores.png", p1, width = 7, height = 3) ggsave("ssu18_plot_qscores.png", p1) ``` ```{r} #| echo: false system("cp files/dada2/figures/ssu18_plot_qscores_b.png include/dada2/ssu18_plot_qscores_b.png") ``` ```{r} #| echo: false #| eval: true #| warning: false #| out-width: 55% #| fig-align: center knitr::include_graphics("include/dada2/ssu18_plot_qscores_b.png") ``` `r caption_fig("qual_scores_after_ssu")` The forward reads look good. ### Filtering First, we again make some path variables and setup a new directory of filtered reads. ```{r} filtFs <- file.path(path.cut, "filtered", basename(cutFs)) ``` And then we trim the reads. ```{r} out <- filterAndTrim(cutFs, filtFs, maxN = 0, maxEE = 2, truncQ = 2, rm.phix = TRUE, compress = TRUE, multithread = 20) out ``` :::{.callout-note} These parameters should be set based on the anticipated length of the amplicon and the read quality. ::: And here is a table of how the filtering step affected the number of reads in each sample. As you can see, there are a few samples that started with a low read count to begin with---we will likely remove those samples at some point. ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/ssu18_filter.txt", header = TRUE, sep = "\t") seq_table$per_reads_kept <- round(seq_table$reads_out/seq_table$reads_in, digits = 3) ``` `r caption_tab("filter_ssu")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "ssu18_filter", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = FALSE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads"), reads_in = colDef(footer = function(values) sprintf("%.0f", sum(values))), reads_out = colDef(footer = function(values) sprintf("%.0f", sum(values))) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ## 5. Learn Error Rates Now it is time to assess the error rate of the data. The DADA2 algorithm uses a parametric error model. Every amplicon data set has a different set of error rates and the `learnErrors` method learns this error model *from the data*. It does this by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. The algorithm begins with an initial guess, for which the maximum possible error rates in the data are used. #### Forward Reads ```{r} errF <- learnErrors(filtFs, multithread = FALSE) ``` ``` 103002523 total bases in 443994 reads from 6 samples will be used for learning the error rates. ``` ```{r} #| warning: false plotErrors(errF, nominalQ=TRUE) ``` ```{r} #| echo: false system("cp files/dada2/figures/ssu18_plot_errorF.png include/dada2/ssu18_plot_errorF.png") ``` ```{r} #| echo: false #| eval: true knitr::include_graphics("include/dada2/ssu18_plot_errorF.png") ``` `r caption_fig("error_F_ssu")` The error rates for each possible transition (A to C, A to G, etc.) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. ## 6. Dereplicate Reads Now we can use `derepFastq` to identify the unique sequences in the forward and reverse fastq files. #### Forward Reads ```{r} #| echo: false #| eval: true # TO run and cache the derepFastq you must set # `cache.lazy=FALSE` # For both code blocks ``` ```{r} derepFs <- derepFastq(filtFs) names(derepFs) <- sample.names ``` <details markdown="1"> <summary>Detailed results of `derep` forward reads</summary> <pre><code> Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R1.fastq.gz Encountered 39950 unique sequences from 79804 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R1.fastq.gz Encountered 31163 unique sequences from 72760 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R1.fastq.gz Encountered 35274 unique sequences from 62598 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R1.fastq.gz Encountered 42119 unique sequences from 81240 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R1.fastq.gz Encountered 40336 unique sequences from 86732 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R1.fastq.gz Encountered 31104 unique sequences from 60860 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R1.fastq.gz Encountered 35169 unique sequences from 63037 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R1.fastq.gz Encountered 27752 unique sequences from 54127 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R1.fastq.gz Encountered 49045 unique sequences from 109421 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R1.fastq.gz Encountered 46853 unique sequences from 94740 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R1.fastq.gz Encountered 35363 unique sequences from 71678 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R1.fastq.gz Encountered 35567 unique sequences from 73639 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R1.fastq.gz Encountered 38518 unique sequences from 72263 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R1.fastq.gz Encountered 43426 unique sequences from 89315 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R1.fastq.gz Encountered 13673 unique sequences from 33000 total sequences read. </code></pre> </details> ## 7. DADA2 & ASV Inference At this point we are ready to apply the core sample inference algorithm (`dada`) to the filtered and trimmed sequence data. DADA2 offers three options for whether and how to pool samples for ASV inference. If `pool = TRUE`, the algorithm will pool together all samples prior to sample inference. If `pool = FALSE`, sample inference is performed on each sample individually. If `pool = "pseudo"`, the algorithm will perform pseudo-pooling between individually processed samples. We tested all three methods through the full pipeline. Click the `+` to see the results of the test. For our final analysis, we chose `pool = FALSE` for this data set. <details markdown="1"> <summary>Show/hide Results of `dada` pooling options </summary> Here are summary tables of results from the tests. Values are from the final sequence table produced by each option. ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/ssu18_dada_compare.txt", header = TRUE, sep = "\t", row.names = 1) ``` `r caption_tab("pooling_summary_ssu")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "ssu18_pooling_summary", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = FALSE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/ssu18_dada_compare_by_samp.txt", header = TRUE, sep = "\t") ``` `r caption_tab("pooling_ssu")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "ssu18_pooling_results", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", minWidth = 150, filterable = FALSE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` </details> ```{r} dadaFs <- dada(derepFs, err = errF, multithread = TRUE, pool = FALSE) ``` <details markdown="1"> <summary>Detailed results of `dada` on forward reads</summary> <pre><code> Sample 1 - 79804 reads in 39950 unique sequences. Sample 2 - 72760 reads in 31163 unique sequences. Sample 3 - 62598 reads in 35274 unique sequences. Sample 4 - 81240 reads in 42119 unique sequences. Sample 5 - 86732 reads in 40336 unique sequences. Sample 6 - 60860 reads in 31104 unique sequences. Sample 7 - 63037 reads in 35169 unique sequences. Sample 8 - 54127 reads in 27752 unique sequences. Sample 9 - 109421 reads in 49045 unique sequences. Sample 10 - 94740 reads in 46853 unique sequences. Sample 11 - 71678 reads in 35363 unique sequences. Sample 12 - 73639 reads in 35567 unique sequences. Sample 13 - 72263 reads in 38518 unique sequences. Sample 14 - 89315 reads in 43426 unique sequences. Sample 15 - 33000 reads in 13673 unique sequences. </code></pre> </details> As an example, we can inspect the returned `dada-class` object for the forward and reverse reads from the sample #`r sampnum`: ```{r} dadaFs[[sampnum]] ``` ``` dada-class: object describing DADA2 denoising results 1759 sequence variants were inferred from 31163 input unique sequences. Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16 ``` This output tells us how many true sequence variants the DADA2 algorithm inferred from the unique sequences, in this case the sample 2. ## 8. Construct Sequence Table Now we construct an amplicon sequence variant (ASV) table. ```{r} seqtab <- makeSequenceTable(dadaFs) dim(seqtab) ``` ```{r} #| echo: false #| eval: true seqtab <- c(15, 21685) seqtab ``` ```{r} table(nchar(getSequences(seqtab))) ``` ``` 22 23 24 25 27 41 66 73 120 149 185 190 193 2 1 1 2 1 1 1 1 1 1 1 9 1 198 200 201 202 208 209 211 215 217 220 221 222 223 1 1 1 1 1 1 1 1 2 4 14 21 8 224 225 227 228 229 230 231 232 233 234 238 251 3 4 6 4 5 9 20 21342 190 12 2 8 ``` The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants. We have `r (seqtab)[2]` sequence variants but also a range of sequence lengths. Since many of these sequence variants are singletons or doubletons, we can just select a range that corresponds to the expected amplicon length and eliminate the spurious reads. ```{r} #| echo: false read_length <- data.frame(nchar(getSequences(seqtab))) colnames(read_length) <- "length" plot_a <- qplot(length, data = read_length, geom = "histogram", binwidth = 1, xlab = "read length", ylab = "total variants", xlim = c(200, 275)) ggsave("read_length_before.png", plot_a, width = 7, height = 3) ``` ```{r} #| echo: false system("cp files/dada2/figures/ssu18_read_length_before.png include/dada2/ssu18_read_length_before.png") ``` ```{r} #| echo: false #| eval: true knitr::include_graphics("include/dada2/ssu18_read_length_before.png") ``` `r caption_fig("read_length_ssu")` ```{r} seqtab.2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(230,235)] dim(seqtab.2) ``` :::{.callout-note} The values you select should be based on the sequence table generated above. ::: ```{r} #| echo: false #| eval: true seqtab2 <- c(15, 21573) seqtab2 ``` ```{r} table(nchar(getSequences(seqtab.2))) ``` ``` 230 231 232 233 234 9 20 21342 190 12 ``` After removing the extreme length variants, we have `r (seqtab2)[2]`, a reduction of `r (seqtab)[2] - (seqtab2)[2]` sequence variants. ```{r} #| echo: false saveRDS(seqtab.2, "ssu18_seqtab.rds") ``` ## 9. Remove Chimeras Even though the `dada` method corrects substitution and indel errors, chimeric sequences remain. According to the DADA2 documentation, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be *exactly reconstructed* by combining a left-segment and a right-segment from two more abundant *parent* sequences. ```{r} seqtab.nochim <- removeBimeraDenovo(seqtab.2, method = "consensus", multithread = 20, verbose = TRUE) dim(seqtab.nochim) ``` ``` Identified 1241 bimeras out of 21573 input sequences. ``` ```{r} #| echo: false #| eval: true seqtab3 <- c(15, 20332) seqtab3 ``` ```{r} sum(seqtab.nochim)/sum(seqtab.2) ``` ```{r} #| echo: false #| eval: true chim_rem <- 0.9596595 0.9596595 ``` Chimera checking removed an additional `r (seqtab2)[2] - (seqtab3)[2]` sequence variants however, when we account for the abundances of each variant, we see chimeras accounts for about `r 100*(1-chim_rem)`% of the merged sequence reads. Not bad. The last thing we want to do is write the sequence table to an `RDS` file. ```{r} saveRDS(seqtab.nochim, "ssu18_seqtab.nochim.rds") ``` ## 10. Track Reads through Workflow At this point we can look at the number of reads that made it through each step of the workflow for every sample. ```{r} getN <- function(x) sum(getUniques(x)) track <- cbind(out, sapply(dadaFs, getN), rowSums(seqtab.nochim)) colnames(track) <- c("input", "filtered", "denoisedF", "nonchim") rownames(track) <- sample.names ``` ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/ssu18_read_changes.txt", header = TRUE, sep = "\t") ``` `r caption_tab("track_ssu")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "ssu18_track_read_changes", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = TRUE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads"), raw = colDef(footer = function(values) sprintf("%.0f", sum(values))), pre_filt = colDef(footer = function(values) sprintf("%.0f", sum(values))), pre_filt = colDef(footer = function(values) sprintf("%.0f", sum(values))), cut = colDef(footer = function(values) sprintf("%.0f", sum(values))), filtered = colDef(footer = function(values) sprintf("%.0f", sum(values))), denoisedF = colDef(footer = function(values) sprintf("%.0f", sum(values))), nonchim = colDef(footer = function(values) sprintf("%.0f", sum(values))) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` And we can save a copy of the table for later use. ```{r} write.table(track, "ssu18_read_changes.txt", sep = "\t", quote = FALSE, col.names=NA) ``` ```{r} #| echo: false save.image("pre_tax_16s_2018.rdata") ``` ## 11. Assign Taxonomy The `assignTaxonomy` command implements the naive Bayesian classifier, so for reproducible results you need to set a random number seed (see issue [#538](https://github.com/benjjneb/dada2/issues/538)). We did this at the beginning of the workflow. For taxonomic assignment, we are using the Silva version 138 [@quast2012silva]. The developers of DADA2 maintain a [formatted version of the database](https://benjjneb.github.io/dada2/training.html). We will read in the `RDS` file containing the sequence table saved above. We also need to run `remove(list = ls())` command, otherwise the final image we save will be huge. This way, the image only contains the sample data, seqtab, and taxtab *after* running `removeBimeraDenovo`. ```{r} remove(list = ls()) seqtab <- readRDS("ssu18_seqtab.nochim.rds") ``` And then native Bayesian [@wang2007naive] classifier against the Silva database. ```{r} tax_silva <- assignTaxonomy( seqtab, "silva_nr_v138_train_set.fa.gz", multithread = TRUE) ``` ## 12. Save Image And finally, we save an image for use in the analytic part of the workflow. This R data file will be needed as the input for the phyloseq portion of the workflow. See the [Data Availability](data-availability.html) page for complete details on where to get this file. ```{r} save.image("ssu18_dada2_wf.rdata") ``` ```{r} #| echo: false # Classify with IdTAX db remove(list = ls()) seqtab <- readRDS("ssu18_seqtab.nochim.rds") tax_silva <- assignTaxonomy(seqtab, "silva_nr_v138_train_set.fa.gz", multithread = TRUE) dna <- DNAStringSet(getSequences(seqtab)) # Create a DNAStringSet from the ASVs load("SILVA_SSU_r138_2018.RData") ids <- IdTaxa(dna, trainingSet, strand = "top", processors = NULL, verbose = FALSE) # use all processors ranks <- c("domain", "phylum", "class", "order", "family", "genus") # ranks of interest # Convert the output object of class "Taxa" to a matrix analogous to the output from assignTaxonomy tax_id <- t(sapply(ids, function(x) { m <- match(ranks, x$rank) taxa <- x$taxon[m] taxa[startsWith(taxa, "unclassified_")] <- NA taxa })) colnames(tax_id) <- ranks; rownames(tax_id) <- getSequences(seqtab) save.image("ssu18_data_prep_wf_ALT.rdata") saveRDS(seqtab, "ssu18_seqtab.nochim.2.rds") saveRDS(tax_silva, "ssu18_tax_silva.rds") saveRDS(tax_id, "ssu18_tax_id.rds") ``` ## R Session Information & Code This workflow was run on the [Smithsonian High Performance Cluster (SI/HPC)](https://doi.org/10.25572/SIHPC), Smithsonian Institution. Below are the specific packages and versions used in this workflow using both `sessionInfo()` and `devtools::session_info()`. Click the arrow to see the details. <details markdown="1"> <summary>Show/hide R Session Info</summary> ``` sessionInfo() R version 4.0.0 (2020-04-24) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /home/scottjj/miniconda3/envs/R-4/lib/libopenblasp-r0.3.9.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods base other attached packages: [1] gridExtra_2.3 ggplot2_3.3.0 [3] ShortRead_1.46.0 GenomicAlignments_1.24.0 [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0 [7] matrixStats_0.56.0 Biobase_2.48.0 [9] Rsamtools_2.4.0 GenomicRanges_1.40.0 [11] GenomeInfoDb_1.24.0 Biostrings_2.56.0 [13] XVector_0.28.0 IRanges_2.22.1 [15] S4Vectors_0.26.0 BiocParallel_1.22.0 [17] BiocGenerics_0.34.0 dada2_1.16.0 [19] Rcpp_1.0.4.6 loaded via a namespace (and not attached): [1] plyr_1.8.6 RColorBrewer_1.1-2 pillar_1.4.4 [4] compiler_4.0.0 bitops_1.0-6 tools_4.0.0 [7] zlibbioc_1.34.0 lifecycle_0.2.0 tibble_3.0.1 [10] gtable_0.3.0 lattice_0.20-41 png_0.1-7 [13] pkgconfig_2.0.3 rlang_0.4.6 Matrix_1.2-18 [16] GenomeInfoDbData_1.2.3 withr_2.2.0 stringr_1.4.0 [19] hwriter_1.3.2 vctrs_0.3.0 glue_1.4.1 [22] R6_2.4.1 jpeg_0.1-8.1 latticeExtra_0.6-29 [25] reshape2_1.4.4 magrittr_1.5 scales_1.1.1 [28] ellipsis_0.3.1 colorspace_1.4-1 stringi_1.4.6 [31] RCurl_1.98-1.2 RcppParallel_5.0.1 munsell_0.5.0 [34] crayon_1.3.4 devtools::session_info() ─ Session info ─────────────────────────────────────────────────────────────── setting value version R version 4.0.0 (2020-04-24) os CentOS Linux 7 (Core) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2020-06-17 ─ Packages ─────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0) backports 1.1.7 2020-05-13 [1] CRAN (R 4.0.0) Biobase * 2.48.0 2020-04-27 [1] Bioconductor BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor BiocParallel * 1.22.0 2020-04-27 [1] Bioconductor Biostrings * 2.56.0 2020-04-27 [1] Bioconductor bitops 1.0-6 2013-08-17 [1] CRAN (R 4.0.0) callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0) cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0) colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0) crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0) dada2 * 1.16.0 2020-04-27 [1] Bioconductor DelayedArray * 0.14.0 2020-04-27 [1] Bioconductor desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0) devtools 2.3.0 2020-04-10 [1] CRAN (R 4.0.0) digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0) ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0) fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0) fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0) GenomeInfoDb * 1.24.0 2020-04-27 [1] Bioconductor GenomeInfoDbData 1.2.3 2020-05-22 [1] Bioconductor GenomicAlignments * 1.24.0 2020-04-27 [1] Bioconductor GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 4.0.0) glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.0) gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.0.0) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0) hwriter 1.3.2 2014-09-10 [1] CRAN (R 4.0.0) IRanges * 2.22.1 2020-04-28 [1] Bioconductor jpeg 0.1-8.1 2019-10-24 [1] CRAN (R 4.0.0) lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.0) latticeExtra 0.6-29 2019-12-19 [1] CRAN (R 4.0.0) lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0) Matrix 1.2-18 2019-11-27 [1] CRAN (R 4.0.0) matrixStats * 0.56.0 2020-03-13 [1] CRAN (R 4.0.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0) pillar 1.4.4 2020-05-05 [1] CRAN (R 4.0.0) pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 4.0.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 4.0.0) plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.0) png 0.1-7 2013-12-03 [1] CRAN (R 4.0.0) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0) processx 3.4.2 2020-02-09 [1] CRAN (R 4.0.0) ps 1.3.3 2020-05-08 [1] CRAN (R 4.0.0) R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0) RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.0.0) Rcpp * 1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0) RcppParallel 5.0.1 2020-05-06 [1] CRAN (R 4.0.0) RCurl 1.98-1.2 2020-04-18 [1] CRAN (R 4.0.0) remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0) reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.0.0) rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0) Rsamtools * 2.4.0 2020-04-27 [1] Bioconductor S4Vectors * 0.26.0 2020-04-27 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0) ShortRead * 1.46.0 2020-04-27 [1] Bioconductor stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0) stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) SummarizedExperiment * 1.18.1 2020-04-30 [1] Bioconductor testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0) tibble 3.0.1 2020-04-20 [1] CRAN (R 4.0.0) usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0) vctrs 0.3.0 2020-05-11 [1] CRAN (R 4.0.0) withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0) XVector * 0.28.0 2020-04-27 [1] Bioconductor zlibbioc 1.34.0 2020-04-27 [1] Bioconductor ``` </details> # ITS Data ```{r} #| include: false remove(list = ls()) ``` ```{r} #| echo: false #| eval: true #| results: hide # Create the caption(s) with captioner caption_tab <- captioner(prefix = "(ITS) Table", suffix = " |", style = "b") caption_fig <- captioner(prefix = "(ITS) Figure", suffix = " |", style = "b") # Create a function for referring to the tables in text ref <- function(x) str_extract(x, "[^|]*") %>% trimws(which = "right", whitespace = "[ ]") ``` ```{r} #| echo: false #| eval: true #| results: hide ## Table captions caption_tab("seq_dets_its", "Sample data & associated sequencing information.") caption_tab("seq_cutadapt_its", "Total reads per sample after prefiltering and primer removal (using cutadapt).") caption_tab("filter_its", "Total reads per sample after filtering.") caption_tab("pooling_summary_its", "Total number of reads, total number of ASVs, minimum and maximum ASVs, followed by the number of singletons, doubletons, etc. for pooling options.") caption_tab("pooling_its", "Total number of reads and ASVs by sample for pooling options.") caption_tab("track_its", "Tracking reads changes at each step of the DADA2 workflow.") #`r caption_tab("qual_scores_after_its")` ``` ```{r} #| echo: false #| eval: true #| results: hide ## Figure captions caption_fig("qual_scores_its", "Aggregated quality score plots for forward (left) & reverse (right) reads before filtering.") caption_fig("qual_scores_after_its", "Aggregated quality score plot for forward reads after prefilter & primer removal.") caption_fig("error_F_its", "Forward reads: Observed frequency of each transition (e.g., T -> G) as a function of the associated quality score.") caption_fig("error_R_its", "Reverse reads: Observed frequency of each transition (e.g., T -> G) as a function of the associated quality score.") caption_fig("read_length_its", "Distribution of read length by total ASVs before removing extreme length variants.") #`r caption_fig("read_length_its")` ``` ## Workflow Input All files needed to run this workflow can be downloaded from figshare. <iframe src="https://widgets.figshare.com/articles/14686755/embed?show_title=1" width="568" height="351" allowfullscreen frameborder="0"></iframe> ## Overview ### Sequence Files & Samples We sequenced a total of 15 samples collected from 10 different plots at a single depth in each plot. ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/its18_sample_seq_info.txt", header = TRUE, sep = "\t") ``` `r caption_tab("seq_dets_its")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "its18_sample_seq_info", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = TRUE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads"), raw_reads = colDef(footer = function(values) sprintf("%.0f", sum(values))) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ### Workflow Our workflow is pretty much ripped from the [DADA2 ITS Workflow (1.8)](https://benjjneb.github.io/dada2/ITS_workflow.html) on the DADA2 website. That webpage contains thorough explanations for each step so we will not repeat most of that here. For more details, check out the post. The workflow consists of the following steps: | Step | Command | What we’re doing | |------|------------------------|------------------------------------------------------------| | 1 | multiple | prepare input file names & paths | | 2 | multiple | Define primers (all orientations) | | 3 | `cutadapt` | Remove primers | | 4 | `filterAndTrim()` | assess quality & filter reads | | 5 | `learnErrors()` | generate an error model for the data | | 6 | `derepFastq` | dereplicate sequences | | 7 | `dada()` | infer ASVs on both forward & reverse reads independently | | 8 | `mergePairs()` | merge forward & reverse reads to further refine ASVs | | 9 | `makeSequenceTable()` | generate a count table | | 10 | `removeBimeraDenovo()` | screen for & remove chimeras | | 11 | | track reads through workflow | | 12 | `assignTaxonomy()` | assign taxonomy & finish workflow | | 13 | `save.image()` | save an image of seqtab & taxtab for next part of workflow | ## 1. Set Working Environment Next, we need to setup the working environment by renaming the fastq files and defining a path for the working directory. #### Rename samples To make the parsing easier, we will eliminate the lane and well number from each sample. If you do not wish to do that you will need to adjust the code accordingly. I am *positive* there are more elegant ways of doing this. > CAUTION. You should check that this code works on a few backup files before proceeding. ```{r} #| warning: false setwd("RAW") orig_fastq <- list.files(pattern = "*.fastq.gz") newname_fastq <- gsub("_S.*_L001", "", orig_fastq) newname_fastq <- gsub("_001", "", newname_fastq) file.rename(orig_fastq, newname_fastq) setwd("../") ``` ```{r} path <- "RAW" head(list.files(path)) ``` ```{r} #| echo: false #| eval: true c("P1_R1.fastq.gz", "P1_R2.fastq.gz", "P1-8C_R1.fastq.gz", "P1-8C_R2.fastq.gz", "P10_R1.fastq.gz", "P10_R2.fastq.gz") ``` Then, we generate matched lists of the forward and reverse read files. We also parse out the sample name. ```{r} fnFs <- sort(list.files(path, pattern = "_R1.fastq.gz", full.names = TRUE)) fnRs <- sort(list.files(path, pattern = "_R2.fastq.gz", full.names = TRUE)) ``` ## 2. Define Primers Before we start the DADA2 workflow we need to run [catadapt](https://github.com/marcelm/cutadapt) [@martin2011cutadapt] on all `fastq.gz` files to trim the primers. For this part of the study we used the primer pair ITS1f (CTTGGTCATTTAGAGGAAGTAA) [@gardes1993its] and ITS2 (GCTGCGTTCTTCATCGATGC) [@white1990amplification] which should yield variable amplicon lengths between 100 to 400 bp. What we are looking for is potential read-through scenarios that are possible when sequencing the length variable ITS region as described in the [DADA2 ITS Workflow (1.8)](https://benjjneb.github.io/dada2/ITS_workflow.html). Please refer to this document for a complete explanation. First we define the primers. ```{r} FWD <- "CTTGGTCATTTAGAGGAAGTAA" REV <- "GCTGCGTTCTTCATCGATGC" ``` Next, we check the presence and orientation of these primers in the data. To do this we will create all orientations of the input primer sequences. In other words the *Forward*, *Complement*, *Reverse*, and *Reverse Complement* variations. ```{r} allOrients <- function(primer) { require(Biostrings) dna <- DNAString(primer) orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), RevComp = reverseComplement(dna)) return(sapply(orients, toString)) } FWD.orients <- allOrients(FWD) REV.orients <- allOrients(REV) FWD.orients REV.orients ``` ``` Forward Complement Reverse "CTTGGTCATTTAGAGGAAGTAA" "GAACCAGTAAATCTCCTTCATT" "AATGAAGGAGATTTACTGGTTC" RevComp "TTACTTCCTCTAAATGACCAAG" Forward Complement Reverse "GCTGCGTTCTTCATCGATGC" "CGACGCAAGAAGTAGCTACG" "CGTAGCTACTTCTTGCGTCG" RevComp "GCATCGATGAAGAACGCAGC" ``` Now we do a little pre-filter step to eliminate ambiguous bases (*Ns*) because *Ns* make mapping of short primer sequences difficult. This step removes any reads with *Ns*. Again, set some files paths, this time for the filtered reads. ```{r} fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) fnRs.filtN <- file.path(path, "filtN", basename(fnRs)) filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE) ``` Sweet. Time to assess the number of times a primer (and all possible primer orientation) appear in the forward and reverse reads. According to the workflow, counting the primers on one set of paired end fastq files is sufficient to see if there is a problem. This assumes that all the files were created using the same library prep. Basically for both primers, we will search for all four orientations in both forward and reverse reads ```{r} sampnum <- 1 primerHits <- function(primer, fn) { # Counts number of reads in which the primer is found nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE) return(sum(nhits > 0)) } ``` #### Forward primers ```{r} rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[sampnum]]), FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.filtN[[sampnum]])) ``` ``` Forward Complement Reverse RevComp FWD.ForwardReads 48467 0 0 0 FWD.ReverseReads 0 0 0 5078 ``` #### Reverse primers ```{r} rbind(REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[sampnum]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[sampnum]])) ``` ``` Forward Complement Reverse RevComp REV.ForwardReads 0 0 0 6237 REV.ReverseReads 43902 0 0 0 ``` ```{r} #| echo: false fnFsXX <- sort(list.files(path, pattern = "_R1.fastq.gz", full.names = FALSE)) qsamp <- fnFsXX[sampnum] qsamp <- sapply(strsplit(qsamp, "_"), `[`, 1) ### add this for inline calc `r countLines(fnFs.filtN[[sampnum]])/4` ``` What does this table mean? I wondered the same thing. Let's break it down. Sample P1 had 50,894 sequences after the original filtering described earlier. The code searched the forward and reverse fastq files for all 8 primers. If we look at the two outputs, we see the forward primer is found in the forward reads in its forward orientation but also in some reverse reads in its reverse-complement orientation. The reverse primer is found in the reverse reads in its forward orientation but also in some forward reads in its reverse-complement orientation. This is due to read-through when the ITS region is short. ## 3. Remove Primers Now we can run [catadapt](https://github.com/marcelm/cutadapt) [@martin2011cutadapt] to remove the primers from the fastq sequences. A little setup first. If this command executes successfully it means R has found cutadapt. ```{r} cutadapt <- "/PATH/to/cutadapt" system2(cutadapt, args = "--version") # Run shell commands from R ``` ``` 2.8 ``` We set paths and trim the forward primer and the reverse-complement of the reverse primer off of R1 (forward reads) and trim the reverse primer and the reverse-complement of the forward primer off of R2 (reverse reads). ```{r} path.cut <- file.path(path, "cutadapt") if(!dir.exists(path.cut)) dir.create(path.cut) fnFs.cut <- file.path(path.cut, basename(fnFs)) fnRs.cut <- file.path(path.cut, basename(fnRs)) FWD.RC <- dada2:::rc(FWD) REV.RC <- dada2:::rc(REV) R1.flags <- paste("-g", FWD, "-a", REV.RC) R2.flags <- paste("-G", REV, "-A", FWD.RC) for(i in seq_along(fnFs)) {system2(cutadapt, args = c(R1.flags, R2.flags, "-m", 20, "-n", 2, "-o", fnFs.cut[i], "-p", fnRs.cut[i], fnFs.filtN[i], fnRs.filtN[i]))} ``` ``` This is cutadapt 2.8 with Python 3.7.6 Command line parameters: -g CTTGGTCATTTAGAGGAAGTAA -a GCATCGATGAAGAACGCAGC \ -G GCTGCGTTCTTCATCGATGC -A TTACTTCCTCTAAATGACCAAG \ -m 20 -n 2 \ -o RAW/cutadapt/P1_R1.fastq.gz \ -p RAW/cutadapt/P1_R2.fastq.gz RAW/filtN/P1_R1.fastq.gz RAW/filtN/P1_R2.fastq.gz ``` *Note*. If the code above removes all of the base pairs in a sequence, you will get downstream errors unless you set the `-m` flag. This flag sets the minimum length and reads shorter than this will be discarded. Without this flag, reads of length 0 will be kept and cause issues. *Also, a lot of output will be written to the screen by cutadapt!* We can now count the number of primers in the sequences from the output of cutadapt. ```{r} rbind( FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[sampnum]]), FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[sampnum]]), REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[sampnum]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[sampnum]])) ``` ``` Forward Complement Reverse RevComp FWD.ForwardReads 0 0 0 0 FWD.ReverseReads 0 0 0 0 REV.ForwardReads 0 0 0 0 REV.ReverseReads 0 0 0 0 ``` Basically, primers are no longer detected in the cutadapted reads. Now, for each sample, we can take a look at how the pre-filtering step and primer removal affected the total number of raw reads. ```{r} #| echo: false temp1 <- read.table("files/dada2/tables/its18_read_changes.txt", header = TRUE) temp1[, 4:8] <- list(NULL) write.table(temp1, "files/dada2/tables/its18_cut.txt", sep = "\t", quote = FALSE) ``` ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/its18_cut.txt", header = TRUE, sep = "\t") seq_table$per_reads_kept <- round(seq_table$cutadapt_reads/seq_table$raw_reads, digits = 3) ``` `r caption_tab("seq_cutadapt_its")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "its18_cutadapt", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = FALSE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads"), raw_reads = colDef(footer = function(values) sprintf("%.0f", sum(values))), pre_filt_reads = colDef(footer = function(values) sprintf("%.0f", sum(values))), cutadapt_reads = colDef(footer = function(values) sprintf("%.0f", sum(values))) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ## 4. Quality Assessment & Filtering We need the forward and reverse fastq file names and the sample names. ```{r} cutFs <- sort(list.files(path.cut, pattern = "_R1.fastq.gz", full.names = TRUE)) cutRs <- sort(list.files(path.cut, pattern = "_R2.fastq.gz", full.names = TRUE)) get.sample.name <- function(fname) strsplit(basename(fname), "_")[[1]][1] sample.names <- unname(sapply(cutFs, get.sample.name)) head(sample.names) ``` ```{r} #| echo: false #| eval: true c("P1", "P1-8C", "P10", "P2", "P3", "P3-8C") ``` ### Quality plots Then we inspect read quality. ```{r} p1 <- plotQualityProfile(cutFs[1:15], aggregate = TRUE) p2 <- plotQualityProfile(cutRs[1:15], aggregate = TRUE) p3 <- grid.arrange(p1, p2, nrow = 1) ggsave("plot_qscores_2018.png", p3, width = 7, height = 3) ``` ```{r} #| echo: false system("cp files/dada2/figures/its18_plot_qscores.png include/dada2/its18_plot_qscores.png") ``` ```{r} #| echo: false #| eval: true #| fig-height: 3 #| warning: false knitr::include_graphics("include/dada2/its18_plot_qscores.png") ``` `r caption_fig("qual_scores_its")` ### Filtering ```{r} filtFs <- file.path(path.cut, "filtered", basename(cutFs)) filtRs <- file.path(path.cut, "filtered", basename(cutRs)) ``` We use a `minLen` value to get rid of very short length sequences. ```{r} out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 2), truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = TRUE) out ``` :::{.callout-note} These parameters should be set based on the anticipated length of the amplicon and the read quality. ::: And here is a table of how the filtering step affected the number of reads in each sample. ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/its18_filter.txt", header = TRUE, sep = "\t") seq_table$per_reads_kept <- round(seq_table$reads_out/seq_table$reads_in, digits = 3) ``` `r caption_tab("filter_its")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "its18_filter", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = FALSE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads"), reads_in = colDef(footer = function(values) sprintf("%.0f", sum(values))), reads_out = colDef(footer = function(values) sprintf("%.0f", sum(values))) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ## 5. Learn Error Rates Time to assess the error rate of the data. The rest of the workflow is very similar to the 16S workflows presented previously. So I will basically stop talking. #### Forward Reads ```{r} errF <- learnErrors(filtFs, multithread = TRUE) ``` ``` 108788735 total bases in 495784 reads from 11 samples will be used for learning the error rates. 109358094 total bases in 495784 reads from 11 samples will be used for learning the error rates. ``` ```{r} plotErrors(errF, nominalQ = TRUE) ``` ```{r} #| echo: false p3 <- plotErrors(errF, nominalQ = TRUE) ggsave("plot_errorF_1.png", p3, width = 7, height = 5) ggsave("plot_errorF_2.png", p3) ``` ```{r} #| echo: false system("cp files/dada2/figures/its18_plot_errorF.png include/dada2/its18_plot_errorF.png") ``` ```{r} #| echo: false #| eval: true knitr::include_graphics("include/dada2/its18_plot_errorF.png") ``` `r caption_fig("error_F_its")` #### Reverse Reads ```{r} errR <- learnErrors(filtRs, multithread = TRUE) ``` ``` 109358094 total bases in 495784 reads from 11 samples will be used for learning the error rates. ``` ```{r} plotErrors(errR, nominalQ = TRUE) ``` ```{r} #| echo: false p4 <- plotErrors(errR, nominalQ = TRUE) ggsave("plot_errorR_1.png", p4, width = 7, height = 5) ggsave("plot_errorR_2.png", p4) ``` ```{r} #| echo: false system("cp files/dada2/figures/its18_plot_errorR.png include/dada2/its18_plot_errorR.png") ``` ```{r} #| echo: false #| eval: true knitr::include_graphics("include/dada2/its18_plot_errorR.png") ``` `r caption_fig("error_R_its")` ## 6. Dereplicate Reads #### Forward Reads ```{r} derepFs <- derepFastq(filtFs, verbose = TRUE) names(derepFs) <- sample.names ``` <details markdown="1"> <summary>Detailed results of derep forward reads</summary> <pre><code> Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R1.fastq.gz Encountered 7277 unique sequences from 35999 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R1.fastq.gz Encountered 11366 unique sequences from 48230 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R1.fastq.gz Encountered 11714 unique sequences from 80877 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R1.fastq.gz Encountered 13088 unique sequences from 58956 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R1.fastq.gz Encountered 10180 unique sequences from 42563 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R1.fastq.gz Encountered 4480 unique sequences from 17548 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R1.fastq.gz Encountered 12259 unique sequences from 68259 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R1.fastq.gz Encountered 160 unique sequences from 308 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R1.fastq.gz Encountered 10436 unique sequences from 48518 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R1.fastq.gz Encountered 11230 unique sequences from 51055 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R1.fastq.gz Encountered 10316 unique sequences from 43471 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R1.fastq.gz Encountered 2121 unique sequences from 9488 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R1.fastq.gz Encountered 9236 unique sequences from 41563 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R1.fastq.gz Encountered 9000 unique sequences from 35722 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R1.fastq.gz Encountered 42 unique sequences from 85 total sequences read. </code></pre> </details> #### Reverse Reads ```{r} derepRs <- derepFastq(filtRs, verbose = TRUE) names(derepRs) <- sample.names ``` <details markdown="1"> <summary>Detailed results of derep reverse reads</summary> <pre><code> Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1_R2.fastq.gz Encountered 13081 unique sequences from 35999 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P1-8C_R2.fastq.gz Encountered 17965 unique sequences from 48230 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P10_R2.fastq.gz Encountered 18610 unique sequences from 80877 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P2_R2.fastq.gz Encountered 19600 unique sequences from 58956 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3_R2.fastq.gz Encountered 16145 unique sequences from 42563 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P3-8C_R2.fastq.gz Encountered 7496 unique sequences from 17548 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P4_R2.fastq.gz Encountered 19933 unique sequences from 68259 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5_R2.fastq.gz Encountered 199 unique sequences from 308 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P5-8C_R2.fastq.gz Encountered 18495 unique sequences from 48518 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P6_R2.fastq.gz Encountered 17216 unique sequences from 51055 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7_R2.fastq.gz Encountered 17111 unique sequences from 43471 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P7-8C_R2.fastq.gz Encountered 3809 unique sequences from 9488 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P8_R2.fastq.gz Encountered 14140 unique sequences from 41563 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9_R2.fastq.gz Encountered 14856 unique sequences from 35722 total sequences read. Dereplicating sequence entries in Fastq file: RAW/cutadapt/filtered/P9-8C_R2.fastq.gz Encountered 45 unique sequences from 85 total sequences read. </code></pre> </details> ## 7. DADA2 & ASV Inference At this point we are ready to apply the core sample inference algorithm (dada) to the filtered and trimmed sequence data. DADA2 offers three options for whether and how to pool samples for ASV inference. If `pool = TRUE`, the algorithm will pool together all samples prior to sample inference. If `pool = FALSE`, sample inference is performed on each sample individually. If `pool = "pseudo"`, the algorithm will perform pseudo-pooling between individually processed samples. We tested all three methods through the full pipeline. Click the `+` to see the results of the test. For our final analysis, we chose `pool = TRUE` for this data set. <details markdown="1"> <summary>Show/hide Results of `dada` pooling options </summary> Here are a few summary tables of results from the tests. Values are from the final sequence table prodcuced by each option. ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/its18_dada_compare.txt", header = TRUE, sep = "\t", row.names = 1) ``` `r caption_tab("pooling_summary_its")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "its18_pooling_summary", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = FALSE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/its18_dada_compare_by_samp.txt", header = TRUE, sep = "\t") ``` `r caption_tab("pooling_its")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "its18_pooling_results", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", minWidth = 150, filterable = FALSE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` </details> #### Forward Reads ```{r} dadaFs <- dada(derepFs, err = errF, pool = TRUE, multithread = TRUE) ``` ``` 15 samples were pooled: 582642 reads in 102887 unique sequences. ``` ```{r} dadaFs[[sampnum]] ``` ``` dada-class: object describing DADA2 denoising results 1175 sequence variants were inferred from 7277 input unique sequences. Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16 ``` #### Reverse Reads ```{r} dadaRs <- dada(derepRs, err = errR, pool = TRUE, multithread = TRUE) ``` ``` 15 samples were pooled: 582642 reads in 174866 unique sequences. ``` ```{r} dadaRs[[sampnum]] ``` ``` dada-class: object describing DADA2 denoising results 1116 sequence variants were inferred from 13081 input unique sequences. Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16 ``` ```{r} save.image("pre_merge_its_2018.rdata") ``` ## 8. Merge Paired Reads ```{r} mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE) ``` <details markdown="1"> <summary>Detailed results of merging reads</summary> <pre><code> 29795 paired-reads (in 946 unique pairings) successfully merged out of 32193 (in 1177 pairings) input. 41880 paired-reads (in 719 unique pairings) successfully merged out of 46715 (in 941 pairings) input. 64805 paired-reads (in 815 unique pairings) successfully merged out of 66771 (in 1015 pairings) input. Duplicate sequences in merged output. 51144 paired-reads (in 1011 unique pairings) successfully merged out of 57613 (in 1325 pairings) input. 38442 paired-reads (in 767 unique pairings) successfully merged out of 41985 (in 994 pairings) input. 14188 paired-reads (in 349 unique pairings) successfully merged out of 16981 (in 452 pairings) input. 56594 paired-reads (in 1018 unique pairings) successfully merged out of 64411 (in 1306 pairings) input. 296 paired-reads (in 95 unique pairings) successfully merged out of 301 (in 100 pairings) input. 35713 paired-reads (in 619 unique pairings) successfully merged out of 41006 (in 786 pairings) input. 45851 paired-reads (in 955 unique pairings) successfully merged out of 50174 (in 1240 pairings) input. 36121 paired-reads (in 870 unique pairings) successfully merged out of 41541 (in 1135 pairings) input. Duplicate sequences in merged output. 9172 paired-reads (in 335 unique pairings) successfully merged out of 9403 (in 402 pairings) input. 38392 paired-reads (in 704 unique pairings) successfully merged out of 40897 (in 886 pairings) input. 29155 paired-reads (in 748 unique pairings) successfully merged out of 34979 (in 990 pairings) input. Duplicate sequences in merged output. 80 paired-reads (in 14 unique pairings) successfully merged out of 82 (in 16 pairings) input. </code></pre> </details> ```{r} #| echo: false head(mergers[[1]]) ``` ## 9. Construct Sequence Table ```{r} seqtabF <- makeSequenceTable(mergers) dim(seqtabF) ``` ```{r} #| echo: false #| eval: true seqtabX <- c(15, 3364) seqtabX ``` ```{r} table(nchar(getSequences(seqtabF))) ``` ``` 130 131 132 133 136 139 140 141 143 144 146 147 148 149 150 151 152 153 154 155 2 2 1 1 1 1 1 3 2 1 3 5 5 4 5 5 3 2 5 4 156 159 160 161 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 1 2 1 1 2 2 4 9 2 1 2 5 5 5 8 9 2 5 10 13 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 13 14 11 14 12 15 19 12 12 13 9 22 11 11 12 15 17 13 10 11 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 22 16 10 11 13 17 10 8 10 12 13 7 20 18 24 30 39 38 20 19 219 220 221 222 223 224 225 226 228 229 230 231 232 233 234 235 236 237 238 239 12 18 21 23 23 29 32 32 6 48 4 73 20 27 29 19 31 33 30 29 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 35 21 36 16 16 18 17 25 22 29 17 27 28 22 20 18 28 19 28 25 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 23 11 15 13 16 18 23 17 12 12 16 15 32 17 22 21 21 18 16 17 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 13 22 13 18 12 13 13 18 17 15 14 18 8 11 13 17 12 11 13 9 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 16 10 9 3 8 12 8 13 20 15 8 13 16 11 10 8 18 10 8 12 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 8 5 12 17 13 9 8 8 9 12 16 13 14 11 12 13 18 10 11 10 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 9 11 8 8 5 15 8 16 11 15 4 9 7 9 12 14 5 12 11 5 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 2 4 11 5 2 14 6 12 7 5 6 4 3 9 5 10 8 9 4 6 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 5 3 7 7 4 2 3 4 3 5 5 1 3 4 3 6 5 4 3 2 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 421 11 5 3 3 3 3 2 1 2 3 3 1 3 3 1 2 2 1 2 2 422 423 424 427 430 431 432 434 436 437 439 440 441 442 443 446 447 448 2 3 2 1 3 2 4 2 1 2 4 3 1 2 1 1 2 1 ``` ```{r} #| echo: false read_length <- data.frame(nchar(getSequences(seqtab))) colnames(read_length) <- "length" plot_a <- qplot(length, data = read_length, geom = "histogram", binwidth = 1, xlab = "read length", ylab = "total variants", xlim = c(200, 300)) ggsave("read_length_before.png", plot_a, width = 7, height = 3) ``` ```{r} #| echo: false system("cp files/dada2/figures/its18_read_length_before.png include/dada2/its18_read_length_before.png") ``` ```{r} #| echo: false #| eval: true knitr::include_graphics("include/dada2/its18_read_length_before.png") ``` `r caption_fig("read_length_its")` As expected, a wide range of length variability in the the amplified ITS region. ```{r} seqtab <- seqtabF[,nchar(colnames(seqtabF)) %in% seq(100,450)] dim(seqtab) ``` ```{r} #| echo: false #| eval: true seqtabY <- c(15, 3364) seqtabY ``` After removing the extreme length variants, we have `r (seqtabY)[2]`, a reduction of `r (seqtabX)[2] - (seqtabY)[2]` sequence variants. ```{r} table(nchar(getSequences(seqtab))) ``` ```{r} #| echo: false saveRDS(seqtab, "its18_seqtab.rds") ``` ## 10. Remove Chimeras ```{r} seqtab.nochim <- removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE, verbose = TRUE) dim(seqtab.nochim) ``` ``` Identified 7 bimeras out of 3364 input sequences. ``` ```{r} #| echo: false #| eval: true seqtabZ <- c(15, 3357) seqtabZ ``` ```{r} sum(seqtab.nochim)/sum(seqtab.2) ``` ```{r} #| echo: false #| eval: true chim_rem <- 0.9990135 0.9990135 ``` Chimera checking removed an additional `r (seqtabY)[2] - (seqtabZ)[2]` sequence variants however, when we account for the abundances of each variant, we see chimeras accounts for about `r 100*(1-chim_rem)`% of the merged sequence reads. Curious. ```{r} table(colSums(seqtab.nochim>0)) ``` ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1247 711 443 289 204 123 102 86 49 32 24 15 23 7 2 ``` ```{r} table(colSums(seqtab.nochim>0)) ``` ``` 14 95 335 349 616 702 719 745 765 812 867 945 954 1011 1017 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ``` The last thing we want to do is write the sequence table to an `RDS` file. ```{r} saveRDS(seqtab.nochim, "its18_seqtab.nochim.rds") ``` ```{r} #| echo: false table(nchar(getSequences(seqtab.nochim))) ``` ## 11. Track Reads through Workflow ```{r} #| echo: false #| eval: true seq_table <- read.table("files/dada2/tables/its18_read_changes.txt", header = TRUE, sep = "\t") ``` `r caption_tab("track_its")` ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "its18_track_read_changes", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = TRUE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads"), raw = colDef(footer = function(values) sprintf("%.0f", sum(values))), pre_filt = colDef(footer = function(values) sprintf("%.0f", sum(values))), pre_filt = colDef(footer = function(values) sprintf("%.0f", sum(values))), cut = colDef(footer = function(values) sprintf("%.0f", sum(values))), filtered = colDef(footer = function(values) sprintf("%.0f", sum(values))), denoisedF = colDef(footer = function(values) sprintf("%.0f", sum(values))), denoisedR = colDef(footer = function(values) sprintf("%.0f", sum(values))), merged = colDef(footer = function(values) sprintf("%.0f", sum(values))), nonchim = colDef(footer = function(values) sprintf("%.0f", sum(values))) ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ```{r} #| echo: false save.image("pre_tax_its_2018.rdata") ``` ## 12. Assign Taxonomy For taxonomic classification, we used the UNITE [@nilsson2019unite] database, specifically the [UNITE general FASTA release for Fungi (v. 04.02.2020)](https://plutof.ut.ee/#/doi/10.15156/BIO/786368) [@abarenkov2020unite]. ```{r} #| echo: false remove(list = ls()) seqtab <- readRDS("its18_seqtab.nochim.rds") ``` ```{r} tax <- assignTaxonomy(seqtab, "sh_general_release_dynamic_s_04.02.2020_dev.fasta", multithread = 20, tryRC = TRUE) saveRDS(tax, "its18_tax_its.rds") ``` ``` UNITE fungal taxonomic reference detected. ``` ## 13. Save Image ```{r} save.image("its18_dada2_wf.rdata") ``` ## R Session Information & Code This workflow was run on the [Smithsonian High Performance Cluster (SI/HPC)](https://doi.org/10.25572/SIHPC), Smithsonian Institution. Below are the specific packages and versions used in this workflow using both `sessionInfo()` and `devtools::session_info()`. Click the arrow to see the details. <details markdown="1"> <summary>Show/hide R Session Info</summary> ``` sessionInfo() R version 4.0.0 (2020-04-24) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /home/scottjj/miniconda3/envs/R-4/lib/libopenblasp-r0.3.9.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 parallel stats graphics grDevices utils [8] datasets methods base other attached packages: [1] gridExtra_2.3 ggplot2_3.3.0 [3] ShortRead_1.46.0 GenomicAlignments_1.24.0 [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0 [7] matrixStats_0.56.0 Biobase_2.48.0 [9] Rsamtools_2.4.0 GenomicRanges_1.40.0 [11] GenomeInfoDb_1.24.0 Biostrings_2.56.0 [13] XVector_0.28.0 IRanges_2.22.1 [15] S4Vectors_0.26.0 BiocParallel_1.22.0 [17] BiocGenerics_0.34.0 dada2_1.16.0 [19] Rcpp_1.0.4.6 loaded via a namespace (and not attached): [1] plyr_1.8.6 RColorBrewer_1.1-2 pillar_1.4.4 [4] compiler_4.0.0 bitops_1.0-6 tools_4.0.0 [7] zlibbioc_1.34.0 lifecycle_0.2.0 tibble_3.0.1 [10] gtable_0.3.0 lattice_0.20-41 png_0.1-7 [13] pkgconfig_2.0.3 rlang_0.4.6 Matrix_1.2-18 [16] GenomeInfoDbData_1.2.3 withr_2.2.0 stringr_1.4.0 [19] hwriter_1.3.2 vctrs_0.3.0 glue_1.4.1 [22] R6_2.4.1 jpeg_0.1-8.1 latticeExtra_0.6-29 [25] reshape2_1.4.4 magrittr_1.5 scales_1.1.1 [28] ellipsis_0.3.1 colorspace_1.4-1 stringi_1.4.6 [31] RCurl_1.98-1.2 RcppParallel_5.0.1 munsell_0.5.0 [34] crayon_1.3.4 devtools::session_info() ─ Session info ─────────────────────────────────────────────────────────────── setting value version R version 4.0.0 (2020-04-24) os CentOS Linux 7 (Core) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/New_York date 2020-06-04 ─ Packages ─────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0) backports 1.1.7 2020-05-13 [1] CRAN (R 4.0.0) Biobase * 2.48.0 2020-04-27 [1] Bioconductor BiocGenerics * 0.34.0 2020-04-27 [1] Bioconductor BiocParallel * 1.22.0 2020-04-27 [1] Bioconductor Biostrings * 2.56.0 2020-04-27 [1] Bioconductor bitops 1.0-6 2013-08-17 [1] CRAN (R 4.0.0) callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0) cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0) colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0) crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0) dada2 * 1.16.0 2020-04-27 [1] Bioconductor DelayedArray * 0.14.0 2020-04-27 [1] Bioconductor desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0) devtools 2.3.0 2020-04-10 [1] CRAN (R 4.0.0) digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0) ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0) fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0) fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0) GenomeInfoDb * 1.24.0 2020-04-27 [1] Bioconductor GenomeInfoDbData 1.2.3 2020-05-22 [1] Bioconductor GenomicAlignments * 1.24.0 2020-04-27 [1] Bioconductor GenomicRanges * 1.40.0 2020-04-27 [1] Bioconductor ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 4.0.0) glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.0) gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.0.0) gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0) hwriter 1.3.2 2014-09-10 [1] CRAN (R 4.0.0) IRanges * 2.22.1 2020-04-28 [1] Bioconductor jpeg 0.1-8.1 2019-10-24 [1] CRAN (R 4.0.0) lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.0) latticeExtra 0.6-29 2019-12-19 [1] CRAN (R 4.0.0) lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0) Matrix 1.2-18 2019-11-27 [1] CRAN (R 4.0.0) matrixStats * 0.56.0 2020-03-13 [1] CRAN (R 4.0.0) memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0) munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0) pillar 1.4.4 2020-05-05 [1] CRAN (R 4.0.0) pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 4.0.0) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0) pkgload 1.0.2 2018-10-29 [1] CRAN (R 4.0.0) plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.0) png 0.1-7 2013-12-03 [1] CRAN (R 4.0.0) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0) processx 3.4.2 2020-02-09 [1] CRAN (R 4.0.0) ps 1.3.3 2020-05-08 [1] CRAN (R 4.0.0) R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0) RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.0.0) Rcpp * 1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0) RcppParallel 5.0.1 2020-05-06 [1] CRAN (R 4.0.0) RCurl 1.98-1.2 2020-04-18 [1] CRAN (R 4.0.0) remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0) reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.0.0) rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0) Rsamtools * 2.4.0 2020-04-27 [1] Bioconductor S4Vectors * 0.26.0 2020-04-27 [1] Bioconductor scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.0) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0) ShortRead * 1.46.0 2020-04-27 [1] Bioconductor stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0) stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) SummarizedExperiment * 1.18.1 2020-04-30 [1] Bioconductor testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0) tibble 3.0.1 2020-04-20 [1] CRAN (R 4.0.0) usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0) vctrs 0.3.0 2020-05-11 [1] CRAN (R 4.0.0) withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0) XVector * 0.28.0 2020-04-27 [1] Bioconductor zlibbioc 1.34.0 2020-04-27 [1] Bioconductor ``` </details> # Workflow Output Data products generated in this workflow can be downloaded from figshare. <iframe src="https://widgets.figshare.com/articles/14687184/embed?show_title=1" width="100%" height="351" frameborder="0"></iframe> The end! ```{r} #| include: false #| eval: true remove(list = ls()) ``` <a href="data-prep.html" class="btnnav button_round" style="float: right;">Next workflow: 2. Data Set Preparation</a> ```{r} #| message: false #| results: hide #| eval: true #| echo: false remove(list = ls()) ### COmmon formatting scripts ### NOTE: captioner.R must be read BEFORE captions_XYZ.R source(file.path("assets", "functions.R")) ``` #### Source Code {.appendix} The source code for this page can be accessed on GitHub `r fa(name = "github")` by [clicking this link](`r source_code()`). #### Data Availability {.appendix} Raw fastq files available on figshare at [10.25573/data.14686665](https://doi.org/10.25573/data.14686665) for the 16S rRNA data and [10.25573/data.14686755](https://doi.org/10.25573/data.14686755) for the ITS data. Trimmed fastq files (primers removed) available through the ENA under project accession number [PRJEB45074 (ERP129199)](https://www.ebi.ac.uk/ena/browser/view/PRJEB45074). Data generated in this workflow can be accessed on figshare at [10.25573/data.14687184](https://doi.org/10.25573/data.14687184). #### Last updated on {.appendix} ```{r} #| echo: false #| eval: true Sys.time() ```