Workflow for preparation of the 16S rRNA and ITS data sets. These steps are needed before analyzing the data. In this workflow, sample groups are defined, and phyloseq objects are created and curated.
You will either need to run the DADA2 workflow or download the files linked below. See the Data Availability page for complete details.
Files needed to run this workflow can be downloaded from figshare. These are the output files from the DADA2 workflow page.
These workflows share many common variable names so you must split the workflow into a script for each data set OR run the command remove(list = ls()) before beginning the next workflow.
16s rRNA Data
Prerequisites
In order to run this workflow, you either need to run the corresponding DADA2 Workflow for 16S rRNA or begin with the output from that workflow, data_prep_ssu18_wf.rdata. See the Data Availability page for complete details.
Unless otherwise noted, we primarily use phyloseq(McMurdie and Holmes 2013) in this section of the workflow to prepare the 2018 16S rRNA data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.
Read Counts Assessment
Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 workflow. We need to inspect how total reads changed through the workflow. While we are at it, let’s create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the summary table at the end of the DADA2 section. Table headers are as follows:
Header
Description
Sample ID
the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair
Plot
the experimental plot number
Depth
the depth where the sample was collected
Treatment
warm or control
Temp
temperature of soil heating (0C, +4C, or +8C)
Pair
control/treatment coupling
input
number of raw reads
nochim
final read count after removing chimeras
Remain
percent of reads remaining from input to nonchim
FastqID
base name of the fastq file
(16S rRNA) Table 1 | Sample summary table including read changes at start and end of DADA2 workflow.
Defining Groups
Load the data packet produced in the final step of the DADA2 workflow. This packet (ssu18_dada2_wf.rdata) contains the ASV-by-sample table and the ASV taxonomy table.
Rename the samples so names have plot and Depth info.
After we load the data packet, we next need to format sample names and define groups.
We have a total of 15 samples, 5 sample pairs, from 10 plots (1 depth). There are 2 treatments corresponding to 3 different temperature regimes.
And finally we define a sample data frame that holds the different groups we extracted from the sample names.
Show the code
#define a sample data framesamdf<-data.frame(SamName =sample_name, PLOT =plot, DEPTH =depth, TREAT =treatment, TEMP =temp, PAIR =pair)rownames(samdf)<-samples.out
For example, P07_D00_010_W3D is the sample heated at +3C from plot P07, collected from a depth of 00_010cm. This sample is in group D which consists of P07_D00_010_W3D, P07_D00_010_W8D, P08_D00_010_C0D.
Create a Phyloseq Object
A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.
Note
A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications).
B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 20332 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 20332 taxa by 8 taxonomic ranks ]
C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full to indicate that these are the raw outputs.
Let’s see if we have any potential contaminants. We can use some inline R code to see the taxonomy table for any taxa of interest.
Note
The code (hidden by default) is written as `r "Chloroplast" %in% tax_table(ps)`.
Are Mitochondria present? TRUE
Are Chloroplast present? TRUE
Are Eukaryota present? TRUE
Let’s remove these taxa—Eukaryota because we used bacterial/archaeal primers, Mitochondria because those are likely from eukaryotes, and Chloroplast because those are likely from plants. We must do each of these in turn using phyloseq and it gets a little messy.
Why messy? The subset_taxa command removes anything that is NA for the specified taxonomic level or above. For example, lets say you run the subset_taxa command using Family != "Mitochondria“. Seems like you should get a phyloseq object with everything except Mitochondria. But actually the command not only gets rid of Mitochondria but everything else that has NA for Family and above. In my experience this is not well documented and I had to dig through the files to figure out what was happening.
Anyway, to remove the taxa we do the following:
Subset the taxa and generate a ps object of just the taxa of interest,
Select the ASV column only, turn it into a factor, and use this to remove from the ps object.
Remove Mitochondria ASVs
Remember the original data set contained 20332 ASVs. Here we generate a file with mitochondria ASVs only.
The code eliminated an additional 50 Kingdom level NA ASVs from the phyloseq object.
Rename NA taxonomic ranks
Phyloseq has an odd way of dealing with taxonomic ranks that have no value—in other words, NA in the tax table. The first thing we are going to do before moving forward is to change all of the NAs to have a value of the next highest classified rank. For example, ASV26 is not classified at the Genus level but is at Family level (Xanthobacteraceae). So we change the Genus name to Family_Xanthobacteraceae. The code for comes from these two posts on the phyloseq GitHub, both by MSMortensen: issue #850 and issue #990.
One thing this code does is reassign the functions class and order to taxonomic ranks. This can cause issues if you need these functions.
So you need to run something like this rm(class, order, phylum, kingdom) at the end of the code to remove these as variables. For now, I have not come up with a better solution.
Note. The original code mentioned above is written for data sets that have species-level designations.
Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.
Finally, we rename the ps object. This is now our working data set.
ps_work<-ps_filt
Add Phylogenetic Tree
One final task is to add a phylogenetic tree to the phyloseq object.
Now that we have removed the NC and sample with zero reads, we can summarize the data set. Let’s start with the in our working phyloseq object using the summarize_phyloseq from the microbiome R package(Lahti, Sudarshan, et al. 2017) as we did above.
Metric
Results
Min. number of reads
25088
Max. number of reads
86600
Total number of reads
936640
Average number of reads
62443
Median number of reads
62117
Total ASVs
20173
Sparsity
0.891
Any ASVs sum to 1 or less?
FALSE
Number of singleton ASVs
NA
Percent of ASVs that are singletons
NA
Number of sample variables are:
6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)
We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample.
(16S rRNA) Table 2 | Sample summary table showing the final number of reads and ASVs (per sample) as well as the number of reads/ASVs lost during curation.
Review
We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.
The full phyloseq object before removing contaminants.
ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 20332 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 20332 taxa by 8 taxonomic ranks ]
In order to run this workflow, you either need to run the corresponding DADA2 Workflow for 2018 ITS or begin with the output from that workflow, data_prep_its18_wf.rdata. See the Data Availability page for complete details.
Unless otherwise noted, we primarily use phyloseq(McMurdie and Holmes 2013) in this section of the workflow to prepare the 2018 ITS data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.
Read Counts Assessment
Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 workflow. We need to inspect how total reads changed from to beginning to the end of the workflow. While we are at it, let’s create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the summary table at the end of the DADA2 section.
Table headers are as follows:
Header
Description
Sample ID
the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair
Plot
the experimental plot number
Depth
the depth where the sample was collected
Treatment
warm or control
Temp
temperature of soil heating (0C, +4C, or +8C)
Pair
control/treatment coupling
input
number of raw reads
nochim
final read count after removing chimeras
Remain
percent of reads remaining from input to nonchim
FastqID
base name of the fastq file
(ITS) Table 1 | Sample summary table including read changes at start and end of DADA2 workflow.
The following samples have fewer than 1000 reads at the end of the workflow:
P05_D00_010_W3C, P09_D00_010_W8E.
We can also plot the final read count by the temperature where the sample was collected.
Defining Groups
Load the data packet produced in the final step of the DADA2 workflow. This packet (its18_dada2_wf.rdata) contains the ASV-by-sample table and the ASV taxonomy table.
Rename the samples so names have plot and Depth info. For convenience, we renamed any Negative Controls (NC) to P00_D00-000_NNN.
After we load the data packet we next need to format sample names and define groups.
For example, P08_D00_010_C0D is the sample heated at +0C from plot P08, collected from a depth of 00_010cm. This sample is in group D which consists of P07_D00_010_W3D, P07_D00_010_W8D, P08_D00_010_C0D.
Create a Phyloseq Object
Fix Taxon Names
The UNITE database we used for classification adds a little prefix to every rank call, like k__ for Kingdom, o__ for Order, etc. To keep in step with our other analyses, we will remove these tags. Of course, we also make a ps object with the original names just in case. At this point we will also remove the Species rank since it is essentially useless.
A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.
Note
A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications).
B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3357 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3357 taxa by 8 taxonomic ranks ]
C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full to indicate that these are the raw outputs.
At this point we would normally remove unwanted taxa and/or negative controls. Since we do not have any of these for this dataset, we are going to save the phyloseq object with a new name. That way, if we do decide later on to remove some taxa it will be easier to plug in the code here and then continue with the rest of the workflow.
ps_filt<-ps
Remove Low-Count Samples
Next, we can remove samples with really low read counts, say less than 500 reads.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3355 taxa and 13 samples ]
sample_data() Sample Data: [ 13 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3355 taxa by 8 taxonomic ranks ]
Rename NA taxonomic ranks
Phyloseq has an odd way of dealing with taxonomic ranks that have no value—in other words, NA in the tax table. The first thing we are going to do before moving forward is to change all of the NAs to have a value of the next highest classified rank. For example, ASV4 is not classified at the Genus level but is at Family level (Nitrososphaeraceae). So we change the Genus name to Family_Nitrososphaeraceae. The code for comes from these two posts on the phyloseq GitHub, both by MSMortensen: issue #850 and issue #990.
One thing this code does is reassign the functions class and order to taxonomic ranks. This can cause issues if you need these functions.
So you need to run something like this rm(class, order, phylum, kingdom) at the end of the code to remove these as variables. For now, I have not come up with a better solution.
Before doing that, we will save a copy of the phyloseq object with the original taxa names.
Note. The original code is written for data sets that have species-level designations.
Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.
Finally, we rename the ps object. This is now our working data set.
ps_work<-ps_filt_no_nc
Data Set Summary
Now that we have removed the NC and samples with low reads, we can summarize the data in our working phyloseq object. Let’s start with the data set as a whole using the summarize_phyloseq from the microbiome R package(Lahti, Sudarshan, et al. 2017) as we did above.
Metric
Results
Min. number of reads
9172
Max. number of reads
64636
Total number of reads
490767
Average number of reads
37751
Median number of reads
38371
Total ASVs
3355
Sparsity
0.671
Any ASVs sum to 1 or less?
TRUE
Number of singleton ASVs
5
Percent of ASVs that are singletons
0.149
Number of sample variables are:
6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)
We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample.
(ITS) Table 2 | Sample summary table showing the final number of reads and ASVs (per sample) as well as the number of reads/ASVs lost during curation.
Review
We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.
The full phyloseq object before removing contaminants.
ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3357 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3357 taxa by 8 taxonomic ranks ]
The last thing tasks to complete are to a) save copies of the taxonomy and sequence tables and b) save an image of the workflow for the next part of the workflow.
McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An r Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.”PLoS One 8 (4): e61217. https://doi.org/10.1371/journal.pone.0061217.
Source Code
---title: "2. Data Set Preparation"description: | Workflow for preparation of the 16S rRNA and ITS data sets. These steps are needed before analyzing the data. In this workflow, sample groups are defined, and phyloseq objects are created and curated.---```{r}#| message: false#| results: hide#| code-fold: true#| code-summary: "Click here for setup information"knitr::opts_chunk$set(echo =TRUE, eval =FALSE)set.seed(119)#library(conflicted)#pacman::p_depends(ape, local = TRUE) #pacman::p_depends_reverse(ape, local = TRUE) library(phyloseq); packageVersion("phyloseq")library(Biostrings); packageVersion("Biostrings")pacman::p_load(tidyverse, DT, microbiome, taxa, formatR, captioner, reactable, downloadthis, fontawesome, metacoder, ampvis2, ape, statnet.common, install =FALSE, update =FALSE)options(scipen=999)knitr::opts_current$get(c("cache","cache.path","cache.rebuild","dependson","autodep"))```# Workflow Input> You will either need to run the [DADA2 workflow](dada2.html) or download the files linked below. See the [Data Availability](data-availability.html) page for complete details.Files needed to run this workflow can be downloaded from figshare. These are the output files from the [DADA2 workflow](dada2.html) page.<iframe src="https://widgets.figshare.com/articles/14687184/embed?show_title=1" width="100%" height="351" allowfullscreen frameborder="0"></iframe>These workflows share many common variable names so you must split the workflow into a script for each data set OR run the command `remove(list = ls())` before beginning the next workflow.# 16s rRNA Data```{r}#| include: false#| eval: true## ONLY FOR PAGE BUILDremove(list =ls())load("page_build/data_prep_ssu18_wf.rdata")``````{r}#| echo: false#| eval: true#| results: hide# Create the caption(s) with captionercaption_tab <-captioner(prefix ="(16S rRNA) Table", suffix =" |", style ="b")caption_fig <-captioner(prefix ="(16S rRNA) Figure", suffix =" |", style ="b")# Create a function for referring to the tables in textref <-function(x) str_extract(x, "[^|]*") %>%trimws(which ="right", whitespace ="[ ]")``````{r}#| echo: false#| eval: truesource("assets/captions/captions_data_prep_ssu.R")```## PrerequisitesIn order to run this workflow, you either need to run the corresponding [DADA2 Workflow for 16S rRNA](dada2-high-temp.html#s-rrna-data.html) or begin with the output from that workflow, `data_prep_ssu18_wf.rdata`. See the [Data Availability](data-availability.html) page for complete details.Unless otherwise noted, we primarily use [phyloseq](https://joey711.github.io/phyloseq/)[@mcmurdie2013phyloseq] in this section of the workflow to prepare the 2018 16S rRNA data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.## Read Counts AssessmentBefore we begin, let's create a summary table containing some basic sample metadata and the read count data from the [DADA2 workflow](dada2-high-temp.html#s-rrna-data.html). We need to inspect how total reads changed through the workflow. While we are at it, let's create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the [summary table](dada2-high-temp.html#track-reads-through-workflow) at the end of the DADA2 section. Table headers are as follows:| Header | Description ||---------------------------|------------------------------------------------------------------------------------------||`Sample ID`| the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair ||`Plot`| the experimental plot number ||`Depth`| the depth where the sample was collected ||`Treatment`| warm or control ||`Temp`| temperature of soil heating (0C, +4C, or +8C) ||`Pair`| control/treatment coupling ||`input`| number of raw reads ||`nochim`| final read count after removing chimeras ||`Remain`| percent of reads remaining from `input` to `nonchim`||`FastqID`| base name of the fastq file |<br/>```{r}#| echo: falsetmp_tab1 <-read.table("files/dada2/tables/ssu18_sample_seq_info.txt", header =TRUE, sep ="\t")tmp_tab1$root <-gsub("_.*", "", tmp_tab1$Fastq_ID_Forward)tmp_tab1[9:ncol(tmp_tab1)-1] <-NULLtmp_tab1 <- tmp_tab1[, c(1,8,2:7)]tmp_tab2 <-read.table("files/dada2/tables/ssu18_read_changes.txt", header =TRUE, sep ="\t")tmp_tab2[2:6] <-NULLtmp_tab3 <-read.table("files/dada2/tables/ssu18_dada_compare_by_samp.txt", header =TRUE, sep ="\t")tmp_tab3 <- tmp_tab3[c(1,5)]seq_table <-left_join(tmp_tab1, tmp_tab2, by =c("Sample_ID"="Sample_ID")) %>%left_join(., tmp_tab3, by ="Sample_ID")seq_table$per_reads_kept <-round(seq_table$nonchim/seq_table$raw_reads, digits =3)colnames(seq_table) <-c("Sample_ID", "root name", "Plot", "Treatment", "Temp", "Depth", "Pair", "input reads", "final reads", "no. ASVs", "per. reads remain")rm(list =ls(pattern ="tmp_"))```<small>`r caption_tab("seq_summary_ssu")`</small>```{r}#| echo: false#| eval: truewrite.table(seq_table, "files/data-prep/tables/ssu18_sample_summary.txt", row.names =FALSE, sep ="\t", quote =FALSE)``````{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =TRUE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads") ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/data-prep/tables/ssu18_sample_summary.txt dname="ssu18_sample_summary" label="Download table as text file" icon="table" type="primary" >}}```{r}#| echo: falsewong_pal <-c("#D55E00", "#56B4E9", "#009E73", "#0072B2","#F0E442", "#CC79A7", "#E69F00", "#7F7F7F","#000000")```## Defining Groups1. Load the data packet produced in the final step of the DADA2 workflow. This packet (`ssu18_dada2_wf.rdata`) contains the ASV-by-sample table and the ASV taxonomy table.2. Rename the samples so names have plot and Depth info.3. After we load the data packet, we next need to format sample names and define groups.```{r}#| code-fold: trueseq_table <- seq_table[order(seq_table$`root name`), ]tmp_names <- seq_table$`Sample_ID`load("files/dada2/rdata/ssu18_dada2_wf.rdata")identical(seq_table$`root name`, rownames(seqtab))rownames(seqtab) <- tmp_namessamples.out <-rownames(seqtab)sample_name <-substr(samples.out, 1, 999)plot <-substr(samples.out, 0, 3)depth <-substr(samples.out, 6, 11)treatment <-substr(samples.out, 13, 13)temp <-substr(samples.out, 14, 14)pair <-substr(samples.out, 15, 15)```We have a total of **`r length(unique(sample_name))`** samples, **`r length(unique(pair))`** sample pairs, from **`r length(unique(plot))`** plots (**`r length(unique(depth))`** depth). There are **`r length(unique(treatment))`** treatments corresponding to **`r length(unique(temp))`** different temperature regimes.4. And finally we define a sample data frame that holds the different groups we extracted from the sample names.```{r}#| code-fold: true#define a sample data framesamdf <-data.frame(SamName = sample_name,PLOT = plot,DEPTH = depth,TREAT = treatment,TEMP = temp,PAIR = pair)rownames(samdf) <- samples.out``````{r}#| echo: false#| eval: trueexample_sam <-"P07_D00_010_W3D"qpair <- samdf[example_sam,]$PAIRqgroup <- samdf %>% dplyr::filter(PAIR == qpair)```For example, `r example_sam` is the sample heated at **+`r samdf[example_sam,]$TEMP`C** from plot **`r samdf[example_sam,]$PLOT`**, collected from a depth of `r samdf[example_sam,]$DEPTH`cm. This sample is in group **`r qpair`** which consists of `r qgroup$SamName`.## Create a Phyloseq Object**A**. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention---ASV1, ASV2, ASV3, and so on.:::{.callout-note}A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications).:::```{r}#| code-fold: true# this create the phyloseq objectps <-phyloseq(otu_table(seqtab, taxa_are_rows =FALSE),sample_data(samdf), tax_table(tax_silva))tax_table(ps) <-cbind(tax_table(ps),rownames(tax_table(ps)))# adding unique ASV namestaxa_names(ps) <-paste0("ASV", seq(ntaxa(ps)))tax_table(ps) <-cbind(tax_table(ps),rownames(tax_table(ps)))``````{r}#| echo: false#| eval: truehead(taxa_names(ps))```So the complete data set contains `r ntaxa(ps)` ASVs. We can use the [microbiome R package](https://github.com/microbiome/microbiome/)[@lahti2017microbiome] to get some additional summary data from the phyloseq object.```{r}#| echo: falsemin_read_ps <-min(readcount(ps))max_read_ps <-max(readcount(ps))total_reads_ps <-sum(readcount(ps))mean_reads_ps <-round(mean(readcount(ps)), digits =0)median_reads_ps <-median(readcount(ps))total_asvs_ps <-ntaxa(ps)singleton_ps <-tryCatch(ntaxa(rare(ps, detection =1, prevalence =0)), error=function(err) NA)singleton_ps_perc <-tryCatch(round((100*(ntaxa(rare(ps, detection =1, prevalence =0)) /ntaxa(ps))), digits =3), error=function(err) NA)sparsity_ps <-round(length(which(abundances(ps) ==0))/length(abundances(ps)),digits =3)```| Metric | Results ||-------------------------------------|------------------------------------------------------|| Min. number of reads |`r min_read_ps`|| Max. number of reads |`r max_read_ps`|| Total number of reads |`r total_reads_ps`|| Average number of reads |`r mean_reads_ps`|| Median number of reads |`r median_reads_ps`|| Total ASVs |`r total_asvs_ps`|| Sparsity |`r sparsity_ps`|| Any ASVs sum to 1 or less? |`r isTRUE(singleton_ps >= 1)`|| Number of singleton ASVs |`r singleton_ps`|| Percent of ASVs that are singletons |`r singleton_ps_perc`|| Number of sample variables are: |`r length(sample_data(ps))` (`r colnames(samdf)`) |**B**. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.```{r}colnames(tax_table(ps)) <-c("Kingdom", "Phylum", "Class", "Order","Family", "Genus", "ASV_SEQ", "ASV_ID")ps``````{r}#| echo: false#| eval: trueps```**C**. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix `full` to indicate that these are the *raw* outputs.```{r}write.table(tax_table(ps),"files/data-prep/tables/ssu18_full_tax_table.txt",sep ="\t", quote =FALSE, col.names =NA)write.table(t(otu_table(ps)),"files/data-prep/tables/ssu18_full_seq_table.txt",sep ="\t", quote =FALSE, col.names =NA)```## Remove Contaminants & Unwanted TaxaLet's see if we have any potential contaminants. We can use some [inline R code](https://rmarkdown.rstudio.com/lesson-4.html) to see the taxonomy table for any taxa of interest.:::{.callout-note}The code (hidden by default) is written as `` `r"Chloroplast" %in% tax_table(ps)` ``.:::- Are Mitochondria present? `r "Mitochondria" %in% tax_table(ps)`- Are Chloroplast present? `r "Chloroplast" %in% tax_table(ps)`- Are Eukaryota present? `r "Eukaryota" %in% tax_table(ps)`Let's remove these taxa---Eukaryota because we used bacterial/archaeal primers, Mitochondria because those are likely from eukaryotes, and Chloroplast because those are likely from plants. We must do each of these in turn using phyloseq and it gets a little messy.Why messy? The `subset_taxa` command removes anything that is `NA` for the specified taxonomic level or above. For example, lets say you run the `subset_taxa` command using `Family != "Mitochondria`". Seems like you should get a phyloseq object with everything except Mitochondria. But actually the command not only gets rid of Mitochondria but everything else that has `NA` for Family and above. In my experience this is not well documented and I had to dig through the files to figure out what was happening.Anyway, to remove the taxa we do the following:* Subset the taxa and generate a `ps` object of just the taxa of interest,* Select the ASV column only, turn it into a factor, and use this to remove <INSERT TAXA> from the `ps` object.### Remove Mitochondria ASVsRemember the original data set contained `r ntaxa(ps)` ASVs. Here we generate a file with mitochondria ASVs only.```{r}MT1 <-subset_taxa(ps, Family =="Mitochondria")MT1 <-as(tax_table(MT1), "matrix")MT1 <- MT1[, 8]MT1df <-as.factor(MT1)goodTaxa <-setdiff(taxa_names(ps), MT1df)ps_no_mito <-prune_taxa(goodTaxa, ps)ps_no_mito``````{r}#| echo: false#| eval: trueps_no_mito```Looks like this removed **`r ntaxa(ps) - ntaxa(ps_no_mito)` Mitochondria ASVs**. We will duplicate the code block to remove other groups.### Remove Chloroplast ASVsAnd again with Chloroplast ASVs only.```{r}CH1 <-subset_taxa(ps_no_mito, Order =="Chloroplast")CH1 <-as(tax_table(CH1), "matrix")CH1 <- CH1[, 8]CH1df <-as.factor(CH1)goodTaxa <-setdiff(taxa_names(ps_no_mito), CH1df)ps_no_chloro <-prune_taxa(goodTaxa, ps_no_mito)ps_no_chloro``````{r}#| echo: false#| eval: trueps_no_chloro```The code removed an additional **`r ntaxa(ps_no_mito) - ntaxa(ps_no_chloro)` Chloroplast ASVs**.### Remove Eukaryota ASVsAnd again with Eukaryota ASVs only.```{r}EU1 <-subset_taxa(ps_no_chloro, Kingdom =="Eukaryota")EU1 <-as(tax_table(EU1), "matrix")EU1 <- EU1[, 8]EU1df <-as.factor(EU1)goodTaxa <-setdiff(taxa_names(ps_no_chloro), EU1df)ps_no_euk <-prune_taxa(goodTaxa, ps_no_chloro)ps_no_euk``````{r}#| echo: false#| eval: trueps_no_euk```The code removed an additional **`r ntaxa(ps_no_chloro) - ntaxa(ps_no_euk)` Eukaryota ASVs** from the `ps` object.### Remove any Kingdom NAsHere we can just use the straight up `subset_taxa` command since we do not need to worry about any ranks above Kingdom also being removed.```{r}ps_filt <-subset_taxa(ps_no_euk, !is.na(Kingdom))``````{r}ps_work_o <- ps_filt```The code eliminated an additional **`r ntaxa(ps_no_euk) - ntaxa(ps_filt)` Kingdom level NA ASVs** from the phyloseq object.## Rename NA taxonomic ranksPhyloseq has an odd way of dealing with taxonomic ranks that have no value---in other words, **NA** in the tax table. The first thing we are going to do before moving forward is to change all of the *NA*s to have a value of the next highest classified rank. For example, `ASV26` is not classified at the Genus level but is at Family level (Xanthobacteraceae). So we change the Genus name to *Family_Xanthobacteraceae*. The code for comes from these two posts on the phyloseq GitHub, both by [MSMortensen](https://github.com/MSMortensen): issue [#850](https://github.com/joey711/phyloseq/issues/850#issuecomment-394771087) and issue [#990](https://github.com/joey711/phyloseq/issues/990#issuecomment-424618425).> One thing this code does is reassign the functions `class` and `order` to taxonomic ranks. This can cause issues if you need these functions.So you need to run something like this `rm(class, order, phylum, kingdom)` at the end of the code to remove these as variables. For now, I have not come up with a better solution.```{r}#| code-fold: truetax.clean <-data.frame(tax_table(ps_filt))for (i in1:6){ tax.clean[,i] <-as.character(tax.clean[,i])}tax.clean[is.na(tax.clean)] <-""for (i in1:nrow(tax.clean)){if (tax.clean[i,2] ==""){ kingdom <- base::paste("k_", tax.clean[i,1], sep ="") tax.clean[i, 2:6] <- kingdom } elseif (tax.clean[i,3] ==""){ phylum <- base::paste("p_", tax.clean[i,2], sep ="") tax.clean[i, 3:6] <- phylum } elseif (tax.clean[i,4] ==""){ class <- base::paste("c_", tax.clean[i,3], sep ="") tax.clean[i, 4:6] <- class } elseif (tax.clean[i,5] ==""){ order <- base::paste("o_", tax.clean[i,4], sep ="") tax.clean[i, 5:6] <- order } elseif (tax.clean[i,6] ==""){ tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep ="_") }}tax_table(ps_filt) <-as.matrix(tax.clean)rank_names(ps_filt)rm(class, order, phylum, kingdom)```Still the same ranks. That's good. What about the new groups? Let's take a peak at some families.```{r}#| eval: truehead(get_taxa_unique(ps_filt, "Family"), 16)```Nice. Bye-bye *NA*> *Note*. The original code mentioned above is written for data sets that have species-level designations.Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.Finally, we rename the ps object. This is now our working data set.```{r}ps_work <- ps_filt```## Add Phylogenetic TreeOne final task is to add a phylogenetic tree to the phyloseq object.```{r}ps_work_tree <-rtree(ntaxa(ps_work), rooted =TRUE,tip.label =taxa_names(ps_work))ps_work <-merge_phyloseq(ps_work, sample_data, ps_work_tree)ps_work_o_tree <-rtree(ntaxa(ps_work_o), rooted =TRUE,tip.label =taxa_names(ps_work_o))ps_work_o <-merge_phyloseq(ps_work_o, sample_data, ps_work_o_tree)```## Data Set SummaryNow that we have removed the NC and sample with zero reads, we can summarize the data set. Let's start with the in our working phyloseq object using the `summarize_phyloseq` from the [microbiome R package](https://github.com/microbiome/microbiome/)[@lahti2017microbiome] as we did above.```{r}#| echo: falsemin_read_ps_work <-min(readcount(ps_work))max_read_ps_work <-max(readcount(ps_work))total_reads_ps_work <-sum(readcount(ps_work))mean_reads_ps_work <-round(mean(readcount(ps_work)), digits =0)median_reads_ps_work <-median(readcount(ps_work))total_asvs_ps_work <-ntaxa(ps_work)singleton_ps_work <-tryCatch(ntaxa(rare(ps_work,detection =1,prevalence =0)),error=function(err) NA)singleton_ps_work_perc <-tryCatch(round((100*(ntaxa(rare( ps_work, detection =1, prevalence =0)) /ntaxa(ps_work))), digits =3),error =function(err) NA)sparsity_ps_work <-round(length(which(abundances(ps_work) ==0))/length(abundances(ps)), digits =3)```| Metric | Results ||-------------------------------------|-----------------------------------------------------------|| Min. number of reads |`r min_read_ps_work`|| Max. number of reads |`r max_read_ps_work`|| Total number of reads |`r total_reads_ps_work`|| Average number of reads |`r mean_reads_ps_work`|| Median number of reads |`r median_reads_ps_work`|| Total ASVs |`r total_asvs_ps_work`|| Sparsity |`r sparsity_ps_work`|| Any ASVs sum to 1 or less? |`r isTRUE(singleton_ps_work >= 1)`|| Number of singleton ASVs |`r singleton_ps_work`|| Percent of ASVs that are singletons |`r singleton_ps_work_perc`|| Number of sample variables are: |`r length(sample_data(ps_work))` (`r colnames(samdf)`) |We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample.```{r}#| code-fold: truetotal_reads <-sample_sums(ps_work)total_reads <-as.data.frame(total_reads, make.names =TRUE)total_reads <- total_reads %>%rownames_to_column("Sample_ID")total_asvs <-estimate_richness(ps_work,measures ="Observed")total_asvs <- total_asvs %>%rownames_to_column("Sample_ID")total_asvs$Sample_ID <-gsub('\\.', '-', total_asvs$Sample_ID)```And then we merge these two data frames with the sample data frame.```{r}#| code-fold: truesam_details <- samdfrownames(sam_details) <-NULLcolnames(sam_details) <-c("Sample_ID", "Plot", "Depth","Treatment", "Temp", "Pair")tmp_tab1 <- seq_tabletmp_tab1[, c(2:7,11)] <-NULLtmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>%left_join(., total_asvs) %>%left_join(., tmp_tab1)reads_lost <- tmp_tab2$`final reads`- tmp_tab2$total_reads asvs_lost <- tmp_tab2$`no. ASVs`- tmp_tab2$Observed tmp_tab2$reads_lost <- reads_losttmp_tab2$asvs_lost <- asvs_lostseq_table2 <- tmp_tab2[, c(1:8,12,13)]colnames(seq_table2) <-c("sample_id", "PLOT", "DEPTH","TREAT", "TEMP", "PAIR", "total_reads","total_ASVs", "reads_lost", "ASVs_lost")```<small>`r caption_tab("seq_summary_final_ssu")`</small>```{r}#| echo: false#| eval: truewrite.table(seq_table2, "files/data-prep/tables/ssu18_sample_summary_final.txt", row.names =FALSE, sep ="\t", quote =FALSE)``````{r}#| echo: false#| eval: truereactable(seq_table2,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =0),align ="center", filterable =TRUE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads") ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/data-prep/tables/ssu18_sample_summary_final.txt dname="ssu18_sample_summary_final" label="Download table as text file" icon="table" type="primary" >}}## ReviewWe have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.The full phyloseq object *before* removing contaminants.```{r}#| eval: truepshead(get_taxa_unique(ps, "Family"), 16)```The phyloseq object *before* changing NA ranks.```{r}#| eval: trueps_work_ohead(get_taxa_unique(ps_work_o, "Family"), 16)```The phyloseq object *after* changing NA ranks.```{r}#| eval: trueps_workhead(get_taxa_unique(ps_work, "Family"), 16)```## Final StepsThe last thing tasks to complete are to **a**) save copies of the taxonomy and sequence tables and **b**) save an image of the workflow for the next section.```{r}write.table(tax_table(ps_work),"files/data-prep/tables/ssu18_work_tax_table.txt",sep ="\t", quote =FALSE, col.names =NA)write.table(t(otu_table(ps_work)),"files/data-prep/tables/ssu18_work_seq_table.txt",sep ="\t", quote =FALSE, col.names =NA)``````{r}#| echo: false# PAGE BUILDsave.image("page_build/data_prep_ssu18_wf.rdata")``````{r}#| echo: falsesaveRDS(ps, "files/data-prep/rdata/ssu18_ps.rds")saveRDS(ps_work, "files/data-prep/rdata/ssu18_ps_work.rds")saveRDS(ps_work_o, "files/data-prep/rdata/ssu18_ps_work_o.rds")```# ITS Data```{r}#| include: false#| eval: true## ONLY FOR PAGE BUILDremove(list =ls())load("page_build/data_prep_its18_wf.rdata")``````{r}#| echo: false#| eval: true#| results: hide# Create the caption(s) with captionercaption_tab <-captioner(prefix ="(ITS) Table", suffix =" |", style ="b")caption_fig <-captioner(prefix ="(ITS) Figure", suffix =" |", style ="b")# Create a function for referring to the tables in textref <-function(x) str_extract(x, "[^|]*") %>%trimws(which ="right", whitespace ="[ ]")``````{r}#| echo: false#| eval: truesource("assets/captions/captions_data_prep_its.R")```## PrerequisitesIn order to run this workflow, you either need to run the corresponding [DADA2 Workflow for 2018 ITS](dada2-high-temp.html#its-data) or begin with the output from that workflow, `data_prep_its18_wf.rdata`. See the [Data Availability](data-availability.html) page for complete details.Unless otherwise noted, we primarily use [phyloseq](https://joey711.github.io/phyloseq/)[@mcmurdie2013phyloseq] in this section of the workflow to prepare the 2018 ITS data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.## Read Counts AssessmentBefore we begin, let's create a summary table containing some basic sample metadata and the read count data from the [DADA2 workflow](dada2-high-temp.html#its-data). We need to inspect how total reads changed from to beginning to the end of the workflow. While we are at it, let's create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the [summary table](dada2-high-temp.html#track-reads-through-workflow-1) at the end of the DADA2 section.Table headers are as follows:| Header | Description ||---------------------------|------------------------------------------------------------------------------------------||`Sample ID`| the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair ||`Plot`| the experimental plot number ||`Depth`| the depth where the sample was collected ||`Treatment`| warm or control ||`Temp`| temperature of soil heating (0C, +4C, or +8C) ||`Pair`| control/treatment coupling ||`input`| number of raw reads ||`nochim`| final read count after removing chimeras ||`Remain`| percent of reads remaining from `input` to `nonchim`||`FastqID`| base name of the fastq file |```{r}#| echo: falsetmp_tab1 <-read.table("files/dada2/tables/its18_sample_seq_info.txt", header =TRUE, sep ="\t")tmp_tab1$root <-gsub("_.*", "", tmp_tab1$Fastq_ID_Forward)tmp_tab1[9:ncol(tmp_tab1)-1] <-NULLtmp_tab1 <- tmp_tab1[, c(1,8,2:7)]tmp_tab2 <-read.table("files/dada2/tables/its18_read_changes.txt", header =TRUE, sep ="\t")tmp_tab2[2:8] <-NULLtmp_tab3 <-read.table("files/dada2/tables/its18_dada_compare_by_samp.txt", header =TRUE, sep ="\t")tmp_tab3 <- tmp_tab3[c(1,7)]seq_table <-left_join(tmp_tab1, tmp_tab2, by =c("Sample_ID"="Sample_ID")) %>%left_join(., tmp_tab3, by ="Sample_ID")seq_table$per_reads_kept <-round(seq_table$nonchim/seq_table$raw_reads, digits =3)colnames(seq_table) <-c("Sample_ID", "root name", "Plot", "Treatment", "Temp", "Depth", "Pair", "input reads", "final reads", "no. ASVs", "per. reads remain")rm(list =ls(pattern ="tmp_"))```<small>`r caption_tab("seq_summary_its")`</small>```{r}#| echo: false#| eval: truewrite.table(seq_table, "files/data-prep/tables/its18_sample_summary.txt", row.names =FALSE, sep ="\t", quote =FALSE)``````{r}#| echo: false#| eval: truereactable(seq_table,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =1),align ="center", filterable =TRUE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads") ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))cutoff <-1000remove_sam <- seq_table[seq_table$`final reads`<= cutoff, ]```{{< downloadthis files/data-prep/tables/its18_sample_summary.txt dname="its18_sample_summary" label="Download table as text file" icon="table" type="primary" >}}The following samples have fewer than **`r cutoff`** reads at the end of the workflow:**`r remove_sam$Sample_ID`**.We can also plot the final read count by the temperature where the sample was collected.```{r}#| echo: falsewong_pal <-c("#D55E00", "#56B4E9", "#009E73", "#0072B2","#F0E442", "#CC79A7", "#E69F00", "#7F7F7F","#000000")```## Defining Groups1. Load the data packet produced in the final step of the DADA2 workflow. This packet (`its18_dada2_wf.rdata`) contains the ASV-by-sample table and the ASV taxonomy table.2. Rename the samples so names have plot and Depth info. For convenience, we renamed any Negative Controls (NC) to `P00_D00-000_NNN`.3. After we load the data packet we next need to format sample names and define groups.```{r}#| code-fold: trueseq_table <- seq_table[order(seq_table$`root name`), ]tmp_names <- seq_table$`Sample_ID`load("files/dada2/rdata/its18_dada2_wf.rdata")identical(seq_table$`root name`, rownames(seqtab))rownames(seqtab) <- tmp_namessamples.out <-rownames(seqtab)sample_name <-substr(samples.out, 1, 999)plot <-substr(samples.out, 0, 3)depth <-substr(samples.out, 6, 11)treatment <-substr(samples.out, 13, 13)temp <-substr(samples.out, 14, 14)pair <-substr(samples.out, 15, 15)```We have a total of **`r length(unique(sample_name))`** samples, **`r length(unique(pair))`** sample pairs, from **`r length(unique(plot))`** plots (**`r length(unique(depth))`** depth). There are **`r length(unique(treatment))`** treatments corresponding to **`r length(unique(temp))`** different temperature regimes.4. And finally we define a sample data frame that holds the different groups we extracted from the sample names.```{r}#| code-fold: truesamdf <-data.frame(SamName = sample_name,PLOT = plot,DEPTH = depth,TREAT = treatment,TEMP = temp,PAIR = pair)rownames(samdf) <- samples.out``````{r}#| echo: false#| eval: trueexample_sam <-"P08_D00_010_C0D"qpair <- samdf[example_sam,]$PAIRqgroup <- samdf %>% dplyr::filter(PAIR == qpair)```For example, `r example_sam` is the sample heated at **+`r samdf[example_sam,]$TEMP`C** from plot **`r samdf[example_sam,]$PLOT`**, collected from a depth of `r samdf[example_sam,]$DEPTH`cm. This sample is in group **`r qpair`** which consists of `r qgroup$SamName`.## Create a Phyloseq Object### Fix Taxon NamesThe UNITE database we used for classification adds a little prefix to every rank call, like `k__` for Kingdom, `o__` for Order, etc. To keep in step with our other analyses, we will remove these tags. Of course, we also make a ps object with the original names just in case. At this point we will also remove the Species rank since it is essentially useless.```{r}tax_mod_o <-data.frame(tax)tax_mod_o$Species <-NULLtax_mod_o <-as.matrix(tax_mod_o)tax_mod <-sub("k__", "", tax_mod_o)tax_mod <-sub("p__", "", tax_mod)tax_mod <-sub("c__", "", tax_mod)tax_mod <-sub("o__", "", tax_mod)tax_mod <-sub("f__", "", tax_mod)tax_mod <-sub("g__", "", tax_mod)```**A**. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention---ASV1, ASV2, ASV3, and so on.:::{.callout-note}A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications).:::```{r}#| code-fold: true# this create the phyloseq objectps <-phyloseq(otu_table(seqtab, taxa_are_rows =FALSE),sample_data(samdf), tax_table(tax_mod))tax_table(ps) <-cbind(tax_table(ps),rownames(tax_table(ps)))# adding unique ASV namestaxa_names(ps) <-paste0("ASV", seq(ntaxa(ps)))tax_table(ps) <-cbind(tax_table(ps),rownames(tax_table(ps)))head(taxa_names(ps))``````{r}#| echo: false# this create the phyloseq objectps_o <-phyloseq(otu_table(seqtab, taxa_are_rows =FALSE),sample_data(samdf), tax_table(tax_mod_o))tax_table(ps_o) <-cbind(tax_table(ps_o),rownames(tax_table(ps_o)))# adding unique ASV namestaxa_names(ps_o) <-paste0("ASV", seq(ntaxa(ps_o)))tax_table(ps_o) <-cbind(tax_table(ps_o),rownames(tax_table(ps_o)))head(taxa_names(ps_o))``````{r}#| echo: false#| eval: truehead(taxa_names(ps))```So the complete data set contains `r ntaxa(ps)` ASVs. We can also use the [microbiome R package](https://github.com/microbiome/microbiome/)[@lahti2017microbiome] to get some additional summary data from the phyloseq object.```{r}#| echo: falsemin_read_ps <-min(readcount(ps))max_read_ps <-max(readcount(ps))total_reads_ps <-sum(readcount(ps))mean_reads_ps <-round(mean(readcount(ps)), digits =0)median_reads_ps <-median(readcount(ps))total_asvs_ps <-ntaxa(ps)singleton_ps <-tryCatch(ntaxa(rare(ps, detection =1, prevalence =0)),error=function(err) NA)singleton_ps_perc <-tryCatch(round((100*(ntaxa(rare(ps, detection =1, prevalence =0)) /ntaxa(ps))), digits =3), error =function(err) NA)sparsity_ps <-round(length(which(abundances(ps) ==0))/length(abundances(ps)),digits =3)```| Metric | Results ||-------------------------------------|------------------------------------------------------|| Min. number of reads |`r min_read_ps`|| Max. number of reads |`r max_read_ps`|| Total number of reads |`r total_reads_ps`|| Average number of reads |`r mean_reads_ps`|| Median number of reads |`r median_reads_ps`|| Total ASVs |`r total_asvs_ps`|| Sparsity |`r sparsity_ps`|| Any ASVs sum to 1 or less? |`r isTRUE(singleton_ps >= 1)`|| Number of singleton ASVs |`r singleton_ps`|| Percent of ASVs that are singletons |`r singleton_ps_perc`|| Number of sample variables are: |`r length(sample_data(ps))` (`r colnames(samdf)`) |**B**. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.```{r}colnames(tax_table(ps)) <-c("Kingdom", "Phylum", "Class", "Order","Family", "Genus", "ASV_SEQ", "ASV_ID")rank_names(ps)``````{r}#| echo: false#| eval: trueps``````{r}#| echo: falsecolnames(tax_table(ps_o)) <-c("Kingdom", "Phylum", "Class", "Order","Family", "Genus", "ASV_SEQ", "ASV_ID")ps_orank_names(ps_o)```**C**. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix `full` to indicate that these are the *raw* outputs.```{r}#| code-fold: truewrite.table(tax_table(ps),"files/data-prep/tables/its18_full_tax_table.txt",sep="\t", quote =FALSE, col.names=NA)write.table(t(otu_table(ps)),"files/data-prep/tables/its18_full_seq_table.txt",sep="\t", quote =FALSE, col.names=NA)``````{r}#| echo: falsewrite.table(tax_table(ps_o),"files/data-prep/tables/its18_full_tax_table_o.txt",sep="\t", quote =FALSE, col.names=NA)write.table(t(otu_table(ps_o)),"files/data-prep/tables/its18_full_seq_table_o.txt",sep="\t", quote =FALSE, col.names=NA)```At this point we would normally remove unwanted taxa and/or negative controls. Since we do not have any of these for this dataset, we are going to save the phyloseq object with a new name. That way, if we do decide later on to remove some taxa it will be easier to plug in the code here and then continue with the rest of the workflow.```{r}ps_filt <- ps```## Remove Low-Count SamplesNext, we can remove samples with really low read counts, say less than 500 reads.```{r}#| results: asisps_filt_no_low <-prune_samples(sample_sums(ps_filt) >500, ps_filt)``````{r}#| eval: trueps_filt_no_low```We lost `r nsamples(ps) - nsamples(ps_filt_no_low)` sample(s). After removing samples we need to check whether any ASVs ended up with no reads.```{r}no_reads <-taxa_sums(ps_filt_no_low) ==0```And we lost `r sum(no_reads)` ASV(s). So now we must remove these from the ps object.```{r}ps_filt_no_nc <-prune_taxa(taxa_sums(ps_filt_no_low) >0, ps_filt_no_low)``````{r}#| eval: trueps_filt_no_nc```## Rename NA taxonomic ranksPhyloseq has an odd way of dealing with taxonomic ranks that have no value---in other words, *NA* in the tax table. The first thing we are going to do before moving forward is to change all of the *NA*s to have a value of the next highest classified rank. For example, `ASV4` is not classified at the Genus level but is at Family level (Nitrososphaeraceae). So we change the Genus name to *Family_Nitrososphaeraceae*. The code for comes from these two posts on the phyloseq GitHub, both by [MSMortensen](https://github.com/MSMortensen): issue [#850](https://github.com/joey711/phyloseq/issues/850#issuecomment-394771087) and issue [#990](https://github.com/joey711/phyloseq/issues/990#issuecomment-424618425).> One thing this code does is reassign the functions `class` and `order` to taxonomic ranks. This can cause issues if you need these functions.So you need to run something like this `rm(class, order, phylum, kingdom)` at the end of the code to remove these as variables. For now, I have not come up with a better solution.Before doing that, we will save a copy of the phyloseq object with the original taxa names.```{r}ps_work_o <- ps_filt_no_nc``````{r}#| code-fold: truetax.clean <-data.frame(tax_table(ps_filt_no_nc))for (i in1:6){ tax.clean[,i] <-as.character(tax.clean[,i])}tax.clean[is.na(tax.clean)] <-""for (i in1:nrow(tax.clean)){if (tax.clean[i,2] ==""){ kingdom <- base::paste("k_", tax.clean[i,1], sep ="") tax.clean[i, 2:6] <- kingdom } elseif (tax.clean[i,3] ==""){ phylum <- base::paste("p_", tax.clean[i,2], sep ="") tax.clean[i, 3:6] <- phylum } elseif (tax.clean[i,4] ==""){ class <- base::paste("c_", tax.clean[i,3], sep ="") tax.clean[i, 4:6] <- class } elseif (tax.clean[i,5] ==""){ order <- base::paste("o_", tax.clean[i,4], sep ="") tax.clean[i, 5:6] <- order } elseif (tax.clean[i,6] ==""){ tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep ="_") }}tax_table(ps_filt_no_nc) <-as.matrix(tax.clean)rank_names(ps_filt_no_nc)rm(class, order, phylum, kingdom)```Still the same ranks. That's good. What about the new groups? Let's take a peak at some families.```{r}#| eval: truehead(get_taxa_unique(ps_filt_no_nc, "Family"), 16)```Nice. Adios *NA*.> *Note*. The original code is written for data sets that have species-level designations.Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.Finally, we rename the ps object. This is now our working data set.```{r}ps_work <- ps_filt_no_nc```## Data Set SummaryNow that we have removed the NC and samples with low reads, we can summarize the data in our working phyloseq object. Let's start with the data set as a whole using the `summarize_phyloseq` from the [microbiome R package](https://github.com/microbiome/microbiome/)[@lahti2017microbiome] as we did above.```{r}#| echo: falsemin_read_ps_work <-min(readcount(ps_work))max_read_ps_work <-max(readcount(ps_work))total_reads_ps_work <-sum(readcount(ps_work))mean_reads_ps_work <-round(mean(readcount(ps_work)), digits =0)median_reads_ps_work <-median(readcount(ps_work))total_asvs_ps_work <-ntaxa(ps_work)singleton_ps_work <-tryCatch(ntaxa(rare(ps_work,detection =1,prevalence =0)),error=function(err) NA)singleton_ps_work_perc <-tryCatch(round((100*(ntaxa(rare(ps_work,detection =1,prevalence =0)) /ntaxa(ps_work))), digits =3),error=function(err) NA)sparsity_ps_work <-round(length(which(abundances(ps_work) ==0))/length(abundances(ps)), digits =3)```| Metric | Results ||-------------------------------------|----------------------------------------------------------|| Min. number of reads |`r min_read_ps_work`|| Max. number of reads |`r max_read_ps_work`|| Total number of reads |`r total_reads_ps_work`|| Average number of reads |`r mean_reads_ps_work`|| Median number of reads |`r median_reads_ps_work`|| Total ASVs |`r total_asvs_ps_work`|| Sparsity |`r sparsity_ps_work`|| Any ASVs sum to 1 or less? |`r isTRUE(singleton_ps_work >= 1)`|| Number of singleton ASVs |`r singleton_ps_work`|| Percent of ASVs that are singletons |`r singleton_ps_work_perc`|| Number of sample variables are: |`r length(sample_data(ps_work))` (`r colnames(samdf)`) |We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample.```{r}#| code-fold: truetotal_reads <-sample_sums(ps_work)total_reads <-as.data.frame(total_reads, make.names =TRUE)total_reads <- total_reads %>%rownames_to_column("Sample_ID")total_asvs <-estimate_richness(ps_work, measures ="Observed")total_asvs <- total_asvs %>%rownames_to_column("Sample_ID")total_asvs$Sample_ID <-gsub('\\.', '-', total_asvs$Sample_ID)```And then we merge these two data frames with the sample data frame.```{r}#| code-fold: truerm(list =ls(pattern ="tmp_"))sam_details <- samdfrownames(sam_details) <-NULLcolnames(sam_details) <-c("Sample_ID", "Plot", "Depth","Treatment", "Temp", "Pair")tmp_tab1 <- seq_tabletmp_tab1[, c(2:7,11)] <-NULLtmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>%left_join(., total_asvs) %>%left_join(., tmp_tab1)reads_lost <- tmp_tab2$`final reads`- tmp_tab2$total_reads asvs_lost <- tmp_tab2$`no. ASVs`- tmp_tab2$Observed tmp_tab2$reads_lost <- reads_losttmp_tab2$asvs_lost <- asvs_lostseq_table2 <- tmp_tab2[, c(1:8,12,13)]colnames(seq_table2) <-c("sample_id", "PLOT", "DEPTH","TREAT", "TEMP", "PAIR", "total_reads","total_ASVs", "reads_lost", "ASVs_lost")```<small>`r caption_tab("seq_summary_final_its")`</small>```{r}#| echo: false#| eval: truewrite.table(seq_table2, "files/data-prep/tables/its18_sample_summary_final.txt", row.names =FALSE, sep ="\t", quote =FALSE)``````{r}#| echo: false#| eval: truereactable(seq_table2,defaultColDef =colDef(header =function(value) gsub("_", " ", value, fixed =TRUE),cell =function(value) format(value, nsmall =0),align ="center", filterable =TRUE, sortable =TRUE, resizable =TRUE,footerStyle =list(fontWeight ="bold") ), columns =list(Sample_ID =colDef(name ="Sample ID", sticky ="left", style =list(borderRight ="1px solid #eee"),headerStyle =list(borderRight ="1px solid #eee"), align ="left",minWidth =150, footer ="Total reads") ), searchable =TRUE, defaultPageSize =5, pageSizeOptions =c(5, 10, nrow(seq_table)), showPageSizeOptions =TRUE, highlight =TRUE, bordered =TRUE, striped =TRUE, compact =FALSE, wrap =FALSE, showSortable =TRUE, fullWidth =TRUE,theme =reactableTheme(style =list(fontSize ="0.8em")))```{{< downloadthis files/data-prep/tables/its18_sample_summary_final.txt dname="its18_sample_summary_final" label="Download table as text file" icon="table" type="primary" >}}## ReviewWe have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.The full phyloseq object *before* removing contaminants.```{r}#| eval: truepshead(get_taxa_unique(ps, "Family"), 16)```The phyloseq object *before* removing the prefix of taxon ranks. This still has low coverage samples.```{r}#| eval: trueps_ohead(get_taxa_unique(ps_o, "Family"), 16)```The phyloseq object *before* changing NA ranks.```{r}#| eval: trueps_work_ohead(get_taxa_unique(ps_work_o, "Family"), 16)```The phyloseq object *after* changing NA ranks.```{r}#| eval: trueps_workhead(get_taxa_unique(ps_work, "Family"), 16)```## Final StepsThe last thing tasks to complete are to **a**) save copies of the taxonomy and sequence tables and **b**) save an image of the workflow for the next part of the workflow.```{r}#| code-fold: truewrite.table(tax_table(ps_work),"files/data-prep/tables/its18_work_tax_table.txt",sep="\t", quote =FALSE, col.names=NA)write.table(t(otu_table(ps_work)),"files/data-prep/tables/its18_work_seq_table.txt",sep="\t", quote =FALSE, col.names=NA)``````{r}#| echo: false# PAGE BUILDsave.image("page_build/data_prep_its18_wf.rdata")``````{r}#| echo: falsesaveRDS(ps, "files/data-prep/rdata/its18_ps.rds")saveRDS(ps_work, "files/data-prep/rdata/its18_ps_work.rds")saveRDS(ps_work_o, "files/data-prep/rdata/its18_ps_work_o.rds")``````{r}#| include: false#| eval: trueremove(list =ls())```# Workflow Output Data products generated in this workflow can be downloaded from figshare.<iframe src="https://widgets.figshare.com/articles/14690739/embed?show_title=1" width="100%" height="351" allowfullscreen frameborder="0"></iframe>That's all for this part!</br><a href="filtering.html" class="btnnav button_round" style="float: right;">Next workflow:<br> 3. Filtering</a><a href="dada2.html" class="btnnav button_round" style="float: left;">Previous workflow:<br> 1. DADA2 Workflow</a><br><br>```{r}#| message: false #| results: hide#| eval: true#| echo: falseremove(list =ls())### COmmon formatting scripts### NOTE: captioner.R must be read BEFORE captions_XYZ.Rsource(file.path("assets", "functions.R"))```#### Source Code {.appendix}The source code for this page can be accessed on GitHub `r fa(name = "github")` by [clicking this link](`r source_code()`). #### Data Availability {.appendix}Data generated in this workflow and the `.Rdata` need to run the workflow can be accessed on figshare at [10.25573/data.14690739](https://doi.org/10.25573/data.14690739).#### Last updated on {.appendix}```{r}#| echo: false#| eval: trueSys.time()```