knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
set.seed(119)
#library(conflicted)
#pacman::p_depends(ape, local = TRUE)
#pacman::p_depends_reverse(ape, local = TRUE)
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
pacman::p_load(tidyverse, DT, microbiome, taxa, formatR,
captioner, reactable, downloadthis, fontawesome,
metacoder, ampvis2, ape, statnet.common,
install = FALSE, update = FALSE)
options(scipen=999)
knitr::opts_current$get(c(
"cache",
"cache.path",
"cache.rebuild",
"dependson",
"autodep"
))
Click here for setup information.
Workflow Input
You will either need to run the DADA2 workflow or download the files linked below. See the Data Availability page for complete details.
Files needed to run this workflow can be downloaded from figshare. These are the output files from the DADA2 workflow page.
These workflows share many common variable names so you must split the workflow into a script for each data set OR run the command remove(list = ls())
before beginning the next workflow.
16s rRNA Data
Prerequisites
In order to run this workflow, you either need to run the corresponding DADA2 Workflow for 16S rRNA or begin with the output from that workflow, data_prep_ssu18_wf.rdata
. See the Data Availability page for complete details.
Unless otherwise noted, we primarily use phyloseq (McMurdie and Holmes 2013) in this section of the workflow to prepare the 2018 16S rRNA data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.
Read Counts Assessment
Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 workflow. We need to inspect how total reads changed through the workflow. While we are at it, let’s create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the summary table at the end of the DADA2 section. Table headers are as follows:
Header | Description |
---|---|
Sample ID |
the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair |
Plot |
the experimental plot number |
Depth |
the depth where the sample was collected |
Treatment |
warm or control |
Temp |
temperature of soil heating (0C, +4C, or +8C) |
Pair |
control/treatment coupling |
input |
number of raw reads |
nochim |
final read count after removing chimeras |
Remain |
percent of reads remaining from input to nonchim
|
FastqID |
base name of the fastq file |
(16S rRNA) Table 1 | Sample summary table including read changes at start and end of DADA2 workflow.
Defining Groups
Load the data packet produced in the final step of the DADA2 workflow. This packet (
ssu18_dada2_wf.rdata
) contains the ASV-by-sample table and the ASV taxonomy table.Rename the samples so names have plot and Depth info.
After we load the data packet, we next need to format sample names and define groups.
Code
seq_table <- seq_table[order(seq_table$`root name`), ]
tmp_names <- seq_table$`Sample_ID`
load("files/dada2/rdata/ssu18_dada2_wf.rdata")
identical(seq_table$`root name`, rownames(seqtab))
rownames(seqtab) <- tmp_names
samples.out <- rownames(seqtab)
sample_name <- substr(samples.out, 1, 999)
plot <- substr(samples.out, 0, 3)
depth <- substr(samples.out, 6, 11)
treatment <- substr(samples.out, 13, 13)
temp <- substr(samples.out, 14, 14)
pair <- substr(samples.out, 15, 15)
We have a total of 15 samples, 5 sample pairs, from 10 plots (1 depth). There are 2 treatments corresponding to 3 different temperature regimes.
- And finally we define a sample data frame that holds the different groups we extracted from the sample names.
Code
#define a sample data frame
samdf <- data.frame(SamName = sample_name,
PLOT = plot,
DEPTH = depth,
TREAT = treatment,
TEMP = temp,
PAIR = pair)
rownames(samdf) <- samples.out
For example, P07_D00_010_W3D is the sample heated at +3C from plot P07, collected from a depth of 00_010cm. This sample is in group D which consists of P07_D00_010_W3D, P07_D00_010_W8D, P08_D00_010_C0D.
Create a Phyloseq Object
A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.
Code
# this create the phyloseq object
ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
sample_data(samdf), tax_table(tax_silva))
tax_table(ps) <- cbind(tax_table(ps),
rownames(tax_table(ps)))
# adding unique ASV names
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
tax_table(ps) <- cbind(tax_table(ps),
rownames(tax_table(ps)))
[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"
So the complete data set contains 20332 ASVs. We can use the microbiome R package (Lahti, Sudarshan, et al. 2017) to get some additional summary data from the phyloseq object.
Metric | Results |
---|---|
Min. number of reads | 25151 |
Max. number of reads | 86841 |
Total number of reads | 937761 |
Average number of reads | 62517 |
Median number of reads | 62171 |
Total ASVs | 20332 |
Sparsity | 0.898 |
Any ASVs sum to 1 or less? | FALSE |
Number of singleton ASVs | NA |
Percent of ASVs that are singletons | NA |
Number of sample variables are: | 6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR) |
B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 20332 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 20332 taxa by 8 taxonomic ranks ]
C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full
to indicate that these are the raw outputs.
write.table(tax_table(ps),
"files/data-prep/tables/ssu18_full_tax_table.txt",
sep = "\t", quote = FALSE, col.names = NA)
write.table(t(otu_table(ps)),
"files/data-prep/tables/ssu18_full_seq_table.txt",
sep = "\t", quote = FALSE, col.names = NA)
Remove Contaminants & Unwanted Taxa
Let’s see if we have any potential contaminants. We can use some inline R code to see the taxonomy table for any taxa of interest.
- Are Mitochondria present? TRUE
- Are Chloroplast present? TRUE
- Are Eukaryota present? TRUE
Let’s remove these taxa—Eukaryota because we used bacterial/archaeal primers, Mitochondria because those are likely from eukaryotes, and Chloroplast because those are likely from plants. We must do each of these in turn using phyloseq and it gets a little messy.
Why messy? The subset_taxa
command removes anything that is NA
for the specified taxonomic level or above. For example, lets say you run the subset_taxa
command using Family != "Mitochondria
“. Seems like you should get a phyloseq object with everything except Mitochondria. But actually the command not only gets rid of Mitochondria but everything else that has NA
for Family and above. In my experience this is not well documented and I had to dig through the files to figure out what was happening.
Anyway, to remove the taxa we do the following:
- Subset the taxa and generate a
ps
object of just the taxa of interest, - Select the ASV column only, turn it into a factor, and use this to remove
from the ps
object.
Remove Mitochondria ASVs
Remember the original data set contained 20332 ASVs. Here we generate a file with mitochondria ASVs only.
MT1 <- subset_taxa(ps, Family == "Mitochondria")
MT1 <- as(tax_table(MT1), "matrix")
MT1 <- MT1[, 8]
MT1df <- as.factor(MT1)
goodTaxa <- setdiff(taxa_names(ps), MT1df)
ps_no_mito <- prune_taxa(goodTaxa, ps)
ps_no_mito
phyloseq-class experiment-level object
otu_table() OTU Table: [ 20291 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 20291 taxa by 8 taxonomic ranks ]
Looks like this removed 41 Mitochondria ASVs. We will duplicate the code block to remove other groups.
Remove Chloroplast ASVs
And again with Chloroplast ASVs only.
CH1 <- subset_taxa(ps_no_mito, Order == "Chloroplast")
CH1 <- as(tax_table(CH1), "matrix")
CH1 <- CH1[, 8]
CH1df <- as.factor(CH1)
goodTaxa <- setdiff(taxa_names(ps_no_mito), CH1df)
ps_no_chloro <- prune_taxa(goodTaxa, ps_no_mito)
ps_no_chloro
phyloseq-class experiment-level object
otu_table() OTU Table: [ 20274 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 20274 taxa by 8 taxonomic ranks ]
The code removed an additional 17 Chloroplast ASVs.
Remove Eukaryota ASVs
And again with Eukaryota ASVs only.
EU1 <- subset_taxa(ps_no_chloro, Kingdom == "Eukaryota")
EU1 <- as(tax_table(EU1), "matrix")
EU1 <- EU1[, 8]
EU1df <- as.factor(EU1)
goodTaxa <- setdiff(taxa_names(ps_no_chloro), EU1df)
ps_no_euk <- prune_taxa(goodTaxa, ps_no_chloro)
ps_no_euk
phyloseq-class experiment-level object
otu_table() OTU Table: [ 20223 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 20223 taxa by 8 taxonomic ranks ]
The code removed an additional 51 Eukaryota ASVs from the ps
object.
Remove any Kingdom NAs
Here we can just use the straight up subset_taxa
command since we do not need to worry about any ranks above Kingdom also being removed.
ps_filt <- subset_taxa(ps_no_euk, !is.na(Kingdom))
ps_work_o <- ps_filt
The code eliminated an additional 50 Kingdom level NA ASVs from the phyloseq object.
Rename NA taxonomic ranks
Phyloseq has an odd way of dealing with taxonomic ranks that have no value—in other words, NA in the tax table. The first thing we are going to do before moving forward is to change all of the NAs to have a value of the next highest classified rank. For example, ASV26
is not classified at the Genus level but is at Family level (Xanthobacteraceae). So we change the Genus name to Family_Xanthobacteraceae. The code for comes from these two posts on the phyloseq GitHub, both by MSMortensen: issue #850 and issue #990.
One thing this code does is reassign the functions
class
andorder
to taxonomic ranks. This can cause issues if you need these functions.
So you need to run something like this rm(class, order, phylum, kingdom)
at the end of the code to remove these as variables. For now, I have not come up with a better solution.
Code
tax.clean <- data.frame(tax_table(ps_filt))
for (i in 1:6){ tax.clean[,i] <- as.character(tax.clean[,i])}
tax.clean[is.na(tax.clean)] <- ""
for (i in 1:nrow(tax.clean)){
if (tax.clean[i,2] == ""){
kingdom <- base::paste("k_", tax.clean[i,1], sep = "")
tax.clean[i, 2:6] <- kingdom
} else if (tax.clean[i,3] == ""){
phylum <- base::paste("p_", tax.clean[i,2], sep = "")
tax.clean[i, 3:6] <- phylum
} else if (tax.clean[i,4] == ""){
class <- base::paste("c_", tax.clean[i,3], sep = "")
tax.clean[i, 4:6] <- class
} else if (tax.clean[i,5] == ""){
order <- base::paste("o_", tax.clean[i,4], sep = "")
tax.clean[i, 5:6] <- order
} else if (tax.clean[i,6] == ""){
tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep = "_")
}
}
tax_table(ps_filt) <- as.matrix(tax.clean)
rank_names(ps_filt)
rm(class, order, phylum, kingdom)
Still the same ranks. That’s good. What about the new groups? Let’s take a peak at some families.
head(get_taxa_unique(ps_filt, "Family"), 16)
[1] "Planococcaceae" "Chthoniobacteraceae" "Nitrososphaeraceae"
[4] "p_RCP2-54" "o_Rokubacteriales" "Methyloligellaceae"
[7] "Nitrospiraceae" "Xanthobacteraceae" "o_Vicinamibacterales"
[10] "Vicinamibacteraceae" "c_Subgroup_5" "Xiphinematobacteraceae"
[13] "Entotheonellaceae" "Chitinophagaceae" "Nitrosomonadaceae"
[16] "o_Acidobacteriales"
Nice. Bye-bye NA
Note. The original code mentioned above is written for data sets that have species-level designations.
Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.
Finally, we rename the ps object. This is now our working data set.
ps_work <- ps_filt
Add Phylogenetic Tree
One final task is to add a phylogenetic tree to the phyloseq object.
ps_work_tree <- rtree(ntaxa(ps_work), rooted = TRUE,
tip.label = taxa_names(ps_work))
ps_work <- merge_phyloseq(ps_work,
sample_data,
ps_work_tree)
ps_work_o_tree <- rtree(ntaxa(ps_work_o), rooted = TRUE,
tip.label = taxa_names(ps_work_o))
ps_work_o <- merge_phyloseq(ps_work_o,
sample_data,
ps_work_o_tree)
Data Set Summary
Now that we have removed the NC and sample with zero reads, we can summarize the data set. Let’s start with the in our working phyloseq object using the summarize_phyloseq
from the microbiome R package (Lahti, Sudarshan, et al. 2017) as we did above.
Metric | Results |
---|---|
Min. number of reads | 25088 |
Max. number of reads | 86600 |
Total number of reads | 936640 |
Average number of reads | 62443 |
Median number of reads | 62117 |
Total ASVs | 20173 |
Sparsity | 0.891 |
Any ASVs sum to 1 or less? | FALSE |
Number of singleton ASVs | NA |
Percent of ASVs that are singletons | NA |
Number of sample variables are: | 6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR) |
We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample.
Code
total_reads <- sample_sums(ps_work)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")
total_asvs <- estimate_richness(ps_work,
measures = "Observed")
total_asvs <- total_asvs %>% rownames_to_column("Sample_ID")
total_asvs$Sample_ID <- gsub('\\.', '-', total_asvs$Sample_ID)
And then we merge these two data frames with the sample data frame.
Code
sam_details <- samdf
rownames(sam_details) <- NULL
colnames(sam_details) <- c("Sample_ID", "Plot", "Depth",
"Treatment", "Temp", "Pair")
tmp_tab1 <- seq_table
tmp_tab1[, c(2:7,11)] <- NULL
tmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>%
left_join(., total_asvs) %>%
left_join(., tmp_tab1)
reads_lost <- tmp_tab2$`final reads` - tmp_tab2$total_reads
asvs_lost <- tmp_tab2$`no. ASVs` - tmp_tab2$Observed
tmp_tab2$reads_lost <- reads_lost
tmp_tab2$asvs_lost <- asvs_lost
seq_table2 <- tmp_tab2[, c(1:8,12,13)]
colnames(seq_table2) <- c("sample_id", "PLOT", "DEPTH",
"TREAT", "TEMP", "PAIR", "total_reads",
"total_ASVs", "reads_lost", "ASVs_lost")
(16S rRNA) Table 2 | Sample summary table showing the final number of reads and ASVs (per sample) as well as the number of reads/ASVs lost during curation.
Review
We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.
The full phyloseq object before removing contaminants.
ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 20332 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 20332 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps, "Family"), 16)
[1] "Planococcaceae" "Chthoniobacteraceae" "Nitrososphaeraceae"
[4] NA "Methyloligellaceae" "Nitrospiraceae"
[7] "Xanthobacteraceae" "Vicinamibacteraceae" "Xiphinematobacteraceae"
[10] "Entotheonellaceae" "Chitinophagaceae" "Nitrosomonadaceae"
[13] "B1-7BS" "WX65" "Gaiellaceae"
[16] "Steroidobacteraceae"
The phyloseq object before changing NA ranks.
ps_work_o
phyloseq-class experiment-level object
otu_table() OTU Table: [ 20173 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 20173 taxa by 8 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 20173 tips and 20172 internal nodes ]
head(get_taxa_unique(ps_work_o, "Family"), 16)
[1] "A4b" "Burkholderiaceae"
[3] "Peptococcaceae" NA
[5] "Polyangiaceae" "Pedosphaeraceae"
[7] "AKYH767" "Micromonosporaceae"
[9] "Acidobacteriaceae_(Subgroup_1)" "Pyrinomonadaceae"
[11] "Clostridiaceae" "Myxococcaceae"
[13] "Hyphomonadaceae" "Roseiflexaceae"
[15] "type_III" "Ktedonobacteraceae"
The phyloseq object after changing NA ranks.
ps_work
phyloseq-class experiment-level object
otu_table() OTU Table: [ 20173 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 20173 taxa by 8 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 20173 tips and 20172 internal nodes ]
head(get_taxa_unique(ps_work, "Family"), 16)
[1] "c_Subgroup_22" "c_OM190" "A4b"
[4] "Diplorickettsiaceae" "k_Bacteria" "Gemmataceae"
[7] "c_Gammaproteobacteria" "Microscillaceae" "Entotheonellaceae"
[10] "Syntrophaceae" "p_WS2" "Verrucomicrobiaceae"
[13] "Pyrinomonadaceae" "Caulobacteraceae" "Parachlamydiaceae"
[16] "c_Phycisphaerae"
Final Steps
The last thing tasks to complete are to a) save copies of the taxonomy and sequence tables and b) save an image of the workflow for the next section.
write.table(tax_table(ps_work),
"files/data-prep/tables/ssu18_work_tax_table.txt",
sep = "\t", quote = FALSE, col.names = NA)
write.table(t(otu_table(ps_work)),
"files/data-prep/tables/ssu18_work_seq_table.txt",
sep = "\t", quote = FALSE, col.names = NA)
ITS Data
Prerequisites
In order to run this workflow, you either need to run the corresponding DADA2 Workflow for 2018 ITS or begin with the output from that workflow, data_prep_its18_wf.rdata
. See the Data Availability page for complete details.
Unless otherwise noted, we primarily use phyloseq (McMurdie and Holmes 2013) in this section of the workflow to prepare the 2018 ITS data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.
Read Counts Assessment
Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 workflow. We need to inspect how total reads changed from to beginning to the end of the workflow. While we are at it, let’s create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the summary table at the end of the DADA2 section.
Table headers are as follows:
Header | Description |
---|---|
Sample ID |
the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair |
Plot |
the experimental plot number |
Depth |
the depth where the sample was collected |
Treatment |
warm or control |
Temp |
temperature of soil heating (0C, +4C, or +8C) |
Pair |
control/treatment coupling |
input |
number of raw reads |
nochim |
final read count after removing chimeras |
Remain |
percent of reads remaining from input to nonchim
|
FastqID |
base name of the fastq file |
(ITS) Table 1 | Sample summary table including read changes at start and end of DADA2 workflow.
The following samples have fewer than 1000 reads at the end of the workflow:
P05_D00_010_W3C, P09_D00_010_W8E.
We can also plot the final read count by the temperature where the sample was collected.
Defining Groups
Load the data packet produced in the final step of the DADA2 workflow. This packet (
its18_dada2_wf.rdata
) contains the ASV-by-sample table and the ASV taxonomy table.Rename the samples so names have plot and Depth info. For convenience, we renamed any Negative Controls (NC) to
P00_D00-000_NNN
.After we load the data packet we next need to format sample names and define groups.
Code
seq_table <- seq_table[order(seq_table$`root name`), ]
tmp_names <- seq_table$`Sample_ID`
load("files/dada2/rdata/its18_dada2_wf.rdata")
identical(seq_table$`root name`, rownames(seqtab))
rownames(seqtab) <- tmp_names
samples.out <- rownames(seqtab)
sample_name <- substr(samples.out, 1, 999)
plot <- substr(samples.out, 0, 3)
depth <- substr(samples.out, 6, 11)
treatment <- substr(samples.out, 13, 13)
temp <- substr(samples.out, 14, 14)
pair <- substr(samples.out, 15, 15)
We have a total of 15 samples, 5 sample pairs, from 10 plots (1 depth). There are 2 treatments corresponding to 3 different temperature regimes.
- And finally we define a sample data frame that holds the different groups we extracted from the sample names.
Code
samdf <- data.frame(SamName = sample_name,
PLOT = plot,
DEPTH = depth,
TREAT = treatment,
TEMP = temp,
PAIR = pair)
rownames(samdf) <- samples.out
For example, P08_D00_010_C0D is the sample heated at +0C from plot P08, collected from a depth of 00_010cm. This sample is in group D which consists of P07_D00_010_W3D, P07_D00_010_W8D, P08_D00_010_C0D.
Create a Phyloseq Object
Fix Taxon Names
The UNITE database we used for classification adds a little prefix to every rank call, like k__
for Kingdom, o__
for Order, etc. To keep in step with our other analyses, we will remove these tags. Of course, we also make a ps object with the original names just in case. At this point we will also remove the Species rank since it is essentially useless.
tax_mod_o <- data.frame(tax)
tax_mod_o$Species <- NULL
tax_mod_o <- as.matrix(tax_mod_o)
tax_mod <- sub("k__", "", tax_mod_o)
tax_mod <- sub("p__", "", tax_mod)
tax_mod <- sub("c__", "", tax_mod)
tax_mod <- sub("o__", "", tax_mod)
tax_mod <- sub("f__", "", tax_mod)
tax_mod <- sub("g__", "", tax_mod)
A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.
Code
# this create the phyloseq object
ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
sample_data(samdf), tax_table(tax_mod))
tax_table(ps) <- cbind(tax_table(ps),
rownames(tax_table(ps)))
# adding unique ASV names
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
tax_table(ps) <- cbind(tax_table(ps),
rownames(tax_table(ps)))
head(taxa_names(ps))
[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"
So the complete data set contains 3357 ASVs. We can also use the microbiome R package (Lahti, Sudarshan, et al. 2017) to get some additional summary data from the phyloseq object.
Metric | Results |
---|---|
Min. number of reads | 80 |
Max. number of reads | 64636 |
Total number of reads | 491143 |
Average number of reads | 32743 |
Median number of reads | 36063 |
Total ASVs | 3357 |
Sparsity | 0.802 |
Any ASVs sum to 1 or less? | TRUE |
Number of singleton ASVs | 3 |
Percent of ASVs that are singletons | 0.089 |
Number of sample variables are: | 6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR) |
B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.
colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order",
"Family", "Genus", "ASV_SEQ", "ASV_ID")
rank_names(ps)
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3357 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3357 taxa by 8 taxonomic ranks ]
C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full
to indicate that these are the raw outputs.
Code
write.table(tax_table(ps),
"files/data-prep/tables/its18_full_tax_table.txt",
sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps)),
"files/data-prep/tables/its18_full_seq_table.txt",
sep="\t", quote = FALSE, col.names=NA)
At this point we would normally remove unwanted taxa and/or negative controls. Since we do not have any of these for this dataset, we are going to save the phyloseq object with a new name. That way, if we do decide later on to remove some taxa it will be easier to plug in the code here and then continue with the rest of the workflow.
ps_filt <- ps
Remove Low-Count Samples
Next, we can remove samples with really low read counts, say less than 500 reads.
ps_filt_no_low <- prune_samples(sample_sums(ps_filt) > 500, ps_filt)
ps_filt_no_low
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3357 taxa and 13 samples ]
sample_data() Sample Data: [ 13 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3357 taxa by 8 taxonomic ranks ]
We lost 2 sample(s). After removing samples we need to check whether any ASVs ended up with no reads.
no_reads <- taxa_sums(ps_filt_no_low) == 0
And we lost 2 ASV(s). So now we must remove these from the ps object.
ps_filt_no_nc <- prune_taxa(taxa_sums(ps_filt_no_low) > 0, ps_filt_no_low)
ps_filt_no_nc
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3355 taxa and 13 samples ]
sample_data() Sample Data: [ 13 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3355 taxa by 8 taxonomic ranks ]
Rename NA taxonomic ranks
Phyloseq has an odd way of dealing with taxonomic ranks that have no value—in other words, NA in the tax table. The first thing we are going to do before moving forward is to change all of the NAs to have a value of the next highest classified rank. For example, ASV4
is not classified at the Genus level but is at Family level (Nitrososphaeraceae). So we change the Genus name to Family_Nitrososphaeraceae. The code for comes from these two posts on the phyloseq GitHub, both by MSMortensen: issue #850 and issue #990.
One thing this code does is reassign the functions
class
andorder
to taxonomic ranks. This can cause issues if you need these functions.
So you need to run something like this rm(class, order, phylum, kingdom)
at the end of the code to remove these as variables. For now, I have not come up with a better solution.
Before doing that, we will save a copy of the phyloseq object with the original taxa names.
ps_work_o <- ps_filt_no_nc
Code
tax.clean <- data.frame(tax_table(ps_filt_no_nc))
for (i in 1:6){ tax.clean[,i] <- as.character(tax.clean[,i])}
tax.clean[is.na(tax.clean)] <- ""
for (i in 1:nrow(tax.clean)){
if (tax.clean[i,2] == ""){
kingdom <- base::paste("k_", tax.clean[i,1], sep = "")
tax.clean[i, 2:6] <- kingdom
} else if (tax.clean[i,3] == ""){
phylum <- base::paste("p_", tax.clean[i,2], sep = "")
tax.clean[i, 3:6] <- phylum
} else if (tax.clean[i,4] == ""){
class <- base::paste("c_", tax.clean[i,3], sep = "")
tax.clean[i, 4:6] <- class
} else if (tax.clean[i,5] == ""){
order <- base::paste("o_", tax.clean[i,4], sep = "")
tax.clean[i, 5:6] <- order
} else if (tax.clean[i,6] == ""){
tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep = "_")
}
}
tax_table(ps_filt_no_nc) <- as.matrix(tax.clean)
rank_names(ps_filt_no_nc)
rm(class, order, phylum, kingdom)
Still the same ranks. That’s good. What about the new groups? Let’s take a peak at some families.
head(get_taxa_unique(ps_filt_no_nc, "Family"), 16)
[1] "Hygrophoraceae" "Mycosphaerellaceae" "p_Ascomycota"
[4] "Trichosporonaceae" "c_Agaricomycetes" "Archaeorhizomycetaceae"
[7] "Chaetomellaceae" "Geastraceae" "k_Fungi"
[10] "Metschnikowiaceae" "Clavariaceae" "o_Agaricales"
[13] "Ascobolaceae" "Trichocomaceae" "Piptocephalidaceae"
[16] "p_Chytridiomycota"
Nice. Adios NA.
Note. The original code is written for data sets that have species-level designations.
Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.
Finally, we rename the ps object. This is now our working data set.
ps_work <- ps_filt_no_nc
Data Set Summary
Now that we have removed the NC and samples with low reads, we can summarize the data in our working phyloseq object. Let’s start with the data set as a whole using the summarize_phyloseq
from the microbiome R package (Lahti, Sudarshan, et al. 2017) as we did above.
Metric | Results |
---|---|
Min. number of reads | 9172 |
Max. number of reads | 64636 |
Total number of reads | 490767 |
Average number of reads | 37751 |
Median number of reads | 38371 |
Total ASVs | 3355 |
Sparsity | 0.671 |
Any ASVs sum to 1 or less? | TRUE |
Number of singleton ASVs | 5 |
Percent of ASVs that are singletons | 0.149 |
Number of sample variables are: | 6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR) |
We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample.
Code
total_reads <- sample_sums(ps_work)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")
total_asvs <- estimate_richness(ps_work, measures = "Observed")
total_asvs <- total_asvs %>% rownames_to_column("Sample_ID")
total_asvs$Sample_ID <- gsub('\\.', '-', total_asvs$Sample_ID)
And then we merge these two data frames with the sample data frame.
Code
rm(list = ls(pattern = "tmp_"))
sam_details <- samdf
rownames(sam_details) <- NULL
colnames(sam_details) <- c("Sample_ID", "Plot", "Depth",
"Treatment", "Temp", "Pair")
tmp_tab1 <- seq_table
tmp_tab1[, c(2:7,11)] <- NULL
tmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>%
left_join(., total_asvs) %>%
left_join(., tmp_tab1)
reads_lost <- tmp_tab2$`final reads` - tmp_tab2$total_reads
asvs_lost <- tmp_tab2$`no. ASVs` - tmp_tab2$Observed
tmp_tab2$reads_lost <- reads_lost
tmp_tab2$asvs_lost <- asvs_lost
seq_table2 <- tmp_tab2[, c(1:8,12,13)]
colnames(seq_table2) <- c("sample_id", "PLOT", "DEPTH",
"TREAT", "TEMP", "PAIR", "total_reads",
"total_ASVs", "reads_lost", "ASVs_lost")
(ITS) Table 2 | Sample summary table showing the final number of reads and ASVs (per sample) as well as the number of reads/ASVs lost during curation.
Review
We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.
The full phyloseq object before removing contaminants.
ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3357 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3357 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps, "Family"), 16)
[1] "Hygrophoraceae" "Mycosphaerellaceae"
[3] NA "Trichosporonaceae"
[5] "Archaeorhizomycetaceae" "Chaetomellaceae"
[7] "Geastraceae" "Metschnikowiaceae"
[9] "Clavariaceae" "Ascobolaceae"
[11] "Trichocomaceae" "Piptocephalidaceae"
[13] "Mortierellaceae" "Pleosporales_fam_Incertae_sedis"
[15] "Glomeraceae" "Ceratobasidiaceae"
The phyloseq object before removing the prefix of taxon ranks. This still has low coverage samples.
ps_o
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3357 taxa and 15 samples ]
sample_data() Sample Data: [ 15 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3357 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps_o, "Family"), 16)
[1] "f__Hygrophoraceae" "f__Mycosphaerellaceae"
[3] NA "f__Trichosporonaceae"
[5] "f__Archaeorhizomycetaceae" "f__Chaetomellaceae"
[7] "f__Geastraceae" "f__Metschnikowiaceae"
[9] "f__Clavariaceae" "f__Ascobolaceae"
[11] "f__Trichocomaceae" "f__Piptocephalidaceae"
[13] "f__Mortierellaceae" "f__Pleosporales_fam_Incertae_sedis"
[15] "f__Glomeraceae" "f__Ceratobasidiaceae"
The phyloseq object before changing NA ranks.
ps_work_o
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3355 taxa and 13 samples ]
sample_data() Sample Data: [ 13 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3355 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps_work_o, "Family"), 16)
[1] "Hygrophoraceae" "Mycosphaerellaceae"
[3] NA "Trichosporonaceae"
[5] "Archaeorhizomycetaceae" "Chaetomellaceae"
[7] "Geastraceae" "Metschnikowiaceae"
[9] "Clavariaceae" "Ascobolaceae"
[11] "Trichocomaceae" "Piptocephalidaceae"
[13] "Mortierellaceae" "Pleosporales_fam_Incertae_sedis"
[15] "Glomeraceae" "Ceratobasidiaceae"
The phyloseq object after changing NA ranks.
ps_work
phyloseq-class experiment-level object
otu_table() OTU Table: [ 3355 taxa and 13 samples ]
sample_data() Sample Data: [ 13 samples by 6 sample variables ]
tax_table() Taxonomy Table: [ 3355 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps_work, "Family"), 16)
[1] "Hygrophoraceae" "Mycosphaerellaceae" "p_Ascomycota"
[4] "Trichosporonaceae" "c_Agaricomycetes" "Archaeorhizomycetaceae"
[7] "Chaetomellaceae" "Geastraceae" "k_Fungi"
[10] "Metschnikowiaceae" "Clavariaceae" "o_Agaricales"
[13] "Ascobolaceae" "Trichocomaceae" "Piptocephalidaceae"
[16] "p_Chytridiomycota"
Final Steps
The last thing tasks to complete are to a) save copies of the taxonomy and sequence tables and b) save an image of the workflow for the next part of the workflow.
Code
write.table(tax_table(ps_work),
"files/data-prep/tables/its18_work_tax_table.txt",
sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_work)),
"files/data-prep/tables/its18_work_seq_table.txt",
sep="\t", quote = FALSE, col.names=NA)
Workflow Output
Data products generated in this workflow can be downloaded from figshare.
That’s all for this part!
Next workflow:
3. Filtering Previous workflow:
1. DADA2 Workflow
Source Code
The source code for this page can be accessed on GitHub by clicking this link.
Data Availability
Data generated in this workflow and the .Rdata
need to run the workflow can be accessed on figshare at 10.25573/data.14690739.
Last updated on
[1] "2022-06-29 07:31:56 EST"