2. Data Set Preparation

Workflow for preparation of the 16S rRNA and ITS data sets. These steps are needed before analyzing the data. In this workflow, sample groups are defined, and phyloseq objects are created and curated.

Click here for setup information.
knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
set.seed(119)
#library(conflicted)
#pacman::p_depends(ape, local = TRUE)  
#pacman::p_depends_reverse(ape, local = TRUE)  
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
pacman::p_load(tidyverse, DT, microbiome, taxa, formatR,
               captioner, reactable, downloadthis, fontawesome,
               metacoder, ampvis2, ape, statnet.common, 
               install = FALSE, update = FALSE)

options(scipen=999)
knitr::opts_current$get(c(
  "cache",
  "cache.path",
  "cache.rebuild",
  "dependson",
  "autodep"
))

Workflow Input

You will either need to run the DADA2 workflow or download the files linked below. See the Data Availability page for complete details.

Files needed to run this workflow can be downloaded from figshare. These are the output files from the DADA2 workflow page.

These workflows share many common variable names so you must split the workflow into a script for each data set OR run the command remove(list = ls()) before beginning the next workflow.

16s rRNA Data

Prerequisites

In order to run this workflow, you either need to run the corresponding DADA2 Workflow for 16S rRNA or begin with the output from that workflow, data_prep_ssu18_wf.rdata. See the Data Availability page for complete details.

Unless otherwise noted, we primarily use phyloseq (McMurdie and Holmes 2013) in this section of the workflow to prepare the 2018 16S rRNA data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.

Read Counts Assessment

Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 workflow. We need to inspect how total reads changed through the workflow. While we are at it, let’s create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the summary table at the end of the DADA2 section. Table headers are as follows:

Header Description
Sample ID the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair
Plot the experimental plot number
Depth the depth where the sample was collected
Treatment warm or control
Temp temperature of soil heating (0C, +4C, or +8C)
Pair control/treatment coupling
input number of raw reads
nochim final read count after removing chimeras
Remain percent of reads remaining from input to nonchim
FastqID base name of the fastq file


(16S rRNA) Table 1 | Sample summary table including read changes at start and end of DADA2 workflow.

Defining Groups

  1. Load the data packet produced in the final step of the DADA2 workflow. This packet (ssu18_dada2_wf.rdata) contains the ASV-by-sample table and the ASV taxonomy table.

  2. Rename the samples so names have plot and Depth info.

  3. After we load the data packet, we next need to format sample names and define groups.

Code
seq_table <- seq_table[order(seq_table$`root name`), ]
tmp_names <- seq_table$`Sample_ID`
load("files/dada2/rdata/ssu18_dada2_wf.rdata")
identical(seq_table$`root name`, rownames(seqtab))
rownames(seqtab) <- tmp_names
samples.out <- rownames(seqtab)

sample_name <- substr(samples.out, 1, 999)
plot <- substr(samples.out, 0, 3)
depth <- substr(samples.out, 6, 11)
treatment <- substr(samples.out, 13, 13)
temp <- substr(samples.out, 14, 14)
pair <- substr(samples.out, 15, 15)

We have a total of 15 samples, 5 sample pairs, from 10 plots (1 depth). There are 2 treatments corresponding to 3 different temperature regimes.

  1. And finally we define a sample data frame that holds the different groups we extracted from the sample names.
Code
#define a sample data frame
samdf <- data.frame(SamName = sample_name,
                    PLOT = plot,
                    DEPTH = depth,
                    TREAT = treatment,
                    TEMP = temp,
                    PAIR = pair)
rownames(samdf) <- samples.out

For example, P07_D00_010_W3D is the sample heated at +3C from plot P07, collected from a depth of 00_010cm. This sample is in group D which consists of P07_D00_010_W3D, P07_D00_010_W8D, P08_D00_010_C0D.

Create a Phyloseq Object

A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.

Note

A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications).

Code
# this create the phyloseq object
ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
                   sample_data(samdf), tax_table(tax_silva))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))

# adding unique ASV names
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))
[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"

So the complete data set contains 20332 ASVs. We can use the microbiome R package (Lahti, Sudarshan, et al. 2017) to get some additional summary data from the phyloseq object.

Metric Results
Min. number of reads 25151
Max. number of reads 86841
Total number of reads 937761
Average number of reads 62517
Median number of reads 62171
Total ASVs 20332
Sparsity 0.898
Any ASVs sum to 1 or less? FALSE
Number of singleton ASVs NA
Percent of ASVs that are singletons NA
Number of sample variables are: 6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)

B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.

colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order",
    "Family", "Genus", "ASV_SEQ", "ASV_ID")
ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20332 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20332 taxa by 8 taxonomic ranks ]

C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full to indicate that these are the raw outputs.

write.table(tax_table(ps),
            "files/data-prep/tables/ssu18_full_tax_table.txt",
            sep = "\t", quote = FALSE, col.names = NA)
write.table(t(otu_table(ps)),
            "files/data-prep/tables/ssu18_full_seq_table.txt",
            sep = "\t", quote = FALSE, col.names = NA)

Remove Contaminants & Unwanted Taxa

Let’s see if we have any potential contaminants. We can use some inline R code to see the taxonomy table for any taxa of interest.

Note

The code (hidden by default) is written as `r "Chloroplast" %in% tax_table(ps)`.

  • Are Mitochondria present? TRUE
  • Are Chloroplast present? TRUE
  • Are Eukaryota present? TRUE

Let’s remove these taxa—Eukaryota because we used bacterial/archaeal primers, Mitochondria because those are likely from eukaryotes, and Chloroplast because those are likely from plants. We must do each of these in turn using phyloseq and it gets a little messy.

Why messy? The subset_taxa command removes anything that is NA for the specified taxonomic level or above. For example, lets say you run the subset_taxa command using Family != "Mitochondria“. Seems like you should get a phyloseq object with everything except Mitochondria. But actually the command not only gets rid of Mitochondria but everything else that has NA for Family and above. In my experience this is not well documented and I had to dig through the files to figure out what was happening.

Anyway, to remove the taxa we do the following:

  • Subset the taxa and generate a ps object of just the taxa of interest,
  • Select the ASV column only, turn it into a factor, and use this to remove from the ps object.

Remove Mitochondria ASVs

Remember the original data set contained 20332 ASVs. Here we generate a file with mitochondria ASVs only.

MT1 <- subset_taxa(ps, Family == "Mitochondria")
MT1 <-  as(tax_table(MT1), "matrix")
MT1 <- MT1[, 8]
MT1df <- as.factor(MT1)
goodTaxa <- setdiff(taxa_names(ps), MT1df)
ps_no_mito <- prune_taxa(goodTaxa, ps)
ps_no_mito
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20291 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20291 taxa by 8 taxonomic ranks ]

Looks like this removed 41 Mitochondria ASVs. We will duplicate the code block to remove other groups.

Remove Chloroplast ASVs

And again with Chloroplast ASVs only.

CH1 <- subset_taxa(ps_no_mito, Order == "Chloroplast")
CH1 <-  as(tax_table(CH1), "matrix")
CH1 <- CH1[, 8]
CH1df <- as.factor(CH1)
goodTaxa <- setdiff(taxa_names(ps_no_mito), CH1df)
ps_no_chloro <- prune_taxa(goodTaxa, ps_no_mito)
ps_no_chloro
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20274 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20274 taxa by 8 taxonomic ranks ]

The code removed an additional 17 Chloroplast ASVs.

Remove Eukaryota ASVs

And again with Eukaryota ASVs only.

EU1 <- subset_taxa(ps_no_chloro, Kingdom == "Eukaryota")
EU1 <-  as(tax_table(EU1), "matrix")
EU1 <- EU1[, 8]
EU1df <- as.factor(EU1)
goodTaxa <- setdiff(taxa_names(ps_no_chloro), EU1df)
ps_no_euk <- prune_taxa(goodTaxa, ps_no_chloro)
ps_no_euk
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20223 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20223 taxa by 8 taxonomic ranks ]

The code removed an additional 51 Eukaryota ASVs from the ps object.

Remove any Kingdom NAs

Here we can just use the straight up subset_taxa command since we do not need to worry about any ranks above Kingdom also being removed.

ps_filt <- subset_taxa(ps_no_euk, !is.na(Kingdom))
ps_work_o <- ps_filt

The code eliminated an additional 50 Kingdom level NA ASVs from the phyloseq object.

Rename NA taxonomic ranks

Phyloseq has an odd way of dealing with taxonomic ranks that have no value—in other words, NA in the tax table. The first thing we are going to do before moving forward is to change all of the NAs to have a value of the next highest classified rank. For example, ASV26 is not classified at the Genus level but is at Family level (Xanthobacteraceae). So we change the Genus name to Family_Xanthobacteraceae. The code for comes from these two posts on the phyloseq GitHub, both by MSMortensen: issue #850 and issue #990.

One thing this code does is reassign the functions class and order to taxonomic ranks. This can cause issues if you need these functions.

So you need to run something like this rm(class, order, phylum, kingdom) at the end of the code to remove these as variables. For now, I have not come up with a better solution.

Code
tax.clean <- data.frame(tax_table(ps_filt))
for (i in 1:6){ tax.clean[,i] <- as.character(tax.clean[,i])}
tax.clean[is.na(tax.clean)] <- ""

for (i in 1:nrow(tax.clean)){
    if (tax.clean[i,2] == ""){
        kingdom <- base::paste("k_", tax.clean[i,1], sep = "")
        tax.clean[i, 2:6] <- kingdom
    } else if (tax.clean[i,3] == ""){
        phylum <- base::paste("p_", tax.clean[i,2], sep = "")
        tax.clean[i, 3:6] <- phylum
    } else if (tax.clean[i,4] == ""){
        class <- base::paste("c_", tax.clean[i,3], sep = "")
        tax.clean[i, 4:6] <- class
    } else if (tax.clean[i,5] == ""){
        order <- base::paste("o_", tax.clean[i,4], sep = "")
        tax.clean[i, 5:6] <- order
    } else if (tax.clean[i,6] == ""){
        tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep = "_")
        }
}
tax_table(ps_filt) <- as.matrix(tax.clean)
rank_names(ps_filt)
rm(class, order, phylum, kingdom)

Still the same ranks. That’s good. What about the new groups? Let’s take a peak at some families.

head(get_taxa_unique(ps_filt, "Family"), 16)
 [1] "Planococcaceae"         "Chthoniobacteraceae"    "Nitrososphaeraceae"    
 [4] "p_RCP2-54"              "o_Rokubacteriales"      "Methyloligellaceae"    
 [7] "Nitrospiraceae"         "Xanthobacteraceae"      "o_Vicinamibacterales"  
[10] "Vicinamibacteraceae"    "c_Subgroup_5"           "Xiphinematobacteraceae"
[13] "Entotheonellaceae"      "Chitinophagaceae"       "Nitrosomonadaceae"     
[16] "o_Acidobacteriales"    

Nice. Bye-bye NA

Note. The original code mentioned above is written for data sets that have species-level designations.

Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.

Finally, we rename the ps object. This is now our working data set.

ps_work <- ps_filt

Add Phylogenetic Tree

One final task is to add a phylogenetic tree to the phyloseq object.

ps_work_tree <- rtree(ntaxa(ps_work), rooted = TRUE,
                      tip.label = taxa_names(ps_work))
ps_work <- merge_phyloseq(ps_work,
                          sample_data,
                          ps_work_tree)
ps_work_o_tree <- rtree(ntaxa(ps_work_o), rooted = TRUE,
                      tip.label = taxa_names(ps_work_o))
ps_work_o <- merge_phyloseq(ps_work_o,
                          sample_data,
                          ps_work_o_tree)

Data Set Summary

Now that we have removed the NC and sample with zero reads, we can summarize the data set. Let’s start with the in our working phyloseq object using the summarize_phyloseq from the microbiome R package (Lahti, Sudarshan, et al. 2017) as we did above.

Metric Results
Min. number of reads 25088
Max. number of reads 86600
Total number of reads 936640
Average number of reads 62443
Median number of reads 62117
Total ASVs 20173
Sparsity 0.891
Any ASVs sum to 1 or less? FALSE
Number of singleton ASVs NA
Percent of ASVs that are singletons NA
Number of sample variables are: 6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)

We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample.

Code
total_reads <- sample_sums(ps_work)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")

total_asvs <- estimate_richness(ps_work,
                                measures = "Observed")
total_asvs <- total_asvs %>% rownames_to_column("Sample_ID")
total_asvs$Sample_ID <- gsub('\\.', '-', total_asvs$Sample_ID)

And then we merge these two data frames with the sample data frame.

Code
sam_details <- samdf
rownames(sam_details) <- NULL
colnames(sam_details) <- c("Sample_ID", "Plot", "Depth",
                                  "Treatment", "Temp", "Pair")
tmp_tab1 <- seq_table
tmp_tab1[, c(2:7,11)] <- NULL
tmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>%
                left_join(., total_asvs) %>%
                left_join(., tmp_tab1)

reads_lost <- tmp_tab2$`final reads` - tmp_tab2$total_reads 
asvs_lost <- tmp_tab2$`no. ASVs` - tmp_tab2$Observed 
tmp_tab2$reads_lost <- reads_lost
tmp_tab2$asvs_lost <- asvs_lost

seq_table2 <- tmp_tab2[, c(1:8,12,13)]
colnames(seq_table2) <- c("sample_id", "PLOT", "DEPTH",
                          "TREAT", "TEMP", "PAIR", "total_reads",
                          "total_ASVs", "reads_lost", "ASVs_lost")

(16S rRNA) Table 2 | Sample summary table showing the final number of reads and ASVs (per sample) as well as the number of reads/ASVs lost during curation.

Review

We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.

The full phyloseq object before removing contaminants.

ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20332 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20332 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps, "Family"), 16)
 [1] "Planococcaceae"         "Chthoniobacteraceae"    "Nitrososphaeraceae"    
 [4] NA                       "Methyloligellaceae"     "Nitrospiraceae"        
 [7] "Xanthobacteraceae"      "Vicinamibacteraceae"    "Xiphinematobacteraceae"
[10] "Entotheonellaceae"      "Chitinophagaceae"       "Nitrosomonadaceae"     
[13] "B1-7BS"                 "WX65"                   "Gaiellaceae"           
[16] "Steroidobacteraceae"   

The phyloseq object before changing NA ranks.

ps_work_o
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20173 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20173 taxa by 8 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 20173 tips and 20172 internal nodes ]
head(get_taxa_unique(ps_work_o, "Family"), 16)
 [1] "A4b"                            "Burkholderiaceae"              
 [3] "Peptococcaceae"                 NA                              
 [5] "Polyangiaceae"                  "Pedosphaeraceae"               
 [7] "AKYH767"                        "Micromonosporaceae"            
 [9] "Acidobacteriaceae_(Subgroup_1)" "Pyrinomonadaceae"              
[11] "Clostridiaceae"                 "Myxococcaceae"                 
[13] "Hyphomonadaceae"                "Roseiflexaceae"                
[15] "type_III"                       "Ktedonobacteraceae"            

The phyloseq object after changing NA ranks.

ps_work
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20173 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20173 taxa by 8 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 20173 tips and 20172 internal nodes ]
head(get_taxa_unique(ps_work, "Family"), 16)
 [1] "c_Subgroup_22"         "c_OM190"               "A4b"                  
 [4] "Diplorickettsiaceae"   "k_Bacteria"            "Gemmataceae"          
 [7] "c_Gammaproteobacteria" "Microscillaceae"       "Entotheonellaceae"    
[10] "Syntrophaceae"         "p_WS2"                 "Verrucomicrobiaceae"  
[13] "Pyrinomonadaceae"      "Caulobacteraceae"      "Parachlamydiaceae"    
[16] "c_Phycisphaerae"      

Final Steps

The last thing tasks to complete are to a) save copies of the taxonomy and sequence tables and b) save an image of the workflow for the next section.

write.table(tax_table(ps_work),
            "files/data-prep/tables/ssu18_work_tax_table.txt",
            sep = "\t", quote = FALSE, col.names = NA)
write.table(t(otu_table(ps_work)),
            "files/data-prep/tables/ssu18_work_seq_table.txt",
            sep = "\t", quote = FALSE, col.names = NA)

ITS Data

Prerequisites

In order to run this workflow, you either need to run the corresponding DADA2 Workflow for 2018 ITS or begin with the output from that workflow, data_prep_its18_wf.rdata. See the Data Availability page for complete details.

Unless otherwise noted, we primarily use phyloseq (McMurdie and Holmes 2013) in this section of the workflow to prepare the 2018 ITS data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.

Read Counts Assessment

Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 workflow. We need to inspect how total reads changed from to beginning to the end of the workflow. While we are at it, let’s create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the summary table at the end of the DADA2 section.

Table headers are as follows:

Header Description
Sample ID the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair
Plot the experimental plot number
Depth the depth where the sample was collected
Treatment warm or control
Temp temperature of soil heating (0C, +4C, or +8C)
Pair control/treatment coupling
input number of raw reads
nochim final read count after removing chimeras
Remain percent of reads remaining from input to nonchim
FastqID base name of the fastq file

(ITS) Table 1 | Sample summary table including read changes at start and end of DADA2 workflow.

The following samples have fewer than 1000 reads at the end of the workflow:

P05_D00_010_W3C, P09_D00_010_W8E.

We can also plot the final read count by the temperature where the sample was collected.

Defining Groups

  1. Load the data packet produced in the final step of the DADA2 workflow. This packet (its18_dada2_wf.rdata) contains the ASV-by-sample table and the ASV taxonomy table.

  2. Rename the samples so names have plot and Depth info. For convenience, we renamed any Negative Controls (NC) to P00_D00-000_NNN.

  3. After we load the data packet we next need to format sample names and define groups.

Code
seq_table <- seq_table[order(seq_table$`root name`), ]
tmp_names <- seq_table$`Sample_ID`
load("files/dada2/rdata/its18_dada2_wf.rdata")
identical(seq_table$`root name`, rownames(seqtab))
rownames(seqtab) <- tmp_names
samples.out <- rownames(seqtab)
sample_name <- substr(samples.out, 1, 999)
plot <- substr(samples.out, 0, 3)
depth <- substr(samples.out, 6, 11)
treatment <- substr(samples.out, 13, 13)
temp <- substr(samples.out, 14, 14)
pair <- substr(samples.out, 15, 15)

We have a total of 15 samples, 5 sample pairs, from 10 plots (1 depth). There are 2 treatments corresponding to 3 different temperature regimes.

  1. And finally we define a sample data frame that holds the different groups we extracted from the sample names.
Code
samdf <- data.frame(SamName = sample_name,
                    PLOT = plot,
                    DEPTH = depth,
                    TREAT = treatment,
                    TEMP = temp,
                    PAIR = pair)
rownames(samdf) <- samples.out

For example, P08_D00_010_C0D is the sample heated at +0C from plot P08, collected from a depth of 00_010cm. This sample is in group D which consists of P07_D00_010_W3D, P07_D00_010_W8D, P08_D00_010_C0D.

Create a Phyloseq Object

Fix Taxon Names

The UNITE database we used for classification adds a little prefix to every rank call, like k__ for Kingdom, o__ for Order, etc. To keep in step with our other analyses, we will remove these tags. Of course, we also make a ps object with the original names just in case. At this point we will also remove the Species rank since it is essentially useless.

tax_mod_o <- data.frame(tax)
tax_mod_o$Species <- NULL
tax_mod_o <- as.matrix(tax_mod_o)
tax_mod <- sub("k__", "", tax_mod_o)
tax_mod <- sub("p__", "", tax_mod)
tax_mod <- sub("c__", "", tax_mod)
tax_mod <- sub("o__", "", tax_mod)
tax_mod <- sub("f__", "", tax_mod)
tax_mod <- sub("g__", "", tax_mod)

A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.

Note

A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications).

Code
# this create the phyloseq object
ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
                   sample_data(samdf), tax_table(tax_mod))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))

# adding unique ASV names
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))
head(taxa_names(ps))
[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"

So the complete data set contains 3357 ASVs. We can also use the microbiome R package (Lahti, Sudarshan, et al. 2017) to get some additional summary data from the phyloseq object.

Metric Results
Min. number of reads 80
Max. number of reads 64636
Total number of reads 491143
Average number of reads 32743
Median number of reads 36063
Total ASVs 3357
Sparsity 0.802
Any ASVs sum to 1 or less? TRUE
Number of singleton ASVs 3
Percent of ASVs that are singletons 0.089
Number of sample variables are: 6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)

B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.

colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order",
    "Family", "Genus", "ASV_SEQ", "ASV_ID")
rank_names(ps)
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3357 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3357 taxa by 8 taxonomic ranks ]

C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full to indicate that these are the raw outputs.

Code
write.table(tax_table(ps),
            "files/data-prep/tables/its18_full_tax_table.txt",
            sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps)),
            "files/data-prep/tables/its18_full_seq_table.txt",
            sep="\t", quote = FALSE, col.names=NA)

At this point we would normally remove unwanted taxa and/or negative controls. Since we do not have any of these for this dataset, we are going to save the phyloseq object with a new name. That way, if we do decide later on to remove some taxa it will be easier to plug in the code here and then continue with the rest of the workflow.

ps_filt <- ps

Remove Low-Count Samples

Next, we can remove samples with really low read counts, say less than 500 reads.

ps_filt_no_low <- prune_samples(sample_sums(ps_filt) > 500, ps_filt)
ps_filt_no_low
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3357 taxa and 13 samples ]
sample_data() Sample Data:       [ 13 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3357 taxa by 8 taxonomic ranks ]

We lost 2 sample(s). After removing samples we need to check whether any ASVs ended up with no reads.

no_reads <- taxa_sums(ps_filt_no_low) == 0

And we lost 2 ASV(s). So now we must remove these from the ps object.

ps_filt_no_nc <- prune_taxa(taxa_sums(ps_filt_no_low) > 0, ps_filt_no_low)
ps_filt_no_nc
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3355 taxa and 13 samples ]
sample_data() Sample Data:       [ 13 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3355 taxa by 8 taxonomic ranks ]

Rename NA taxonomic ranks

Phyloseq has an odd way of dealing with taxonomic ranks that have no value—in other words, NA in the tax table. The first thing we are going to do before moving forward is to change all of the NAs to have a value of the next highest classified rank. For example, ASV4 is not classified at the Genus level but is at Family level (Nitrososphaeraceae). So we change the Genus name to Family_Nitrososphaeraceae. The code for comes from these two posts on the phyloseq GitHub, both by MSMortensen: issue #850 and issue #990.

One thing this code does is reassign the functions class and order to taxonomic ranks. This can cause issues if you need these functions.

So you need to run something like this rm(class, order, phylum, kingdom) at the end of the code to remove these as variables. For now, I have not come up with a better solution.

Before doing that, we will save a copy of the phyloseq object with the original taxa names.

ps_work_o <- ps_filt_no_nc
Code
tax.clean <- data.frame(tax_table(ps_filt_no_nc))
for (i in 1:6){ tax.clean[,i] <- as.character(tax.clean[,i])}
tax.clean[is.na(tax.clean)] <- ""

for (i in 1:nrow(tax.clean)){
    if (tax.clean[i,2] == ""){
        kingdom <- base::paste("k_", tax.clean[i,1], sep = "")
        tax.clean[i, 2:6] <- kingdom
    } else if (tax.clean[i,3] == ""){
        phylum <- base::paste("p_", tax.clean[i,2], sep = "")
        tax.clean[i, 3:6] <- phylum
    } else if (tax.clean[i,4] == ""){
        class <- base::paste("c_", tax.clean[i,3], sep = "")
        tax.clean[i, 4:6] <- class
    } else if (tax.clean[i,5] == ""){
        order <- base::paste("o_", tax.clean[i,4], sep = "")
        tax.clean[i, 5:6] <- order
    } else if (tax.clean[i,6] == ""){
        tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep = "_")
        }
}
tax_table(ps_filt_no_nc) <- as.matrix(tax.clean)
rank_names(ps_filt_no_nc)
rm(class, order, phylum, kingdom)

Still the same ranks. That’s good. What about the new groups? Let’s take a peak at some families.

head(get_taxa_unique(ps_filt_no_nc, "Family"), 16)
 [1] "Hygrophoraceae"         "Mycosphaerellaceae"     "p_Ascomycota"          
 [4] "Trichosporonaceae"      "c_Agaricomycetes"       "Archaeorhizomycetaceae"
 [7] "Chaetomellaceae"        "Geastraceae"            "k_Fungi"               
[10] "Metschnikowiaceae"      "Clavariaceae"           "o_Agaricales"          
[13] "Ascobolaceae"           "Trichocomaceae"         "Piptocephalidaceae"    
[16] "p_Chytridiomycota"     

Nice. Adios NA.

Note. The original code is written for data sets that have species-level designations.

Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.

Finally, we rename the ps object. This is now our working data set.

ps_work <- ps_filt_no_nc

Data Set Summary

Now that we have removed the NC and samples with low reads, we can summarize the data in our working phyloseq object. Let’s start with the data set as a whole using the summarize_phyloseq from the microbiome R package (Lahti, Sudarshan, et al. 2017) as we did above.

Metric Results
Min. number of reads 9172
Max. number of reads 64636
Total number of reads 490767
Average number of reads 37751
Median number of reads 38371
Total ASVs 3355
Sparsity 0.671
Any ASVs sum to 1 or less? TRUE
Number of singleton ASVs 5
Percent of ASVs that are singletons 0.149
Number of sample variables are: 6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)

We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample.

Code
total_reads <- sample_sums(ps_work)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")

total_asvs <- estimate_richness(ps_work, measures = "Observed")
total_asvs <- total_asvs %>% rownames_to_column("Sample_ID")
total_asvs$Sample_ID <- gsub('\\.', '-', total_asvs$Sample_ID)

And then we merge these two data frames with the sample data frame.

Code
rm(list = ls(pattern = "tmp_"))

sam_details <- samdf
rownames(sam_details) <- NULL
colnames(sam_details) <- c("Sample_ID", "Plot", "Depth",
                                  "Treatment", "Temp", "Pair")
tmp_tab1 <- seq_table
tmp_tab1[, c(2:7,11)] <- NULL
tmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>%
                left_join(., total_asvs) %>%
                left_join(., tmp_tab1)

reads_lost <- tmp_tab2$`final reads` - tmp_tab2$total_reads 
asvs_lost <- tmp_tab2$`no. ASVs` - tmp_tab2$Observed 
tmp_tab2$reads_lost <- reads_lost
tmp_tab2$asvs_lost <- asvs_lost

seq_table2 <- tmp_tab2[, c(1:8,12,13)]
colnames(seq_table2) <- c("sample_id", "PLOT", "DEPTH",
                          "TREAT", "TEMP", "PAIR", "total_reads",
                          "total_ASVs", "reads_lost", "ASVs_lost")

(ITS) Table 2 | Sample summary table showing the final number of reads and ASVs (per sample) as well as the number of reads/ASVs lost during curation.

Review

We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.

The full phyloseq object before removing contaminants.

ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3357 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3357 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps, "Family"), 16)
 [1] "Hygrophoraceae"                  "Mycosphaerellaceae"             
 [3] NA                                "Trichosporonaceae"              
 [5] "Archaeorhizomycetaceae"          "Chaetomellaceae"                
 [7] "Geastraceae"                     "Metschnikowiaceae"              
 [9] "Clavariaceae"                    "Ascobolaceae"                   
[11] "Trichocomaceae"                  "Piptocephalidaceae"             
[13] "Mortierellaceae"                 "Pleosporales_fam_Incertae_sedis"
[15] "Glomeraceae"                     "Ceratobasidiaceae"              

The phyloseq object before removing the prefix of taxon ranks. This still has low coverage samples.

ps_o
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3357 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3357 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps_o, "Family"), 16)
 [1] "f__Hygrophoraceae"                  "f__Mycosphaerellaceae"             
 [3] NA                                   "f__Trichosporonaceae"              
 [5] "f__Archaeorhizomycetaceae"          "f__Chaetomellaceae"                
 [7] "f__Geastraceae"                     "f__Metschnikowiaceae"              
 [9] "f__Clavariaceae"                    "f__Ascobolaceae"                   
[11] "f__Trichocomaceae"                  "f__Piptocephalidaceae"             
[13] "f__Mortierellaceae"                 "f__Pleosporales_fam_Incertae_sedis"
[15] "f__Glomeraceae"                     "f__Ceratobasidiaceae"              

The phyloseq object before changing NA ranks.

ps_work_o
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3355 taxa and 13 samples ]
sample_data() Sample Data:       [ 13 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3355 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps_work_o, "Family"), 16)
 [1] "Hygrophoraceae"                  "Mycosphaerellaceae"             
 [3] NA                                "Trichosporonaceae"              
 [5] "Archaeorhizomycetaceae"          "Chaetomellaceae"                
 [7] "Geastraceae"                     "Metschnikowiaceae"              
 [9] "Clavariaceae"                    "Ascobolaceae"                   
[11] "Trichocomaceae"                  "Piptocephalidaceae"             
[13] "Mortierellaceae"                 "Pleosporales_fam_Incertae_sedis"
[15] "Glomeraceae"                     "Ceratobasidiaceae"              

The phyloseq object after changing NA ranks.

ps_work
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3355 taxa and 13 samples ]
sample_data() Sample Data:       [ 13 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3355 taxa by 8 taxonomic ranks ]
head(get_taxa_unique(ps_work, "Family"), 16)
 [1] "Hygrophoraceae"         "Mycosphaerellaceae"     "p_Ascomycota"          
 [4] "Trichosporonaceae"      "c_Agaricomycetes"       "Archaeorhizomycetaceae"
 [7] "Chaetomellaceae"        "Geastraceae"            "k_Fungi"               
[10] "Metschnikowiaceae"      "Clavariaceae"           "o_Agaricales"          
[13] "Ascobolaceae"           "Trichocomaceae"         "Piptocephalidaceae"    
[16] "p_Chytridiomycota"     

Final Steps

The last thing tasks to complete are to a) save copies of the taxonomy and sequence tables and b) save an image of the workflow for the next part of the workflow.

Code
write.table(tax_table(ps_work),
            "files/data-prep/tables/its18_work_tax_table.txt",
            sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_work)),
            "files/data-prep/tables/its18_work_seq_table.txt",
            sep="\t", quote = FALSE, col.names=NA)

Workflow Output

Data products generated in this workflow can be downloaded from figshare.

That’s all for this part!

Next workflow:
3. Filtering
Previous workflow:
1. DADA2 Workflow


Source Code

The source code for this page can be accessed on GitHub by clicking this link.

Data Availability

Data generated in this workflow and the .Rdata need to run the workflow can be accessed on figshare at 10.25573/data.14690739.

Last updated on

[1] "2022-06-29 07:31:56 EST"

References

Lahti, Leo, Shetty Sudarshan, et al. 2017. “Tools for Microbiome Analysis in r. Version 1.9.97.” https://github.com/microbiome/microbiome.
McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An r Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PLoS One 8 (4): e61217. https://doi.org/10.1371/journal.pone.0061217.