Click here for setup information.

knitr::opts_chunk$set(echo = TRUE, eval = FALSE)
set.seed(119)
#library(conflicted)
#pacman::p_depends(ape, local = TRUE)  
#pacman::p_depends_reverse(ape, local = TRUE)  
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
pacman::p_load(tidyverse, DT, microbiome, taxa, formatR,
               captioner, reactable, downloadthis, fontawesome,
               metacoder, ampvis2, ape, statnet.common, 
               install = FALSE, update = FALSE)

options(scipen=999)
knitr::opts_current$get(c(
  "cache",
  "cache.path",
  "cache.rebuild",
  "dependson",
  "autodep"
))

Workflow Input

You will either need to run the DADA2 workflow or download the files linked below. See the Data Availability page for complete details.

Files needed to run this workflow can be downloaded from figshare. These are the output files from the DADA2 workflow page.

These workflows share many common variable names so you must split the workflow into a script for each data set OR run the command remove(list = ls()) before beginning the next workflow.

16s rRNA Data

Prerequisites

In order to run this workflow, you either need to run the corresponding DADA2 Workflow for 16S rRNA or begin with the output from that workflow, data_prep_ssu18_wf.rdata. See the Data Availability page for complete details.

Unless otherwise noted, we primarily use phyloseq (McMurdie and Holmes 2013) in this section of the workflow to prepare the 2018 16S rRNA data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.

Read Counts Assessment

Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 workflow. We need to inspect how total reads changed through the workflow. While we are at it, let’s create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the summary table at the end of the DADA2 section. Table headers are as follows:

Header	Description
`Sample ID`	the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair
`Plot`	the experimental plot number
`Depth`	the depth where the sample was collected
`Treatment`	warm or control
`Temp`	temperature of soil heating (0C, +4C, or +8C)
`Pair`	control/treatment coupling
`input`	number of raw reads
`nochim`	final read count after removing chimeras
`Remain`	percent of reads remaining from `input` to `nonchim`
`FastqID`	base name of the fastq file

(16S rRNA) Table 1 | Sample summary table including read changes at start and end of DADA2 workflow.

Defining Groups

Load the data packet produced in the final step of the DADA2 workflow. This packet (ssu18_dada2_wf.rdata) contains the ASV-by-sample table and the ASV taxonomy table.
Rename the samples so names have plot and Depth info.
After we load the data packet, we next need to format sample names and define groups.

Code

seq_table <- seq_table[order(seq_table$`root name`), ]
tmp_names <- seq_table$`Sample_ID`
load("files/dada2/rdata/ssu18_dada2_wf.rdata")
identical(seq_table$`root name`, rownames(seqtab))
rownames(seqtab) <- tmp_names
samples.out <- rownames(seqtab)

sample_name <- substr(samples.out, 1, 999)
plot <- substr(samples.out, 0, 3)
depth <- substr(samples.out, 6, 11)
treatment <- substr(samples.out, 13, 13)
temp <- substr(samples.out, 14, 14)
pair <- substr(samples.out, 15, 15)

We have a total of 15 samples, 5 sample pairs, from 10 plots (1 depth). There are 2 treatments corresponding to 3 different temperature regimes.

And finally we define a sample data frame that holds the different groups we extracted from the sample names.

Code

#define a sample data frame
samdf <- data.frame(SamName = sample_name,
                    PLOT = plot,
                    DEPTH = depth,
                    TREAT = treatment,
                    TEMP = temp,
                    PAIR = pair)
rownames(samdf) <- samples.out

For example, P07_D00_010_W3D is the sample heated at +3C from plot P07, collected from a depth of 00_010cm. This sample is in group D which consists of P07_D00_010_W3D, P07_D00_010_W8D, P08_D00_010_C0D.

Create a Phyloseq Object

A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.

Note

A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications).

Code

# this create the phyloseq object
ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
                   sample_data(samdf), tax_table(tax_silva))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))

# adding unique ASV names
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))

[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"

So the complete data set contains 20332 ASVs. We can use the microbiome R package (Lahti, Sudarshan, et al. 2017) to get some additional summary data from the phyloseq object.

Metric	Results
Min. number of reads	25151
Max. number of reads	86841
Total number of reads	937761
Average number of reads	62517
Median number of reads	62171
Total ASVs	20332
Sparsity	0.898
Any ASVs sum to 1 or less?	FALSE
Number of singleton ASVs	NA
Percent of ASVs that are singletons	NA
Number of sample variables are:	6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)

B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.

colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order",
    "Family", "Genus", "ASV_SEQ", "ASV_ID")
ps

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20332 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20332 taxa by 8 taxonomic ranks ]

C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full to indicate that these are the raw outputs.

write.table(tax_table(ps),
            "files/data-prep/tables/ssu18_full_tax_table.txt",
            sep = "\t", quote = FALSE, col.names = NA)
write.table(t(otu_table(ps)),
            "files/data-prep/tables/ssu18_full_seq_table.txt",
            sep = "\t", quote = FALSE, col.names = NA)

Remove Contaminants & Unwanted Taxa

Let’s see if we have any potential contaminants. We can use some inline R code to see the taxonomy table for any taxa of interest.

Note

The code (hidden by default) is written as `r "Chloroplast" %in% tax_table(ps)`.

Are Mitochondria present? TRUE
Are Chloroplast present? TRUE
Are Eukaryota present? TRUE

Let’s remove these taxa—Eukaryota because we used bacterial/archaeal primers, Mitochondria because those are likely from eukaryotes, and Chloroplast because those are likely from plants. We must do each of these in turn using phyloseq and it gets a little messy.

Why messy? The subset_taxa command removes anything that is NA for the specified taxonomic level or above. For example, lets say you run the subset_taxa command using Family != "Mitochondria“. Seems like you should get a phyloseq object with everything except Mitochondria. But actually the command not only gets rid of Mitochondria but everything else that has NA for Family and above. In my experience this is not well documented and I had to dig through the files to figure out what was happening.

Anyway, to remove the taxa we do the following:

Subset the taxa and generate a ps object of just the taxa of interest,
Select the ASV column only, turn it into a factor, and use this to remove from the ps object.

Remove Mitochondria ASVs

Remember the original data set contained 20332 ASVs. Here we generate a file with mitochondria ASVs only.

MT1 <- subset_taxa(ps, Family == "Mitochondria")
MT1 <-  as(tax_table(MT1), "matrix")
MT1 <- MT1[, 8]
MT1df <- as.factor(MT1)
goodTaxa <- setdiff(taxa_names(ps), MT1df)
ps_no_mito <- prune_taxa(goodTaxa, ps)
ps_no_mito

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20291 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20291 taxa by 8 taxonomic ranks ]

Looks like this removed 41 Mitochondria ASVs. We will duplicate the code block to remove other groups.

Remove Chloroplast ASVs

And again with Chloroplast ASVs only.

CH1 <- subset_taxa(ps_no_mito, Order == "Chloroplast")
CH1 <-  as(tax_table(CH1), "matrix")
CH1 <- CH1[, 8]
CH1df <- as.factor(CH1)
goodTaxa <- setdiff(taxa_names(ps_no_mito), CH1df)
ps_no_chloro <- prune_taxa(goodTaxa, ps_no_mito)
ps_no_chloro

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20274 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20274 taxa by 8 taxonomic ranks ]

The code removed an additional 17 Chloroplast ASVs.

Remove Eukaryota ASVs

And again with Eukaryota ASVs only.

EU1 <- subset_taxa(ps_no_chloro, Kingdom == "Eukaryota")
EU1 <-  as(tax_table(EU1), "matrix")
EU1 <- EU1[, 8]
EU1df <- as.factor(EU1)
goodTaxa <- setdiff(taxa_names(ps_no_chloro), EU1df)
ps_no_euk <- prune_taxa(goodTaxa, ps_no_chloro)
ps_no_euk

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20223 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20223 taxa by 8 taxonomic ranks ]

The code removed an additional 51 Eukaryota ASVs from the ps object.

Remove any Kingdom NAs

Here we can just use the straight up subset_taxa command since we do not need to worry about any ranks above Kingdom also being removed.

ps_filt <- subset_taxa(ps_no_euk, !is.na(Kingdom))

ps_work_o <- ps_filt

The code eliminated an additional 50 Kingdom level NA ASVs from the phyloseq object.

Rename NA taxonomic ranks

Phyloseq has an odd way of dealing with taxonomic ranks that have no value—in other words, NA in the tax table. The first thing we are going to do before moving forward is to change all of the NAs to have a value of the next highest classified rank. For example, ASV26 is not classified at the Genus level but is at Family level (Xanthobacteraceae). So we change the Genus name to Family_Xanthobacteraceae. The code for comes from these two posts on the phyloseq GitHub, both by MSMortensen: issue #850 and issue #990.

One thing this code does is reassign the functions class and order to taxonomic ranks. This can cause issues if you need these functions.

So you need to run something like this rm(class, order, phylum, kingdom) at the end of the code to remove these as variables. For now, I have not come up with a better solution.

Code

tax.clean <- data.frame(tax_table(ps_filt))
for (i in 1:6){ tax.clean[,i] <- as.character(tax.clean[,i])}
tax.clean[is.na(tax.clean)] <- ""

for (i in 1:nrow(tax.clean)){
    if (tax.clean[i,2] == ""){
        kingdom <- base::paste("k_", tax.clean[i,1], sep = "")
        tax.clean[i, 2:6] <- kingdom
    } else if (tax.clean[i,3] == ""){
        phylum <- base::paste("p_", tax.clean[i,2], sep = "")
        tax.clean[i, 3:6] <- phylum
    } else if (tax.clean[i,4] == ""){
        class <- base::paste("c_", tax.clean[i,3], sep = "")
        tax.clean[i, 4:6] <- class
    } else if (tax.clean[i,5] == ""){
        order <- base::paste("o_", tax.clean[i,4], sep = "")
        tax.clean[i, 5:6] <- order
    } else if (tax.clean[i,6] == ""){
        tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep = "_")
        }
}
tax_table(ps_filt) <- as.matrix(tax.clean)
rank_names(ps_filt)
rm(class, order, phylum, kingdom)

Still the same ranks. That’s good. What about the new groups? Let’s take a peak at some families.

head(get_taxa_unique(ps_filt, "Family"), 16)

 [1] "Planococcaceae"         "Chthoniobacteraceae"    "Nitrososphaeraceae"    
 [4] "p_RCP2-54"              "o_Rokubacteriales"      "Methyloligellaceae"    
 [7] "Nitrospiraceae"         "Xanthobacteraceae"      "o_Vicinamibacterales"  
[10] "Vicinamibacteraceae"    "c_Subgroup_5"           "Xiphinematobacteraceae"
[13] "Entotheonellaceae"      "Chitinophagaceae"       "Nitrosomonadaceae"     
[16] "o_Acidobacteriales"

Nice. Bye-bye NA

Note. The original code mentioned above is written for data sets that have species-level designations.

Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.

Finally, we rename the ps object. This is now our working data set.

ps_work <- ps_filt

Add Phylogenetic Tree

One final task is to add a phylogenetic tree to the phyloseq object.

ps_work_tree <- rtree(ntaxa(ps_work), rooted = TRUE,
                      tip.label = taxa_names(ps_work))
ps_work <- merge_phyloseq(ps_work,
                          sample_data,
                          ps_work_tree)
ps_work_o_tree <- rtree(ntaxa(ps_work_o), rooted = TRUE,
                      tip.label = taxa_names(ps_work_o))
ps_work_o <- merge_phyloseq(ps_work_o,
                          sample_data,
                          ps_work_o_tree)

Data Set Summary

Now that we have removed the NC and sample with zero reads, we can summarize the data set. Let’s start with the in our working phyloseq object using the summarize_phyloseq from the microbiome R package (Lahti, Sudarshan, et al. 2017) as we did above.

Metric	Results
Min. number of reads	25088
Max. number of reads	86600
Total number of reads	936640
Average number of reads	62443
Median number of reads	62117
Total ASVs	20173
Sparsity	0.891
Any ASVs sum to 1 or less?	FALSE
Number of singleton ASVs	NA
Percent of ASVs that are singletons	NA
Number of sample variables are:	6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)

We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample.

Code

total_reads <- sample_sums(ps_work)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")

total_asvs <- estimate_richness(ps_work,
                                measures = "Observed")
total_asvs <- total_asvs %>% rownames_to_column("Sample_ID")
total_asvs$Sample_ID <- gsub('\\.', '-', total_asvs$Sample_ID)

And then we merge these two data frames with the sample data frame.

Code

sam_details <- samdf
rownames(sam_details) <- NULL
colnames(sam_details) <- c("Sample_ID", "Plot", "Depth",
                                  "Treatment", "Temp", "Pair")
tmp_tab1 <- seq_table
tmp_tab1[, c(2:7,11)] <- NULL
tmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>%
                left_join(., total_asvs) %>%
                left_join(., tmp_tab1)

reads_lost <- tmp_tab2$`final reads` - tmp_tab2$total_reads 
asvs_lost <- tmp_tab2$`no. ASVs` - tmp_tab2$Observed 
tmp_tab2$reads_lost <- reads_lost
tmp_tab2$asvs_lost <- asvs_lost

seq_table2 <- tmp_tab2[, c(1:8,12,13)]
colnames(seq_table2) <- c("sample_id", "PLOT", "DEPTH",
                          "TREAT", "TEMP", "PAIR", "total_reads",
                          "total_ASVs", "reads_lost", "ASVs_lost")

(16S rRNA) Table 2 | Sample summary table showing the final number of reads and ASVs (per sample) as well as the number of reads/ASVs lost during curation.

Review

We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.

The full phyloseq object before removing contaminants.

ps

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20332 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20332 taxa by 8 taxonomic ranks ]

head(get_taxa_unique(ps, "Family"), 16)

 [1] "Planococcaceae"         "Chthoniobacteraceae"    "Nitrososphaeraceae"    
 [4] NA                       "Methyloligellaceae"     "Nitrospiraceae"        
 [7] "Xanthobacteraceae"      "Vicinamibacteraceae"    "Xiphinematobacteraceae"
[10] "Entotheonellaceae"      "Chitinophagaceae"       "Nitrosomonadaceae"     
[13] "B1-7BS"                 "WX65"                   "Gaiellaceae"           
[16] "Steroidobacteraceae"

The phyloseq object before changing NA ranks.

ps_work_o

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20173 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20173 taxa by 8 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 20173 tips and 20172 internal nodes ]

head(get_taxa_unique(ps_work_o, "Family"), 16)

 [1] "A4b"                            "Burkholderiaceae"              
 [3] "Peptococcaceae"                 NA                              
 [5] "Polyangiaceae"                  "Pedosphaeraceae"               
 [7] "AKYH767"                        "Micromonosporaceae"            
 [9] "Acidobacteriaceae_(Subgroup_1)" "Pyrinomonadaceae"              
[11] "Clostridiaceae"                 "Myxococcaceae"                 
[13] "Hyphomonadaceae"                "Roseiflexaceae"                
[15] "type_III"                       "Ktedonobacteraceae"

The phyloseq object after changing NA ranks.

ps_work

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 20173 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 20173 taxa by 8 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 20173 tips and 20172 internal nodes ]

head(get_taxa_unique(ps_work, "Family"), 16)

 [1] "c_Subgroup_22"         "c_OM190"               "A4b"                  
 [4] "Diplorickettsiaceae"   "k_Bacteria"            "Gemmataceae"          
 [7] "c_Gammaproteobacteria" "Microscillaceae"       "Entotheonellaceae"    
[10] "Syntrophaceae"         "p_WS2"                 "Verrucomicrobiaceae"  
[13] "Pyrinomonadaceae"      "Caulobacteraceae"      "Parachlamydiaceae"    
[16] "c_Phycisphaerae"

Final Steps

The last thing tasks to complete are to a) save copies of the taxonomy and sequence tables and b) save an image of the workflow for the next section.

write.table(tax_table(ps_work),
            "files/data-prep/tables/ssu18_work_tax_table.txt",
            sep = "\t", quote = FALSE, col.names = NA)
write.table(t(otu_table(ps_work)),
            "files/data-prep/tables/ssu18_work_seq_table.txt",
            sep = "\t", quote = FALSE, col.names = NA)

ITS Data

Prerequisites

In order to run this workflow, you either need to run the corresponding DADA2 Workflow for 2018 ITS or begin with the output from that workflow, data_prep_its18_wf.rdata. See the Data Availability page for complete details.

Unless otherwise noted, we primarily use phyloseq (McMurdie and Holmes 2013) in this section of the workflow to prepare the 2018 ITS data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects.

Read Counts Assessment

Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 workflow. We need to inspect how total reads changed from to beginning to the end of the workflow. While we are at it, let’s create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the summary table at the end of the DADA2 section.

Table headers are as follows:

Header	Description
`Sample ID`	the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair
`Plot`	the experimental plot number
`Depth`	the depth where the sample was collected
`Treatment`	warm or control
`Temp`	temperature of soil heating (0C, +4C, or +8C)
`Pair`	control/treatment coupling
`input`	number of raw reads
`nochim`	final read count after removing chimeras
`Remain`	percent of reads remaining from `input` to `nonchim`
`FastqID`	base name of the fastq file

(ITS) Table 1 | Sample summary table including read changes at start and end of DADA2 workflow.

The following samples have fewer than 1000 reads at the end of the workflow:

P05_D00_010_W3C, P09_D00_010_W8E.

We can also plot the final read count by the temperature where the sample was collected.

Defining Groups

Load the data packet produced in the final step of the DADA2 workflow. This packet (its18_dada2_wf.rdata) contains the ASV-by-sample table and the ASV taxonomy table.
Rename the samples so names have plot and Depth info. For convenience, we renamed any Negative Controls (NC) to P00_D00-000_NNN.
After we load the data packet we next need to format sample names and define groups.

Code

seq_table <- seq_table[order(seq_table$`root name`), ]
tmp_names <- seq_table$`Sample_ID`
load("files/dada2/rdata/its18_dada2_wf.rdata")
identical(seq_table$`root name`, rownames(seqtab))
rownames(seqtab) <- tmp_names
samples.out <- rownames(seqtab)
sample_name <- substr(samples.out, 1, 999)
plot <- substr(samples.out, 0, 3)
depth <- substr(samples.out, 6, 11)
treatment <- substr(samples.out, 13, 13)
temp <- substr(samples.out, 14, 14)
pair <- substr(samples.out, 15, 15)

We have a total of 15 samples, 5 sample pairs, from 10 plots (1 depth). There are 2 treatments corresponding to 3 different temperature regimes.

And finally we define a sample data frame that holds the different groups we extracted from the sample names.

Code

samdf <- data.frame(SamName = sample_name,
                    PLOT = plot,
                    DEPTH = depth,
                    TREAT = treatment,
                    TEMP = temp,
                    PAIR = pair)
rownames(samdf) <- samples.out

For example, P08_D00_010_C0D is the sample heated at +0C from plot P08, collected from a depth of 00_010cm. This sample is in group D which consists of P07_D00_010_W3D, P07_D00_010_W8D, P08_D00_010_C0D.

Create a Phyloseq Object

Fix Taxon Names

The UNITE database we used for classification adds a little prefix to every rank call, like k__ for Kingdom, o__ for Order, etc. To keep in step with our other analyses, we will remove these tags. Of course, we also make a ps object with the original names just in case. At this point we will also remove the Species rank since it is essentially useless.

tax_mod_o <- data.frame(tax)
tax_mod_o$Species <- NULL
tax_mod_o <- as.matrix(tax_mod_o)
tax_mod <- sub("k__", "", tax_mod_o)
tax_mod <- sub("p__", "", tax_mod)
tax_mod <- sub("c__", "", tax_mod)
tax_mod <- sub("o__", "", tax_mod)
tax_mod <- sub("f__", "", tax_mod)
tax_mod <- sub("g__", "", tax_mod)

Note

A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications).

Code

# this create the phyloseq object
ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
                   sample_data(samdf), tax_table(tax_mod))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))

# adding unique ASV names
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))
head(taxa_names(ps))

[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"

So the complete data set contains 3357 ASVs. We can also use the microbiome R package (Lahti, Sudarshan, et al. 2017) to get some additional summary data from the phyloseq object.

Metric	Results
Min. number of reads	80
Max. number of reads	64636
Total number of reads	491143
Average number of reads	32743
Median number of reads	36063
Total ASVs	3357
Sparsity	0.802
Any ASVs sum to 1 or less?	TRUE
Number of singleton ASVs	3
Percent of ASVs that are singletons	0.089
Number of sample variables are:	6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)

B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.

colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order",
    "Family", "Genus", "ASV_SEQ", "ASV_ID")
rank_names(ps)

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3357 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3357 taxa by 8 taxonomic ranks ]

C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full to indicate that these are the raw outputs.

Code

write.table(tax_table(ps),
            "files/data-prep/tables/its18_full_tax_table.txt",
            sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps)),
            "files/data-prep/tables/its18_full_seq_table.txt",
            sep="\t", quote = FALSE, col.names=NA)

At this point we would normally remove unwanted taxa and/or negative controls. Since we do not have any of these for this dataset, we are going to save the phyloseq object with a new name. That way, if we do decide later on to remove some taxa it will be easier to plug in the code here and then continue with the rest of the workflow.

ps_filt <- ps

Remove Low-Count Samples

Next, we can remove samples with really low read counts, say less than 500 reads.

ps_filt_no_low <- prune_samples(sample_sums(ps_filt) > 500, ps_filt)

ps_filt_no_low

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3357 taxa and 13 samples ]
sample_data() Sample Data:       [ 13 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3357 taxa by 8 taxonomic ranks ]

We lost 2 sample(s). After removing samples we need to check whether any ASVs ended up with no reads.

no_reads <- taxa_sums(ps_filt_no_low) == 0

And we lost 2 ASV(s). So now we must remove these from the ps object.

ps_filt_no_nc <- prune_taxa(taxa_sums(ps_filt_no_low) > 0, ps_filt_no_low)

ps_filt_no_nc

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3355 taxa and 13 samples ]
sample_data() Sample Data:       [ 13 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3355 taxa by 8 taxonomic ranks ]

Rename NA taxonomic ranks

Phyloseq has an odd way of dealing with taxonomic ranks that have no value—in other words, NA in the tax table. The first thing we are going to do before moving forward is to change all of the NAs to have a value of the next highest classified rank. For example, ASV4 is not classified at the Genus level but is at Family level (Nitrososphaeraceae). So we change the Genus name to Family_Nitrososphaeraceae. The code for comes from these two posts on the phyloseq GitHub, both by MSMortensen: issue #850 and issue #990.

One thing this code does is reassign the functions class and order to taxonomic ranks. This can cause issues if you need these functions.

So you need to run something like this rm(class, order, phylum, kingdom) at the end of the code to remove these as variables. For now, I have not come up with a better solution.

Before doing that, we will save a copy of the phyloseq object with the original taxa names.

ps_work_o <- ps_filt_no_nc

Code

tax.clean <- data.frame(tax_table(ps_filt_no_nc))
for (i in 1:6){ tax.clean[,i] <- as.character(tax.clean[,i])}
tax.clean[is.na(tax.clean)] <- ""

for (i in 1:nrow(tax.clean)){
    if (tax.clean[i,2] == ""){
        kingdom <- base::paste("k_", tax.clean[i,1], sep = "")
        tax.clean[i, 2:6] <- kingdom
    } else if (tax.clean[i,3] == ""){
        phylum <- base::paste("p_", tax.clean[i,2], sep = "")
        tax.clean[i, 3:6] <- phylum
    } else if (tax.clean[i,4] == ""){
        class <- base::paste("c_", tax.clean[i,3], sep = "")
        tax.clean[i, 4:6] <- class
    } else if (tax.clean[i,5] == ""){
        order <- base::paste("o_", tax.clean[i,4], sep = "")
        tax.clean[i, 5:6] <- order
    } else if (tax.clean[i,6] == ""){
        tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep = "_")
        }
}
tax_table(ps_filt_no_nc) <- as.matrix(tax.clean)
rank_names(ps_filt_no_nc)
rm(class, order, phylum, kingdom)

Still the same ranks. That’s good. What about the new groups? Let’s take a peak at some families.

head(get_taxa_unique(ps_filt_no_nc, "Family"), 16)

 [1] "Hygrophoraceae"         "Mycosphaerellaceae"     "p_Ascomycota"          
 [4] "Trichosporonaceae"      "c_Agaricomycetes"       "Archaeorhizomycetaceae"
 [7] "Chaetomellaceae"        "Geastraceae"            "k_Fungi"               
[10] "Metschnikowiaceae"      "Clavariaceae"           "o_Agaricales"          
[13] "Ascobolaceae"           "Trichocomaceae"         "Piptocephalidaceae"    
[16] "p_Chytridiomycota"

Nice. Adios NA.

Note. The original code is written for data sets that have species-level designations.

Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you.

Finally, we rename the ps object. This is now our working data set.

ps_work <- ps_filt_no_nc

Data Set Summary

Now that we have removed the NC and samples with low reads, we can summarize the data in our working phyloseq object. Let’s start with the data set as a whole using the summarize_phyloseq from the microbiome R package (Lahti, Sudarshan, et al. 2017) as we did above.

Metric	Results
Min. number of reads	9172
Max. number of reads	64636
Total number of reads	490767
Average number of reads	37751
Median number of reads	38371
Total ASVs	3355
Sparsity	0.671
Any ASVs sum to 1 or less?	TRUE
Number of singleton ASVs	5
Percent of ASVs that are singletons	0.149
Number of sample variables are:	6 (SamName, PLOT, DEPTH, TREAT, TEMP, PAIR)

Code

total_reads <- sample_sums(ps_work)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")

total_asvs <- estimate_richness(ps_work, measures = "Observed")
total_asvs <- total_asvs %>% rownames_to_column("Sample_ID")
total_asvs$Sample_ID <- gsub('\\.', '-', total_asvs$Sample_ID)

And then we merge these two data frames with the sample data frame.

Code

rm(list = ls(pattern = "tmp_"))

sam_details <- samdf
rownames(sam_details) <- NULL
colnames(sam_details) <- c("Sample_ID", "Plot", "Depth",
                                  "Treatment", "Temp", "Pair")
tmp_tab1 <- seq_table
tmp_tab1[, c(2:7,11)] <- NULL
tmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>%
                left_join(., total_asvs) %>%
                left_join(., tmp_tab1)

reads_lost <- tmp_tab2$`final reads` - tmp_tab2$total_reads 
asvs_lost <- tmp_tab2$`no. ASVs` - tmp_tab2$Observed 
tmp_tab2$reads_lost <- reads_lost
tmp_tab2$asvs_lost <- asvs_lost

seq_table2 <- tmp_tab2[, c(1:8,12,13)]
colnames(seq_table2) <- c("sample_id", "PLOT", "DEPTH",
                          "TREAT", "TEMP", "PAIR", "total_reads",
                          "total_ASVs", "reads_lost", "ASVs_lost")

(ITS) Table 2 | Sample summary table showing the final number of reads and ASVs (per sample) as well as the number of reads/ASVs lost during curation.

Review

We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have.

The full phyloseq object before removing contaminants.

ps

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3357 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3357 taxa by 8 taxonomic ranks ]

head(get_taxa_unique(ps, "Family"), 16)

 [1] "Hygrophoraceae"                  "Mycosphaerellaceae"             
 [3] NA                                "Trichosporonaceae"              
 [5] "Archaeorhizomycetaceae"          "Chaetomellaceae"                
 [7] "Geastraceae"                     "Metschnikowiaceae"              
 [9] "Clavariaceae"                    "Ascobolaceae"                   
[11] "Trichocomaceae"                  "Piptocephalidaceae"             
[13] "Mortierellaceae"                 "Pleosporales_fam_Incertae_sedis"
[15] "Glomeraceae"                     "Ceratobasidiaceae"

The phyloseq object before removing the prefix of taxon ranks. This still has low coverage samples.

ps_o

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3357 taxa and 15 samples ]
sample_data() Sample Data:       [ 15 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3357 taxa by 8 taxonomic ranks ]

head(get_taxa_unique(ps_o, "Family"), 16)

 [1] "f__Hygrophoraceae"                  "f__Mycosphaerellaceae"             
 [3] NA                                   "f__Trichosporonaceae"              
 [5] "f__Archaeorhizomycetaceae"          "f__Chaetomellaceae"                
 [7] "f__Geastraceae"                     "f__Metschnikowiaceae"              
 [9] "f__Clavariaceae"                    "f__Ascobolaceae"                   
[11] "f__Trichocomaceae"                  "f__Piptocephalidaceae"             
[13] "f__Mortierellaceae"                 "f__Pleosporales_fam_Incertae_sedis"
[15] "f__Glomeraceae"                     "f__Ceratobasidiaceae"

The phyloseq object before changing NA ranks.

ps_work_o

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3355 taxa and 13 samples ]
sample_data() Sample Data:       [ 13 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3355 taxa by 8 taxonomic ranks ]

head(get_taxa_unique(ps_work_o, "Family"), 16)

 [1] "Hygrophoraceae"                  "Mycosphaerellaceae"             
 [3] NA                                "Trichosporonaceae"              
 [5] "Archaeorhizomycetaceae"          "Chaetomellaceae"                
 [7] "Geastraceae"                     "Metschnikowiaceae"              
 [9] "Clavariaceae"                    "Ascobolaceae"                   
[11] "Trichocomaceae"                  "Piptocephalidaceae"             
[13] "Mortierellaceae"                 "Pleosporales_fam_Incertae_sedis"
[15] "Glomeraceae"                     "Ceratobasidiaceae"

The phyloseq object after changing NA ranks.

ps_work

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3355 taxa and 13 samples ]
sample_data() Sample Data:       [ 13 samples by 6 sample variables ]
tax_table()   Taxonomy Table:    [ 3355 taxa by 8 taxonomic ranks ]

head(get_taxa_unique(ps_work, "Family"), 16)

 [1] "Hygrophoraceae"         "Mycosphaerellaceae"     "p_Ascomycota"          
 [4] "Trichosporonaceae"      "c_Agaricomycetes"       "Archaeorhizomycetaceae"
 [7] "Chaetomellaceae"        "Geastraceae"            "k_Fungi"               
[10] "Metschnikowiaceae"      "Clavariaceae"           "o_Agaricales"          
[13] "Ascobolaceae"           "Trichocomaceae"         "Piptocephalidaceae"    
[16] "p_Chytridiomycota"

Final Steps

The last thing tasks to complete are to a) save copies of the taxonomy and sequence tables and b) save an image of the workflow for the next part of the workflow.

Code

write.table(tax_table(ps_work),
            "files/data-prep/tables/its18_work_tax_table.txt",
            sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_work)),
            "files/data-prep/tables/its18_work_seq_table.txt",
            sep="\t", quote = FALSE, col.names=NA)

Workflow Output

Data products generated in this workflow can be downloaded from figshare.

That’s all for this part!

Next workflow:
3. Filtering Previous workflow:
1. DADA2 Workflow

Source Code

The source code for this page can be accessed on GitHub by clicking this link.

Data Availability

Data generated in this workflow and the .Rdata need to run the workflow can be accessed on figshare at 10.25573/data.14690739.

Last updated on

[1] "2022-06-29 07:31:56 EST"

References

Lahti, Leo, Shetty Sudarshan, et al. 2017. “Tools for Microbiome Analysis in r. Version 1.9.97.” https://github.com/microbiome/microbiome.

McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An r Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PLoS One 8 (4): e61217. https://doi.org/10.1371/journal.pone.0061217.

--- title: "2. Data Set Preparation" description: | Workflow for preparation of the 16S rRNA and ITS data sets. These steps are needed before analyzing the data. In this workflow, sample groups are defined, and phyloseq objects are created and curated. --- <details markdown="1"> <summary><strong>Click here</strong> for setup information.</summary> ```{r} #| message: false #| results: hide knitr::opts_chunk$set(echo = TRUE, eval = FALSE) set.seed(119) #library(conflicted) #pacman::p_depends(ape, local = TRUE) #pacman::p_depends_reverse(ape, local = TRUE) library(phyloseq); packageVersion("phyloseq") library(Biostrings); packageVersion("Biostrings") pacman::p_load(tidyverse, DT, microbiome, taxa, formatR, captioner, reactable, downloadthis, fontawesome, metacoder, ampvis2, ape, statnet.common, install = FALSE, update = FALSE) options(scipen=999) knitr::opts_current$get(c( "cache", "cache.path", "cache.rebuild", "dependson", "autodep" )) ``` </details> # Workflow Input > You will either need to run the [DADA2 workflow](dada2.html) or download the files linked below. See the [Data Availability](data-availability.html) page for complete details. Files needed to run this workflow can be downloaded from figshare. These are the output files from the [DADA2 workflow](dada2.html) page. <iframe src="https://widgets.figshare.com/articles/14687184/embed?show_title=1" width="100%" height="351" allowfullscreen frameborder="0"></iframe> These workflows share many common variable names so you must split the workflow into a script for each data set OR run the command `remove(list = ls())` before beginning the next workflow. # 16s rRNA Data ```{r} #| include: false #| eval: true ## ONLY FOR PAGE BUILD remove(list = ls()) load("page_build/data_prep_ssu18_wf.rdata") ``` ```{r} #| echo: false #| eval: true #| results: hide # Create the caption(s) with captioner caption_tab <- captioner(prefix = "(16S rRNA) Table", suffix = " |", style = "b") caption_fig <- captioner(prefix = "(16S rRNA) Figure", suffix = " |", style = "b") # Create a function for referring to the tables in text ref <- function(x) str_extract(x, "[^|]*") %>% trimws(which = "right", whitespace = "[ ]") ``` ```{r} #| echo: false #| eval: true source("assets/captions/captions_data_prep_ssu.R") ``` ## Prerequisites In order to run this workflow, you either need to run the corresponding [DADA2 Workflow for 16S rRNA](dada2-high-temp.html#s-rrna-data.html) or begin with the output from that workflow, `data_prep_ssu18_wf.rdata`. See the [Data Availability](data-availability.html) page for complete details. Unless otherwise noted, we primarily use [phyloseq](https://joey711.github.io/phyloseq/) [@mcmurdie2013phyloseq] in this section of the workflow to prepare the 2018 16S rRNA data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects. ## Read Counts Assessment Before we begin, let's create a summary table containing some basic sample metadata and the read count data from the [DADA2 workflow](dada2-high-temp.html#s-rrna-data.html). We need to inspect how total reads changed through the workflow. While we are at it, let's create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the [summary table](dada2-high-temp.html#track-reads-through-workflow) at the end of the DADA2 section. Table headers are as follows: | Header | Description | |---------------------------|------------------------------------------------------------------------------------------| | `Sample ID` | the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair | | `Plot` | the experimental plot number | | `Depth` | the depth where the sample was collected | | `Treatment` | warm or control | | `Temp` | temperature of soil heating (0C, +4C, or +8C) | | `Pair` | control/treatment coupling | | `input` | number of raw reads | | `nochim` | final read count after removing chimeras | | `Remain` | percent of reads remaining from `input` to `nonchim` | | `FastqID` | base name of the fastq file | <br/> ```{r} #| echo: false tmp_tab1 <- read.table( "files/dada2/tables/ssu18_sample_seq_info.txt", header = TRUE, sep = "\t") tmp_tab1$root <- gsub("_.*", "", tmp_tab1$Fastq_ID_Forward) tmp_tab1[9:ncol(tmp_tab1)-1] <- NULL tmp_tab1 <- tmp_tab1[, c(1,8,2:7)] tmp_tab2 <- read.table( "files/dada2/tables/ssu18_read_changes.txt", header = TRUE, sep = "\t") tmp_tab2[2:6] <- NULL tmp_tab3 <- read.table( "files/dada2/tables/ssu18_dada_compare_by_samp.txt", header = TRUE, sep = "\t") tmp_tab3 <- tmp_tab3[c(1,5)] seq_table <- left_join(tmp_tab1, tmp_tab2, by = c("Sample_ID" = "Sample_ID")) %>% left_join(., tmp_tab3, by = "Sample_ID") seq_table$per_reads_kept <- round(seq_table$nonchim/seq_table$raw_reads, digits = 3) colnames(seq_table) <- c("Sample_ID", "root name", "Plot", "Treatment", "Temp", "Depth", "Pair", "input reads", "final reads", "no. ASVs", "per. reads remain") rm(list = ls(pattern = "tmp_")) ``` <small>`r caption_tab("seq_summary_ssu")`</small> ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "ssu18_sample_summary", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = TRUE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads") ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ```{r} #| echo: false wong_pal <- c("#D55E00", "#56B4E9", "#009E73", "#0072B2", "#F0E442", "#CC79A7", "#E69F00", "#7F7F7F", "#000000") ``` ## Defining Groups 1. Load the data packet produced in the final step of the DADA2 workflow. This packet (`ssu18_dada2_wf.rdata`) contains the ASV-by-sample table and the ASV taxonomy table. 2. Rename the samples so names have plot and Depth info. 3. After we load the data packet, we next need to format sample names and define groups. ```{r} #| code-fold: true seq_table <- seq_table[order(seq_table$`root name`), ] tmp_names <- seq_table$`Sample_ID` load("files/dada2/rdata/ssu18_dada2_wf.rdata") identical(seq_table$`root name`, rownames(seqtab)) rownames(seqtab) <- tmp_names samples.out <- rownames(seqtab) sample_name <- substr(samples.out, 1, 999) plot <- substr(samples.out, 0, 3) depth <- substr(samples.out, 6, 11) treatment <- substr(samples.out, 13, 13) temp <- substr(samples.out, 14, 14) pair <- substr(samples.out, 15, 15) ``` We have a total of **`r length(unique(sample_name))`** samples, **`r length(unique(pair))`** sample pairs, from **`r length(unique(plot))`** plots (**`r length(unique(depth))`** depth). There are **`r length(unique(treatment))`** treatments corresponding to **`r length(unique(temp))`** different temperature regimes. 4. And finally we define a sample data frame that holds the different groups we extracted from the sample names. ```{r} #| code-fold: true #define a sample data frame samdf <- data.frame(SamName = sample_name, PLOT = plot, DEPTH = depth, TREAT = treatment, TEMP = temp, PAIR = pair) rownames(samdf) <- samples.out ``` ```{r} #| echo: false #| eval: true example_sam <- "P07_D00_010_W3D" qpair <- samdf[example_sam,]$PAIR qgroup <- samdf %>% dplyr::filter(PAIR == qpair) ``` For example, `r example_sam` is the sample heated at **+`r samdf[example_sam,]$TEMP`C** from plot **`r samdf[example_sam,]$PLOT`**, collected from a depth of `r samdf[example_sam,]$DEPTH`cm. This sample is in group **`r qpair`** which consists of `r qgroup$SamName`. ## Create a Phyloseq Object **A**. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention---ASV1, ASV2, ASV3, and so on. :::{.callout-note} A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications). ::: ```{r} #| code-fold: true # this create the phyloseq object ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE), sample_data(samdf), tax_table(tax_silva)) tax_table(ps) <- cbind(tax_table(ps), rownames(tax_table(ps))) # adding unique ASV names taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps))) tax_table(ps) <- cbind(tax_table(ps), rownames(tax_table(ps))) ``` ```{r} #| echo: false #| eval: true head(taxa_names(ps)) ``` So the complete data set contains `r ntaxa(ps)` ASVs. We can use the [microbiome R package](https://github.com/microbiome/microbiome/) [@lahti2017microbiome] to get some additional summary data from the phyloseq object. ```{r} #| echo: false min_read_ps <- min(readcount(ps)) max_read_ps <- max(readcount(ps)) total_reads_ps <- sum(readcount(ps)) mean_reads_ps <- round(mean(readcount(ps)), digits = 0) median_reads_ps <- median(readcount(ps)) total_asvs_ps <- ntaxa(ps) singleton_ps <- tryCatch(ntaxa(rare(ps, detection = 1, prevalence = 0)), error=function(err) NA) singleton_ps_perc <- tryCatch(round((100*(ntaxa(rare(ps, detection = 1, prevalence = 0)) / ntaxa(ps))), digits = 3), error=function(err) NA) sparsity_ps <- round(length(which(abundances(ps) == 0))/length(abundances(ps)), digits = 3) ``` | Metric | Results | |-------------------------------------|------------------------------------------------------| | Min. number of reads | `r min_read_ps` | | Max. number of reads | `r max_read_ps` | | Total number of reads | `r total_reads_ps` | | Average number of reads | `r mean_reads_ps` | | Median number of reads | `r median_reads_ps` | | Total ASVs | `r total_asvs_ps` | | Sparsity | `r sparsity_ps` | | Any ASVs sum to 1 or less? | `r isTRUE(singleton_ps >= 1)` | | Number of singleton ASVs | `r singleton_ps` | | Percent of ASVs that are singletons | `r singleton_ps_perc` | | Number of sample variables are: | `r length(sample_data(ps))` (`r colnames(samdf)`) | **B**. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object. ```{r} colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "ASV_SEQ", "ASV_ID") ps ``` ```{r} #| echo: false #| eval: true ps ``` **C**. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix `full` to indicate that these are the *raw* outputs. ```{r} write.table(tax_table(ps), "files/data-prep/tables/ssu18_full_tax_table.txt", sep = "\t", quote = FALSE, col.names = NA) write.table(t(otu_table(ps)), "files/data-prep/tables/ssu18_full_seq_table.txt", sep = "\t", quote = FALSE, col.names = NA) ``` ## Remove Contaminants & Unwanted Taxa Let's see if we have any potential contaminants. We can use some [inline R code](https://rmarkdown.rstudio.com/lesson-4.html) to see the taxonomy table for any taxa of interest. :::{.callout-note} The code (hidden by default) is written as `` `r "Chloroplast" %in% tax_table(ps)` ``. ::: - Are Mitochondria present? `r "Mitochondria" %in% tax_table(ps)` - Are Chloroplast present? `r "Chloroplast" %in% tax_table(ps)` - Are Eukaryota present? `r "Eukaryota" %in% tax_table(ps)` Let's remove these taxa---Eukaryota because we used bacterial/archaeal primers, Mitochondria because those are likely from eukaryotes, and Chloroplast because those are likely from plants. We must do each of these in turn using phyloseq and it gets a little messy. Why messy? The `subset_taxa` command removes anything that is `NA` for the specified taxonomic level or above. For example, lets say you run the `subset_taxa` command using `Family != "Mitochondria`". Seems like you should get a phyloseq object with everything except Mitochondria. But actually the command not only gets rid of Mitochondria but everything else that has `NA` for Family and above. In my experience this is not well documented and I had to dig through the files to figure out what was happening. Anyway, to remove the taxa we do the following: * Subset the taxa and generate a `ps` object of just the taxa of interest, * Select the ASV column only, turn it into a factor, and use this to remove <INSERT TAXA> from the `ps` object. ### Remove Mitochondria ASVs Remember the original data set contained `r ntaxa(ps)` ASVs. Here we generate a file with mitochondria ASVs only. ```{r} MT1 <- subset_taxa(ps, Family == "Mitochondria") MT1 <- as(tax_table(MT1), "matrix") MT1 <- MT1[, 8] MT1df <- as.factor(MT1) goodTaxa <- setdiff(taxa_names(ps), MT1df) ps_no_mito <- prune_taxa(goodTaxa, ps) ps_no_mito ``` ```{r} #| echo: false #| eval: true ps_no_mito ``` Looks like this removed **`r ntaxa(ps) - ntaxa(ps_no_mito)` Mitochondria ASVs**. We will duplicate the code block to remove other groups. ### Remove Chloroplast ASVs And again with Chloroplast ASVs only. ```{r} CH1 <- subset_taxa(ps_no_mito, Order == "Chloroplast") CH1 <- as(tax_table(CH1), "matrix") CH1 <- CH1[, 8] CH1df <- as.factor(CH1) goodTaxa <- setdiff(taxa_names(ps_no_mito), CH1df) ps_no_chloro <- prune_taxa(goodTaxa, ps_no_mito) ps_no_chloro ``` ```{r} #| echo: false #| eval: true ps_no_chloro ``` The code removed an additional **`r ntaxa(ps_no_mito) - ntaxa(ps_no_chloro)` Chloroplast ASVs**. ### Remove Eukaryota ASVs And again with Eukaryota ASVs only. ```{r} EU1 <- subset_taxa(ps_no_chloro, Kingdom == "Eukaryota") EU1 <- as(tax_table(EU1), "matrix") EU1 <- EU1[, 8] EU1df <- as.factor(EU1) goodTaxa <- setdiff(taxa_names(ps_no_chloro), EU1df) ps_no_euk <- prune_taxa(goodTaxa, ps_no_chloro) ps_no_euk ``` ```{r} #| echo: false #| eval: true ps_no_euk ``` The code removed an additional **`r ntaxa(ps_no_chloro) - ntaxa(ps_no_euk)` Eukaryota ASVs** from the `ps` object. ### Remove any Kingdom NAs Here we can just use the straight up `subset_taxa` command since we do not need to worry about any ranks above Kingdom also being removed. ```{r} ps_filt <- subset_taxa(ps_no_euk, !is.na(Kingdom)) ``` ```{r} ps_work_o <- ps_filt ``` The code eliminated an additional **`r ntaxa(ps_no_euk) - ntaxa(ps_filt)` Kingdom level NA ASVs** from the phyloseq object. ## Rename NA taxonomic ranks Phyloseq has an odd way of dealing with taxonomic ranks that have no value---in other words, **NA** in the tax table. The first thing we are going to do before moving forward is to change all of the *NA*s to have a value of the next highest classified rank. For example, `ASV26` is not classified at the Genus level but is at Family level (Xanthobacteraceae). So we change the Genus name to *Family_Xanthobacteraceae*. The code for comes from these two posts on the phyloseq GitHub, both by [MSMortensen](https://github.com/MSMortensen): issue [#850](https://github.com/joey711/phyloseq/issues/850#issuecomment-394771087) and issue [#990](https://github.com/joey711/phyloseq/issues/990#issuecomment-424618425). > One thing this code does is reassign the functions `class` and `order` to taxonomic ranks. This can cause issues if you need these functions. So you need to run something like this `rm(class, order, phylum, kingdom)` at the end of the code to remove these as variables. For now, I have not come up with a better solution. ```{r} #| code-fold: true tax.clean <- data.frame(tax_table(ps_filt)) for (i in 1:6){ tax.clean[,i] <- as.character(tax.clean[,i])} tax.clean[is.na(tax.clean)] <- "" for (i in 1:nrow(tax.clean)){ if (tax.clean[i,2] == ""){ kingdom <- base::paste("k_", tax.clean[i,1], sep = "") tax.clean[i, 2:6] <- kingdom } else if (tax.clean[i,3] == ""){ phylum <- base::paste("p_", tax.clean[i,2], sep = "") tax.clean[i, 3:6] <- phylum } else if (tax.clean[i,4] == ""){ class <- base::paste("c_", tax.clean[i,3], sep = "") tax.clean[i, 4:6] <- class } else if (tax.clean[i,5] == ""){ order <- base::paste("o_", tax.clean[i,4], sep = "") tax.clean[i, 5:6] <- order } else if (tax.clean[i,6] == ""){ tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep = "_") } } tax_table(ps_filt) <- as.matrix(tax.clean) rank_names(ps_filt) rm(class, order, phylum, kingdom) ``` Still the same ranks. That's good. What about the new groups? Let's take a peak at some families. ```{r} #| eval: true head(get_taxa_unique(ps_filt, "Family"), 16) ``` Nice. Bye-bye *NA* > *Note*. The original code mentioned above is written for data sets that have species-level designations. Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you. Finally, we rename the ps object. This is now our working data set. ```{r} ps_work <- ps_filt ``` ## Add Phylogenetic Tree One final task is to add a phylogenetic tree to the phyloseq object. ```{r} ps_work_tree <- rtree(ntaxa(ps_work), rooted = TRUE, tip.label = taxa_names(ps_work)) ps_work <- merge_phyloseq(ps_work, sample_data, ps_work_tree) ps_work_o_tree <- rtree(ntaxa(ps_work_o), rooted = TRUE, tip.label = taxa_names(ps_work_o)) ps_work_o <- merge_phyloseq(ps_work_o, sample_data, ps_work_o_tree) ``` ## Data Set Summary Now that we have removed the NC and sample with zero reads, we can summarize the data set. Let's start with the in our working phyloseq object using the `summarize_phyloseq` from the [microbiome R package](https://github.com/microbiome/microbiome/) [@lahti2017microbiome] as we did above. ```{r} #| echo: false min_read_ps_work <- min(readcount(ps_work)) max_read_ps_work <- max(readcount(ps_work)) total_reads_ps_work <- sum(readcount(ps_work)) mean_reads_ps_work <- round(mean(readcount(ps_work)), digits = 0) median_reads_ps_work <- median(readcount(ps_work)) total_asvs_ps_work <- ntaxa(ps_work) singleton_ps_work <- tryCatch(ntaxa(rare(ps_work, detection = 1, prevalence = 0)), error=function(err) NA) singleton_ps_work_perc <- tryCatch(round((100*(ntaxa(rare( ps_work, detection = 1, prevalence = 0)) / ntaxa(ps_work))), digits = 3), error = function(err) NA) sparsity_ps_work <- round(length(which( abundances(ps_work) == 0))/length(abundances(ps)), digits = 3) ``` | Metric | Results | |-------------------------------------|-----------------------------------------------------------| | Min. number of reads | `r min_read_ps_work` | | Max. number of reads | `r max_read_ps_work` | | Total number of reads | `r total_reads_ps_work` | | Average number of reads | `r mean_reads_ps_work` | | Median number of reads | `r median_reads_ps_work` | | Total ASVs | `r total_asvs_ps_work` | | Sparsity | `r sparsity_ps_work` | | Any ASVs sum to 1 or less? | `r isTRUE(singleton_ps_work >= 1)` | | Number of singleton ASVs | `r singleton_ps_work` | | Percent of ASVs that are singletons | `r singleton_ps_work_perc` | | Number of sample variables are: | `r length(sample_data(ps_work))` (`r colnames(samdf)`) | We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample. ```{r} #| code-fold: true total_reads <- sample_sums(ps_work) total_reads <- as.data.frame(total_reads, make.names = TRUE) total_reads <- total_reads %>% rownames_to_column("Sample_ID") total_asvs <- estimate_richness(ps_work, measures = "Observed") total_asvs <- total_asvs %>% rownames_to_column("Sample_ID") total_asvs$Sample_ID <- gsub('\\.', '-', total_asvs$Sample_ID) ``` And then we merge these two data frames with the sample data frame. ```{r} #| code-fold: true sam_details <- samdf rownames(sam_details) <- NULL colnames(sam_details) <- c("Sample_ID", "Plot", "Depth", "Treatment", "Temp", "Pair") tmp_tab1 <- seq_table tmp_tab1[, c(2:7,11)] <- NULL tmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>% left_join(., total_asvs) %>% left_join(., tmp_tab1) reads_lost <- tmp_tab2$`final reads` - tmp_tab2$total_reads asvs_lost <- tmp_tab2$`no. ASVs` - tmp_tab2$Observed tmp_tab2$reads_lost <- reads_lost tmp_tab2$asvs_lost <- asvs_lost seq_table2 <- tmp_tab2[, c(1:8,12,13)] colnames(seq_table2) <- c("sample_id", "PLOT", "DEPTH", "TREAT", "TEMP", "PAIR", "total_reads", "total_ASVs", "reads_lost", "ASVs_lost") ``` <small>`r caption_tab("seq_summary_final_ssu")`</small> ```{r} #| echo: false #| eval: true seq_table2 %>% download_this( output_name = "ssu18_sample_summary_final", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table2, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 0), align = "center", filterable = TRUE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads") ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ## Review We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have. The full phyloseq object *before* removing contaminants. ```{r} #| eval: true ps head(get_taxa_unique(ps, "Family"), 16) ``` The phyloseq object *before* changing NA ranks. ```{r} #| eval: true ps_work_o head(get_taxa_unique(ps_work_o, "Family"), 16) ``` The phyloseq object *after* changing NA ranks. ```{r} #| eval: true ps_work head(get_taxa_unique(ps_work, "Family"), 16) ``` ## Final Steps The last thing tasks to complete are to **a**) save copies of the taxonomy and sequence tables and **b**) save an image of the workflow for the next section. ```{r} write.table(tax_table(ps_work), "files/data-prep/tables/ssu18_work_tax_table.txt", sep = "\t", quote = FALSE, col.names = NA) write.table(t(otu_table(ps_work)), "files/data-prep/tables/ssu18_work_seq_table.txt", sep = "\t", quote = FALSE, col.names = NA) ``` ```{r} #| echo: false # PAGE BUILD save.image("page_build/data_prep_ssu18_wf.rdata") ``` ```{r} #| echo: false saveRDS(ps, "files/data-prep/rdata/ssu18_ps.rds") saveRDS(ps_work, "files/data-prep/rdata/ssu18_ps_work.rds") saveRDS(ps_work_o, "files/data-prep/rdata/ssu18_ps_work_o.rds") ``` # ITS Data ```{r} #| include: false #| eval: true ## ONLY FOR PAGE BUILD remove(list = ls()) load("page_build/data_prep_its18_wf.rdata") ``` ```{r} #| echo: false #| eval: true #| results: hide # Create the caption(s) with captioner caption_tab <- captioner(prefix = "(ITS) Table", suffix = " |", style = "b") caption_fig <- captioner(prefix = "(ITS) Figure", suffix = " |", style = "b") # Create a function for referring to the tables in text ref <- function(x) str_extract(x, "[^|]*") %>% trimws(which = "right", whitespace = "[ ]") ``` ```{r} #| echo: false #| eval: true source("assets/captions/captions_data_prep_its.R") ``` ## Prerequisites In order to run this workflow, you either need to run the corresponding [DADA2 Workflow for 2018 ITS](dada2-high-temp.html#its-data) or begin with the output from that workflow, `data_prep_its18_wf.rdata`. See the [Data Availability](data-availability.html) page for complete details. Unless otherwise noted, we primarily use [phyloseq](https://joey711.github.io/phyloseq/) [@mcmurdie2013phyloseq] in this section of the workflow to prepare the 2018 ITS data set for community analyses. We prepare the data set by curating samples, removing contaminants, and creating phyloseq objects. ## Read Counts Assessment Before we begin, let's create a summary table containing some basic sample metadata and the read count data from the [DADA2 workflow](dada2-high-temp.html#its-data). We need to inspect how total reads changed from to beginning to the end of the workflow. While we are at it, let's create more intuitive Sample IDs. For more details on how reads changed at each step of the workflow, see the [summary table](dada2-high-temp.html#track-reads-through-workflow-1) at the end of the DADA2 section. Table headers are as follows: | Header | Description | |---------------------------|------------------------------------------------------------------------------------------| | `Sample ID` | the new sample ID based on Plot number, Depth, Treatment, Temperature, & Pair | | `Plot` | the experimental plot number | | `Depth` | the depth where the sample was collected | | `Treatment` | warm or control | | `Temp` | temperature of soil heating (0C, +4C, or +8C) | | `Pair` | control/treatment coupling | | `input` | number of raw reads | | `nochim` | final read count after removing chimeras | | `Remain` | percent of reads remaining from `input` to `nonchim` | | `FastqID` | base name of the fastq file | ```{r} #| echo: false tmp_tab1 <- read.table( "files/dada2/tables/its18_sample_seq_info.txt", header = TRUE, sep = "\t") tmp_tab1$root <- gsub("_.*", "", tmp_tab1$Fastq_ID_Forward) tmp_tab1[9:ncol(tmp_tab1)-1] <- NULL tmp_tab1 <- tmp_tab1[, c(1,8,2:7)] tmp_tab2 <- read.table( "files/dada2/tables/its18_read_changes.txt", header = TRUE, sep = "\t") tmp_tab2[2:8] <- NULL tmp_tab3 <- read.table( "files/dada2/tables/its18_dada_compare_by_samp.txt", header = TRUE, sep = "\t") tmp_tab3 <- tmp_tab3[c(1,7)] seq_table <- left_join(tmp_tab1, tmp_tab2, by = c("Sample_ID" = "Sample_ID")) %>% left_join(., tmp_tab3, by = "Sample_ID") seq_table$per_reads_kept <- round(seq_table$nonchim/seq_table$raw_reads, digits = 3) colnames(seq_table) <- c("Sample_ID", "root name", "Plot", "Treatment", "Temp", "Depth", "Pair", "input reads", "final reads", "no. ASVs", "per. reads remain") rm(list = ls(pattern = "tmp_")) ``` <small>`r caption_tab("seq_summary_its")`</small> ```{r} #| echo: false #| eval: true seq_table %>% download_this( output_name = "its18_sample_summary", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 1), align = "center", filterable = TRUE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads") ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) cutoff <- 1000 remove_sam <- seq_table[seq_table$`final reads` <= cutoff, ] ``` The following samples have fewer than **`r cutoff`** reads at the end of the workflow: **`r remove_sam$Sample_ID`**. We can also plot the final read count by the temperature where the sample was collected. ```{r} #| echo: false wong_pal <- c("#D55E00", "#56B4E9", "#009E73", "#0072B2", "#F0E442", "#CC79A7", "#E69F00", "#7F7F7F", "#000000") ``` ## Defining Groups 1. Load the data packet produced in the final step of the DADA2 workflow. This packet (`its18_dada2_wf.rdata`) contains the ASV-by-sample table and the ASV taxonomy table. 2. Rename the samples so names have plot and Depth info. For convenience, we renamed any Negative Controls (NC) to `P00_D00-000_NNN`. 3. After we load the data packet we next need to format sample names and define groups. ```{r} #| code-fold: true seq_table <- seq_table[order(seq_table$`root name`), ] tmp_names <- seq_table$`Sample_ID` load("files/dada2/rdata/its18_dada2_wf.rdata") identical(seq_table$`root name`, rownames(seqtab)) rownames(seqtab) <- tmp_names samples.out <- rownames(seqtab) sample_name <- substr(samples.out, 1, 999) plot <- substr(samples.out, 0, 3) depth <- substr(samples.out, 6, 11) treatment <- substr(samples.out, 13, 13) temp <- substr(samples.out, 14, 14) pair <- substr(samples.out, 15, 15) ``` We have a total of **`r length(unique(sample_name))`** samples, **`r length(unique(pair))`** sample pairs, from **`r length(unique(plot))`** plots (**`r length(unique(depth))`** depth). There are **`r length(unique(treatment))`** treatments corresponding to **`r length(unique(temp))`** different temperature regimes. 4. And finally we define a sample data frame that holds the different groups we extracted from the sample names. ```{r} #| code-fold: true samdf <- data.frame(SamName = sample_name, PLOT = plot, DEPTH = depth, TREAT = treatment, TEMP = temp, PAIR = pair) rownames(samdf) <- samples.out ``` ```{r} #| echo: false #| eval: true example_sam <- "P08_D00_010_C0D" qpair <- samdf[example_sam,]$PAIR qgroup <- samdf %>% dplyr::filter(PAIR == qpair) ``` For example, `r example_sam` is the sample heated at **+`r samdf[example_sam,]$TEMP`C** from plot **`r samdf[example_sam,]$PLOT`**, collected from a depth of `r samdf[example_sam,]$DEPTH`cm. This sample is in group **`r qpair`** which consists of `r qgroup$SamName`. ## Create a Phyloseq Object ### Fix Taxon Names The UNITE database we used for classification adds a little prefix to every rank call, like `k__` for Kingdom, `o__` for Order, etc. To keep in step with our other analyses, we will remove these tags. Of course, we also make a ps object with the original names just in case. At this point we will also remove the Species rank since it is essentially useless. ```{r} tax_mod_o <- data.frame(tax) tax_mod_o$Species <- NULL tax_mod_o <- as.matrix(tax_mod_o) tax_mod <- sub("k__", "", tax_mod_o) tax_mod <- sub("p__", "", tax_mod) tax_mod <- sub("c__", "", tax_mod) tax_mod <- sub("o__", "", tax_mod) tax_mod <- sub("f__", "", tax_mod) tax_mod <- sub("g__", "", tax_mod) ``` **A**. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention---ASV1, ASV2, ASV3, and so on. :::{.callout-note} A phyloseq object contains ASV table (taxa abundances), sample metadata, and taxonomy table (mapping between ASVs and higher-level taxonomic classifications). ::: ```{r} #| code-fold: true # this create the phyloseq object ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE), sample_data(samdf), tax_table(tax_mod)) tax_table(ps) <- cbind(tax_table(ps), rownames(tax_table(ps))) # adding unique ASV names taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps))) tax_table(ps) <- cbind(tax_table(ps), rownames(tax_table(ps))) head(taxa_names(ps)) ``` ```{r} #| echo: false # this create the phyloseq object ps_o <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE), sample_data(samdf), tax_table(tax_mod_o)) tax_table(ps_o) <- cbind(tax_table(ps_o), rownames(tax_table(ps_o))) # adding unique ASV names taxa_names(ps_o) <- paste0("ASV", seq(ntaxa(ps_o))) tax_table(ps_o) <- cbind(tax_table(ps_o), rownames(tax_table(ps_o))) head(taxa_names(ps_o)) ``` ```{r} #| echo: false #| eval: true head(taxa_names(ps)) ``` So the complete data set contains `r ntaxa(ps)` ASVs. We can also use the [microbiome R package](https://github.com/microbiome/microbiome/) [@lahti2017microbiome] to get some additional summary data from the phyloseq object. ```{r} #| echo: false min_read_ps <- min(readcount(ps)) max_read_ps <- max(readcount(ps)) total_reads_ps <- sum(readcount(ps)) mean_reads_ps <- round(mean(readcount(ps)), digits = 0) median_reads_ps <- median(readcount(ps)) total_asvs_ps <- ntaxa(ps) singleton_ps <- tryCatch(ntaxa(rare(ps, detection = 1, prevalence = 0)), error=function(err) NA) singleton_ps_perc <- tryCatch(round((100*(ntaxa(rare(ps, detection = 1, prevalence = 0)) / ntaxa(ps))), digits = 3), error = function(err) NA) sparsity_ps <- round(length(which(abundances(ps) == 0))/length(abundances(ps)), digits = 3) ``` | Metric | Results | |-------------------------------------|------------------------------------------------------| | Min. number of reads | `r min_read_ps` | | Max. number of reads | `r max_read_ps` | | Total number of reads | `r total_reads_ps` | | Average number of reads | `r mean_reads_ps` | | Median number of reads | `r median_reads_ps` | | Total ASVs | `r total_asvs_ps` | | Sparsity | `r sparsity_ps` | | Any ASVs sum to 1 or less? | `r isTRUE(singleton_ps >= 1)` | | Number of singleton ASVs | `r singleton_ps` | | Percent of ASVs that are singletons | `r singleton_ps_perc` | | Number of sample variables are: | `r length(sample_data(ps))` (`r colnames(samdf)`) | **B**. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object. ```{r} colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "ASV_SEQ", "ASV_ID") rank_names(ps) ``` ```{r} #| echo: false #| eval: true ps ``` ```{r} #| echo: false colnames(tax_table(ps_o)) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "ASV_SEQ", "ASV_ID") ps_o rank_names(ps_o) ``` **C**. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix `full` to indicate that these are the *raw* outputs. ```{r} #| code-fold: true write.table(tax_table(ps), "files/data-prep/tables/its18_full_tax_table.txt", sep="\t", quote = FALSE, col.names=NA) write.table(t(otu_table(ps)), "files/data-prep/tables/its18_full_seq_table.txt", sep="\t", quote = FALSE, col.names=NA) ``` ```{r} #| echo: false write.table(tax_table(ps_o), "files/data-prep/tables/its18_full_tax_table_o.txt", sep="\t", quote = FALSE, col.names=NA) write.table(t(otu_table(ps_o)), "files/data-prep/tables/its18_full_seq_table_o.txt", sep="\t", quote = FALSE, col.names=NA) ``` At this point we would normally remove unwanted taxa and/or negative controls. Since we do not have any of these for this dataset, we are going to save the phyloseq object with a new name. That way, if we do decide later on to remove some taxa it will be easier to plug in the code here and then continue with the rest of the workflow. ```{r} ps_filt <- ps ``` ## Remove Low-Count Samples Next, we can remove samples with really low read counts, say less than 500 reads. ```{r} #| results: asis ps_filt_no_low <- prune_samples(sample_sums(ps_filt) > 500, ps_filt) ``` ```{r} #| eval: true ps_filt_no_low ``` We lost `r nsamples(ps) - nsamples(ps_filt_no_low)` sample(s). After removing samples we need to check whether any ASVs ended up with no reads. ```{r} no_reads <- taxa_sums(ps_filt_no_low) == 0 ``` And we lost `r sum(no_reads)` ASV(s). So now we must remove these from the ps object. ```{r} ps_filt_no_nc <- prune_taxa(taxa_sums(ps_filt_no_low) > 0, ps_filt_no_low) ``` ```{r} #| eval: true ps_filt_no_nc ``` ## Rename NA taxonomic ranks Phyloseq has an odd way of dealing with taxonomic ranks that have no value---in other words, *NA* in the tax table. The first thing we are going to do before moving forward is to change all of the *NA*s to have a value of the next highest classified rank. For example, `ASV4` is not classified at the Genus level but is at Family level (Nitrososphaeraceae). So we change the Genus name to *Family_Nitrososphaeraceae*. The code for comes from these two posts on the phyloseq GitHub, both by [MSMortensen](https://github.com/MSMortensen): issue [#850](https://github.com/joey711/phyloseq/issues/850#issuecomment-394771087) and issue [#990](https://github.com/joey711/phyloseq/issues/990#issuecomment-424618425). > One thing this code does is reassign the functions `class` and `order` to taxonomic ranks. This can cause issues if you need these functions. So you need to run something like this `rm(class, order, phylum, kingdom)` at the end of the code to remove these as variables. For now, I have not come up with a better solution. Before doing that, we will save a copy of the phyloseq object with the original taxa names. ```{r} ps_work_o <- ps_filt_no_nc ``` ```{r} #| code-fold: true tax.clean <- data.frame(tax_table(ps_filt_no_nc)) for (i in 1:6){ tax.clean[,i] <- as.character(tax.clean[,i])} tax.clean[is.na(tax.clean)] <- "" for (i in 1:nrow(tax.clean)){ if (tax.clean[i,2] == ""){ kingdom <- base::paste("k_", tax.clean[i,1], sep = "") tax.clean[i, 2:6] <- kingdom } else if (tax.clean[i,3] == ""){ phylum <- base::paste("p_", tax.clean[i,2], sep = "") tax.clean[i, 3:6] <- phylum } else if (tax.clean[i,4] == ""){ class <- base::paste("c_", tax.clean[i,3], sep = "") tax.clean[i, 4:6] <- class } else if (tax.clean[i,5] == ""){ order <- base::paste("o_", tax.clean[i,4], sep = "") tax.clean[i, 5:6] <- order } else if (tax.clean[i,6] == ""){ tax.clean$Genus[i] <- base::paste("f",tax.clean$Family[i], sep = "_") } } tax_table(ps_filt_no_nc) <- as.matrix(tax.clean) rank_names(ps_filt_no_nc) rm(class, order, phylum, kingdom) ``` Still the same ranks. That's good. What about the new groups? Let's take a peak at some families. ```{r} #| eval: true head(get_taxa_unique(ps_filt_no_nc, "Family"), 16) ``` Nice. Adios *NA*. > *Note*. The original code is written for data sets that have species-level designations. Since this data set does not contain species, I modified the code to stop at the genus level. If your data set has species, my modifications will not work for you. Finally, we rename the ps object. This is now our working data set. ```{r} ps_work <- ps_filt_no_nc ``` ## Data Set Summary Now that we have removed the NC and samples with low reads, we can summarize the data in our working phyloseq object. Let's start with the data set as a whole using the `summarize_phyloseq` from the [microbiome R package](https://github.com/microbiome/microbiome/) [@lahti2017microbiome] as we did above. ```{r} #| echo: false min_read_ps_work <- min(readcount(ps_work)) max_read_ps_work <- max(readcount(ps_work)) total_reads_ps_work <- sum(readcount(ps_work)) mean_reads_ps_work <- round(mean(readcount(ps_work)), digits = 0) median_reads_ps_work <- median(readcount(ps_work)) total_asvs_ps_work <- ntaxa(ps_work) singleton_ps_work <- tryCatch(ntaxa(rare(ps_work, detection = 1, prevalence = 0)), error=function(err) NA) singleton_ps_work_perc <- tryCatch(round((100*(ntaxa(rare(ps_work, detection = 1, prevalence = 0)) / ntaxa(ps_work))), digits = 3), error=function(err) NA) sparsity_ps_work <- round(length(which(abundances(ps_work) == 0))/ length(abundances(ps)), digits = 3) ``` | Metric | Results | |-------------------------------------|----------------------------------------------------------| | Min. number of reads | `r min_read_ps_work` | | Max. number of reads | `r max_read_ps_work` | | Total number of reads | `r total_reads_ps_work` | | Average number of reads | `r mean_reads_ps_work` | | Median number of reads | `r median_reads_ps_work` | | Total ASVs | `r total_asvs_ps_work` | | Sparsity | `r sparsity_ps_work` | | Any ASVs sum to 1 or less? | `r isTRUE(singleton_ps_work >= 1)` | | Number of singleton ASVs | `r singleton_ps_work` | | Percent of ASVs that are singletons | `r singleton_ps_work_perc` | | Number of sample variables are: | `r length(sample_data(ps_work))` (`r colnames(samdf)`) | We can also generate a summary table of total reads & ASVs for each sample. You can sort the table, download a copy, or filter by search term. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample. ```{r} #| code-fold: true total_reads <- sample_sums(ps_work) total_reads <- as.data.frame(total_reads, make.names = TRUE) total_reads <- total_reads %>% rownames_to_column("Sample_ID") total_asvs <- estimate_richness(ps_work, measures = "Observed") total_asvs <- total_asvs %>% rownames_to_column("Sample_ID") total_asvs$Sample_ID <- gsub('\\.', '-', total_asvs$Sample_ID) ``` And then we merge these two data frames with the sample data frame. ```{r} #| code-fold: true rm(list = ls(pattern = "tmp_")) sam_details <- samdf rownames(sam_details) <- NULL colnames(sam_details) <- c("Sample_ID", "Plot", "Depth", "Treatment", "Temp", "Pair") tmp_tab1 <- seq_table tmp_tab1[, c(2:7,11)] <- NULL tmp_tab2 <- dplyr::left_join(sam_details, total_reads) %>% left_join(., total_asvs) %>% left_join(., tmp_tab1) reads_lost <- tmp_tab2$`final reads` - tmp_tab2$total_reads asvs_lost <- tmp_tab2$`no. ASVs` - tmp_tab2$Observed tmp_tab2$reads_lost <- reads_lost tmp_tab2$asvs_lost <- asvs_lost seq_table2 <- tmp_tab2[, c(1:8,12,13)] colnames(seq_table2) <- c("sample_id", "PLOT", "DEPTH", "TREAT", "TEMP", "PAIR", "total_reads", "total_ASVs", "reads_lost", "ASVs_lost") ``` <small>`r caption_tab("seq_summary_final_its")`</small> ```{r} #| echo: false #| eval: true seq_table2 %>% download_this( output_name = "its18_sample_summary_final", output_extension = ".csv", button_label = "Download data as csv file", button_type = "default", csv2 = FALSE, has_icon = TRUE, icon = "fa fa-save") reactable(seq_table2, defaultColDef = colDef( header = function(value) gsub("_", " ", value, fixed = TRUE), cell = function(value) format(value, nsmall = 0), align = "center", filterable = TRUE, sortable = TRUE, resizable = TRUE, footerStyle = list(fontWeight = "bold") ), columns = list( Sample_ID = colDef(name = "Sample ID", sticky = "left", style = list(borderRight = "1px solid #eee"), headerStyle = list(borderRight = "1px solid #eee"), align = "left", minWidth = 150, footer = "Total reads") ), searchable = TRUE, defaultPageSize = 5, pageSizeOptions = c(5, 10, nrow(seq_table)), showPageSizeOptions = TRUE, highlight = TRUE, bordered = TRUE, striped = TRUE, compact = FALSE, wrap = FALSE, showSortable = TRUE, fullWidth = TRUE, theme = reactableTheme(style = list(fontSize = "0.8em"))) ``` ## Review We have a few different options to play with. At this point it is difficult to say which we will use or whether additional objects need to be created. Here is a summary of the objects we have. The full phyloseq object *before* removing contaminants. ```{r} #| eval: true ps head(get_taxa_unique(ps, "Family"), 16) ``` The phyloseq object *before* removing the prefix of taxon ranks. This still has low coverage samples. ```{r} #| eval: true ps_o head(get_taxa_unique(ps_o, "Family"), 16) ``` The phyloseq object *before* changing NA ranks. ```{r} #| eval: true ps_work_o head(get_taxa_unique(ps_work_o, "Family"), 16) ``` The phyloseq object *after* changing NA ranks. ```{r} #| eval: true ps_work head(get_taxa_unique(ps_work, "Family"), 16) ``` ## Final Steps The last thing tasks to complete are to **a**) save copies of the taxonomy and sequence tables and **b**) save an image of the workflow for the next part of the workflow. ```{r} #| code-fold: true write.table(tax_table(ps_work), "files/data-prep/tables/its18_work_tax_table.txt", sep="\t", quote = FALSE, col.names=NA) write.table(t(otu_table(ps_work)), "files/data-prep/tables/its18_work_seq_table.txt", sep="\t", quote = FALSE, col.names=NA) ``` ```{r} #| echo: false # PAGE BUILD save.image("page_build/data_prep_its18_wf.rdata") ``` ```{r} #| echo: false saveRDS(ps, "files/data-prep/rdata/its18_ps.rds") saveRDS(ps_work, "files/data-prep/rdata/its18_ps_work.rds") saveRDS(ps_work_o, "files/data-prep/rdata/its18_ps_work_o.rds") ``` ```{r} #| include: false #| eval: true remove(list = ls()) ``` # Workflow Output Data products generated in this workflow can be downloaded from figshare. <iframe src="https://widgets.figshare.com/articles/14690739/embed?show_title=1" width="100%" height="351" allowfullscreen frameborder="0"></iframe> That's all for this part! </br> <a href="filtering.html" class="btnnav button_round" style="float: right;">Next workflow: <br> 3. Filtering</a> <a href="dada2.html" class="btnnav button_round" style="float: left;">Previous workflow: <br> 1. DADA2 Workflow</a><br><br> ```{r} #| message: false #| results: hide #| eval: true #| echo: false remove(list = ls()) ### COmmon formatting scripts ### NOTE: captioner.R must be read BEFORE captions_XYZ.R source(file.path("assets", "functions.R")) ``` #### Source Code {.appendix} The source code for this page can be accessed on GitHub `r fa(name = "github")` by [clicking this link](`r source_code()`). #### Data Availability {.appendix} Data generated in this workflow and the `.Rdata` need to run the workflow can be accessed on figshare at [10.25573/data.14690739](https://doi.org/10.25573/data.14690739). #### Last updated on {.appendix} ```{r} #| echo: false #| eval: true Sys.time() ```