Setup
The DADA2 pipeline is used as a method to correct errors that are introduced into sequencing data during amplicon sequencing. It is implemented as an open-source R-package that will allow you to run through the entire pipeline, including steps to filter, dereplicate, identify chimeras, and merge paired-end reads. We will use one external application, cutadapt, to remove primer sequences. The output of this pipeline is a table of amplicon sequence variants (ASVs), as opposed to the traditional OTUs seen in other amplicon sequencing workflows that cluster similar sequences into a single OTU. DADA2 is capable of resolving biological differences of 1 or 2 nucleotides, producing ASVs from amplicon data that are of higher resolution than OTUs. For a more general and detailed tutorial please see the dada2 website
Requirements
- MacOS or Linux. Windows is possible if you use Windows Subsystem for Linux or a virtual machine like VirtualBox.
- A modern desktop or laptop. The memory and cpu requirements for running dada2 are fairly modest. If you have a low-powered computer it will still run but will likely take much longer.
Install R
To use DADA2, you must download and install R for your operating system. This can be done here. You should use the latest version of R. It is also highly recommned (but not required) to download and install RStudio, which is an integrated development environment (IDE) for R programming. This can be done through this link.
Install dada2
Since dada2 is available as part of the Bioconductor Project Package Repository, first install Bioconductor.
install.packages("BiocManager")
To install the latest version of the DADA2 package (1.12.1), type the following:
BiocManager::install("dada2", version = "3.9")
If you have previously installed R and Bioconductor, you may need to update them to the most recent versions. This tutorial relies on features of dada2 only in version 1.12.1 or greater
Install cutadapt
To use the DADA2 pipeline, primers must be removed from the amplicon sequence data. In the following tutorial, we will be using the cutadapt
tool in R to do this. Cutadapt can be installed as a conda package. If you do not already have conda installed, you can do so here. Once you have conda installed, go to Terminal or Command Prompt, and type the following:
conda install -c bioconda cutadapt
Download files
This tutorial is run with the files listed on the Analysis page. Please download those and if you want to follow along exactly then place them in a folder called MiSeq_Run
in your home directory. Here we only use two samples to keep the run time short but feel free to run it with all the example files if desired.
You’ll also need the ITS2 database files formatted for IDTAXA. You can find them at https://doi.org/10.5281/zenodo.3235802. Please download the Nematode_ITS2_1.0.0_idtaxa.fasta file with sequences and the Nematode_ITS2_1.0.0_idtaxa.tax taxonomy file.
Preparing
Once you have installed the above, you are ready to start using the DADA2 pipeline. The following is a sample analysis using the DADA2 pipeline (https://benjjneb.github.io/dada2/tutorial.html). This is meant to be an example and not a definitive workflow. First, we load packages and set a seed for reproducibility.
## [1] '2.12.0'
## [1] '1.12.1'
## [1] '1.42.0'
## [1] '2.52.0'
## [1] '3.2.0'
## [1] '1.4.0'
## [1] '1.3.1'
To get started we’ll need to do a little prep. We’ll start by defining the path of sequences to be processed (using “~/Miseq_Run” from previous examples). The DADA2 pipeline is intended to be used with demultiplexed, paired-end fastq files. That is to say, the samples must be separated into individual files and ordered such that forward and reverse reads of the same sample are matched. The pattern of the filenames should be the same for all forward and reverse files. In this example, the names are formatted as such: samplename_R1_001.fastq.fz
for forward reads and samplename_R2_001.fastq.gz
for reverse reads.
We then create a vector of file names, one for the forward reads and one for the reverse. The samples
vector contains our sample names, extracted from the file names using a regular expression. For Illumina data, often the sample name is everything before the first underscore.
The primer sequences are defined here as well, because we’ll be clipping those off shortly. The forward primer will be at the beginning of the forward read and the reverse primer will be at the beginning of the reverse read. The reverse complements are included in case the amplicon sequence is shorter than our read length. When this happens the sequence will contain the reverse complement of the opposite primer, i.e. there may be reverse-complemented reverse primer at the end of the forward read.
Before we move on let’s do a quick sanity check to make sure our primer sequences are detected. The code below will count the number of times we have a primer hit in our fastq file. Note that this is a quick and dirty method to be sure that we have the right primer sequence. For proper primer removal we’ll need something more sophisticated, like cutadapt
which we’ll demonstrate next.
## [1] 25357
## [1] 21503
Primer removal
Now we can use cutadapt to remove primers, and check that they have been successfully removed. Cutadapt was designed to remove sequencing adapters that sometimes get left on the reads, however, it’s main function is to detect and trim off regions that match a known sequence, which is precisely what we want to do here.
Cutadapt is best installed using conda (conda install cutadapt
) and can also be used directly from the command line. Here we demonstrate how to run it directly from R using the system2
command but the results are the same either way. The first parameter to system2
is the program we wish to run and the second is character vector of arguments and their values.
Now output filenames are defined as well as parameters for cutadapt. The critical parameters are the primer sequences and their orientation. It’s strongly recommended that you review the [cutadapt documentation] (https://cutadapt.readthedocs.io/en/stable/guide.html) for full details of each parameter. Briefly:
-g
: sequence to trim off the 5’ end of the forward read (forward primer) -a
: sequence to trim off the 3’ end of the forward read (reverse complemented reverse primer) -G
: sequence to trim off the 5’ end of the reverse read (reverse primer) -A
: sequence to trim off the 3’ end of the reverse read (reverse complemented forward primer)
We’ll also add -m 50
to get rid of super short junky reads, --max-n 1
to get rid of reads that have any N’s in them and -n 2
so that cutadapt will remove multiple primer hits if there happens to be read-through. The --discard-untrimmed
is also added so only reads that contain a primer will be kept ensuring we keep only valid amplicons. And finally a -q 15
will trim off low quality bases from the 3’ end.
In our experience this small amount of trimming can remove the need for truncating the reads later on but this is something that may vary from run to run it and an important paramter to consider for your own data
# Create an output directory to store the clipped files
cut_dir <- file.path(path, "cutadapt")
if (!dir.exists(cut_dir)) dir.create(cut_dir)
fwd_cut <- file.path(cut_dir, basename(fwd_files))
rev_cut <- file.path(cut_dir, basename(rev_files))
names(fwd_cut) <- samples
names(rev_cut) <- samples
# It's good practice to keep some log files so let's create some
# file names that we can use for those
cut_logs <- path.expand(file.path(cut_dir, paste0(samples, ".log")))
cutadapt_args <- c("-g", fwd_primer, "-a", rev_primer_rev,
"-G", rev_primer, "-A", fwd_primer_rev,
"-n", 2, "--discard-untrimmed")
# Loop over the list of files, running cutadapt on each file. If you don't have a vector of sample names or
# don't want to keep the log files you can set stdout = "" to output to the console or stdout = NULL to discard
for (i in seq_along(fwd_files)) {
system2(cutadapt,
args = c(cutadapt_args,
"-o", fwd_cut[i], "-p", rev_cut[i],
fwd_files[i], rev_files[i]),
stdout = cut_logs[i])
}
# quick check that we got something
head(list.files(cut_dir))
## [1] "G1_S1_L001_R1_001.fastq.gz" "G1_S1_L001_R2_001.fastq.gz"
## [3] "G1.log" "G2_S2_L001_R1_001.fastq.gz"
## [5] "G2_S2_L001_R2_001.fastq.gz" "G2.log"
Quality filtering
Inspect quality scores
Now that primers are removed, we can begin the DADA2 pipeline in full. We start with plotting quality profiles of our sequences to determine suitable quality filtering parameters. For the sake of speed we only show 2 samples here but feel free to look at all your samples (or more than 2 anyway).

Everything looks pretty good here. The majority of our reads should be 230 bp based on the fact that we trimmed 20 bp primers off the beginning of the read. It looks like a very, very few did not get trimmed properly but from the red line across the bottom, which is the number of reads of that length, we can see that this is such a small percentage that we don’t need to worry about it.
Side note
If you’re paying attention you might be asking why, if the primers always appear at the same point in the read, can’t we just clip the reads at this point. In fact, for some datasets, this one included, that would be a valid strategy given that the all the ITS2 sequences are longer than the read length (250 bp). However, this is not always the case and as described above, if the amplicon length is less than the read lenght we’ll have to find and remove the opposite primer. So in general it’s good practice to use something like cutadapt which will ensure the correct sequences are output.

Reverse reads are lower quality as is always the case for Illumina data. However, it looks like the quality average is above 20 for most of the length of the read so we’ll go ahead with the analysis.
Filter reads
For the purposes of our analysis, we chose to conduct quality filtering with maximum expected errors of 2 for the forward sequence, 5 for the reverse, to truncate after a quality score of 2 or lower.
Make your own decisions about trimming paramters! Your data will be different, possibly very different, from ours so our parmeter choices are likely to be not at all appropriate for your data. So don’t copy and paste this code - think carefully about your data and choose parameters appropriate for your data.
# Same as for the clippling we create an output directory to store the filtered files
filt_dir <- file.path(path, "filtered")
if (!dir.exists(filt_dir)) dir.create(filt_dir)
fwd_filt <- file.path(filt_dir, basename(fwd_files))
rev_filt <- file.path(filt_dir, basename(rev_files))
names(fwd_filt) <- samples
names(rev_filt) <- samples
filtered_out <- filterAndTrim(
fwd = fwd_cut,
filt = fwd_filt,
rev = rev_cut,
filt.rev = rev_filt,
maxEE = c(2, 5),
truncQ = 2,
rm.phix = TRUE,
compress = TRUE,
multithread = TRUE
)
head(filtered_out)
## reads.in reads.out
## G1_S1_L001_R1_001.fastq.gz 25276 24107
## G2_S2_L001_R1_001.fastq.gz 25908 24533
Main workflow
Learn errors
Here dada2 learns the error profile for your data using an interative approach. These error profiles are used in the next step to correct errors.
## 11162461 total bases in 48640 reads from 2 samples will be used for learning the error rates.
## 11118332 total bases in 48640 reads from 2 samples will be used for learning the error rates.
The plot shows how well the estimated error rates fit the observed rates. The black line should line up reasonably well with the black points. This looks reasonable so we continue on.
Denoising (aka error-correcting)
We can now denoise our sequences using the DADA algorithm. Note that older versions of dada2 required a dereplication step that is now done automatically by the dada
function.
## Sample 1 - 24107 reads in 4939 unique sequences.
## Sample 2 - 24533 reads in 5379 unique sequences.
## Sample 1 - 24107 reads in 15358 unique sequences.
## Sample 2 - 24533 reads in 15463 unique sequences.
Merge pairs
Now that we have accurate sequences we can merge the paired-end reads to reconstruct the full amplicon sequence. Note that most sequences should succesfully overlap, unless your trimming parameters were too aggresive (i.e. too much sequence was trimmed off the end) or you have amplicons that longer than the combined read length (>500 bp, in this case). In the latter case these reads are still useable but will need special handling (more details to come on those cases).
Sequence table
Before we can assign taxonomy, we can construct a sequence table, which lists the number of each sequence present in each sample. We can see the dimensions of this sequence table, where the number of rows is the number of samples and the number of columns is the number of total unique sequences
## [1] 2 55
Remove chimeras
Don’t be alarmed here if many of your sequences are discarded - the number of chimeras varies but can be quite high in some cases.
## [1] 2 19
Assigning taxonomy with IDTAXA
Now that we have a sequence table with the chimeras removed, we can classify/assign taxonomy to the sequence variants. Although there are multiple multiple taxonomic classification methods, we will be using IdTaxa, which is available through the DECIPHER
package. You can read more about IdTaxa here.
Training the classifier
The IdTaxa classification method relies on a training set that contains sequences that are representative of known taxa. Although there exists trained classifiers, we will be using the ITS2 Nematode database as our classifier. Since it is not trained, we will have to train the classifier using LearnTaxa
, which is also available as a DECIPHER
package. The train
parameter of LearnTaxa
contains the sequences included in the classifier as a DNAStringSet. Each sequence is formatted such that the names of each taxonomic level is separated by a semicolon. Once the sequence headers of the classifier are parsed, the taxonomy
parameter will contain the taxonomic assignment, separated by a semicolon, for each sequence in train
.
## ===========================================================================
##
## Time difference of 16.38 secs
Running IdTaxa
To use IdTaxa, we will need to convert the unique sequence variants that we want to classify into a DNAStringSet object. Now that we have our trained classifier (i.e. training set) and the sequences we want to classify, we can run IdTaxa
. You can change various parameters of IdTaxa
depending on your data. We recommend reviewing the [IdTaxa documentation] (https://rdrr.io/bioc/DECIPHER/man/IdTaxa.html) for further information. Here, we have specified several parameters:
strand = "both"
: by setting the strand
parameter to “both”, each sequence variant is classified using both the forward and reverse complement orientation. The sequence varient will be classified as the result with the highest confidence.
threshold = 60
: setting the threshold
to 60 specifies when the taxonomic classifications are truncated. A lower threshold usually results in higher taxonomic level classifications, but lower accuracy (i.e. confidence). A higher threshold usually results in lower taxonomic level classifications, but with higher accuracy.
bootstraps = 100
: this specifies the amount of times bootstrap replicates are performed for each sequence.
processors = NULL
: automatically uses all available processors.
verbose = TRUE
: displays progress.
type = "extended"
: by setting the type
to “extended”, the output for each sequence variant will contain the taxonomic classification, the names of each taxonomic rank, and the confidence of the assignment.
## ===========================================================================
##
## Time difference of 0.63 secs
Although it contains all the necessary information, the output of IdTaxa
is not as intuitive as we would like. It outputs a list of the sequence variants. Within this list, each element will have its taxonomic classification and the confidence for each taxon. To facilitate easier manipulation and comprehension of the data, we will convert this list into a matrix, with the rows as the sequence variants and the columns as their taxonomic classifications. We also drop the “Root” column.
Over to phyloseq
The individual pieces of the analysis can be save to text or excel files but if you’ve made it this far then why not continue on in R! Phyloseq is a powerful toolkit for working with these types of data in R and is the recommended next step for analysis. More tutorials on downstream analysis to come!
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 19 taxa and 2 samples ]
## sample_data() Sample Data: [ 2 samples by 1 sample variables ]
## tax_table() Taxonomy Table: [ 19 taxa by 8 taxonomic ranks ]
## refseq() DNAStringSet: [ 19 reference sequences ]