Sample preparation and bioinformatics
In addition to the above considerations, the study design needs to
account for the sensitivity and error-prone nature of many
molecular-based approaches. Both shotgun metagenomics and DNA
metabarcoding (i.e., amplicon sequencing of marker genes). include
numerous opportunities for introducing false negatives and positives
during the data generation process, starting from sample collection to
the laboratory, sequencing, bioinformatics, and data analyses. Details
of these issues are already largely covered in another Molecular
Ecology editorial (Zinger et al., 2019), but to summarize briefly, some
possible pitfalls include sample contamination stemming from the field
or lab environment (de Goffau et al., 2018; Salter et al., 2014),
extraction/PCR amplification biases, and errors generated during PCR and
sequencing. Technical considerations, such as sample volume and choice
of lab reagents, are, in many cases, the result of a compromise between
the research question, logistical feasibility, time and available funds
(Taberlet et al., 2018). However, any compromise of the protocol should
still allow one to appropriately address the research question. In
addition, we want to re-emphasize the importance of adequately
describing the whole data production workflow in the methods section of
manuscripts (e.g., primer sequences, polymerase, molecular labelling
strategy). Environmental or lab contamination is a particularly large
problem for samples with low microbial biomass (Eisenhofer et al.,
2019), and the collection of such low biomass samples cannot be avoided
in many study designs (e.g. host-associated microbiomes of small
organisms, or depauperate environmental habitats). The sequencing of
negative controls (and potentially also positive controls and technical
replicates) alongside experimental samples is important for quantifying
errors and artefacts (e.g., Davis et al., 2018), and can improve data
curation procedures through tuned, experiment-specific criteria,
including for samples with low microbial biomass. While there is more
than one way to implement such efforts, a thorough description of the
controls, a rationale for including them, and the ways they are
integrated into data analysis, are essential practices of good
microbiome science (Hakimzadeh et al., 2023).
One overlooked problem in microbiome studies is cross-contamination
between samples during library preparation procedures (Kim et al., 2017;
Zinger et al., 2019), which can result in an artificial reduction in
beta diversity (i.e., compositional differences between samples) and an
increase in alpha diversity. Such cross-contamination can occur during
the PCR plate preparation process through pipetting errors or aerosol
production. Random positioning of samples in the PCR plates provides a
relatively simple approach to reduce this problem (Minich et al., 2019;
Taberlet et al., 2018). More often, - and insidiously -
cross-contamination can occur during the PCR cycles, a bias referred to
as tag-switches (Carlsen et al., 2012; Esling et al., 2015), tag-jumps
(Schnell et al., 2015), or, more recently in the microbiome literature,
cross-talks (Edgar, 2018; Minich et al., 2019). During this laboratory
step, amplicon molecules from different samples can recombine within
conserved primer sequences, resulting in the production of new molecules
containing the genuine DNA sequence, but the wrong sample/barcode label.
As a consequence, the most abundant taxa will be detected in many
samples, including the negative controls (Esling et al., 2015; Minich et
al., 2019; Taberlet et al., 2018), preventing simple removal of all taxa
occurring in controls as a fix for field and lab contamination.
Alleviating the problem of tag-switches can be achieved with modified
library preparation protocols (e.g., Carøe & Bohmann, 2020),
appropriate sample labelling strategies and a posteriori using
the information contained in both samples and negative controls (Bohmann
et al., 2022; Hakimzadeh et al., 2023).
One limitation in microbiome studies, using either DNA metabarcoding or
metagenomics, is the compositional nature of the sequence data (i.e.,
described as proportions or probabilities, not absolute number of
molecules). Like other count-based sequencing approaches, this
limitation arises because the concentration of PCR products is
standardised prior to sequencing and the number of molecules read is
limited by the sequencing platform (Gloor et al., 2017). As a result, we
are unable to obtain data on absolute abundances and biomass for the
different microbial members of the community. However, exciting new
developments are emerging to overcome these limitations, relying on
known reference values of DNA molecule abundances, allowing simple
conversion of relative abundance into absolute values. Two broad classes
of methods based on this approach have emerged: (i) quantification of
target markers using q/ddPCR prior to metabarcoding (Barlow et al.,
2020; Callahan et al., 2019; Ji et al., 2019) and (ii) introduction of
exogenous DNA molecules spike-ins (i.e. DNA molecules of known sequence
and quantity to calibrate measurements). Related to the latter, a
host-associated microbiome PCR approach (HamPCR, Lundberg et al., 2021)
represents a promising method to assess the ratio of the microbial
population size relative to the amount of host tissue (i.e. microbial
load).
Despite obtaining a better estimate of the absolute number of molecules
in a sample, it is still challenging to convert this number into the
actual number of microbial cells. The calculation is often difficult
because most gold-standard barcoding genes used for bacteria, fungi, and
protists have multiple copies in the genome, with precise numbers
varying across taxa and in unpredictable ways (Louca et al., 2018).
Another problem is that some of the retrieved molecules can be derived
from extracellular DNA or DNA adsorbed on cell debris or particles, i.e.
correspond to non-living organisms (Torti et al., 2015).The proportion
of extracellular DNA in the environment is often not known but can be
estimated with different approaches (reviewed by Nagler et al., 2022).
Further, because extracellular DNA is often degraded, approaches
including long-read sequencing targeting larger genomic regions will
likely overcome this issue. Approaches that are able to quantify or
eliminate extracellular DNA can prove useful when having correct
snapshots of the microbial community is crucial (e.g., when studying
short-term processes with repeated observations capturing microbiome
variation within host individuals) but are likely less relevant when
studying processes operating at larger temporal scales (e.g., microbiome
response to climate change).
Lastly, incorporating site-occupancy modelling in microbiome studies
presents an exciting avenue to quantify measurement uncertainty and to
account for imperfect detection (e.g., Ficetola et al., 2015;
McClenaghan et al., 2020; Willoughby et al., 2016). Site-occupancy
models use data collected over multiple visits to sites (or across
multiple technical/biological replicates) to quantify how likely it is
to detect a taxon when it is present. For microbiome studies, including
both biological and technical PCR replicates can enable rigorous
statistical predictions regarding the true or false positive detection
of microbial species within the community. Further, these predictions
can be utilised to improve study design (Fukaya et al., 2022;
McClenaghan et al., 2020). How many replicates of each type is required
is an open question, although biological replicates may improve
detection probabilities (Willoughby et al., 2016).