Sample preparation and bioinformatics
In addition to the above considerations, the study design needs to account for the sensitivity and error-prone nature of many molecular-based approaches. Both shotgun metagenomics and DNA metabarcoding (i.e., amplicon sequencing of marker genes). include numerous opportunities for introducing false negatives and positives during the data generation process, starting from sample collection to the laboratory, sequencing, bioinformatics, and data analyses. Details of these issues are already largely covered in another Molecular Ecology editorial (Zinger et al., 2019), but to summarize briefly, some possible pitfalls include sample contamination stemming from the field or lab environment (de Goffau et al., 2018; Salter et al., 2014), extraction/PCR amplification biases, and errors generated during PCR and sequencing. Technical considerations, such as sample volume and choice of lab reagents, are, in many cases, the result of a compromise between the research question, logistical feasibility, time and available funds (Taberlet et al., 2018). However, any compromise of the protocol should still allow one to appropriately address the research question. In addition, we want to re-emphasize the importance of adequately describing the whole data production workflow in the methods section of manuscripts (e.g., primer sequences, polymerase, molecular labelling strategy). Environmental or lab contamination is a particularly large problem for samples with low microbial biomass (Eisenhofer et al., 2019), and the collection of such low biomass samples cannot be avoided in many study designs (e.g. host-associated microbiomes of small organisms, or depauperate environmental habitats). The sequencing of negative controls (and potentially also positive controls and technical replicates) alongside experimental samples is important for quantifying errors and artefacts (e.g., Davis et al., 2018), and can improve data curation procedures through tuned, experiment-specific criteria, including for samples with low microbial biomass. While there is more than one way to implement such efforts, a thorough description of the controls, a rationale for including them, and the ways they are integrated into data analysis, are essential practices of good microbiome science (Hakimzadeh et al., 2023).
One overlooked problem in microbiome studies is cross-contamination between samples during library preparation procedures (Kim et al., 2017; Zinger et al., 2019), which can result in an artificial reduction in beta diversity (i.e., compositional differences between samples) and an increase in alpha diversity. Such cross-contamination can occur during the PCR plate preparation process through pipetting errors or aerosol production. Random positioning of samples in the PCR plates provides a relatively simple approach to reduce this problem (Minich et al., 2019; Taberlet et al., 2018). More often, - and insidiously - cross-contamination can occur during the PCR cycles, a bias referred to as tag-switches (Carlsen et al., 2012; Esling et al., 2015), tag-jumps (Schnell et al., 2015), or, more recently in the microbiome literature, cross-talks (Edgar, 2018; Minich et al., 2019). During this laboratory step, amplicon molecules from different samples can recombine within conserved primer sequences, resulting in the production of new molecules containing the genuine DNA sequence, but the wrong sample/barcode label. As a consequence, the most abundant taxa will be detected in many samples, including the negative controls (Esling et al., 2015; Minich et al., 2019; Taberlet et al., 2018), preventing simple removal of all taxa occurring in controls as a fix for field and lab contamination. Alleviating the problem of tag-switches can be achieved with modified library preparation protocols (e.g., Carøe & Bohmann, 2020), appropriate sample labelling strategies and a posteriori using the information contained in both samples and negative controls (Bohmann et al., 2022; Hakimzadeh et al., 2023).
One limitation in microbiome studies, using either DNA metabarcoding or metagenomics, is the compositional nature of the sequence data (i.e., described as proportions or probabilities, not absolute number of molecules). Like other count-based sequencing approaches, this limitation arises because the concentration of PCR products is standardised prior to sequencing and the number of molecules read is limited by the sequencing platform (Gloor et al., 2017). As a result, we are unable to obtain data on absolute abundances and biomass for the different microbial members of the community. However, exciting new developments are emerging to overcome these limitations, relying on known reference values of DNA molecule abundances, allowing simple conversion of relative abundance into absolute values. Two broad classes of methods based on this approach have emerged: (i) quantification of target markers using q/ddPCR prior to metabarcoding (Barlow et al., 2020; Callahan et al., 2019; Ji et al., 2019) and (ii) introduction of exogenous DNA molecules spike-ins (i.e. DNA molecules of known sequence and quantity to calibrate measurements). Related to the latter, a host-associated microbiome PCR approach (HamPCR, Lundberg et al., 2021) represents a promising method to assess the ratio of the microbial population size relative to the amount of host tissue (i.e. microbial load).
Despite obtaining a better estimate of the absolute number of molecules in a sample, it is still challenging to convert this number into the actual number of microbial cells. The calculation is often difficult because most gold-standard barcoding genes used for bacteria, fungi, and protists have multiple copies in the genome, with precise numbers varying across taxa and in unpredictable ways (Louca et al., 2018). Another problem is that some of the retrieved molecules can be derived from extracellular DNA or DNA adsorbed on cell debris or particles, i.e. correspond to non-living organisms (Torti et al., 2015).The proportion of extracellular DNA in the environment is often not known but can be estimated with different approaches (reviewed by Nagler et al., 2022). Further, because extracellular DNA is often degraded, approaches including long-read sequencing targeting larger genomic regions will likely overcome this issue. Approaches that are able to quantify or eliminate extracellular DNA can prove useful when having correct snapshots of the microbial community is crucial (e.g., when studying short-term processes with repeated observations capturing microbiome variation within host individuals) but are likely less relevant when studying processes operating at larger temporal scales (e.g., microbiome response to climate change).
Lastly, incorporating site-occupancy modelling in microbiome studies presents an exciting avenue to quantify measurement uncertainty and to account for imperfect detection (e.g., Ficetola et al., 2015; McClenaghan et al., 2020; Willoughby et al., 2016). Site-occupancy models use data collected over multiple visits to sites (or across multiple technical/biological replicates) to quantify how likely it is to detect a taxon when it is present. For microbiome studies, including both biological and technical PCR replicates can enable rigorous statistical predictions regarding the true or false positive detection of microbial species within the community. Further, these predictions can be utilised to improve study design (Fukaya et al., 2022; McClenaghan et al., 2020). How many replicates of each type is required is an open question, although biological replicates may improve detection probabilities (Willoughby et al., 2016).