Beyond estimating diversity: Exciting advances in statistics
Modelling advances in community ecology offer exciting opportunities to understand the complex patterns in microbial diversity and complement robust sampling designs (e.g., Grantham et al., 2020; Trego et al., 2022). In addition, novel methods for analyzing amplicon sequencing data are continuously emerging, primarily focused on the human gut microbiome but adaptable to other microbial ecology fields with suitable study designs and datasets (e.g., Trego et al., 2022). Broadly, these tools can be categorized into quantifying community assembly processes, mapping occurrence networks, capturing spatial/temporal dynamics, integrating multi-omics, identifying differentially abundant taxa, finding species-environment associations, and predicting functional patterns (Trego et al., 2022). However, despite the frequent use of high-throughput sequencing, there has been a slow uptake of these new analytical techniques, and many studies do not go much beyond basic comparisons of alpha and beta diversity estimates across samples. While important inferences can be made by examining overall patterns of composition and diversity (e.g., Grosser et al., 2019; Motta et al., 2018), they offer only a starting point toward having a more mechanistic understanding of the ecological drivers of microbiome variation (Shade, 2017).
Common analytical approaches to quantify differences in beta diversity across microbiome samples, such as the permutational multivariate analysis of variance (PERMANOVA), are algorithmic (i.e., not based on a statistical model) and do not explicitly account for uncertainty in ecological data (Björk et al., 2018; Warton et al., 2012, 2015). Importantly, making inferences about microbiome variation is often difficult using algorithmic distance-based approaches (Björk et al., 2018; Warton et al., 2012). Model-based approaches such as joint species distribution models (JSDMs) or stacked models (Powell-Romero et al., 2023) are multi-response extensions of generalized linear mixed models (GLMMs) that can overcome some of the limitations of the algorithmic methods to elucidate patterns of microbiome variation (e.g., Björk et al., 2018; Grantham et al., 2020). Often using a Bayesian framework, JSDMs simultaneously analyze multiple species and environmental variables, allowing for the assessment of community-level responses to environmental change and host effects (Björk et al., 2018; Ovaskainen et al., 2017; Pollock et al., 2014). JSDMs can i) incorporate information on species traits and phylogenetic relatedness, improving estimation accuracy and power when there is a phylogenetic signal (Ovaskainen et al., 2017), and ii) analyse patterns of taxon covariance to infer microbial co-occurrence networks (Björk et al., 2018; Fountain-Jones et al., 2020, 2023). Microbial co-occurrence networks are valuable tools in microbiome science, as they offer insights (but see Current gaps and future directions below) into the associations among microbial taxa, enhancing our understanding of microbial community dynamics and functioning. JSDM-based co-occurrence networks have an added advantage of interpretation as the major environmental and host effects shaping microbial presences are controlled for (i.e., an inferred association between microbes is then not likely a mere product of a shared environmental response). However, GLMM-based JSDM co-occurrence networks cannot untangle the relative roles of taxa associations, and environmental or host effects (Clark et al., 2018; Fountain-Jones et al., 2020) and tend to not scale well with large datasets (Pichler & Hartig, 2021). Approaches such as conditional random fields (CRF, Clark et al., 2018), multi-response interpretable machine learning (mrIML, Fountain-Jones et al., 2021), MIMIX (Microbiome MIXed Model, Grantham et al., 2020) and scalable JSDMs (sjSDM, Pichler & Hartig, 2021) can overcome these limitations. Importantly, approaches such as MrIML and MiMiX allow for predictions and treatment effects to be extracted for individual taxa, which can be useful if researchers have a set of focal taxa. We note that these methods are not appropriate in all situations. For particularly large datasets (thousands of samples), new distance-based methods such as D-MANOVA (Chen & Zhang, 2021) or multivariate distance matrix regression (MDMR, Zapala & Schork, 2012) may be better options. Boshuizen & te Beest (2023) have provided a complete guide of the pitfalls in analysing amplicon data. While the tools mentioned here represent only a tiny fraction of the potential methods available, we encourage readers to go beyond diversity metrics and differentially abundant taxa to gain more mechanistic insights into microbiome data from wild species.
Incorporating some of the methodological advances in bioinformatics and statistics, coupled with robust study design, and rigorous laboratory techniques, will improve current research efforts in the field (see Fig. 1 for a summary). Moreover, taking into consideration both the limitations and opportunities of these various approaches allow us to open up new exciting avenues in the field of microbiome ecology research.