Short Communication
Chinese Hamster Ovary (CHO) cells are widely used to manufacture complex
biotherapeutic molecules at large scales. Industrial bioprocesses ensure
high product yield and quality by maintaining favorable growth
conditions in cell culture environments, which requires careful
monitoring and control of nutrient availability. Chemically-defined
serum-free media can contain dozens or >100
components(Ritacco et al., 2018), but key nutrients include
proteinogenic amino acids, which are direct substrates and
regulators(Duarte et al., 2014; Fomina‐Yadlin et al., 2014) of
proliferation and protein synthesis. Unfortunately, conventional methods
for amino acid quantification based on liquid chromatography and mass
spectrometry are time-consuming and difficult to use for decision making
and control of cell culture. Alternate spectroscopic approaches have
been sensitive to a limited number of amino acid species(Bhatia et al.,
2018). Here we present a computational method to forecast time-course
amino acid concentrations from routine bioprocess measurements,
facilitating a timely and anticipatory control of the bioprocess (Figure
1).
At the foundation of our method is a genome-scale metabolic network
model, which accounts for the complex conversion from media nutrients to
biomass and recombinant protein production. Such models have been
increasingly utilized for CHO cells(Hefzi et al., 2016; Calmels et al.,
2019; Huang & Yoon, 2020) and bioprocess applications(Sommeregger et
al., 2017; Zhang & Hua, 2016), such as predicting clonal
performances(Popp et al., 2016), identifying metabolic
bottlenecks(Zhuangrong & Seongkyu, 2020), and optimizing media
formulation(Fouladiha et al., 2020; Traustason et al., 2019). Metabolic
network models can also estimate amino acid uptake rates necessary to
experimentally support observed proliferation and productivity(Chen et
al., 2019). However, challenges have limited their practical
application.
First, metabolic network models are typically highly complex but
under-constrained, and therefore are easy to overfit. This is mitigated
by training the model on a variety of bioprocess conditions and
metabolic phenotypes. Second, metabolic network models assume that cells
operate at some metabolic optimum, and thus tend to describe an
idealized metabolism specifically fit to the assumed objective (e.g.,
biomass production(Feist & Palsson, 2010; Szeliova et al., 2020),
minimization of redox(Savinell & Palsson, 1992)). Third, for the
present purpose, these models need to predict amino acid consumption
fluxes, typically on the order of 10-3mmol·gDW-1·hr-1(see
Methods), from input data that are multiple magnitudes larger, such as
growth rate and glucose consumption (10-1 to
10-2mmol·gDW-1·hr-1).
The preceding two challenges increase prediction error. Lastly,
metabolic network models assume a steady state, which reduces the range
of forecast. Typically, input data from one day are used to make
predictions for the same day. However, such predictions cannot be
extended to multiple days or subsequent culture phases, as
cross-temporal shifts in metabolism would violate the steady state
assumption. Thus, model predictions of amino acid concentrations can be
overfit, ideal, and near-sighted – all of which dilutes their
practicality for industrial bioprocess control. Here we demonstrate that
these weaknesses can be addressed in a data-driven manner by coupling a
metabolic network model with machine learning.
We developed this hybrid approach on a diverse set of 10 CHO clones with
different growth and productivity profiles from two different fed-batch
production processes. These CHO clones were subject to different
bioprocess conditions and recombinant antibody identities (see Methods),
resulting in a variety of phenotypes and productivity performances (Fig.
S1). For example, several high-performing clones were exceptionally
proliferative or productive, suggesting an efficient conversion from
nutrients to biomass or recombinant protein product. Other clones
performed these conversions at lower rates, suggesting attenuated
metabolic activity or inefficient resource utilization. The CHO cells
adjusted their nutrient uptake according to these various metabolic
phenotypes, leading to diverse amino acid consumption patterns (Fig,
S2). For example, the consumption of glucose and serine differed by
several fold across conditions and time. Furthermore, different clones
varied in their consumption or secretion of key metabolites such as
lactate, alanine, glycine, and glutamine.
We sought to predict these diverse consumption behaviors using a
tailored model of CHO metabolism(Schinn et al., 2020). As input
information, we utilized the following routinely measured industrial
bioprocess data: (1) viable cell density and titer measurements, from
which growth rate and specific productivity are calculated (Methods,
equation 1), and (2) bioreactor concentrations of glucose, lactate,
glutamate and glutamine, from which their respective consumption rates
are calculated. These measurements were used as boundary conditions by
constraining the fluxes of biomass production, recombinant protein
synthesis and consumption of the four metabolites to observed values.
Subsequently, we used Markov chain Monte Carlo sampling of metabolic
fluxes(Schellenberger et al., 2011) to sample the range and magnitude of
all reaction fluxes to calculate the likely uptake fluxes of the
remaining 18 proteinogenic amino acids (see Methods). These predictions
were applied to the CHO clones across 8 days of a 12-day production run
(days 4 to 11), resulting in a total of 80 individual predictions.
We evaluated the resulting model predictions in two ways. First, we
examined the differences in model predictions and experimental
measurements of amino acid uptake and secretion (Figure 2A). For most
amino acids, this difference was small compared to the scale of input
data, suggesting that metabolic models can describe the conversion from
nutrients to biomass and recombinent proteins. Second, we examined the
fold changes between model predictions and experimental measurements.
These fold change errors are summarized in Figure 2B by their mean and
variance across the 80 observations. Overall, fold change error varied
significantly across amino acids. For example, the model predicted some
essential amino acids consistently well – e.g. phenylalanine, cysteine
and tryptophan (fold change ≈ 1), but predicted others poorly – e.g.
alanine, lysine, glycine, and methionine (fold change ≈ 0). Overall, the
sizeable fold change errors for many amino acids confirm the difficulty
of using metabolic network models alone to predict amino acid
consumption.
Notably, the model systematically underestimated consumption rates for
almost all amino acids (fold change < 1). This is likely
because the model doesn’t consider certain metabolic inefficiencies –
e.g. CHO cells consume more amino acids than needed for the observed
production of biomass and recombinant protein, and catabolize them as
byproducts(Mulukutla et al., 2017). Furthermore, the variance of fold
change error was relatively low (≤1) for most amino acids. This suggests
that the difference between model ideality and biological reality
remained consistent across many clones and conditions.
We hypothesized that this consistent gap could be bridged with data and
statistical modeling. We constructed a series of linear regression
models to ‘correct’ the predictions from metabolic modeling, using
growth rate and the predictions from the metabolic model as explanatory
variables (Methods, equation 2). The 80 observations were randomly
divided into a training dataset and validation dataset, consisting of 48
and 32 observations, respectively. The regression coefficients were
first estimated from the training dataset and then applied to the
validation dataset. According to validation results, the regression
models substantially improved predictions, as fold change error
approached unity for most amino acids (Fig. 3B). As exceptions,
predictions for alanine, glycine and histidine were not reliably
improved (Fig. 3, red). These results were replicated in additional
validation studies involving four distinct clones (Supplementary
Document).
These results show that our hybrid modeling approach estimates amino
acid consumptions well for a small timescale of 1 day, when the steady
state assumption holds true. This assumption is not valid at larger
timescales of multiple days, where nutrient consumption declines
asymptotically as cellular metabolism shifts from exponential growth
phase to stationary phase. However, we found this limitation could be
remedied by modeling the multi-phase decline in amino acid consumption
with a simple sigmoid function (Methods, equation 3; line in Fig. 4),
which can be fitted from only a few datapoints. Specifically, we further
adapted our hybrid modeling approach by first predicting amino acid
consumption rates of several early culture days as heretofore described.
Then, these datapoints were used to fit a sigmoid function that
described the entire consumption profile, including later culture days
(Fig. 4A). Using this approach, we accurately predicted the time-course
consumption rates of 13 out of 18 amino acids (Spearman ρ >
0.65; Fig. 4B), with only few amino acids remaining difficult to predict
(alanine, glycine, and histidine). Notably, our approach accurately
predicted the consumption profiles of amino acids that are highly
abundant in recombinant antibodies (e.g. serine, valine, and
leucine)(Fan et al., 2015), or that complicate media formulation due to
low solubility (e.g. tyrosine). These results highlight the method’s
value in monitoring and forecasting the bioreactor environment.
In summary, the presented modeling workflow forecasted the entire amino
acid consumption profile from early bioprocess measurements,
facilitating anticipatory and in situ control of bioreactor
nutrient availability. This was realized by a novel combination of
metabolic and statistical models. A metabolic network model estimated
amino acid uptake rates necessary for observed proliferation and
productivity, assuming an ideally efficient metabolism and steady state
conditions. Two subsequent regression models refined these predictions
by offsetting prediction errors empirically and by describing the
time-course relationship of individual predictions. Our efforts are part
of a growing trend of synergizing metabolic network models with machine
learning methods(Zampieri et al., 2019), and demonstrates the power of
hybrid modeling for on-line control of bioprocesses.