Main Text
Proteomics techniques are increasingly tackling the identification and
quantification of intact proteoforms, via the so-called top-down (TD)
proteomics mass spectrometry (MS)-based approaches, in addition to the
more widely used ‘bottom-up’ proteomics analysis. Proteoforms are the
distinct molecular species of an expressed and translated gene
including the precise combination of sequence variants, alternative
splicing events, post-translational proteolysis (e.g. signal peptide
cleavage) and post-translational modifications (PTMs)
[1].
The increasing popularity of open science practices in proteomics has
resulted in the dramatic growth of publicly available MS-based
proteomics datasets. The PRIDE database
(https://www.ebi.ac.uk/pride/)
[2]
is the largest proteomics data repository worldwide and is one of the
founding members of the ProteomeXchange consortium
[3].
As of May 2023, PRIDE stores approximately 34,000 datasets.
ProteomeXchange resources adopt the FAIR data principles (Findability,
Accessibility, Interoperability and Reusability)
[4].
One of the main benefits of making data publicly available is to enable
data reuse and the reproducibility of the analysis, facilitating an
independent assessment of the results described in the corresponding
publications. Additionally, as is already happening in other proteomics
fields, new knowledge and tools are being generated from data reuse
activities, for instance by applying machine learning techniques
[5].
Furthermore, in the context of data interoperability, it is important to
highlight that ProteomeXchange resources implement the main open data
standards developed by the Proteomics Standards Initiative
[6],
such as mzML (for MS data)
[7]
and mzTab (for peptide/protein identification and quantification)
[8,9].
However, in the context of open science practices, TD proteomics lags
behind others, more widely implemented, proteomics fields. For instance,
the total number of submitted TD proteomics datasets to PRIDE is still
small (approximately 231 datasets, of which are still 50 private, as of
May 2023), although the trend is changing slowly. In our view, this is
due to two reasons. On one hand, the number of TD proteomics
practitioners is still relatively low when compared with bottom-up
approaches. On the other hand and more importantly, TD datasets are not
being made available at the same pace as in other proteomics fields.
Furthermore, to increase data reuse of TD proteomics data, it is
important to develop open software for enabling computational data
(re-)analysis and visualisation. Indeed, advances in topics such as data
integration and the reproducibility of the computational analyses,
including scaling-up analysis capabilities for increasingly large
experiments, are key to match the developments in other proteomics
fields, and in the life sciences as a whole
[10].
We here introduce an open, modular and flexible platform called
TopDownApp for the analysis and visualisation of TD proteomics data,
which can be applied for instance to public datasets. Since the TD data
analysis methodology is still quite dynamic, a flexible and modular
approach to data analysis was necessary. This is why TopDownApp was
implemented using automated and modular analysis workflows – i.e. a
flexible succession of tools, inputs/outputs connected through open data
standard formats (as interoperability layer) and a modular choice of
tools (through software containerisation) through Nextflow
[11].
In the context of TD data analysis, currently, the most common and
generally successive tasks are: MS raw file access, deconvolution and
proteoform identification.