Main Text
Proteomics techniques are increasingly tackling the identification and quantification of intact proteoforms, via the so-called top-down (TD) proteomics mass spectrometry (MS)-based approaches, in addition to the more widely used ‘bottom-up’ proteomics analysis. Proteoforms are the distinct molecular species of an expressed and translated gene including the precise combination of sequence variants, alternative splicing events, post-translational proteolysis (e.g. signal peptide cleavage) and post-translational modifications (PTMs) [1].
The increasing popularity of open science practices in proteomics has resulted in the dramatic growth of publicly available MS-based proteomics datasets. The PRIDE database (https://www.ebi.ac.uk/pride/) [2] is the largest proteomics data repository worldwide and is one of the founding members of the ProteomeXchange consortium [3]. As of May 2023, PRIDE stores approximately 34,000 datasets. ProteomeXchange resources adopt the FAIR data principles (Findability, Accessibility, Interoperability and Reusability) [4]. One of the main benefits of making data publicly available is to enable data reuse and the reproducibility of the analysis, facilitating an independent assessment of the results described in the corresponding publications. Additionally, as is already happening in other proteomics fields, new knowledge and tools are being generated from data reuse activities, for instance by applying machine learning techniques [5]. Furthermore, in the context of data interoperability, it is important to highlight that ProteomeXchange resources implement the main open data standards developed by the Proteomics Standards Initiative [6], such as mzML (for MS data) [7] and mzTab (for peptide/protein identification and quantification) [8,9].
However, in the context of open science practices, TD proteomics lags behind others, more widely implemented, proteomics fields. For instance, the total number of submitted TD proteomics datasets to PRIDE is still small (approximately 231 datasets, of which are still 50 private, as of May 2023), although the trend is changing slowly. In our view, this is due to two reasons. On one hand, the number of TD proteomics practitioners is still relatively low when compared with bottom-up approaches. On the other hand and more importantly, TD datasets are not being made available at the same pace as in other proteomics fields. Furthermore, to increase data reuse of TD proteomics data, it is important to develop open software for enabling computational data (re-)analysis and visualisation. Indeed, advances in topics such as data integration and the reproducibility of the computational analyses, including scaling-up analysis capabilities for increasingly large experiments, are key to match the developments in other proteomics fields, and in the life sciences as a whole [10].
We here introduce an open, modular and flexible platform called TopDownApp for the analysis and visualisation of TD proteomics data, which can be applied for instance to public datasets. Since the TD data analysis methodology is still quite dynamic, a flexible and modular approach to data analysis was necessary. This is why TopDownApp was implemented using automated and modular analysis workflows – i.e. a flexible succession of tools, inputs/outputs connected through open data standard formats (as interoperability layer) and a modular choice of tools (through software containerisation) through Nextflow [11]. In the context of TD data analysis, currently, the most common and generally successive tasks are: MS raw file access, deconvolution and proteoform identification.