A second benchmarking dataset included erythrocyte data from the human
Blood Proteoform Atlas: concretely, datasetsPXD026123(DS-01: 35 LC-MS/MS runs) andPXD026159(DS-53: 6 LC-MS/MS runs). We applied a minimal modification setting
(Oxidation [unimod:35], Methylation [unimod:34]) and used a
human sequence database from UniProtKB/SwissProt (canonical; release
2023_01), applying a spectrum-level and proteoform-level FDR (False
Discovery Rate) cutoff of 1%.
Table 2 shows the analysis results using different tools for
identification, split by identified proteins (accessions), number of
truncated sequences for PrSM with or without PTMs, number of
proteoforms, and PrSMs. The samples of PXD026123 were subject to an
enrichment strategy, explaining the overall greater number of
identifications as opposed to PXD026159. Switching from TopMG
identification software to TopPIC also noticeably affects the number of
observed identifications. This can be explained considering the
different goals of the identification analysis for the two. TopMG is
designed “for identifying highly modified proteoforms”, whereas TopPIC
is built for the characterisation of proteoforms at the proteome level
[21].
This is further corroborated by comparing the overlap of accessions (75
of 209(TopPIC)/101(TopMG) in PXD026123 and 7 of 55/24 in PXD026159) and
overlap of proteoform sequences (126 of max. 4826 unique PrSM sequences
and 50 of max. 3085 unique PrSM sequences) for both methods (for details
see Supplementary Figure S1).