A second benchmarking dataset included erythrocyte data from the human Blood Proteoform Atlas: concretely, datasetsPXD026123(DS-01: 35 LC-MS/MS runs) andPXD026159(DS-53: 6 LC-MS/MS runs). We applied a minimal modification setting (Oxidation [unimod:35], Methylation [unimod:34]) and used a human sequence database from UniProtKB/SwissProt (canonical; release 2023_01), applying a spectrum-level and proteoform-level FDR (False Discovery Rate) cutoff of 1%.
Table 2 shows the analysis results using different tools for identification, split by identified proteins (accessions), number of truncated sequences for PrSM with or without PTMs, number of proteoforms, and PrSMs. The samples of PXD026123 were subject to an enrichment strategy, explaining the overall greater number of identifications as opposed to PXD026159. Switching from TopMG identification software to TopPIC also noticeably affects the number of observed identifications. This can be explained considering the different goals of the identification analysis for the two. TopMG is designed “for identifying highly modified proteoforms”, whereas TopPIC is built for the characterisation of proteoforms at the proteome level [21]. This is further corroborated by comparing the overlap of accessions (75 of 209(TopPIC)/101(TopMG) in PXD026123 and 7 of 55/24 in PXD026159) and overlap of proteoform sequences (126 of max. 4826 unique PrSM sequences and 50 of max. 3085 unique PrSM sequences) for both methods (for details see Supplementary Figure S1).