Ambiguity of protein identifications is an important issue in shotgun proteomics. It is due to the presence of shared peptides, which can generate quite complex peptide-to-protein mapping structures. These structures can be efficiently represented using bipartite graphs, with peptides and proteins as vertices and with edges featuring peptide to protein membership. Ambiguity in protein identifications can be assessed using graph connected components (CCs), i.e., the largest subgraphs in which any two vertices are connected to each other by a path and not connected to any other of the vertices in the supergraph. Proteins sharing one or more peptides are gathered in the same CC (multi-protein CCs), while proteins exclusively identified by specific peptides represent CCs with a single protein vertex (single-protein CCs). The proportion of multi-protein CCs and their size (i.e., number of protein members) can be used to measure the level of ambiguity of protein identifications.
Nowadays, an increasing number of sample-matched proteomic and transcriptomic datasets is available, which can be exploited to deal with ambiguity of protein identifications. Indeed, according to the central dogma of biology, there can be no protein without the corresponding transcript. Hence, protein identifications for which the corresponding transcript is identified in the sample-matched transcriptome are more likely to be correct than protein identifications with no expressed transcript. This rationale can be used to filter proteomic identifications based on transcriptome information and reduce ambiguity of protein identifications.
The net4pg package implements two main functionalities:
1. a pipeline to build, from proteomic identifications, a
bipartite graph of peptide-to-protein mappings and calculate graph CCs
to assess ambiguity of protein identifications. CCs represent a
peptide-centric strategy to group proteins and it is independent from
the variety of protein-centric strategies of protein grouping and
protein inference. As such, it does not require protein inference and it
is widely applicable, reproducible and transparent.
2. a transcriptome-informed filtering strategy which exploits
sample-matched transcriptome information to reduce ambiguity of protein
identifications in shotgun proteomics. This strategy
fundamentally consists in the removal of proteins for which the
corresponding transcript is not detected in the sample-matched
transcriptome. The impact of the filtering strategy can be then assessed
by building a new graph of filtered peptide-to-protein mappings,
calculating the proportion of multi-protein CCs and visually inspecting
peptide-to-protein mappings for ambiguous protein identifications of
interest.
This vignette illustrates how to use the above described methods implemented in this package for the following applications:
The required input are proteomic identifications obtained by shotgun proteomics and, more precisely, all valid peptide identifications and their corresponding proteins, prior to any protein inference strategy. The required input must be in the form of a tab-delimited incidence matrix with peptides along the rows and proteins along the columns and 1 or 0 cell values to indicate whether or not a peptide maps on the corresponding protein. Protein identifiers (column names) are to be provided, one per line, in a separate file and should be in the Ensembl format (i.e. ENSPXXXXXXXXXXX for human, ENSMUSPXXXXXXXXXXX for mouse) or, if protein contaminants, in any format followed by a unique tag to indicate them as contaminants. Peptide identifiers (row names) are to be provided, one per line, in a separate file and can be in any format (amino acid sequence, numeric identifiers, …).
The pipeline to generate a bipartite graph of peptide-to-protein mapping and calculate their connected components consists in the following steps:
Incidence matrices generated from proteomic datasets can be quite
large but they can be easily read in input chunk by chunk using the
read_inc_matrix() function. This function requires three inputs:
* the name of the tab-delimited file containing the incidence matrix
(peptide-to-protein mappings), with no column or row names;
* the name of the file containing the matrix column names (protein
identifiers), one per line;
* the name of the file containing the matrix row names (peptide
identifiers), one per line.
Read the incidence matrix describing peptides identified in a shotgun proteomic experiment and their and corresponding proteins
incM_filename <- system.file("extdata"
, "incM_example"
, package = "net4pg"
, mustWork = TRUE)
rownames_filename <- system.file("extdata"
, "peptideIDs_incM_example"
, package = "net4pg"
, mustWork = TRUE)
colnames_filename <- system.file("extdata"
, "proteinIDs_incM_example"
, package = "net4pg"
, mustWork = TRUE)
incM <- read_inc_matrix(incM_filename = incM_filename
, colnames_filename = colnames_filename
, rownames_filename = rownames_filename)
Check the size of the input incidence matrix:
Reduce the size of the input incidence matrix by removing all proteins not sharing peptides and all peptides exclusively mapping to these proteins. Only ambiguous protein identifications are left, which is proteins connected by shared peptides, hence belonging to a multi-protein CC. Reducing data size allows to decrease the computational cost of calculating graph connected components.
The adjacency matrix is calculated by cross-product of the reduced incidence matrix and it constitutes a more compact representation than an incidence matrix; it scales linearly with the number of proteins and it is independent of the number of peptides, which is convenient to speed up connected components calculation.
Build a graph representing protein-to-protein connections by shared
peptides, as described in the adjacency matrix, and calculate graph
connected components, using the get_cc() function. The function provides
in output a list of two elements:
i. a graph of the protein connections (by shared peptides) described in
the adjacency matrix;
ii. a list of n vectors (one per connected component)
enumerating protein members of each connected component
Extract the number of multi-protein CCs
The proportion of single- or multi-protein CCs and the size of multi-protein CCs represent a measure for the level of ambiguity of protein identifications and they can be obtained by the cc_stats() function, provided with the above calculated CCs and the original incidence matrix. The above calculated CCs exclusively include multi-protein CCs and no single-protein CC, since they were calculated on the reduced incidence matrix from step 2, which only contains proteins sharing peptides. This is specified by the reducedIncM parameter in the cc_stats() function. Single-protein CCs are calculated by cc_stats() as all those proteins from the original incidence matrix which do not belong to multi-protein CCs.
# Calculate CCs size and percentage of single- vs multi-protein CCs
CCstatsOut <- cc_stats(incM = incM
, cc.proteins = cc.multProteins
, reducedIncM = TRUE)
# Number of single-protein CCs:
CCstatsOut$N_singleProtCC
#> [1] 27
# Number of multi-protein CCs
CCstatsOut$N_multiProtCC
#> [1] 12
# Total number of CCs
totCCs <- CCstatsOut$N_singleProtCC + CCstatsOut$N_multiProtCC
totCCs
#> [1] 39
# Percentage of single-protein CCs:
PercSingleP <- round(CCstatsOut$N_singleProtCC / totCCs * 100, digits = 2)
PercSingleP
#> [1] 69.23
# View table of CC size distribution
CCstatsOut$NproteinsDistribution
#> N_proteins N_CCs
#> 1 1 27
#> 2 2 4
#> 3 3 0
#> 4 4 3
#> 5 5 0
#> 6 6 2
#> 7 7 0
#> 8 8 0
#> 9 9 0
#> 10 10 2
#> 11 >10 1
# Plot CC size distribution
plot(factor(CCstatsOut$NproteinsDistribution$N_proteins
, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10"))
, as.numeric(as.vector(CCstatsOut$NproteinsDistribution$N_CC))
, type = "s"
, xlab = "N_proteins"
, ylab = "N_CCs")
The proportion of specific or shared peptides also represents a measure for the level of ambiguity of protein identifications. It can be calculated using the peptide_stats() function.
Given an ambiguous protein identification of interest (a protein with shared peptides), find the multi-protein CC it belong to, identify all protein members of that CC and all their peptides and represent this information as a bipartite graph of peptide-to-protein mappings.
First, extract all peptides and peptide-to-protein mapping
information for each CC using the cc.peptides.incM() function. This
function generates in output a list of two elements:
i. a list of vectors enumerating peptides included in each CC;
ii. a list of incidence matrices (or vectors if only one peptide)
describing peptide-to-protein mappings.
Then plot a bipartite graph representing the CC which includes the protein of interest:
# Generate the bipartite graph
prot <- "ENSP261"
subgraphCC <- plot_cc(prot = prot
, cc.proteins = cc.multProteins
, cc.subincM = cc.peptides.incM$cc.subincM
, incM = incM
, tagProt = "ENSP"
, tagContam = "Contam")
#> [1] "Protein ENSP261 is member of a multi-protein CC"
# Plot it
plot.igraph(subgraphCC$g
, layout = layout_as_bipartite
, edge.width = 1
, edge.arrow.width = 0.3
, vertex.size = 35
, edge.arrow.size = 0.5
, vertex.size2 = 35
, vertex.label.cex = 1
, asp = 0.25
, margin = -0.1) +
title(paste0("Protein ", prot, " in CC#", subgraphCC$cc_id), line = -1)
#> integer(0)
The required input is the same as in section 2: a tab-delimited incidence matrix describing all valid peptide identifications from shotgun proteomics and their corresponding proteins, prior to any protein inference strategy.
Filtering can be performed in three different ways, established by
the remove parameter of the transcriptome_filter()
function:
i. remove all proteins for which no transcript is expressed according to
the sample-matched transcriptome; then remove all (specific or shared)
peptides exclusively mapping on them (remove = “all”);
ii. remove proteins for which no transcript is expressed, if they are
not identified by any specific peptide; then remove all (shared)
peptides exclusively mapping on them (remove = “sharedOnly”);
iii. remove proteins for which no transcript is expressed, if they are
not identified by any specific peptide and if their peptides are shared
with at least one protein for which the corresponding transcript is
expressed (i.e. not filtered out); hence, no peptide needs to
be removed (remove = “sharedNoRemove”).
Read the incidence matrix describing peptides identified in a shotgun proteomic experiment and their corresponding proteins
incM_filename <- system.file("extdata"
, "incM_example"
, package = "net4pg"
, mustWork = TRUE)
rownames_filename <- system.file("extdata"
, "peptideIDs_incM_example"
, package = "net4pg"
, mustWork = TRUE)
colnames_filename <- system.file("extdata"
, "proteinIDs_incM_example"
, package = "net4pg"
, mustWork = TRUE)
incM <- read_inc_matrix(incM_filename = incM_filename
, colnames_filename = colnames_filename
, rownames_filename = rownames_filename)
Check the size of the input incidence matrix:
Perform transcriptome-informed post-hoc filtering using the
transcriptome_filter() function, which requires three inputs:
i. the name of the file containing the incidence matrix of
peptide-to-protein mappings
ii. the name of the file containing identifiers of transcripts detected
in the sample-matched transcriptome
iii. the name of the file containing for each protein identifier the
corresponding transcript identifier. 1
# Read input file names
exprTranscriptsFile <- system.file("extdata"
, "expressed_transcripts.txt"
, package = "net4pg"
, mustWork = TRUE)
protein2transcriptFile <- system.file("extdata"
, "protein_to_transcript"
, package = "net4pg"
, mustWork = TRUE)
# Perform filtering
incM_filt <- transcriptome_filter(incM
, exprTranscriptsFile = exprTranscriptsFile
, proteinToTranscriptFile = protein2transcriptFile
, tagContam = "Contam"
, remove = "sharedOnly")
# Check size after transcriptome-informed filtering
dim(incM_filt)
#> [1] 87 37
To measure the impact of transcriptome-informed filtering on
ambiguity of protein identifications:
3.3.1 Compare the proportion and size of multi-protein CCs obtained
before and after filtering
3.3.2 Compare the proportion of shared peptides obtained before and
after filtering
3.3.3 Visualize peptide-to-protein mappings for ambiguous protein
identifications of interest
Calculate the proportion and size of multi-protein CCs obtained after filtering and compare it to that obtained from the original proteomic identifications, prior to transcriptome-informed filtering.
Calculate CCs on the graph of proteomic identifications obtained after transcriptome-informed filtering.
# Reduce incidence matrix size to accelerate downstream computation
incM_filt_reduced <- reduce_inc_matrix(incM_filt)
# Calculate the adjacency matrix describing protein-to-protein connections
adjM_filt <- get_adj_matrix(incM_filt_reduced)
# Generate a graph of protein-to-protein connections by shared peptides and
# calculate its CCs (i.e., sets of proteins connected by shared peptides
multProteinCC_filt <- get_cc(adjM_filt)
# Extract the list of vectors enumerating protein members in each CC
cc.multProteins_filt <- multProteinCC_filt$ccs
# Calculate CCs size and % of single- vs multi-protein CCs obtained after
# transcriptome-informed filtering
CCstatsOut_filt <- cc_stats(incM = incM_filt
, cc.proteins = multProteinCC_filt$ccs
, reducedIncM = TRUE)
# Number of single-protein CCs
CCstatsOut_filt$N_singleProtCC
#> [1] 15
# Number of multi-protein CCs
CCstatsOut_filt$N_multiProtCC
#> [1] 8
# Total number of CCs
totCCs_filt <- CCstatsOut_filt$N_singleProtCC + CCstatsOut_filt$N_multiProtCC
totCCs_filt
#> [1] 23
# Percentage of single-protein CCs
PercSingleP_filt <- round(CCstatsOut_filt$N_singleProtCC / totCCs_filt * 100
, digits = 2)
# View table of CC size distribution
CCstatsOut_filt$NproteinsDistribution
#> N_proteins N_CCs
#> 1 1 15
#> 2 2 5
#> 3 3 0
#> 4 4 3
#> 5 5 0
#> 6 6 0
#> 7 7 0
#> 8 8 0
#> 9 9 0
#> 10 10 0
#> 11 >10 0
# Plot CC size distribution
plot(factor(CCstatsOut_filt$NproteinsDistribution$N_proteins
, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10"))
, as.numeric(as.vector(CCstatsOut_filt$NproteinsDistribution$N_CC))
, type = "s"
, xlab = "N_proteins"
, ylab = "N_CCs")
Compare the percentage of single-protein CCs obtained before and after transcriptome-informed filtering.
comp <- as.data.frame(cbind(as.character(as.vector(c("before_filter"
, "after_filter")))
, as.numeric(as.vector(c(PercSingleP
, PercSingleP_filt)))))
colnames(comp) <- c("Filter", "Perc_SingleP")
ggplot(data = comp
, aes(x = as.factor(Filter), y = as.numeric(as.vector(Perc_SingleP)))) +
geom_bar(stat = "identity") +
theme_classic() +
xlab("") +
ylab("% single-prot CCs") +
ylim(0, 100) +
coord_flip() +
geom_text(aes(label = as.numeric(as.vector(Perc_SingleP)))
, hjust = 1.5, color = "white", size = 4)
Plot CC size distribution before and after transcriptome-informed filtering.
old.par <- par(no.readonly = TRUE) # save default par values
ymax_before <- as.numeric(as.vector(CCstatsOut$NproteinsDistribution$N_CC))
ymax_after <- as.numeric(as.vector(CCstatsOut_filt$NproteinsDistribution$N_CC))
ymax <- max(max(ymax_before), max(ymax_after))
par(mfrow = c(1, 2))
plot(factor(CCstatsOut$NproteinsDistribution$N_proteins
, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10"))
, as.numeric(as.vector(CCstatsOut$NproteinsDistribution$N_CC))
, type = "s"
, xlab = "N_proteins"
, ylab = "N_CCs"
, ylim = c(0, ymax)
, main = "before filtering")
plot(factor(CCstatsOut_filt$NproteinsDistribution$N_proteins
, levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10"))
, as.numeric(as.vector(CCstatsOut_filt$NproteinsDistribution$N_CC))
, type = "s"
, xlab = "N_proteins"
, ylab = "N_CCs"
, ylim = c(0, ymax)
, main = "after filtering")
In the above example, transcriptome-informed filtering does not lead to an increased percentage of single-protein CCs (i.e., percentage of non ambiguous protein identifications); however, it yields smaller multi-protein CCs (i.e., fewer protein members per CC), which corresponds to reduced, although not solved, ambiguity of protein identifications.
To visually inspect the impact of transcriptome-informed filtering on any ambiguous protein identification of interest, plot the bipartite graph representing the CC which contains the protein of interest. Compare the CC before or after transcriptome-informed filtering.
# Extract peptides and peptide-to-protein mappings for each CC after filtering
cc.peptides.incM_filt <- cc_composition(cc.multProteins_filt
, incM = incM_filt)
# Generate a bipartite graph of the CC which contains the protein of interest,
# before and after transcriptome-informed filtering.
prot <- "ENSP261"
subgraphCC_beforeFilter <- plot_cc(prot = prot
, cc.proteins = cc.multProteins
, cc.subincM = cc.peptides.incM$cc.subincM
, incM = incM
, tagProt = "ENSP"
, tagContam = "Contam")
#> [1] "Protein ENSP261 is member of a multi-protein CC"
subgraphCC_afterFilter <- plot_cc(prot = prot
, cc.proteins = cc.multProteins_filt
, cc.subincM = cc.peptides.incM_filt$cc.subincM
, incM = incM_filt
, tagProt = "ENSP"
, tagContam = "Contam")
#> [1] "Protein ENSP261 is member of a multi-protein CC"
# Plot
old.par <- par(no.readonly = TRUE) # save default par values
par(mfrow = c(2, 1))
plot.igraph(subgraphCC_beforeFilter$g
, layout = layout_as_bipartite
, edge.width = 1
, edge.arrow.width = 0.3
, vertex.size = 35
, edge.arrow.size = 0.5
, vertex.size2 = 35
, vertex.label.cex = 1
, asp = 0.45
, margin = -0.1) +
title(paste0("Protein "
, prot
, " in CC #"
, subgraphCC_beforeFilter$cc_id
, " before filtering")
, line = -1)
#> integer(0)
plot.igraph(subgraphCC_afterFilter$g
, layout = layout_as_bipartite
, edge.width = 1
, edge.arrow.width = 0.3
, vertex.size = 35
, edge.arrow.size = 0.5
, vertex.size2 = 35
, vertex.label.cex = 1
, asp = 0.45
, margin = -0.1) +
title(paste0("Protein "
, prot, " in CC #"
, subgraphCC_beforeFilter$cc_id
, " after filtering")
, line = -1)
#> integer(0)
par(old.par) # restore default par values
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 igraph_2.1.1 net4pg_0.1.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-1 gtable_0.3.6 jsonlite_1.8.9
#> [4] compiler_4.4.2 BiocManager_1.30.25 jquerylib_0.1.4
#> [7] scales_1.3.0 yaml_2.3.10 fastmap_1.2.0
#> [10] lattice_0.22-6 R6_2.5.1 labeling_0.4.3
#> [13] generics_0.1.3 knitr_1.49 BiocGenerics_0.53.3
#> [16] graph_1.85.0 tibble_3.2.1 maketools_1.3.1
#> [19] munsell_0.5.1 bslib_0.8.0 pillar_1.9.0
#> [22] rlang_1.1.4 utf8_1.2.4 cachem_1.1.0
#> [25] xfun_0.49 sass_0.4.9 sys_3.4.3
#> [28] cli_3.6.3 withr_3.0.2 magrittr_2.0.3
#> [31] digest_0.6.37 grid_4.4.2 lifecycle_1.0.4
#> [34] vctrs_0.6.5 evaluate_1.0.1 glue_1.8.0
#> [37] data.table_1.16.2 farver_2.1.2 buildtools_1.0.0
#> [40] stats4_4.4.2 fansi_1.0.6 colorspace_2.1-1
#> [43] rmarkdown_2.29 tools_4.4.2 pkgconfig_2.0.3
#> [46] htmltools_0.5.8.1
It can be easily obtained from the Ensembl protein
sequence database file (e.g., Homo_sapiens.GRCh38.pep.all.fa
for the GRCh38 human genome assembly), publicly available on Ensembl.
The following bash command generates a tab-delimited file containing
protein identifier and corresponding transcript identifier:
cat Homo_sapiens.GRCh38.pep.all.fa | grep ">" | awk -F ' ' '{print $1,$5}' | sed 's/>//g' | sed 's/transcript://g' | sed 's/ /\t/' >> transcriptToProteinIDs_GRCh38.txt
↩︎