--- title: "An introduction to net4pg" author: - name: Laura Fancello affiliation: Univ. Grenoble Alpes, CNRS, CEA, Inserm, Profi FR2048, Grenoble, France email: laura.fancello@cea.fr - name: Thomas Burger affiliation: Univ. Grenoble Alpes, CNRS, CEA, Inserm, Profi FR2048, Grenoble, France email: thomas.burger@cea.fr # author: # - Laura Fancello; Univ. Grenoble Alpes, CNRS, CEA, Inserm, Profi FR2048, Grenoble, France; laura.fancello@cea.fr # - Thomas Burger; Univ. Grenoble Alpes, CNRS, CEA, Inserm, Profi FR2048, Grenoble, France; thomas.burger@cea.fr date: "`r Sys.Date()`" output: BiocStyle::html_document fig_width: 10 fig_height: 10 vignette: > %\VignetteIndexEntry{An introduction to net4pg} %\VignetteEngine{knitr::rmarkdown} %\VignetteEngine{knitr::knitr} \usepackage[utf8]{inputenc} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` Introduction =============================================================================== Ambiguity of protein identifications is an important issue in shotgun proteomics. It is due to the presence of shared peptides, which can generate quite complex peptide-to-protein mapping structures. These structures can be efficiently represented using bipartite graphs, with peptides and proteins as vertices and with edges featuring peptide to protein membership. Ambiguity in protein identifications can be assessed using graph connected components (CCs), *i.e.*, the largest subgraphs in which any two vertices are connected to each other by a path and not connected to any other of the vertices in the supergraph. Proteins sharing one or more peptides are gathered in the same CC (multi-protein CCs), while proteins exclusively identified by specific peptides represent CCs with a single protein vertex (single-protein CCs). The proportion of multi-protein CCs and their size (*i.e.*, number of protein members) can be used to measure the level of ambiguity of protein identifications. Nowadays, an increasing number of sample-matched proteomic and transcriptomic datasets is available, which can be exploited to deal with ambiguity of protein identifications. Indeed, according to the central dogma of biology, there can be no protein without the corresponding transcript. Hence, protein identifications for which the corresponding transcript is identified in the sample-matched transcriptome are more likely to be correct than protein identifications with no expressed transcript. This rationale can be used to filter proteomic identifications based on transcriptome information and reduce ambiguity of protein identifications. The net4pg package implements two main functionalities: **1. a pipeline to build, from proteomic identifications, a bipartite graph of peptide-to-protein mappings and calculate graph CCs to assess ambiguity of protein identifications**. CCs represent a peptide-centric strategy to group proteins and it is independent from the variety of protein-centric strategies of protein grouping and protein inference. As such, it does not require protein inference and it is widely applicable, reproducible and transparent. **2. a transcriptome-informed filtering strategy which exploits sample-matched transcriptome information to reduce ambiguity of protein identifications in shotgun proteomics**. This strategy fundamentally consists in the removal of proteins for which the corresponding transcript is not detected in the sample-matched transcriptome. The impact of the filtering strategy can be then assessed by building a new graph of filtered peptide-to-protein mappings, calculating the proportion of multi-protein CCs and visually inspecting peptide-to-protein mappings for ambiguous protein identifications of interest. This vignette illustrates how to use the above described methods implemented in this package for the following applications: * [Build a graph from shotgun proteomic identifications and calculate its connected components (CCs) to quantify ambiguity of protein identifications;](#partA) * [Perform a transcriptome-informed filtering of shotgun proteomic identifications to reduce ambiguity of protein identifications;](#partB) ```{r message = FALSE, include = FALSE} library(net4pg) library(igraph) library(ggplot2) ``` Build a graph from shotgun proteomic identifications and calculate its connected components (CCs) to quantify ambiguity of protein identifications {#partA} ================================================================================ The required input are proteomic identifications obtained by shotgun proteomics and, more precisely, all valid peptide identifications and their corresponding proteins, prior to any protein inference strategy. The required input must be in the form of a tab-delimited incidence matrix with peptides along the rows and proteins along the columns and 1 or 0 cell values to indicate whether or not a peptide maps on the corresponding protein. Protein identifiers (column names) are to be provided, one per line, in a separate file and should be in the Ensembl format (i.e. ENSPXXXXXXXXXXX for human, ENSMUSPXXXXXXXXXXX for mouse) or, if protein contaminants, in any format followed by a unique tag to indicate them as contaminants. Peptide identifiers (row names) are to be provided, one per line, in a separate file and can be in any format (amino acid sequence, numeric identifiers, ...). The pipeline to generate a bipartite graph of peptide-to-protein mapping and calculate their connected components consists in the following steps: 1. [Read in input the incidence matrix describing peptide-to-protein mappings;](#step1) 2. [Reduce the size of the incidence matrix](#step2); 3. [Calculate the adjacency matrix describing protein-to-protein connections via shared peptides;](#step3) 4. [Generate a graph of protein-to-protein connections and calculate its connected components (CCs);](#step4) 5. [Assess the ambiguity of protein identifications based on the proportion of multi-protein CCs and shared peptides;](#step5) 6. [Visualize peptide-to-protein mappings for ambiguous protein identifications of interest;](#step6) ## Read in input the incidence matrix describing peptide-to-protein mappings{#step1} Incidence matrices generated from proteomic datasets can be quite large but they can be easily read in input chunk by chunk using the read_inc_matrix() function. This function requires three inputs: * the name of the tab-delimited file containing the incidence matrix (peptide-to-protein mappings), with no column or row names; * the name of the file containing the matrix column names (protein identifiers), one per line; * the name of the file containing the matrix row names (peptide identifiers), one per line. Read the incidence matrix describing peptides identified in a shotgun proteomic experiment and their and corresponding proteins ```{r} incM_filename <- system.file("extdata" , "incM_example" , package = "net4pg" , mustWork = TRUE) rownames_filename <- system.file("extdata" , "peptideIDs_incM_example" , package = "net4pg" , mustWork = TRUE) colnames_filename <- system.file("extdata" , "proteinIDs_incM_example" , package = "net4pg" , mustWork = TRUE) incM <- read_inc_matrix(incM_filename = incM_filename , colnames_filename = colnames_filename , rownames_filename = rownames_filename) ``` Check the size of the input incidence matrix: ```{r} dim(incM) ``` ## Reduce the size of the incidence matrix{#step2} Reduce the size of the input incidence matrix by removing all proteins not sharing peptides and all peptides exclusively mapping to these proteins. Only ambiguous protein identifications are left, which is proteins connected by shared peptides, hence belonging to a multi-protein CC. Reducing data size allows to decrease the computational cost of calculating graph connected components. ```{r} incM_reduced <- reduce_inc_matrix(incM) dim(incM_reduced) # check the size of the reduced incidence matrix ``` ## Calculate the adjacency matrix describing protein-to-protein connections via shared peptides{#step3} The adjacency matrix is calculated by cross-product of the reduced incidence matrix and it constitutes a more compact representation than an incidence matrix; it scales linearly with the number of proteins and it is independent of the number of peptides, which is convenient to speed up connected components calculation. ```{r} adjM <- get_adj_matrix(incM_reduced) dim(adjM) # check the size of the adjacency matrix: ``` ## Generate a graph of protein-to-protein connections and calculate its connected components (CCs){#step4} Build a graph representing protein-to-protein connections by shared peptides, as described in the adjacency matrix, and calculate graph connected components, using the get_cc() function. The function provides in output a list of two elements: i. a graph of the protein connections (by shared peptides) described in the adjacency matrix; ii. a list of *n* vectors (one per connected component) enumerating protein members of each connected component ```{r} multProteinCC <- get_cc(adjM) ``` Extract the number of multi-protein CCs ```{r} cc.multProteins <- multProteinCC$ccs length(cc.multProteins) ``` ## Assess the ambiguity of protein identifications based on the proportion of multi-protein CCs and shared peptides{#step5} The proportion of single- or multi-protein CCs and the size of multi-protein CCs represent a measure for the level of ambiguity of protein identifications and they can be obtained by the cc_stats() function, provided with the above calculated CCs and the original incidence matrix. The above calculated CCs exclusively include multi-protein CCs and no single-protein CC, since they were calculated on the reduced incidence matrix from step 2, which only contains proteins sharing peptides. This is specified by the *reducedIncM* parameter in the cc_stats() function. Single-protein CCs are calculated by cc_stats() as all those proteins from the original incidence matrix which do not belong to multi-protein CCs. ```{r} # Calculate CCs size and percentage of single- vs multi-protein CCs CCstatsOut <- cc_stats(incM = incM , cc.proteins = cc.multProteins , reducedIncM = TRUE) # Number of single-protein CCs: CCstatsOut$N_singleProtCC # Number of multi-protein CCs CCstatsOut$N_multiProtCC # Total number of CCs totCCs <- CCstatsOut$N_singleProtCC + CCstatsOut$N_multiProtCC totCCs # Percentage of single-protein CCs: PercSingleP <- round(CCstatsOut$N_singleProtCC / totCCs * 100, digits = 2) PercSingleP # View table of CC size distribution CCstatsOut$NproteinsDistribution # Plot CC size distribution plot(factor(CCstatsOut$NproteinsDistribution$N_proteins , levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10")) , as.numeric(as.vector(CCstatsOut$NproteinsDistribution$N_CC)) , type = "s" , xlab = "N_proteins" , ylab = "N_CCs") ``` The proportion of specific or shared peptides also represents a measure for the level of ambiguity of protein identifications. It can be calculated using the peptide_stats() function. ```{r} peptideStatsOut <- peptide_stats(incM = incM) # Number of shared peptides peptideStatsOut$nbShared # Number of specific peptides peptideStatsOut$nbSpecific # Percentage of specific peptides peptideStatsOut$percSpecific ``` ## Visualize peptide-to-protein mappings for ambiguous protein identifications of interest{#step6} Given an ambiguous protein identification of interest (a protein with shared peptides), find the multi-protein CC it belong to, identify all protein members of that CC and all their peptides and represent this information as a bipartite graph of peptide-to-protein mappings. First, extract all peptides and peptide-to-protein mapping information for each CC using the cc.peptides.incM() function. This function generates in output a list of two elements: i. a list of vectors enumerating peptides included in each CC; ii. a list of incidence matrices (or vectors if only one peptide) describing peptide-to-protein mappings. ```{r} cc.peptides.incM <- cc_composition(cc.multProteins, incM = incM) ``` Then plot a bipartite graph representing the CC which includes the protein of interest: ```{r, fig.height=7} # Generate the bipartite graph prot <- "ENSP261" subgraphCC <- plot_cc(prot = prot , cc.proteins = cc.multProteins , cc.subincM = cc.peptides.incM$cc.subincM , incM = incM , tagProt = "ENSP" , tagContam = "Contam") # Plot it plot.igraph(subgraphCC$g , layout = layout_as_bipartite , edge.width = 1 , edge.arrow.width = 0.3 , vertex.size = 35 , edge.arrow.size = 0.5 , vertex.size2 = 35 , vertex.label.cex = 1 , asp = 0.25 , margin = -0.1) + title(paste0("Protein ", prot, " in CC#", subgraphCC$cc_id), line = -1) ``` Perform a transcriptome-informed filtering of shotgun proteomic identifications to reduce ambiguity of protein identifications {#partB} =============================================================================== The required input is the same as in section 2: a tab-delimited incidence matrix describing all valid peptide identifications from shotgun proteomics and their corresponding proteins, prior to any protein inference strategy. Filtering can be performed in three different ways, established by the *remove* parameter of the transcriptome_filter() function: i. remove all proteins for which no transcript is expressed according to the sample-matched transcriptome; then remove all (specific or shared) peptides exclusively mapping on them (remove = "all"); ii. remove proteins for which no transcript is expressed, if they are not identified by any specific peptide; then remove all (shared) peptides exclusively mapping on them (remove = "sharedOnly"); iii. remove proteins for which no transcript is expressed, if they are not identified by any specific peptide and if their peptides are shared with at least one protein for which the corresponding transcript is expressed (*i.e.* not filtered out); hence, no peptide needs to be removed (remove = "sharedNoRemove"). ## Read in input the incidence matrix describing peptide-to-protein mappings Read the incidence matrix describing peptides identified in a shotgun proteomic experiment and their corresponding proteins ```{r} incM_filename <- system.file("extdata" , "incM_example" , package = "net4pg" , mustWork = TRUE) rownames_filename <- system.file("extdata" , "peptideIDs_incM_example" , package = "net4pg" , mustWork = TRUE) colnames_filename <- system.file("extdata" , "proteinIDs_incM_example" , package = "net4pg" , mustWork = TRUE) incM <- read_inc_matrix(incM_filename = incM_filename , colnames_filename = colnames_filename , rownames_filename = rownames_filename) ``` Check the size of the input incidence matrix: ```{r} dim(incM) ``` ## Perform transcriptome-informed filtering Perform transcriptome-informed post-hoc filtering using the transcriptome_filter() function, which requires three inputs: i. the name of the file containing the incidence matrix of peptide-to-protein mappings ii. the name of the file containing identifiers of transcripts detected in the sample-matched transcriptome iii. the name of the file containing for each protein identifier the corresponding transcript identifier. [^1] [^1]: It can be easily obtained from the Ensembl protein sequence database file (*e.g.*, Homo_sapiens.GRCh38.pep.all.fa for the GRCh38 human genome assembly), publicly available on Ensembl. The following bash command generates a tab-delimited file containing protein identifier and corresponding transcript identifier: `cat Homo_sapiens.GRCh38.pep.all.fa | grep ">" | awk -F ' ' '{print $1,$5}' | sed 's/>//g' | sed 's/transcript://g' | sed 's/ /\t/' >> transcriptToProteinIDs_GRCh38.txt` ```{r} # Read input file names exprTranscriptsFile <- system.file("extdata" , "expressed_transcripts.txt" , package = "net4pg" , mustWork = TRUE) protein2transcriptFile <- system.file("extdata" , "protein_to_transcript" , package = "net4pg" , mustWork = TRUE) # Perform filtering incM_filt <- transcriptome_filter(incM , exprTranscriptsFile = exprTranscriptsFile , proteinToTranscriptFile = protein2transcriptFile , tagContam = "Contam" , remove = "sharedOnly") # Check size after transcriptome-informed filtering dim(incM_filt) ``` ## Assess the impact of transcriptome-informed filtering on ambiguity of protein identifications To measure the impact of transcriptome-informed filtering on ambiguity of protein identifications: 3.3.1 Compare the proportion and size of multi-protein CCs obtained before and after filtering 3.3.2 Compare the proportion of shared peptides obtained before and after filtering 3.3.3 Visualize peptide-to-protein mappings for ambiguous protein identifications of interest ### Compare the proportion and size of multi-protein CCs obtained before and after filtering Calculate the proportion and size of multi-protein CCs obtained after filtering and compare it to that obtained from the original proteomic identifications, prior to transcriptome-informed filtering. Calculate CCs on the graph of proteomic identifications obtained after transcriptome-informed filtering. ```{r} # Reduce incidence matrix size to accelerate downstream computation incM_filt_reduced <- reduce_inc_matrix(incM_filt) # Calculate the adjacency matrix describing protein-to-protein connections adjM_filt <- get_adj_matrix(incM_filt_reduced) # Generate a graph of protein-to-protein connections by shared peptides and # calculate its CCs (i.e., sets of proteins connected by shared peptides multProteinCC_filt <- get_cc(adjM_filt) # Extract the list of vectors enumerating protein members in each CC cc.multProteins_filt <- multProteinCC_filt$ccs # Calculate CCs size and % of single- vs multi-protein CCs obtained after # transcriptome-informed filtering CCstatsOut_filt <- cc_stats(incM = incM_filt , cc.proteins = multProteinCC_filt$ccs , reducedIncM = TRUE) # Number of single-protein CCs CCstatsOut_filt$N_singleProtCC # Number of multi-protein CCs CCstatsOut_filt$N_multiProtCC # Total number of CCs totCCs_filt <- CCstatsOut_filt$N_singleProtCC + CCstatsOut_filt$N_multiProtCC totCCs_filt # Percentage of single-protein CCs PercSingleP_filt <- round(CCstatsOut_filt$N_singleProtCC / totCCs_filt * 100 , digits = 2) # View table of CC size distribution CCstatsOut_filt$NproteinsDistribution # Plot CC size distribution plot(factor(CCstatsOut_filt$NproteinsDistribution$N_proteins , levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10")) , as.numeric(as.vector(CCstatsOut_filt$NproteinsDistribution$N_CC)) , type = "s" , xlab = "N_proteins" , ylab = "N_CCs") ``` Compare the percentage of single-protein CCs obtained before and after transcriptome-informed filtering. ```{r} comp <- as.data.frame(cbind(as.character(as.vector(c("before_filter" , "after_filter"))) , as.numeric(as.vector(c(PercSingleP , PercSingleP_filt))))) colnames(comp) <- c("Filter", "Perc_SingleP") ggplot(data = comp , aes(x = as.factor(Filter), y = as.numeric(as.vector(Perc_SingleP)))) + geom_bar(stat = "identity") + theme_classic() + xlab("") + ylab("% single-prot CCs") + ylim(0, 100) + coord_flip() + geom_text(aes(label = as.numeric(as.vector(Perc_SingleP))) , hjust = 1.5, color = "white", size = 4) ``` Plot CC size distribution before and after transcriptome-informed filtering. ```{r} old.par <- par(no.readonly = TRUE) # save default par values ymax_before <- as.numeric(as.vector(CCstatsOut$NproteinsDistribution$N_CC)) ymax_after <- as.numeric(as.vector(CCstatsOut_filt$NproteinsDistribution$N_CC)) ymax <- max(max(ymax_before), max(ymax_after)) par(mfrow = c(1, 2)) plot(factor(CCstatsOut$NproteinsDistribution$N_proteins , levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10")) , as.numeric(as.vector(CCstatsOut$NproteinsDistribution$N_CC)) , type = "s" , xlab = "N_proteins" , ylab = "N_CCs" , ylim = c(0, ymax) , main = "before filtering") plot(factor(CCstatsOut_filt$NproteinsDistribution$N_proteins , levels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", ">10")) , as.numeric(as.vector(CCstatsOut_filt$NproteinsDistribution$N_CC)) , type = "s" , xlab = "N_proteins" , ylab = "N_CCs" , ylim = c(0, ymax) , main = "after filtering") par(old.par) # restore default par values ``` In the above example, transcriptome-informed filtering does not lead to an increased percentage of single-protein CCs (i.e., percentage of non ambiguous protein identifications); however, it yields smaller multi-protein CCs (i.e., fewer protein members per CC), which corresponds to reduced, although not solved, ambiguity of protein identifications. ### Compare the proportion of shared peptides obtained before and after filtering Calculate the proportion of shared peptides obtained after transcriptome-informed filtering ```{r} peptideStatsOut_filt <- peptide_stats(incM = incM_filt) ``` Compare the proportion of shared peptides obtained before and after transcriptome-informed filtering ```{r} comp <- as.data.frame(cbind( as.character(as.vector(c("before_filter" , "after_filter"))) , as.numeric(as.vector(c(peptideStatsOut$nbShared , peptideStatsOut_filt$nbShared))))) colnames(comp) <- c("Filter", "Perc_sharedPeptides") ggplot(data = comp , aes(x = as.factor(Filter) , y = as.numeric(as.vector(Perc_sharedPeptides)))) + geom_bar(stat = "identity") + theme_classic() + xlab("") + ylab("% shared peptides") + ylim(0, 100) + coord_flip() + geom_text(aes(label = as.numeric(as.vector(Perc_sharedPeptides))) , hjust = 1.5, color = "white", size = 4) ``` ### Visualize peptide-to-protein mappings for ambiguous protein identifications of interest To visually inspect the impact of transcriptome-informed filtering on any ambiguous protein identification of interest, plot the bipartite graph representing the CC which contains the protein of interest. Compare the CC before or after transcriptome-informed filtering. ```{r, fig.height=14} # Extract peptides and peptide-to-protein mappings for each CC after filtering cc.peptides.incM_filt <- cc_composition(cc.multProteins_filt , incM = incM_filt) # Generate a bipartite graph of the CC which contains the protein of interest, # before and after transcriptome-informed filtering. prot <- "ENSP261" subgraphCC_beforeFilter <- plot_cc(prot = prot , cc.proteins = cc.multProteins , cc.subincM = cc.peptides.incM$cc.subincM , incM = incM , tagProt = "ENSP" , tagContam = "Contam") subgraphCC_afterFilter <- plot_cc(prot = prot , cc.proteins = cc.multProteins_filt , cc.subincM = cc.peptides.incM_filt$cc.subincM , incM = incM_filt , tagProt = "ENSP" , tagContam = "Contam") # Plot old.par <- par(no.readonly = TRUE) # save default par values par(mfrow = c(2, 1)) plot.igraph(subgraphCC_beforeFilter$g , layout = layout_as_bipartite , edge.width = 1 , edge.arrow.width = 0.3 , vertex.size = 35 , edge.arrow.size = 0.5 , vertex.size2 = 35 , vertex.label.cex = 1 , asp = 0.45 , margin = -0.1) + title(paste0("Protein " , prot , " in CC #" , subgraphCC_beforeFilter$cc_id , " before filtering") , line = -1) plot.igraph(subgraphCC_afterFilter$g , layout = layout_as_bipartite , edge.width = 1 , edge.arrow.width = 0.3 , vertex.size = 35 , edge.arrow.size = 0.5 , vertex.size2 = 35 , vertex.label.cex = 1 , asp = 0.45 , margin = -0.1) + title(paste0("Protein " , prot, " in CC #" , subgraphCC_beforeFilter$cc_id , " after filtering") , line = -1) par(old.par) # restore default par values ``` ```{r} sessionInfo() ```