blog

Knowledge-Driven NGS Analysis

Human biological pathway unification

PathCards is an integrated database of human biological pathways and their annotations. Human pathways were clustered into SuperPaths based on gene content similarity. Each PathCard provides information on one SuperPath which represents one or more human pathways. It includes 1,131 SuperPath entries, consolidated from 12 sources.

Publication Details

Belinky, F., Nativ, N., Stelzer, G., Zimmerman, S., Iny Stein, T., Safran, M. and Lancet, D.PathCards: multi-source consolidation of human biological pathways, Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006 . [PDF]

http://pathcards.genecards.org/

PathCards: multi-source consolidation of human biological pathways

+Author Affiliations

1. Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel
1. *Corresponding author: Tel: +972-89343188; Fax: +972-89344487; Email: Frida.Belinky@weizmann.ac.il
• Revision received January 13, 2015.
• Accepted January 14, 2015.

Introduction

The systematic analysis of biological pathways has ever-increasing significance in an age of growing systems analyses and omics data. Mapping genes onto pathways may contribute to a better understanding of biological and biomedical mechanisms. The literature provides a large collection of pathway definition sources (1). Pathway knowledge bases represent the careful collection of genes and their interactions, mapped onto biological processes. These repositories, which include both academic and commercial resources (Figure 1A), provide lists of pathways and their cellular components, each with an idiosyncratic view of the pathway universe.

Figure 1.

The gene-content network of pathway sources. Eighteen sources are shown, 12 of which (colored) are included in SuperPaths generation. Edge widths are proportional to the pairwise Jaccard similarity coefficient computed for the gene contents of the entire source. The sources, depicted in GeneCards Version 3.12, are: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16), QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (see Table 1). White circles correspond to sources not included in the SuperPath generation process: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).

Indeed, the definition of the boundaries of biological pathways differs among sources, as exemplified by the highly studied processes of fatty acid metabolism (2) or the TCA cycle (the tricarboxylic acid cycle) (3). Further, the same pathway name may have widely dissimilar gene content in different sources (4). At present, there is no definitive analysis of pathway similarities, either between or within sources. Thus the multitude of pathway resources can often be confusing when portraying gene-pathway affiliations.

Previous attempts to unify pathways from several sources include NCBI’s Biosystems (5), PathwayCommons (6), PathJam (7), HPD (8), ConsensusPathDB (9), hiPathDB (10) and Pathway Distiller (11). But none of these efforts entail a standardized method to unify numerous sources into a consolidated global repository.

Here, we describe an approach aimed at generating an integrated view across multiple pathway sources. We applied a combination of nearest neighbor graph and hierarchical clustering, utilizing a gene-content metric, to generate a manageable set of 1073 unified pathways (SuperPaths). These optimally encompass all of the information contained in the individual sources, striving to minimize pathway redundancy while maximizing gene-related pathway informativeness. The resultant SuperPaths are integrated into GeneCards (12), enabling clear portrayal of a gene’s set of unified pathways. Finally, these SuperPaths, together with diverse related biological data, are provided in PathCards—a new pathway-centric online database, enabling quick in-depth analysis of each human SuperPath.

Materials and methods

Pathway mining and comparison

Pathway gene sets were generated based on the GeneCards platform (12), implementing the gene symbolization process allowing for comparison of pathway gene sets, from 12 different manually curated sources, including: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16) QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (seeTable 1). A binary matrix was generated for all 3125 pathways, where each column represents a gene indicated by 1 for presence in the pathway and 0 for absence. Additionally, six sources were analysed for their cumulative tallying of genes content, including: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).

Pathway similarity assessment

In the analyses performed, we utilized gene content overlap to estimate pathway similarity. This was done based on the Jaccard coefficient, that measures similarity between finite sample sets, and defined as the size of the intersection divided by the size of the union of the sets. To examine the legitimacy of this method, we performed a comparison to an alternative methodology, embodied in MetaPathwayHunter pathway comparison, that incorporates topology in pairwise pathway alignment (25). For such analysis, we used a set of 151 yeast pathways available in MetaPathwayHunter, and computed Jaccard similarity coefficients (J) for all 11 325 pathway pairs. We then selected a sample of 30 pairs containing 28 unique pathways out of a total of 87 pairs with J ≥ 0.3, ensuring maximal representation for larger pathways. Each of the 28 pathways was queried in MetaPathwayHunter against the entire gamut of 151 with default parameters (a total of 4228 comparisons). We found that 29 out of the 30 sample pathway pairs obtained a significant MetaPathwayHunter alignment (P ≤ 0.01). As only 64 of the 4228 comparisons showed such a P-value, the probability of obtaining this result at random is 1.6 × 1053(Supplementary Table S1). Thus, Jaccard scores appear as excellent predictors for the results of the more elaborate method. A full account of interpathway pairwise similarity is available upon request.

Clustering algorithm

For the main pathway clustering algorithm, we applied a method described elsewhere (26), which includes the following steps: i) The generation of cluster cores by joining all pathway pairs with Jaccard coefficient ≥T2, the upper cutoff, equivalent to hierarchical clustering. ii) Performing cluster extension by generating new best edges, i.e. joining every pathway to a pathway showing the highest score, as long as it is ≥T1, the lower cutoff, akin to nearest neighbor joining. If two or more target pathways have the same best score, all are joined. Each resultant connected component is defined to be a pathway cluster (SuperPath). Identical pathway sets were joined without considering each other as nearest neighbors (i.e. the best scoring non-identical pathway gene-set is chosen as the nearest neighbor). This clustering algorithm is order independent.

Determination of cutoffs

Uniqueness of a SuperPath UsUs is defined as log10(1NpNg)log10(∑1NpNg) where Npis the number of pathways that include a certain gene, averaging for each pathway over all genes in the SuperPath (divided by the number of genes Ng). Uniqueness of genes IsIsis symmetrically defined per SuperPath as log10(1NgNp)log10(∑1NgNp) where each Ng is the number of genes included in the relevant pathway, averaging for each gene over all SuperPaths including a gene. In order to then find the best tradeoff between the two scores, we summed up the average Us and Is for each set of T1 and T2 cutoff parameters. Thus Us + Is was calculated for each set of parameters to find the two parameters for which the tradeoff between pathway and gene uniqueness would be optimal. The best cutoffs by maximizing Us + Is were T1 = 0.3 and T2 ≥ 0.5. Further fine tuning of the upper cutoff was performed by resampling of the data, a technique employed by Levin and Domany (27). We used two dilutions (0.75 and 0.9), i.e. randomly sampling 75% and 90% of the pathways (resampling 100 times for each dilution) and performing the clustering algorithm on each sample, each time calculating the percent of the edges present in the original clustering—the percent of cases that two pathways belonged to the same cluster as in the full dataset. In both dilutions, the upper cutoff of 0.7 was found to recover a higher percent of the edges in the original clustering algorithm (Figure 4C).

Name similarity calculation and concordance with gene similarity

Name similarity was calculated as the Jaccard coefficients of the shared words in the two pathway names, after omitting trivial words and using stemming to identify words with the same root. The cutoff between similar and non-similar names (as well as gene content in regard to comparison with name similarity) was set to J = 0.5. Name similarity was compared with gene content similarity to find the level of concordance between the two.

Shared publications and PPI data

Publication and Protein-Protein Interactions (PPI) data for each gene were obtained from the GeneCards database, including several combined sources. Publications sources of GeneCards include both manually curated publications (e.g. UniProtKB/Swiss-Prot) as well as text mining approaches that report connections between a gene and a list of publications. A shared publication between two genes is an association of both genes to the same publication and does not indicate a direct interaction between the genes. PPI scores between pairs of genes are also based on several interaction sources in GeneCards. Unlike shared publications, PPIs reflect direct interactions between the two gene products.

Randomization and comparison

A randomized set of pseudo-SuperPaths was generated, such that the pseudo-SuperPaths are the same size and quantity as the SuperPaths, albeit with genes assigned at random (from the list of genes with any pathway annotation). Gene pairs that belong to at least one SuperPath, but do not belong together in any individual pathway (the test set) were analysed for the number of shared publications and PPI scores for each pair. In comparison, gene pairs that belong to at least one pseudo-SuperPath, but do not belong together in any individual pathway (the control set) were analysed for the same attributes. To compare the two sets which are of different sizes, a random sample of the larger set (the control set) of the same size as the smaller set (the test set) was compared with the smaller set. A one-sided Kolmogorov–Smirnoff test was performed to compare between the test and control sets.

Gene enrichment analysis comparison

Differentially expressed sets of genes were obtained from the GeneCards database (12) containing 830 different embryonic tissues based on manual curation (28). For the comparison of SuperPaths and their pathway constituents, 89 SuperPaths that contained exactly two pathways with Jaccard similarity coefficient <0.6 were chosen, a value selected to include pairs of relatively dissimilar pathways in order to enhance comparative power. Two gene set enrichment analyses were run for all 830 gene sets: one with SuperPaths and the other with their constituent pathways. Whenever both SuperPath and the constituent pathways received a statistical enrichment score, the difference between negative log Pvalues was computed.

GeneCards and PathCards

SuperPaths have been implemented in GeneCards and are now included in the standard procedure of GeneCards generation. PathCards is an online compendium of human pathways, based on the GeneCards database, presenting SuperPath-related data in each page.

Results

Pathway sources

We analysed 12 pathway sources included in GeneCardshttp://www.genecards.org/ (12) with a total of 3215 biological pathways (Table 1 and Figure 1A). The total number of genes covered by these sources is 11 478, nearly twice as large as the gene count in the largest source (Figure 1B), suggesting the power of analysing multiple sources. Asymptotic behavior is observed in the change of total gene count with increasing number of sources. When considering the incorporation of six additional sources (Supplementary Figure S1), we found that the gene count increment is ∼2% of the currently analysed total. This is an indication that the chosen 12 sources provide adequate coverage of human gene-pathway mappings. Switching between the six non-included sources and six included sources of similar size give a very similar graph, with merely 4% increment in gene count (Supplementary Figure S1).

Analysing the gene repertoires of the four largest sources (Figure 2A), we found that among the 10 770 genes contained within these sources, only 1413 genes were jointly covered by all four sources, and that more than 4000 were unique to one of the four sources. This highlights the notion that source unification is essential to obtain maximal gene coverage. In its simplest embodiment, source unification would entail presenting a unified list of the 3215 pathways included in all 12 sources. This however would ignore the extensive gene-content connectivity embodied in the network representation of this pathway collection (Figure 3A). Further, the original pathway collection has considerable inconsistencies of relations between pathway name and pathway gene content, as exemplified in Figure 2B and C. The summary in Table 2A suggests that only ∼9.4% of all pathway pairs with a similar name have similar gene content, and likewise, only 9.8% of all pathway pairs with similar gene content are named similarly (Supplementary Figure S2).

Figure 2.

Discrepancies between pathway sources. (A)Incomplete gene overlap among sources. Venn diagram (created using VENNYhttp://bioinfogp.cnb.csic.es/tools/venny/) showing the number of shared genes among the four largest pathway sources. For a total of 10 770 genes, only 1413 (13%) are shared by all four sources and 609–1791 genes are unique to each of these sources. (B) Inconsistency of names versus content in meiosis-related pathways. A Venn diagram created using BioVenn (29), exemplifies two pathways, ‘Meiosis’ from Reactome and ‘Oocyte meiosis’ from KEGG with very small gene sharing (7 genes out of 172, J = 0.04). (C) Redundancy in meiosis-related pathways. This is exemplified by the large number of genes (88 of 119, J = 0.74) shared by ‘Meiosis’ and ‘Meiotic recombination’ pathways both from Reactome, and by the large number of genes (52 of 146, J = 0.36) shared by ‘Oocyte meiosis’ and ‘Progesterone-mediated oocyte maturation’ both from KEGG. (D) Pathway size distribution across sources. The pathway size in gene count, is distributed differently across the different sources.

Figure 3.

Network representations of the 3215 analyzed pathways. Nodes represent pathways and edges represent Jaccard similarity coefficients (J) using different methods. Network visualizations were performed using Gephi (30). Colors correspond to pathway sources. (A)No clustering. All edges with J ≥ 0.05 are shown. All but 20 pathways form one large connected component with an average degree of 134. (B) SuperPaths. Each is a connected component obtain by the main clustering algorithm, with thresholdsT1 (best edges) of J ≥ 0.3 and T2 of J ≥ 0.7. There are 544 singletons and 529 multi-pathway clusters; the size of the largest cluster is 70. (C) Pure hierarchical clustering, with thresholds T2 of J ≥ 0.3. There are 544 singletons and 288 multimembered clusters; the size of the largest cluster is 1046 pathways.

Figure 4.

Selection of the T1 andT2 thresholds. (A)Distribution of Jaccard coefficients across all pathway pairs. T1 andT2 respectively represent the lower and upper cutoffs used in the algorithm employed. (B) Us + Isscores across combinations of T1 andT2. The diagonal (T1 = T2) represents pure hierarchical clustering with different thresholds. The best scores are attained when T1 = 0.3 and T2 ≥ 0.5. (C) Determination of T2. T2(upper cutoff) was determined by resampling of the pathway data at two dilution levels (27), 0.75 and 0.9. In both cases J = 0.7 was found to be the optimum in which a higher fraction of the original clustering is recovered.

View this table:

Table 2.

Gene content versusname similarity of pathways and SuperPaths

Pathway clustering

We performed global pathway analysis aimed at assigning maximally informative pathway-related annotation to every human gene. For this, we converted the pathway compendium into a set of connected components (SuperPaths), each being a limited-size cluster of pathways. We aimed at controlling the size of the resulting SuperPaths, so as to maintain a high measure of annotation specificity and minimize redundancy.

The following two steps were used in the clustering procedure, in which pathways were connected to each other to form SuperPaths. i) Preprocessing of very small pathways: pathways smaller than 20 genes were connected to larger pathways (<200 genes) with a content similarity metric of ≥0.9 relative to the smaller partner. ii) The main pathway clustering algorithm: this was performed using the Jaccard similarity coefficient (J) metric (31) (see Materials and Methods). We used a combination (cf. 26) of modified nearest neighbor graph generation with a threshold T1 and hierarchical clustering with a threshold T2 (Figure 4A and Materials and Methods).

To determine the optimal values of the thresholds T1 and T2, we defined two quantitative attributes of the clustering process. The first is US, the overall uniqueness of the set of SuperPaths. USelevation is the result of increasing pathway clustering, and reflects the gradual disappearance of redundancy, i.e. of cases in which certain gene sets are portrayed in multiple SuperPaths. The second parameter is IS, the overall informativeness of the set of SuperPaths. IS is a measure of how revealing a collection of SuperPaths is for annotating individual genes. It decreases with the extent of pathway clustering, reaching an undesirable minimum of one exceedingly large cluster, whereby identical SuperPath annotation is obtained for all genes. We thus sought an optimal degree of clustering whereby US + IS is maximized (Figure 4B and Materials and Methods).

Our procedure pointed to an optimum at T1 = 0.3 and T2 ≥ 0.5. Further fine tuning by data resampling suggested an optimal value of T2 = 0.7 (Figure 4C and Materials and Methods). This procedure resulted in the definition of 1073 SuperPaths, including 529 SuperPaths ranging in size from 2 to 70 pathways, and 544 singletons (one pathway per SuperPath) (Figures 3B and 5A). Each SuperPath had 3 ± 4.3 pathways (Figure 5A) and 82.7 ± 140.6 genes (Supplementary Figure S3A). The resultant set of SuperPaths indeed enhances the uniqueness US as depicted in Figure 5B.

Figure 5.

SuperPaths increase uniqueness while keeping high informativeness. (A) Number of pathways in hierarchical clusteringversus SuperPath algorithm. The largest cluster with hierarchical clustering includes 1046 pathways, about 33% of the entire input, causing a great reduction of informativeness. In the SuperPath clustering the maximum cluster size is 70, about 2% of all pathways. (B) Increase in uniqueness (Us) following unification of pathways into SuperPaths.

The unification process resulted in relatively small changes in gene count distribution between the original pathways and the resultant SuperPaths (Supplementary Figure S3), suggesting a substantial preservation of gene groupings. Notably, applying pure hierarchical clustering (T1 = T2 = 0.3) resulted in a single very large cluster with 1046 pathways (Figure 3C) and with the same amount of singletons, strongly deviating from the goal of specific pathway annotation for genes (Supplementary Figure S3B). This sub-optimal performance of pure hierarchical clustering is general; any of the examined cases of T1 = T2 (Figure 4B diagonal), shows an Us + Isvalue lower than that for T1 = 0.3 T2 = 0.7.

Each SuperPath is identified by a textual name derived from one of its constituent pathways selected as the most connected pathway (hub) in the SuperPath cluster. For simplicity, the option of de novonaming was not exercised. Selecting the hub’s name, as opposed to that of the largest pathway, was chosen since this tends to enhance the descriptive value for the entire SuperPath. When more than one pathway has the same maximal number of connections, the larger one is chosen.

SuperPaths make important gene connections

One of the major implications of the process of SuperPath generation is elucidating new connections among genes. This happens because genes that were not connected via any pre-unification pathway become connected through belonging to the same SuperPath. The unification into SuperPaths is important in two ways: first, it brings, under one roof, pathway information from 12 sources, each individually contributing ∼9000 to ∼5 million instances of gene pairing, for a total of 7.3 million pairs (Supplementary Figure S4). Second, by unifying into SuperPaths, the number of gene pairs is further enhanced, reaching 8.3 million (Supplementary Figure S4).

To test the significance of the million new gene–gene connections resulting from SuperPath generation, we checked their correlation with two independent measures of gene pairing. First, a comparison was made to publications shared among gene pairs (Figure 6A). We found that for gene pairs appearing in a SuperPath but not in any of its constituent pathways, there is a 4- to 75-fold increase in instances of >20 shared publications when compared with random pairs of genes with pathway annotation. Added gene pairs have significantly more shared publications than those randomly paired. Second, we performed a similar analysis based on protein–protein interaction information. We found that for the SuperPath-implicated gene pairs there was a 4- to 25-fold increase of PPIs with score >0.2 (Figure 6B) when compared with controls. SuperPaths thus provide significant gene partnering information not conveyed by any of their 3215 constituent individual pathways. This may be seen when performing gene set enrichment analysis on 830 differential expression sets and comparing the scores of SuperPaths to that of their constituent pathways, demonstrating that SuperPaths tend to receive more significant scores compared with their constituent pathways average score (Figure 7A).

Figure 6.

SuperPath-specific gene pairs are informative.(A) Shared publications. SuperPath-specific gene pairs are genes connected only by SuperPaths and not by any of the contained pathways. Enrichment of 10–100 is seen in the high abscissa values. The two distributions are significantly different (Kolmogorov–Smirnof P < 10−100). No random gene pairs with 80–90 publications—this point was treated as having one such publication for computing the ratio. (B) Protein–protein interactions. Experimental interaction score from STRING (32) as depicted in GeneCards (12), for SuperPath versus random gene pairs as in panel A. The two distributions are significantly different (Kolmogorov–Smirnof P < 2.8 × 10−61).

Figure 7.

SuperPath integration attributes. (A)SuperPaths outperform their constituent pathways in significance scores across 830 differentially expressed genes sets.(B) Number of included sources in non-singleton SuperPaths.

SuperPaths in databases

SuperPath information is available both in the GeneCards pathway section (Supplementary Figure S5A) and in PathCards (Supplementary Figure S5B) http://pathcards.genecards.org/, a GeneCards companion database presenting a web card for each SuperPath. PathCards allows the user a view of the pathway network connectivity within a SuparPath, as well as the gene lists of the SuperPath and of each of its constituent pathways. Links to the original pathways are available from the pathway database symbols, placed to the left of pathway names. PathCards has extensive search capacity including finding any SuperPath that contains a search term within its included pathway names, gene symbols and gene descriptions. Multiple search terms are afforded, allowing fine-tuned results. The search results can be expanded to show exactly where in the SuperPath-related text the terms were found. The list of genes in a PathCard utilizes graded coloring to designate the fraction of included pathways containing this gene, providing an assessment of the importance of a gene in a SuperPath. Other features, including gene list sorting and a search tutorial, are under construction. PathCards is updated regularly, together with GeneCards updates. A new version is released 2–3 times a year.

Discussion

Pathway source heterogeneity

This study highlights substantial mutual discrepancies among different pathway sources, e.g. with regard to pathway sizes, names and gene contents. The world of human biological pathways consists of many idiosyncratic definitions provided by mostly independent sources that curate publication data and interpret it into sets of genes and their connections. The idiosyncratic view of the different pathway sources is exemplified by the variation in pathway size distribution among sources (Table 1, Figure 2D), where some sources have overrepresentation of large pathways (QIAGEN), while others have mainly small pathways (HumanCyc). In some cases, the large standard deviation in pathway size (Table 1) is easily explained, as exemplified in the case of Reactome, which provides hierarchies of pathways and therefore contains a spectrum of pathway sizes. However, large standard deviations of pathway size are also observed in KEGG and QIAGEN—sources that are not hierarchical by definition. On the other hand, some sources (e.g. HumanCyc, PID and PharmGKB) have very little variation in their pathway sizes, revealing their focus on pathways of particular size. The idiosyncratic view provided by different sources is also evident when examining the genes covered by each source (Figure 2A), where some genes in the gene space are covered by only one source. This causes the unfavorable outcome that when unifying pathways, irrespective of the algorithm chosen, there is a relatively high proportion of single source pathway clusters. In order to account for the drawback of the Jaccard index to cope with large size differences between pathways, we added a preprocessing step to unify pathways that are almost completely included within other pathways (≥0.9 gene content similarity of the smaller pathway), thereby diminishing the barrier of variable pathway size between sources. Previously published isolated instances of intersource discrepancies include the lack of pathway source consensus for the TCA cycle (3) and fatty acid metabolism (2). The authors of both papers stress that each of their pathway sources has only a partial view of the pathway. For the TCA cycle example (3) there is an attempt to provide an optimal TCA cycle pathway by identifying genes that appear in multiple sources, but such manual curation is not feasible for a collection of >3000 biological pathways. In our procedure, 11 relevant pathways from four sources are unified into a SuperPath entitled ‘Citric acid cycle (TCA cycle)’ (Supplementary Figure S5). PathCards enables one to then view which genes are more highly represented within the constituent pathways. Our algorithm thus mimics human intervention, and greatly simplifies the task of finding concurrence within and among pathway sources.

Pathway unification

Combining several pathway resources has been attempted before, using different approaches. The first method is to simply aggregate all of the pathways in several knowledge bases into one database, without further processing. This approach is taken, for example, by NCBI’s Biosystems with 2496 human pathways from five sources (5) and by PathwayCommons with 1668 pathways from four sources (6). This was also the approach taken by GeneCards prior to the SuperPaths effort described here, where pathways from six sources were shown separately in every GeneCard. While this approach provides centralized portals with easy access to several pathway sets, it does not reveal interpathway relationships and may result in considerable redundancy. The second unification approach, taken by PathJam (7), and HPD (8) provide proteins versus pathways tables as search output. This scheme allows useful comparisons as related to specific search terms, but is not leveraged into global analyses of interpathway relations. A third line of action is exemplified by ConsensusPathDB (9), which integrates information from 38 sources, including 26 protein–protein interaction compendia as well as 12 knowledge bases with 4873 pathways. This allows users to observe which interactions are supported by each of the information sources. In turn, hiPathDB (10) integrates protein interactions from four pathway sources (1661 pathways) and creates ad hoc unified superpathways for a query gene, without globally generating consolidated pathway sets. Finally, a fourth methodology is employed by Pathway Distiller (11), which mines 2462 pathways from six pathway databases, and subsequently unifies them into clusters of several predecided sizes between 5 and 500, using hierarchical clustering. The third method of interaction mapping taken by ConsensusPathDB and HiPathDB differs conceptually from the fourth method of clustering, where the interaction mapping method provides information on the specific commonalities and discrepancies in protein interactions among sources with regard to specific keywords or genes, while the clustering method suggests which of the pathways are similar enough to be considered for the same cluster. Therefore, the third and fourth methods are complementary approaches aimed at utilization of pathway information in different observation levels, where the fourth (clustering) method is independent of user input or search in resultant consolidation. In the study described herein, we pursued a clustering method similar to the fourth methodology taken by Pathway Distiller, namely consolidation of pathways into clusters. However, in contrast to Pathway Distiller, our aim was to create a single coherent unification of biological pathways, which is essential for having a universal set of descriptors when looking at gene–gene relations. The resulting SuperPaths simplify the pathway-related descriptive space of a gene and reduce it 3-fold. Furthermore, the cutoffs in our algorithm are chosen to optimally adjust the criteria of uniqueness and informativeness, thereby reducing the subjective effect of choosing cutoffs arbitrarily or by predetermining the number of clusters.

SuperPath generation

A crucial element in our SuperPaths generation method is the definition of interpathway relationships. We have opted for the use of gene content, as described by others (11, 33). One could also consider the use of pathway name similarity (11). However, among the 3215 pathways analysed here, only 79 names were shared by more than one pathway, implying that the efficacy of such an approach would have been rather limited. Further, Table 2 andSupplementary Figure S2 indicate a relatively weak concordance between pathway names and their gene content. Specifically among 79 name-identical pathway groups 52 remained incompletely unified, again suggesting a limited usefulness for unifying based on pathway names. Many resources, including ConsensusPathDB (9) facilitate the option of finding pathways based on keywords in the name. Name sharing is thus a relatively trivial task to overcome when trying to find similar pathways. The more challenging goal is finding pathways that are similar in the biological process that they convey.

In this article we treated pathways as sets of genes, using gene content as a comparative measure and omitting topology and small molecule information. This approach was previously advocated as a means of reducing the complexity of pathway comparisons greatly (34). Further, most sources used in this study provide only the gene set information, hence topology information was unavailable. Finally, the high concordance between significance of pathway alignment and Jaccard coefficients ≥0.3 (P < 1052) indicates that the Jaccard coefficient is a good approximation of the more elaborate pathway alignment procedure (25).

SuperPath utility

A central aim of pathway source unification is enhancing the inference of gene-to-gene relations needed for pathway enrichment scrutiny (32, 35–40). To this end, we developed an algorithm for pathway clustering so as to optimize this inference and at the same time minimize redundancy.

Extending pathways into SuperPaths affords two major advantages. The first is augmenting the gene grouping used for such inference. Indeed, SuperPaths have slightly larger sizes than the original pathways, as evident by the SuperPath size distribution (Figure 2D). Nevertheless, comparing SuperPaths to pseudo-SuperPaths of the same size and quantity clearly show that the increase in size does not account for the addition of true positive gene connections, as evident by the higher PPIs and larger counts of shared publications for SuperPath gene pairs (Figure 6). Subsequently, it is not surprising that SuperPaths outperform their average pathway constituent’s enrichment analysis scores (Figure 7A). SuperPaths are currently used in two GeneCards-related novel tools, VarElecthttp://varelect.genecards.org/ and GeneAnalyticshttp://geneana lytics.genecards.org/. A second advantage of SuperPaths is in the reduction of redundancy, since they provide a smaller, unified pathway set, and thus diminish the necessary statistical correction for multiple testing. We note that ConsensusPathDB (9) also provides intersource integrated view of interactions. However, gene set analysis in ConsensusPathDB is only allowed for pathways as defined by the original sources. Finally, a third advantage of SuperPaths is their ability to rank genes within a biological mechanism via the multiplicity of constituent pathways within which a gene appears. This can be used not only to gain better functional insight but also to help eliminate suspected false-positive genes appearing in a minority of the pathway versions. A capacity to view such gene ranking is available within the PathCards database.

Limitations of SuperPaths

The SuperPaths generation procedure appears incomplete, as about a half of all SuperPaths are ‘singleton SuperPath’ (labelled accordingly in PathCards), having only one constituent pathway. This is an outcome of the specific cutoff parameters used. However, this provides a useful indication to the user that a singleton pathway is distinct, differing greatly in its constituent genes from any other pathway.

This SuperPath generation process is intended to reduce redundancies and inconsistencies found when analysing the unified pathways. Although SuperPaths increase uniqueness as compared with the original pathway set (Figure 5B), some redundancy and inconsistency still remain within SuperPaths. There are cases of pathways with similar names, which do not get unified into the same SuperPath. This happens because they have not met the unification criteria employed. We also note similarity in name does not always indicate similarity in gene content (Figure 2B and C,Supplementary Figure S2B), and such events are faithfully conveyed to the user.

A clarifying example is that of the 40 pathways whose names include the string ‘apoptosis’. The final post-unification list has 10 SuperPaths whose name includes ‘apoptosis’. This obviously provides the user with a greatly simplified view of the apoptosis world. Yet, at the same time the outcome is replete with instances of two name-similar pathways being included in different SuperPaths. Employing a more stringent algorithm would result in over-clustering, which would in turn reduce informativeness (seeFigure 3C).

In parallel, there are pathways with overlapping functions that are not consolidated into one SuperPath. For example, the pathway ‘integrated breast cancer pathway’ does not unify with the pathways ‘DNA repair’ and ‘DNA damage response pathway’, despite the strong functional relation of breast cancer with DNA damage and repair (41). This is because the relevant gene content similarity in the original pathway sources is small, respectively, J = 0.03 and 0.13. The need to view information on pathways with low pairwise similarity is addressed in Supplementary Figure S6, and is available as a text file upon request.

Finally, when looking at the number of contributing sources per SuperPath (Figure 7B), it is evident that the majority of SuperPaths are comprised by either one or two sources, and no SuperPaths includes more than five. Although this integration limitation is evident, it mainly arises from the inherent biases in gene coverage for the different information sources (Figure 2A).

PathCards

Biological pathway information has traditionally been a central facet of GeneCards, the database of human genes (12, 42, 43). In previous versions, pathways were presented separately for each of the pathway sources, and it was difficult for users to relate the separate lists to each other. As a result of the consolidation into SuperPaths described herein, this problem has been effectively addressed. Thus, in every GeneCard, a table portrays all of a gene’s SuperPaths, each with its constituent pathways, with links to the original sources (Supplementary Figure S5A).

GeneCards is gene-centric and inherently does not present (Super) pathway-centric annotations. We therefore developed PathCardshttp://pathcards.genecards.org/, a database that encompasses and displays such information in greater detail. PathCards has a page for every SuperPath, showing the connectivity of its included pathways, as well as gene lists for the SuperPath and its pathways. For every SuperPath, we also show a STRING gene interaction network (32) for the entire gamut of constituent genes, providing perspective on topological relationships within the SuperPath.

Supplementary Data

Supplementary data are available at Database Online.

Funding

This research is funded by grants from LifeMap Sciences Inc. California (USA) and the SysKid—EU FP7 project (number 241544). Support is also provided by the Crown Human Genome Center at the Weizmann Institute of Science. Funding for open access charge: LifeMap Sciences Inc. California (USA).

Conflict of interest. None declared.

Acknowledgements

We thank Prof. Eitan Domany and Prof. Ron Pinter for helpful discussions, as well as Dr. Noa Rappaport and Dr. Omer Markovich for assistance with clustering and visualization.

Footnotes

• Citation details: Belinky,F., Nativ,N., Stelzer,G., et al. PathCards: multi-source consolidation of human biological pathways.Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

The MIPS Mammalian Protein-Protein Interaction Database

The MIPS Mammalian Protein-Protein Interaction Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. We took great care to include only data from individually performed experiments since they usually provide the most reliable evidence for physical interactions.

Search the database

To suit different users needs we provide a variety of interfaces to search the database:

Background

Protein-protein interactions (PPI) represent a pivotal aspect of protein function. Almost every cellular process relies on transient or permanent physical binding of two or more proteins in order to accomplish the respective task. Comprehensive databases of PPI in Saccharomyces cerevisiae have proved to be invaluable resources for both bioinformatics and experimental research and are used heavily in the scientific community.

Although yeast is a well established model organism, not all interactions in higher eukaryotes have equivalent counterparts in unicellular systems. Currently, publicly available PPI databases contain comparatively few entries from mammals so we embarked on building a high-quality, manually curated database of protein-protein interactions in mammals.

Conditions of use

You are free to use the database as you please including full download of the dataset for your own analyses as long as you cite the source properly:

Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stümpflen V, Mewes HW, Ruepp A, Frishman D
The MIPS mammalian protein-protein interaction database
Bioinformatics 2005; 21(6):832-834; [Epub 2004 Nov 5]   doi:10.1093/bioinformatics/bti115

Other PPI resources

There are plenty of interesting databases and other sites on protein-protein interactions. Currently we are aware of the following PPI resources:

APID Agile Protein Interaction DataAnalyzer (Cancer Research Center, Salamanca, Spain)
BIND Biomolecular INteraction Network Database at the University of Toronto, Canada. No species restriction
CYGD PPI section of the Comprehensive Yeast Genome Database. Manually curated comprehensive S. cerevisiae PPI database at MIPS
DIP Database of Interacting Proteins at UCLA. No species restriction.
GRID General Repository for Interaction Datasets. Mount Sinai Hospital, Toronto, Canada
HIV Interaction DB Interactions between HIV and host proteins.
HPRD The Human Protein Reference Database. Institute of Bioinformatics, Bangalore, India and Johns Hopkins University, Baltimore, MD, USA.
HPID Human Protein Interaction Database. Department of computer Science and Information Engineering Inha University, Inchon, Korea
iHOP iHOP (Information Hyperlinked over Proteins). Protein association network built by literature mining
IntAct Protein interaction database at EBI. No species restriction.
InterDom Database of putative interacting protein domains. Institute for InfoComm Research, Singapore.
JCB PPI site at the Jena Centre for Bioinformatics, Germany
MetaCore Commercial software suite and database. Manually curated human PPIs (among other things). GeneGo
MINT Molecular INTeraction database at the Centro di Bioinformatica Moleculare, Universita di Roma, Italy.
MRC PPI links Commented list of links to PPI databases and resources maintained at the MRC Rosalind Franklin Cetre for Genomics Research, Cambridge, UK
OPHID The Online Predicted Human Interaction Database. Ontario Cancer Institute and University of Toronto, Canada.
Pawson Lab Information on protein-interaction domains.
PDZbase Database of PDZ mediated protein-protein interactions.
Predictome Predicted functional associations and interactions. Boston University.
Protein-Protein Interaction Server Analysis of protein-protein interfaces of protein complexes from PDB. University College of London, UK.
PathCalling Proteomics and PPI tool/database. CuraGen Corporation.
PIM Hybrigenics PPI data and tool, H. pylori. Free academic license available
RIKEN Experimental and literature PPIs in mouse.
STRING Protein networks based on experimental data and predictions at EMBL.
YPD “BioKnowledge Library” at Incyte Corporation. Manually curated PPI data from S. cerevisiae. Proprietary.

If we forgot to list your favorite PPI resource or you are providing one yourself please let us know – we will be happy to include it.

PPI related software

aiSee Commercial graph layout software
Cytoscape Open source software for visualization of PPI networks and data integration
graphviz Graph layout software

You can get the full dataset here (PSI-MI format).

Acknowledgements

This work is funded by a grant from the German Federal Ministry of Education and Research. It is part of the initiative “Bioinformatics for the Functional Analysis of Mammalian Genomes” (BFAM).

An evaluation of human protein-protein interaction data in the public domain

BMC Bioinformatics20067(Suppl 5):S19

DOI: 10.1186/1471-2105-7-S5-S19

Published: 18 December 2006

Abstract

Background

Protein-protein interaction (PPI) databases have become a major resource for investigating biological networks and pathways in cells. A number of publicly available repositories for human PPIs are currently available. Each of these databases has their own unique features with a large variation in the type and depth of their annotations.

Results

We analyzed the major publicly available primary databases that contain literature curated PPI information for human proteins. This included BIND, DIP, HPRD, IntAct, MINT, MIPS, PDZBase and Reactome databases. The number of binary non-redundant human PPIs ranged from 101 in PDZBase and 346 in MIPS to 11,367 in MINT and 36,617 in HPRD. The number of genes annotated with at least one interactor was 9,427 in HPRD, 4,975 in MINT, 4,614 in IntAct, 3,887 in BIND and <1,000 in the remaining databases. The number of literature citations for the PPIs included in the databases was 43,634 in HPRD, 11,480 in MINT, 10,331 in IntAct, 8,020 in BIND and <2,100 in the remaining databases.

Conclusion

Given the importance of PPIs, we suggest that submission of PPIs to repositories be made mandatory by scientific journals at the time of manuscript submission as this will minimize annotation errors, promote standardization and help keep the information up to date. We hope that our analysis will help guide biomedical scientists in selecting the most appropriate database for their needs especially in light of the dramatic differences in their content.

Background

Protein-protein interactions (PPI) are essential for almost all cellular functions. Proteins seldom carry out their function in isolation; rather, they operate through a number of interactions with other biomolecules. Experimental elucidation and computational analysis of the complex networks formed by individual protein-protein interactions (PPIs) is one of the major challenges in the post-genomic era. PPI databases have thus become valuable resources for the systematic analysis of the molecular networks of a cell [1, 2]. With the accumulation of PPIs from high-throughput experiments, it is increasingly important to store such data for easy retrieval and analysis [3]. Several databases have compiled protein interactions based on manual curation of the scientific literature, automated text mining of articles or computational predictions. In this review, various features of nine different databases are evaluated, including compliance with emerging data standards such as proteomics standards initiative – molecular interaction (PSI-MI) format [4] and BioPAX [5], which define a unified framework for sharing PPI and pathway information, respectively.

Human protein-protein interaction databases

Protein interaction repositories can be broadly classified into 2 types based on their content: i) Those containing interactions supported by experimental evidence, or, ii) Those containing interactions derived from in silico predictions alone, or, mixed together with experimentally derived PPIs. Here, we evaluate only those databases that exclusively contain experimentally derived PPI data in humans.

Curated literature based repositories have two major mechanisms of incorporating PPIs supported by experimental validation: i) curation by biologists from the literature, or, ii) direct deposit of the experimentally derived PPIs prior to publication by an investigator. Currently, the majority of PPIs in most databases are from curation of the literature. If all scientific journals mandated that PPIs be submitted to repositories as a requirement for publication (as is currently the case with nucleotide sequences), the databases would not only become more comprehensive but perhaps also contain fewer annotation errors. Below, we will briefly describe salient features of nine major PPI databases.

H uman P rotein R eference D atabase (HPRD)

HPRD contains annotations pertaining to human proteins based on experimental evidence from the literature [6, 7]. This includes PPIs as well as information about post-translational modifications, subcellular localization, protein domain architecture, tissue expression and association with human diseases. In addition to interactions of proteins with other proteins, HPRD also reports interactions of proteins with nucleic acids and small molecules. The PPI data is sub classified as binary or complex interactions based on topology and the number of participants. Binary PPIs are direct interactions between two proteins while complexes represent interactions with more than 2 participants and the topology of interaction is unknown. Relevant publications are cited for each interaction. The type of experiment is also indicated as in vivo (e.g. coimmunoprecipitation),in vitro (e.g. GST pull-down assays) or yeast two-hybrid. Information about post-translational modifications includes the residue of modification, type of experiment and the upstream enzyme. These modifications can be viewed alongside the protein domain architecture. Each protein is linked to a genome browser, GenProt Viewer [8], which allows protein and transcript information to be visualized in the context of the relevant gene. HPRD is also linked to a compendium of signal transduction pathways, NetPath [9], which is freely available in several different formats. This database includes a tool called PhosphoMotif Finder, which reports the presence of any of over 320 phosphorylation-based motifs curated from the literature in a protein of interest. HPRD also incorporates a new feature, Protein Distributed Annotation System (PDAS) which allows researchers to contribute and share their data with the rest of the community. All interaction information can be downloaded from the website either in PSI-MI format or as tab delimited files.

IntAct

The PPI information in the IntAct database includes a brief description of the interaction, experimental method and the literature citation of human proteins as well as proteins derived from several other species [10, 11]. Whenever possible, PPI information is isoform specific. The database can be accessed by either a basic or advanced search. The latter provides the user with additional querying options such as experimental method or controlled vocabulary terms listed in PSI-MI. IntAct also has a tool which predicts best baits for pull-down experiments in humans by prioritizing the proteins which have the highest likelihood of being highly connected, or hubs, based on the available data within IntAct for various species – this is termed Pay-As-You-Go algorithm. Additional software developed as part of the IntAct project includes HierarchView, which depicts interaction networks as 2-dimensional graphs and highlights nodes based on a GO category specified by the user (e.g. cellular component).

M olecular INT eraction database (MINT)

MINT is a repository of experimentally verified protein interactions with special emphasis on mammalian interactions [12, 13]. It also features interactions involving non-protein entities such as promoter regions and mRNA transcripts. PPI information includes binary and complex interactions and is isoform specific. Each interaction is given a confidence score based on the number of interactions and type of experiment and the number of citations provided for each interaction. The interactors can be viewed graphically using the ‘MINT Viewer,’ which permits users to view interactors as a network, and to manipulate it such that only the proteins of interest are shown. Users can expand the network by dragging individual interactors, select and visualize PPIs based on confidence scores, and they can also export the data in flat files, PSI-MI format or to Osprey, a system developed for visualizing and manipulating network data [14]. The interaction data are displayed along with the corresponding Swiss-Prot annotation. Proteins with a role in genetic diseases (according to OMIM (Online Mendelian Inheritance in Man)) are further highlighted. MINT features a separate annotation of human PPIs called HomoMINT, which includes in addition to literature derived data information from other organisms mapped to their human orthologs.

D atabase of I nteracting P roteins (DIP)

PPI data stored in DIP were obtained through manual curation of the scientific literature and include direct and complex interactions [15, 16]. The JDIP is a Java application based visualization tool; it provides a graphical representation of interactions. New high-throughput experimental and predicted PPI data can be evaluated through other services provided by DIP such as Paralogous Verification Method (PVM), Expression Profile Reliability (EPR) [17] and Domain Pair Verification (DPV) [18]. PVM validates interacting pairs by showing the existence of paralogous interactions; EPR validates comparison based on common expression profiles of interactors and DPV validates through domain-domain interaction preferences. Other satellite projects, Live-DIP and DLRP, use the DIP database for accessing the interactions. Live-DIP annotates proteins under different physiological conditions [19] whereas DLRP annotates protein-ligand and protein-receptor pairs known to interact with each other [20].

MIPS Database

MIPS database consists of mammalian interaction data manually curated from the literature [21, 22], and includes experiment type, description of the interaction and binding regions of interacting partners (where available). Data from mass spectrometry and yeast two-hybrid studies are not included. PPIs can be queried based on interaction partners, experimental method, and functional aspects of the PPIs. The results can be retrieved in 2 formats – long and short. The long format details the interaction, including reference, experimental details, binding sites for each protein and a short comment on each interaction, its functional significance or the immediate outcome of the interaction. The short format is restricted to listing the interacting proteins. Both formats are also linked to visualization tools. Each protein is further linked to the corresponding annotation in the mouse PEDANT genome database developed by the same group; which contains pre-computed bioinformatics analyses of publicly available genomes [23].

A lliance F or C ellular S ignaling (AfCS)

The AfCS is a multidisciplinary, multi-institutional consortium that studies cellular signaling [24, 25]. “Molecule Pages” in the AfCS database provide qualitative and quantitative information on signaling molecules (mostly murine) and their interactions; – these include results of experiments carried out by the Alliance in addition to literature-derived data. The molecule pages contain automated as well as author-entered data. The former integrate DNA/protein sequence information and structural details along with basic biophysical and biochemical properties from external databases, whereas the latter consist of data manually curated from the literature. This is further assessed by AfCS-appointed editorial board members and anonymously peer-reviewed in a process established by the Nature Publishing Group. The curated data includes a textual description of protein function, regulation of activity, subcellular localization, major sites of expression, splice variants and phenotype of knockout animals. The interaction data are derived from murine proteins, or, if they are from other species, the interaction is mapped to the corresponding mouse orthologs. For some proteins, the annotations include descriptions of signaling molecules under different physiological conditions termed ‘states’ (e.g. binding of a phosphorylated protein with another protein). A number of signaling pathway maps are also available in this database. We have not considered this database in our comparison mainly because of its focus on murine, and not human, proteins.

B iomolecular I nteraction N etwork D atabase (BIND)

BIND is a database of biomolecular associations that are classified into 3 categories, binary molecular interactions, molecular complexes and pathways [26, 27]. In BIND, a molecular complex is a collection of two or more molecules that associate to form a functional unit in a cell. These records are supplemented with additional information such as complex topology and the number of subunits involved in the interaction. Pathways are a collection of two or more interactions that occur in a defined sequence within a living system; currently 8 pathways have been annotated. Data pertaining to 1473 organisms is available in BIND. Information on molecular associations is obtained from the literature. The majority of the interactions in BIND are PPIs although it includes some interactions with nucleic acids and small molecules as well. The function of proteins is depicted using ontoglyphs, a series of symbolic characters representing a high-level summary of Gene Ontology (GO) information, and, proteoglyphs, symbols used to represent the structural and binding properties of proteins at the level of conserved domains. Data in BIND can be queried using various database identifiers or by a BLAST search. BIND also stores biomolecular interactions for several other species. For yeast high-throughput PPI datasets, BIND provides a confidence measure based on text mining of publications, existence of homologous interactions, common and related GO annotations, domain composition and phenotypic profiling for the evaluation. The data can be downloaded in flat file and PSI-MI formats and the pathways can be exported to ‘sif’ format which allows visualization by Cytoscape, a software tool developed for visualization and manipulation of pathway data [28]. BIND offers a Standard Object Access Protocol (SOAP) interface for those who wish to access the data from third-party software. BIND also has data imports from FlyBase, MIPS, MGI etc. and entries can be queried through various sources (e.g. Wormbase and KEGG).

Reactome

Reactome is a curated knowledgebase of biological pathways [29, 30]. The goal of Reactome is to develop a curated resource of pathways and biochemical reactions in humans; however many of the reactions are also obtained via transfer from other species. The basic unit of this database is a reaction. Information on reactions is either derived from experiments in the literature or is an electronic inference based on sequence similarity. Reactions are also inferred in humans based on the putative human orthologs for the proteins that participate in the same reaction in other species. In such cases, the model organism reaction is annotated in Reactome, the inferred human reaction is annotated as a separate event, and the inferential link between the two reactions is explicitly noted. Each reaction is detailed with input, output, preceding and following events of the reaction, cellular component of the reaction and species of its occurrence. Each reaction is linked to pathways according to the order of reactions in corresponding pathway. The available pathways are integrated and represented graphically as a series of constellations in a ‘starry sky.’ This can be used to navigate through the reactions in biological pathways and visualize connections between them. It must be cautioned that the definition of PPIs in Reactome is quite broad: the interactions can be represented as ‘direct complex,’ ‘indirect complex,’ ‘reaction’ or ‘neighboring reaction.’ In a ‘direct complex,’ interactions occur between proteins present in the same complex and are not true pairwise interaction. ‘Indirect complexes’ contain interactions between interactors in different subcomplexes of a complex. ‘Reactions’ are interactions between proteins that participate in a reaction and the interactors are not reported to be in a complex. ‘Neighboring reactions’ represent the interactors that participate in 2 consecutive reactions, i.e. when one reaction produces a product, which is either an input or a catalyst for another reaction. The information is edited by the Reactome staff at Cold Spring Harbor Laboratory and the European Bioinformatics Institute and is then reviewed by other biological researchers for consistency and accuracy. Each reaction or pathway can be exported to Systems Biology Markup Language (SBML) and BioPAX formats. Reactome also provides tools such as Pathfinder and Skypainter. Pathfinder can identify pathways that connect input with output molecules while Skypainter allows the coloring of reaction maps based on user-specified identifiers that have been linked to each pathway. For our analysis, we have considered only the ‘direct complexes’ as they are the category most likely to correspond to true PPIs.

PDZBase

PDZBase is a database that focuses only on PPIs involving proteins with PDZ domains [31, 32]. Only those interactions involving the PDZ domain that have been confirmed by individual in vitro or in vivo biochemical experiments are considered. Thus, interactions discovered solely through high-throughput methods (e.g. yeast two-hybrid or mass spectrometry) are not included in PDZBase. PDZ domains and their ligands can be queried using sequence motifs. Each interaction in PDZBase consists of the residues of the interacting proteins on a 2D-diagram generated by a residue-based-diagram-editor (RBDG). The interacting residues between the PDZ domain and their peptide ligands are predicted based on similarity with the available structures of PDZ-peptide complexes.

Strategy used for comparison of datasets

The datasets were downloaded from the download sites of PPI databases on October 2, 2006 and scripts were used for parsing out the protein pairs involved in PPIs along with the experiment type and literature references, if provided. The PPIs were further parsed to extract binary interactions for those proteins pairs where both proteins were human. Most databases had Swiss-Prot as one of their accession identifiers except BIND which provided RefSeq, GenBank and PDB identifiers. To determine the overlap among databases, the Swiss-Prot or RefSeq identifiers were mapped to the corresponding Entrez Gene identifiers as of October 2, 2006. Scripts were used to convert these PPIs into a non-redundant list of PPIs (if protein A and B interact, the dataset may have two PPIs, A-B and B-A – only one of the PPI was retained for our analyses). All datasets were compared with each other to obtain the overlap at PPI and protein levels. Experiment types extracted for PPIs were mapped with PSI-MI vocabulary list. Disease annotations for genes were obtained from OMIM and mapped to gene symbols to obtain the number of proteins in PPIs corresponding to disease-associated genes.

Caveats of comparing PPI data

Assessment of the accuracy of annotation of all PPIs in various publicly available databases is beyond the scope of this article. In this study, we have tried to evaluate parameters that could be measured objectively. Nevertheless, there are still a number of caveats of any analysis comparing PPIs. Below is a list of some of the potential pitfalls and our strategies to tackle them.

1. 1.

Binary interactions including homodimers were considered for this analysis while complex interactions were not. It is not easy to look at complex interactions across databases especially for comparison purposes although ‘spoke’ and ‘matrix’ models have been described previously for comparing protein complexes [33]. In this study, we have chosen not to compare the complex interactions because of predictive nature of these models. However, cases where a protein complex was already converted into binary PPIs by using one of these models (e.g. use of the ‘matrix’ model to computationally predict PPIs in Reactome) were treated as binary interactions.

2. 2.

Some of the binary interactions involved proteins that were non-human. Mapping of orthologs is not an easy task and is not standardized. Thus, we did not attempt to map the human orthologs for proteins from any other species that were listed as interacting proteins.

3. 3.

We mapped all protein isoforms to a unique gene and then examined the overlaps. This was done because often a given isoform is annotated as an interacting protein although the interaction is not specific to that isoform. For example, this strategy allowed us to correctly capture PPIs as overlapping where a given protein was annotated as interacting with one isoform of another protein in one database and with another isoform of that protein in another database.

Results and Discussion

Comparison of PPI data

Table 1 summarizes the salient features of each database including total number of PPIs, total number of proteins, method of detection of PPIs, curation methodology, download options and URL links. The availability of data as a downloadable file is also indicated. Fig. 1A shows the distribution of the number of PPIs in each of the literature-based curated databases considered in our analysis. For each database, the total number of human PPIs present in the statistics page or in the downloaded files is shown along with the number of unique (non-redundant) binary human PPIs calculated by us. For this calculation, we only considered binary PPIs in which both members of an interacting pair were human proteins. As explained above, protein complexes were excluded from this analysis because it is difficult to ascertain the topology (i.e. which protein interacts with which protein in a complex) for determining overlap between datasets. The difference in the total and non-redundant PPIs in HPRD is because of protein complexes whereas in all other databases it is mainly due to the redundancy of PPIs. The distribution of PPI data in (Fig. 1A) shows a dramatic variation across these databases.

It is difficult to directly assess the depth of PPIs based on total interactions alone; thus, we analyzed the distribution of number of proteins in each database according to the number of binary (i.e. direct) interactions per protein. The majority of proteins in all databases have <10 interaction partners (Fig. 1B). The number of PPIs that fall under 31–40 and 41–50 PPI bins are high in HPRD and Reactome database. Although these PPIs are distributed across many types of proteins in HPRD, those in Reactome belong to mainly two classes: proteosomal or ribosomal protein complexes. The number of interactions for these two classes of proteins in Reactome is high because a ‘matrix’ model of interpreting protein complexes is used in which all proteins are considered connected to all proteins within a complex. All other database shows the same trend with a greater number of proteins in bins with lower number of PPIs per protein. This does not automatically imply that most proteins truly interact with a small number of interactors. Rather, this is likely due to the fact that not all proteins have been studied thoroughly and because all published interactions have not yet been included in these databases. Additionally, there is a bias of experimental methods in capturing all interactions (e.g. yeast two-hybrid system does not generally detect interactions involving integral membrane proteins). Overall, most databases contain a very small number of proteins with >30 PPIs.

Comparison of proteins annotated with PPIs

We looked for the total number of unique genes represented in the PPI databases (Fig. 2A). In HPRD, proteins encoded by 9,427 genes have at least one or more direct PPI annotated (out of ~20,000 proteins annotated in this database) while BIND, IntAct and MINT contain 3,887, 4,614 and 4,975 proteins, respectively. Other databases such as DIP, Reactome, MIPS and PDZ Base contain PPIs for <1000 proteins.

Proteins encoded by disease-associated genes in PPIs

PPIs are attractive as potential targets for small-molecule drugs for treatment of diseases. We checked for proteins encoded by genes listed in the OMIM database that are mutated in inherited genetic disorders (Fig.2B). HPRD has all human disease-associated genes listed in OMIM of which 1,463 have at least one protein interactor while most of the other databases contain significantly less number of proteins encoded by these genes.

Overlap of PPIs and proteins between databases

As discussed above, there is a significant difference in the total number of PPIs in the various databases. However, this statistic does not provide an idea of the extent to which the PPIs actually overlap across databases. As shown in Fig. 3A, HPRD contains a high proportion of human PPIs that are present in other literature-derived curated databases. The overlap between IntAct (10,244 PPIs) and MINT (11,367 PPIs) is 7,362, which is the highest overlap among the remaining literature-derived databases; the overlap between BIND (6,621 PPIs) and MINT (11,367 PPIs) is only 1,463 and there is no overlap between PDZBase and DIP.

To determine whether the overlap is small because of proteins not being annotated in different databases, we looked at the overlap at the protein level between databases. As shown in Fig. 3B, the overlap of proteins between BIND (3,887 proteins) and IntAct (4,614 proteins) is 1,969 but the overlap at PPI level is only 1,167. HPRD contains 76% and MINT contains 51% of proteins in Reactome, although there is a very low overlap at the level of PPIs across these databases. Overall, although at protein level there is a good overlap between the databases, the PPIs do not overlap as much. Average degree (K) of a protein i.e. the number of interactions that a protein has with other proteins, is 7.6 for HPRD, while that for MIPS, PDZ Base, DIP, BIND, MINT and IntAct ranges from 1.7 to 4.5. Strikingly, the average degree of a protein in Reactome is 12.2, which is because of the interpretation of protein complexes through the ‘matrix’ model as explained above.

We also carried out a comparison of a test set of proteins to check the distribution of interaction partners of PPIs across different databases (Table 2). The test proteins were selected based on the presence of proteins in four or more databases. We required that the protein be present in four or more databases because there was not even a single protein that was common to all databases. The proteins were further selected to cover proteins that participate in several different types of biological processes to avoid any potential bias in the event that any particular database is especially ‘strong’ in certain types of annotations. As shown in Table 2, Caspase 3 (CASP3) has 126 protein interaction partners annotated in HPRD, while BIND, MINT, IntAct and Reactome contain 15, 6, 3 and 1 interaction, respectively. S-phase kinase-associated protein 1A (SKP1A) has 35 PPIs in HPRD, 11 in BIND, 5 in DIP and 13 in MINT. MIPS and PDZBase do not contain any PPIs for this protein. Nuclear factor kappa-B subunit 3 (RELA) has 98 protein interaction partners in HPRD while BIND, MINT, DIP and IntAct contain 13, 103, 13 and 90 PPIs. Overall, for most proteins, there is at least one, and often several, databases that do not contain any PPI annotations (Table 2). This again reflects the fact that the databases are still at an early stage of curation and annotation of published PPIs.

Literature citations in literature-derived databases

Literature citations are generally linked to interactions in literature-derived datasets. We checked the total citations in PubMed linked to PPIs in the literature-derived databases (Fig. 4A). HPRD has >43,634 published articles to support the PPI data, while BIND and MINT contain ~8,020 and ~11,480 citations, respectively. Reactome contains a total of ~2,000 citations. Another parameter to assess the extent of curation is to determine the number of citations per interaction. More than one citation for a given PPI indicates that the interaction has been verified by more than one group or method. Conversely, however, the presence of a single citation does not automatically imply that there is only one study describing the interaction because it is quite likely that only one published paper was linked although several studies might have been carried out (i.e. incomplete curation). This is illustrated in the section below where the same PPI is compared across multiple databases. As shown in Fig. 4B, 100% of PPIs in PDZBase and >95% of PPIs in MINT, IntAct and MIPS had one PubMed citation. In contrast, 87% in BIND and DIP and 84% of PPIs in HPRD have only one citation. Notably, ~11% and 7% of PPIs in HPRD and BIND, respectively, have 2 citations and ~2% of PPIs in HPRD, BIND and IntAct have more than 5 citations each. The majority of PPIs in Reactome (~96%) are linked to the same 2 published articles because these PPIs are predicted computationally using a matrix approach (i.e. all against all) to link proteins that were identified in two mass spectrometry-based protein complex pulldown studies on spliceosomes [34, 35].

Comparison of PPI annotations common to multiple databases

Overall statistics of databases might not reflect the breadth and depth of protein annotations from a biologist’s perspective. To provide certain ‘case studies,’ we prepared a list of protein interactions that are common to 4 or more literature-derived databases and then tabulated the number of PPIs in each database. We left out PDZBase because of its small size. Table 3 lists 6 representative PPIs that were common to 4 or more databases along with the article(s) cited for each interaction and the annotation of the experimental methods used to detect the corresponding PPI. As an example, the experimental method annotated for the interaction between transcription factors NFKB1 and NFKB3 reported recently [36] is in vivo (MI:0492) in HPRD, tandem affinity purification (TAP) (MI:0045) in DIP, anti tag coimmunoprecipitation (MI:0109) in MINT and tap tag coip (MI:0007) in IntAct. This example illustrates how databases can describe the same experiment using alternative vocabulary terms. The interaction, TNFRSF1A with TRADD, is annotated as in vivo, in vitro and yeast 2-hybrid with 3 PubMed citations in HPRD, simply ‘experimental’ with 1 PubMed citation in DIP, immunoprecipitation and affinity chromatography with 3 PubMed citations in BIND, co-immunoprecipitation with 1 PubMed citation by MIPS, ‘co-immunoprecipitation, pulldown and two hybrid’ with 2 citations by MINT and ‘anti-bait coip, pulldown and two hybrid’ with 1 citation by IntAct. Together, the 6 databases refer to 8 PubMed citations to describe this interaction while each individual database only uses between 1 and 3 citations. For the interaction of FADD with FAS, HPRD annotation is ‘in vivo, in vitro and yeast 2-hybrid,’ DIP mentions ‘two hybrid test,’ BIND describes it as ‘immunoprecipitation’, MIPS mentions ‘coip,’ MINT describes it as ‘coimmunoprecipitation and two hybrid’ and IntAct annotates it as coip, pull down, anti tag coip and two hybrid.’ Table 3 highlights how different databases use different published articles for annotating the same PPI. Thus, mere presence of a PPI in different literature-derived databases does not automatically guarantee that the annotations will be identical. It also illustrates that merging of annotations from multiple databases will lead to an increase in the depth of individual annotations.

Proteomics Standards Initiative (PSI) is a collaborative initiative for standardization of protein-related data including protein-protein interaction and mass spectrometry data. PSI-molecular interaction (PSI-MI) [37] format is an exchange format, which has already become the standard for PPI data [4]. Table 1 shows that although many databases provide the PPI data in this format such as HPRD, BIND, DIP MINT, MIPS and IntAct, some databases such as AfCS and Reactome do not currently have this option. Reactome also provides data in two pathway-related formats, BioPAX and SBML. The data contained in AfCS is not currently available as a downloadable file.

Although a consensus on the use of standardized vocabulary for denoting PPIs is evolving and is being increasingly used, there is no requirement for use of any particular type of identifiers or database accession numbers for proteins in PPI databases. Different sets of protein database identifiers are used, with many of them being frequently retired, merged or otherwise updated. This creates great difficulties for those who want to combine datasets from different databases. It is not a trivial task to ‘map’ identifiers to a single set of proteins and creates a bioinformatics pitfall of its own. If this ‘mapping’ is done by purely automated methods, there is a risk of wrong assignment of a protein entry from one database to another. To minimize this, we recommend the use of gene symbols in addition to any ‘favorite’ protein identifier. This allows for a relatively more error-free interpretation of PPI data at the gene level.

Conclusion

There is great interest in protein-protein interactions as a means of understanding the complexities of a cell. Large scale PPI data derived from high-throughput experiments or literature derived curated databases has been used to analyze the molecular networks of human cells [38, 39, 40, 41]. Here, our assessment shows that the number of PPIs in databases varies widely from as low as 100 to over 36,600 interactions. Overlap of PPIs within the same category of databases (e.g. within literature-derived databases) is low despite the presence of overlapping proteins. A comparison of the number of PPIs for a test set of proteins confirms that there is indeed a large variation in the number of interactors across the interaction databases. Also, a comparison of annotations for the PPIs that do overlap between the databases reveals differences in annotations through the use of alternative vocabulary terms. This is partly because of the difference in interpretation of the experimental results by the biologists annotating them and partly because of the overlapping meaning of the terms themselves.

A particularly important issue is that of protein isoforms. Often, only one isoform is annotated as an interactor although there is no evidence that the interaction is specific to that isoform. In other experiments such as coimmunoprecipitation experiments, it is almost impossible to discern which isoform binds unless an isoform-specific antibody is used. Because of this difficulty in mapping isoforms, we suggest that groups carrying out interaction studies, especially large-scale studies, map the identity of the proteins to genes and include this in their data submission. We have also previously done this for protein identification studies using mass spectrometry where a similar difficulty exists with regard to identification of particular isoforms [42]. If this is done, then a binary interaction can be interpreted thus: at least one of the gene products of Gene A interacts with at least one of the gene products of Gene B.

The dissemination of PPI datasets is an important aspect for optimal use of the data. Through decades of research, molecular biologists have discovered a large number of PPIs. Collecting this information, storing it and maintaining a database is a valuable task, which is perhaps not adequately appreciated by the scientific community. Our evaluation of human PPI databases highlights the diverse nature of annotation and representation of PPIs in databases. We hope that this review will assist biomedical scientists in making informed decisions about the most appropriate database to suit their needs and to actively participate with the databases to maintain error-free and updated annotations.

List of Abbreviations

PSI-MI:

Proteomics Standards Initiative – Molecular Interaction

HPRD:

Human Protein Reference Database

BIND:

Biomolecular Interaction Network Database

DIP:

Database of Interacting Proteins

MINT:

Molecular INTeraction database

AfCS:

Alliance for Cellular Signaling

Declarations

Acknowledgements

Akhilesh Pandey is supported by a grant from the National Institutes of Health (U54 RR020839). The Human Protein Reference Database was developed with funding from the National Institutes of Health and the Institute of Bioinformatics. Dr. Pandey serves as Chief Scientific Advisor to the Institute of Bioinformatics. Dr. Pandey is entitled to a share of licensing fees paid to the Johns Hopkins University by commercial entities for use of the database. The terms of these arrangements are being managed by the Johns Hopkins University in accordance with its conflict of interest policies.

This article has been published as part of BMC Bioinformatics Volume 7, Supplement 5, 2006: APBioNet – Fifth International Conference on Bioinformatics (InCoB2006). The full contents of the supplement are available online at http://​www.​biomedcentral.​com/​1471-2105/​7?​issue=​S5.

References

1. Kemmer D, Huang Y, Shah SP, Lim J, Brumm J, Yuen MM, Ling J, Xu T, Wasserman WW, Ouellette BF: Ulysses – an application for the projection of molecular interactions across species. Genome Biol 2005, 6: R106. 10.1186/gb-2005-6-12-r106
2. Riley R, Lee C, Sabatti C, Eisenberg D: Inferring protein domain interactions from databases of interacting proteins. Genome Biol 2005, 6: R89. 10.1186/gb-2005-6-10-r89
3. Suresh S, Sujatha Mohan S, Mishra G, Hanumanthu GR, Suresh M, Reddy R, Pandey A: Proteomic resources: Integrating biomedical information in humans. Gene 2005, 364: 13–18. 10.1016/j.gene.2005.07.021
4. Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, et al.: The HUPO PSI’s molecular interaction format – a community standard for the representation of protein interaction data. Nat Biotechnol 2004, 22: 177–183. 10.1038/nbt926
5. BioPAX[http://​www.​biopax.​org]
6. HPRD Human Proteins Reference Database[http://​www.​hprd.​org]
7. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, et al.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13: 2363–2371. 10.1101/gr.1680803
8. GenProt[http://​www.​genprot.​org]
9. NetPath[http://​www.​netpath.​org]
10. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, et al.: IntAct: an open source molecular interaction database. Nucleic Acids Res2004, 32: D452–455. 10.1093/nar/gkh052
11. IntAct[http://​www.​ebi.​ac.​uk/​intact]
12. Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G: MINT: a Molecular INTeraction database. FEBS Lett 2002, 513: 135–140. 10.1016/S0014-5793(01)03293-8
13. MINT Molecular INTeraction database[http://​mint.​bio.​uniroma2.​it/​mint]
14. Breitkreutz BJ, Stark C, Tyers M: Osprey: a network visualization system. Genome Biol 2003, 4: R22. 10.1186/gb-2003-4-3-r22
15. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32: D449–451. 10.1093/nar/gkh086
16. DIP Database of Interacting Proteins[http://​dip.​doe-mbi.​ucla.​edu]
17. Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 2002, 1: 349–356. 10.1074/mcp.M100037-MCP200
18. Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions.Genome Res 2002, 12: 1540–1548. 10.1101/gr.153002
19. Duan XJ, Xenarios I, Eisenberg D: Describing biological protein interactions in terms of protein states and state transitions: the LiveDIP database. Mol Cell Proteomics 2002, 1: 104–116. 10.1074/mcp.M100026-MCP200
20. Graeber TG, Eisenberg D: Bioinformatic identification of potential autocrine signaling loops in cancers from gene expression profiles. Nat Genet 2001, 29: 295–300. 10.1038/ng755
21. Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stumpflen V, Mewes HW, et al.: The MIPS mammalian protein-protein interaction database.Bioinformatics 2005, 21: 832–834. 10.1093/bioinformatics/bti115
22. MIPS Mammalian Protein-Protein InteractionDatabase[http://​mips.​gsf.​de/​proj/​ppi]
23. Riley ML, Schmidt T, Wagner C, Mewes HW, Frishman D: The PEDANT genome database in 2005. Nucleic Acids Res 2005, 33: D308–310. 10.1093/nar/gki019
24. Gilman AG, Simon MI, Bourne HR, Harris BA, Long R, Ross EM, Stull JT, Taussig R, Bourne HR, Arkin AP, et al.:Overview of the Alliance for Cellular Signaling. Nature 2002, 420: 703–706. 10.1038/nature01304
25. AfCS Alliance for Cellular Signaling[http://​www.​signaling-gateway.​org]
26. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, et al.: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res 2005, 33: D418–424. 10.1093/nar/gki051
27. BIND Biomolecular Interaction Network Database[http://​www.​bind.​ca]
28. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003,13: 2498–2504. 10.1101/gr.1239303
29. Reactome[http://​www.​reactome.​org]
30. Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, et al.: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005, 33: D428–432. 10.1093/nar/gki072
31. PDZBase[http://​icb.​med.​cornell.​edu/​services/​pdz]
32. Beuming T, Skrabanek L, Niv MY, Mukherjee P, Weinstein H: PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics 2005, 21: 827–828. 10.1093/bioinformatics/bti098
33. Bader GD, Hogue CW: Analyzing yeast protein-protein interaction data obtained from different sources.Nat Biotechnol 2002, 20: 991–997. 10.1038/nbt1002-991
34. Hartmuth K, Urlaub H, Vornlocher HP, Will CL, Gentzel M, Wilm M, Luhrmann R: Protein composition of human prespliceosomes isolated by a tobramycin affinity-selection method. Proc Natl Acad Sci U S A2002, 99: 16719–16724. 10.1073/pnas.262483899
35. Rappsilber J, Ryder U, Lamond AI, Mann M: Large-scale proteomic analysis of the human spliceosome.Genome Res 2002, 12: 1231–1245. 10.1101/gr.473902
36. Bouwmeester T, Bauch A, Ruffner H, Angrand PO, Bergamini G, Croughton K, Cruciat C, Eberhard D, Gagneur J, Ghidelli S, et al.: A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol 2004, 6: 97–105. 10.1038/ncb1086
37. PSI-MI Proteomics Standards Initiative – Molecular Interaction[http://​psidev.​sourceforge.​net/​mi/​xml/​doc/​user]
38. Neduva V, Linding R, Su-Angrand I, Stark A, de Masi F, Gibson TJ, Lewis J, Serrano L, Russell RB: Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol 2005, 3: e405. 10.1371/journal.pbio.0030405
39. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al.: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437: 1173–1178. 10.1038/nature04209
40. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, et al.: A human protein-protein interaction network: a resource for annotating the proteome.Cell 2005, 122: 957–968. 10.1016/j.cell.2005.08.029
41. Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, Mohan SS, Sharma S, Pinkert S, Nagaraju S, Periaswamy B, et al.: Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 2006, 38: 285–293. 10.1038/ng1747
42. Muthusamy B, Hanumanthu G, Suresh S, Rekha B, Srinivas D, Karthick L, Vrushabendra BM, Sharma S, Mishra G, Chatterjee P, et al.: Plasma Proteome Database as a resource for proteomics research.Proteomics 2005, 5: 3531–3536. 10.1002/pmic.200401335

http://www.ebi.ac.uk/intact/

IntAct Molecular Interaction Database

IntAct provides a freely available, open source database system and analysis tools for molecular interaction data. All interactions are derived from literature curation or direct user submissions and are freely available. The IntAct Team also produce the Complex Portal.

BioGRID interaction data are 100% freely available to both commercial and academic users and are provided WITHOUT ANY WARRANTY. Publications that make use of this data are requested to please cite the contributing authors and : Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. Biogrid: A General Repository for Interaction Datasets. Nucleic Acids Res. Jan1; 34:D535-9 where applicable.

Syn-Lethality: An Integrative Knowledge Base of Synthetic Lethality towards Discovery of Selective Anticancer Therapies

BioMed Research International
Volume 2014 (2014), Article ID 196034, 7 pages
http://dx.doi.org/10.1155/2014/196034
Research Article

Syn-Lethality: An Integrative Knowledge Base of Synthetic Lethality towards Discovery of Selective Anticancer Therapies

1Bioinformatics Research Centre (BIRC), School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798
2Institute for Infocomm Research (I2R), 1 Fusionopolis Way, Singapore 138632
3Genome Institute of Singapore (GIS), Biopolis, Singapore 138672

Received 17 November 2013; Accepted 11 March 2014; Published 22 April 2014

Copyright © 2014 Xue-juan Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Synthetic lethality (SL) is a novel strategy for anticancer therapies, whereby mutations of two genes will kill a cell but mutation of a single gene will not. Therefore, a cancer-specific mutation combined with a drug-induced mutation, if they have SL interactions, will selectively kill cancer cells. While numerous SL interactions have been identified in yeast, only a few have been known in human. There is a pressing need to systematically discover and understand SL interactions specific to human cancer. In this paper, we present Syn-Lethality, the first integrative knowledge base of SL that is dedicated to human cancer. It integrates experimentally discovered and verified human SL gene pairs into a network, associated with annotations of gene function, pathway, and molecular mechanisms. It also includes yeast SL genes from high-throughput screenings which are mapped to orthologous human genes. Such an integrative knowledge base, organized as a relational database with user interface for searching and network visualization, will greatly expedite the discovery of novel anticancer drug targets based on synthetic lethality interactions. The database can be downloaded as a stand-alone Java application.

1. Introduction

Finding effective anticancer therapies is a major goal of biomedical research. As a devastating human disease, cancer kills millions of people each year. In 2008, the World Health Organization (WHO) predicted that, if new anticancer treatments are not discovered, there will be 26.4 million cancer patients around the world and 17 million cancer deaths by 2030 [1]. The currently prevalent anticancer treatments, chemotherapies, have several limitations, including the drug resistance and the side-effects of toxicity [2]. Although targeted therapies are being developed, the lack of selectivity (i.e., killing both tumour and healthy cells) remains a major issue for current anticancer therapeutics.

Recently, synthetic lethality (SL) has emerged as a novel anticancer strategy that is promising to be highly selective. A pair of genes is defined to have synthetic lethal interactions if the mutation to either gene will not kill the cell but the mutations to both genes will lead to cell death [2] (Figure 1). Compared with healthy cells, cancer cells contain many genetic mutations. Hence, an SL partner of a cancer-specific mutation will be potentially a selective anticancer drug target. A drug that induces a mutation to the SL partner gene will kill cancer cells but spare normal cells, due to the SL interaction with the cancer-specific mutation that is not present in healthy cells.

Figure 1: The concept of synthetic lethality. (a) If just one of the SL pair genes is mutated, then the cell is alive. A/B wild type, a/b-mutated genes; (b) mutation/inhibition of one gene or both genes of a SL gene pair leads to different cell fates [2].

However, the discovery and clinical applications of SL-based anticancer therapies need to overcome several technical obstacles. Most known SL cases are discovered in yeast, and so far only a few SL gene pairs are known in human. A prevalent technique to discover SL genes is high-throughput screening based on chemical or RNAi libraries [3]. Due to genetic heterogeneity of cancer cells, the SL identified from one screening might not be repeatable in another platform or cancer subtypes. Importantly, the screening-based discovery can hardly yield any mechanistic insight into SL interactions. The interpretation of SL candidates is crucial for reliable application of SL-based therapies. To address these issues, systems biology approaches that can uncover the molecular mechanisms of SL in cancer cells would be needed.

The technique of SL was originated from yeast genetics [4]. Due to its rapid generation time, simple culture, and easy-to-handle genetic manipulation, S. cerevisiae has been extensively used to study SL [5]. Computational methods have also been developed to predict and analyze yeast SL [6]. In contrast, there is a dearth of resources (e.g., data, knowledge, or bioinformatics tools) available about SL in human cancer. Recently, some methods have been developed to infer human SL from yeast SL, considering that the genome integrity and cell-cycle related genes from yeast are highly conserved with human and closely related with cancer disease [7]. Massive screening of yeast SL interaction can provide valuable information for SL inference of human cancer. For example, Conde-Pueyo et al. applied the yeast-to-human inference method to obtain potential cancer-related SL target and identified SL partners of cancer-related genes that are drug target [8]. It is highly desirable to integrate data of human cancer SL pairs to reduce the follow-up experimental research in the manageable size.

In this paper, we present an integrative knowledge base dedicated to SL in human cancer, called Syn-Lethality.From literature, we collected SL gene pairs that have been experimentally discovered and verified and integrated them into a network (Figure 2), where each node is a gene and each edge represents an SL interaction. We call such a network as SL network. Moreover, we associated the SL network with related gene annotations and pathway information, to facilitate mechanistic understanding of SL. In addition to human specific SL, we also collected yeast SL, which were mapped to human genes through orthologous correspondence. The information collected as such has been organized into a relational database with user friendly interface. When users input cancer genes (e.g., TP53), Syn-Lethality will search for SL partners of the query genes and display related annotations (e.g., pathways, gene functions, and hyperlinks to the related literature). The SL network we constructed serves as a roadmap for the whole knowledge base.

Figure 2: SL network of human cancer constructing based on SL literatures. Each node in the network denotes a gene/protein and each edge represents an SL interaction (the arrow direction leads from mutation gene to target gene).

To our best knowledge, Syn-Lethality is the first database dedicated to human synthetic lethality. There are few genome wide screenings for SL interactions with human cancer genes, and they are focused on a few well-known oncogenes (e.g., TP53 and KARS). The large-scale screening for human cancer cells is limited by high-cost, false positives, and difficulty to interpret mechanisms, and the information is scattered in the literature. An integrative approach is indispensable for a systematic and mechanistic understanding of human SL. Syn-Lethality database is one of the first attempts to integrate knowledge and data about SL in human cancer. We have also integrated data from yeast and will do so in the future from other model organisms. We believe that it would be a valuable resource and framework that would facilitate novel discovery of potential selective anticancer therapy based on synthetic lethality.

2. Data Integration

2.1. Data Collection and the Literature Search

The primary aim of our Syn-Lethality database is to collect and maintain a high quality set of SL gene pairs, which serves as a comprehensive, fully classified, and accurately annotated knowledge base for SL-related research. The database also provides extensive cross-references and querying interfaces. The SL pairs in Syn-Lethality database are collected by two alternative methods and we will next introduce them in more detail.

The first method for collecting SL pairs is the literature search. We examined the Web of Knowledge and NCBI PubMed databases with the keywords like “synthetic lethality” and then screened with the keyword “human cancer/tumour” from the abstracts. As such, we collected more than one hundred scientific publications. From these articles, we manually extracted more than one hundred SL gene pairs, which have been verified by experiments for cancer treatment. Although the number of SL pairs collected by the literature search is limited, they are highly trustworthy and thus they lay the foundation for our Syn-Lethality database.

The second source of potential SL pairs is the knowledge transfer from the model organism of yeast to human by comparative genomics analysis. Currently, there are quite a few number of SL pairs in yeast which are experimentally detected by various screening techniques. Meanwhile, some human cancer genes (e.g., related with cell cycle, DNA repair) are observed to be highly evolutionarily conserved with yeast cancer genes for inferring human SL pairs of genes based on human-yeast conservation. Therefore, it is possible to infer some SL pairs in human cancers from yeast. We predict a human gene pair to be an SL pair in human cancer based on the following two constraints. First, this human gene pair has a conserved SL interaction in yeast. Second, one of these two genes is a cancer gene. For example, two yeast genes and form an SL relationship while two human genes and are orthologs of and , respectively. If or is a gene that is observed to be mutated in a certain type of cancer, (, ) is then a predicted SL pair in the human cancer. In this paper, all the yeast SL interactions are downloaded from BioGrid [9] (Table 1). However, we noticed that some of these yeast SL pairs from BioGrid involve essential genes. By the definition of SL (i.e., mutation of one gene should not kill the cell, but mutation of both genes kills the cell), both genes in a SL pair should be nonessential. Therefore, with the list of essential genes downloaded from Gerstein Lab at Yale University (http://bioinfo.mbb.yale.edu/genome/yeast/cluster/essential/) and Saccharomyces Genome Deletion Project (http://www-sequence.stanford.edu/group/yeast_deletion_project/) we collected 6,613 SL pairs without any essential genes. In addition, 507 human cancer genes are downloaded from COSMIC: Cancer Gene Census via the link http://cancer.sanger.ac.uk/cancergenome/projects/census/. Finally, we inferred 1,114 SL pairs related with human cancers that are predicted from yeast.

Table 1: Representative entries for human cancer Syn-Lethality database.

Based on the above in silico analysis, the Syn-Lethality database contains 113 SL pairs from NCBI PubMed abstracts and 1,114 SL pairs from the model organism of yeast (Table 3). We also provide additional information about the genes/proteins involved in these SL pairs as shown in Table 1, for example, Entrez gene IDs, full gene name, symbols, gene type (oncogene or tumour suppressor gene), cancer type, pathway information, and some remarks on the molecular mechanisms.

2.2. Pathway/Mechanism Analysis of SL Pairs Directly from the Literature

From the list of SL gene pairs, it is interesting to note that a large fraction of SL pairs are involved in fundamental processes of cell fates, cell cycle, and DNA damage response. We first take the KRAS oncogene as an example. Genome-wide RNAi screen was conducted to identify SL interaction partners of KRAS [10]. We observed that the SL interaction partners of KRAS are involved in the mitotic progression, including the subunits of the anaphase-promoting complex/cyclosome (APC/C) complex (ANAPC1, ANAPC4, CDC16, and CDC27), cyclin A2 (CCNA2), kinesin-like protein 2C (KIF2C), KNL-1 (CASC5), hMis18a and hMis18b (C21ORF45 and OIP5), borealin (CDCA8), and SMC4 and polo-like kinase 1 (PLK1). The inhibition of the above genes will lead to the death of cells in which the KRAS has been mutated [10]. TP53 is another example. It is a major downstream effector of DNA-damage kinase pathways. In response to DNA damage, a normal cell will activate a complex signaling network to arrest cell-cycle progression and facilitate the DNA repair. In contrast, TP53-deficient tumor cells rely on other G2/M checkpoint regulators such as checkpoint kinase 1 (CHK1) to arrest cell-cycle progression. Recently, the SL interactions between TP53 (TP53 is mutated) and ATR/Chk1, WEE1, ATM/Chk2, and MK2 targets have been investigated [11]. As an example, myelocytomatosis viral oncogene homolog, MYC, is a multifunctional, nuclear phosphoprotein that plays a role in cell cycle progression, apoptosis, and cellular transformation, as a transcription factor. Overexpression of MYC sensitizes fibroblasts to agonists of the TNF-related apoptosis-inducing ligand (TRAIL) death receptor DR5. It was shown that MYC mediates increased DR5 expression and signaling as a result of enhanced procaspase-8 autocatalytic activities [12].

As reported by [3, 13], the authors proposed the following four types of mechanisms for SL interactions in human cancers from the perspectives of protein complexes and pathways. First, two complexes may be synthetic lethal when they have an essential function in common and they are uniquely redundant. Second, two units within an essential protein complexes may form SL relationship. Third, two components in a linear essential pathway may be SL partners, because the mutation of each component decreases the flow through the pathway but the pathway still has signal flow, whereas the mutation of both will destroy the pathway. Forth, two components in two parallel essential pathways may be backups of each other for the lethality. Generally, the SL pairs can be interpreted as due to the above four mechanisms. For example, EGFR and BRCA1 are SL pairs because they are in the same essential protein complex [14]. In this paper, we will focus on the analysis of SL pairs from the perspective of signalling pathways and provide three SL examples, in which two partners are from two parallel pathways.

First, TANK binding kinase (TBK1) was identified as a synthetic lethal gene of KRAS [15]. TBK1 is a noncanonical inhibitor of B protein (IB) that is known to regulate nuclear factor B (NFB) signalling. TBK1 activates NF-kB antiapoptotic signals involving c-Rel and BCL-XL (also known as BCL2L1) that were essential for survival. These indicate that TBK1 and NF-kB signalling pathways are essential in KRAS mutant tumours. Second, the inhibition of both EGFR and Notch signalling pathway is dramatically more effective for suppressing tumor growth than blocking EGFR or Notch signalling pathway alone. Normally the activated form of Notch1 restores AKT activity and enables cells to overcome cell death after dual-pathway blockade [16]. Here, the combined EGFR and Notch inhibition decreases significantly the AKT activation and thus suppresses tumor growth more effectively. Third, EGFR, a protooncogene, belongs to a family of four transmembrane receptor tyrosine kinases that mediate the growth, differentiation, and survival of cells. It is often overexpressed in aggressive triple negative breast cancers (TNBCs) and is also associated with other aggressive disease phenotypes. Nowsheens group recently reported that a contextual synthetic lethality can be achieved both in vitro and in vivo with combined EGFR and PARP inhibition with lapatinib and ABT-888, respectively [14]. The mechanism involves a transient deficit of DNA double strand break repair induced by lapatinib and a subsequent activation of the intrinsic pathway of apoptosis. Our Syn-Lethality database contains SL pairs of genes that likely belong to one of the above four mechanisms. The gene function and pathway information in our database will facilitate in silico interpretation of mechanisms.

3. Database Interface

3.1. Usage of SL Database

Our synthetic lethality database contains SL gene pairs in organised form and provides interface to perform query in the database. Our preliminary database is available in the downloadable form fromhttp://www.ntu.edu.sg/home/zhengjie/software/Syn-Lethality/. This software is a Java executable file and requires the installation of Java. The required version 10 of Java (free) and it can be installed fromhttp://www.java.com/en/download/index.jsp. Once the Java is installed on local machine, just double clicking on the Java executable file will launch the database interface. Since the database is available in the single setup file, the database can be used simultaneously by many end users for performing the query (Figure 4). The database includes information such as synthetic lethal gene pairs, type of lethality, type of gene alteration, and target genes for synthetic lethality.

Searching in our database can be divided in the following categories.(a)Simple Search. The user is required to provide abbreviations for gene names. For example, for epidermal growth factor receptor we just need to write EGFR and for cyclin-dependent kinase we just need to write CDK in the search field. This helps the user in search for the SL gene pair information without typing long gene names.(b)Batch Search. User can directly copy and paste names of various genes (separated by space) in each field. Figure 3 shows an example of using KRAS as input to query its related SL pairs. This helps find information simultaneously for various synthetic lethal gene pairs.(c)Smart Search. Users have flexibility of searching SL gene pairs based on the Boolean logical operators by selecting logical AND and OR operators from the drop down menu. This helps in analyzing various combinations of SL gene pairs.(d)Genetic Alteration Search. The interface of our database provides user flexibility to screen the SL pairs based on various types of the gene alteration which refer to the gene mutated in cancer. The gene alteration types captured in our database includes overexpression, mutation, activation, inactivation, and deficiency.

Figure 3: An example of KRAS related SL pairs (the alteration types refer to the cancer mutated gene).
Figure 4: SL query interface.

As of now, it is possible to retrieve complete SL gene pair information based on information such as gene names (MYC, EGFR, CDK, and so forth) (Table 2) and types of genetic alterations (overexpression, mutation, activation, and so forth). The relevant research papers for the SL gene pair are provided via web hyperlinks in database search results.

Table 2: List of annotation database links in Syn-Lethality database.
Table 3: Total statistics for human cancer Syn-Lethality database.
3.2. Synthetic Lethality Network

To provide more clear understanding of SL gene pairs, we constructed the network for available SL gene pairs (Figure 2). The diagram depicts the synthetic lethal genes and the target genes. For example, the SL pair information for MYC oncogenic gene is depicted as shown in Figure 5.

Figure 5: Subnetwork of our SL network for human cancer.

4. Conclusion and Future Perspectives

Syn-Lethality is the first comprehensive database constructed through integrating experimentally validated SL pairs of human cancer with the inferred SL pairs from yeast according to the orthologous relation between human and yeast. It is the first attempt to apply the experimentally verified SL pairs to construct a SL network. In the SL network, each node represents a gene/protein and each edge denotes the SL interactions which can be easily linked to the annotation information including gene/protein alteration type, screening method, pathway, mechanism, and the related literature. It is a valuable resource for better understanding SL mechanism in human cancer and developing useful information for anticancer medicine.

Considering that our current database only includes the predicted SL pairs from yeast, it is desirable to collect and predict more SL pairs from other model organisms, such as Caenorhabditis elegans, Zebrafish, and mouse. With the progress of SL experimental screening technology, it is believed that more SL interactions are expected to be identified. We will continue to collect and curate SL pairs of genes. Additionally, using our SL database, we plan to develop data mining algorithms to quickly extract SL information and mechanistic insights. Moreover, by incorporating the signalling pathways associated with the SL pairs of genes, we will construct a comprehensive and global SL network about human cancer.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Authors’ Contribution

Xue-juan Li and Shital K. Mishra contributed equally to this paper.

Acknowledgments

This research was supported by NTU Start-up Grant (COE-SUG/RSS-1FEB11-1/8), Singapore Ministry of Education (MOE) AcRF Tier 1 Grant RG32/11 (M4010977), and ARC 39/13 (MOE2013-T2-1-079).

References

1. P. Boyle and B. E. Levin, World Cancer Report, IARC Press, 2008.
2. W. G. Kaelin, “Synthetic lethality: a framework for the development of wiser cancer therapeutics,”Genome Medicine, vol. 1, no. 10, article 99, 2009.
3. W. G. Kaelin Jr., “The concept of synthetic lethality in the context of anticancer therapy,” Nature Reviews Cancer, vol. 5, no. 9, pp. 689–698, 2005.
4. L. H. Hartwell, P. Szankasi, C. J. Roberts, A. W. Murray, and S. H. Friend, “Integrating genetic approaches into the discovery of anticancer drugs,” Science, vol. 278, no. 5340, pp. 1064–1068, 1997.
5. M. A. Heiskanen and T. Aittokallio, “Mining high-throughput screens for cancer drug targetslessons from yeast chemical-genomic profiling and synthetic lethality,” WIREs Data Mining Knowl Discovery, vol. 2, no. 3, pp. 263–272, 2012.
6. M. Wu, X. J. Li, F. Zhang, X. L. Li, C. K. Kwoh, and J. Zheng, Meta-Analysis of Genomic and Proteomic Features To Predict Synthetic Lethality of Yeast and Human Cancer, ACM-BCB, 2013.
7. K. W. Y. Yuen, C. D. Warren, O. Chen, T. Kwok, P. Hieter, and F. A. Spencer, “Systematic genome instability screens in yeast and their potential relevance to cancer,” Proceedings of the National Academy of Sciences of the United States of America, vol. 104, no. 10, pp. 3925–3930, 2007.
8. N. Conde-Pueyo, A. Munteanu, R. V. Solé, and C. Rodríguez-Caso, “Human synthetic lethal inference as potential anti-cancer target gene detection,” BMC Systems Biology, vol. 3, article 116, 2009.
9. A. Chatr-Aryamontri, B. J. Breitkreutz, S. Heinicke et al., “The biogrid interaction database: 2013 update,” Nucleic Acids Research, vol. 41, pp. 816–823, 2013.
10. J. Luo, M. J. Emanuele, D. Li et al., “A genome-wide RNAi screen identifies multiple synthetic lethal interactions with the ras oncogene,” Cell, vol. 137, no. 5, pp. 835–848, 2009.
11. S. Morandell and M. B. Yaffe, “Exploiting synthetic lethal interactions between dna damage signaling, checkpoint control, and p53 for targeted cancer therapy,” Progress in Molecular Biology and Translational Science, vol. 110, pp. 289–314, 2012.
12. Y. Wang, I. H. Engels, D. A. Knee, M. Nasoff, Q. L. Deveraux, and K. C. Quon, “Synthetic lethal targeting of MYC by activation of the DR5 death receptor pathway,” Cancer Cell, vol. 5, no. 5, pp. 501–512, 2004.
13. N. Le Meur and R. Gentleman, “Modeling synthetic lethality,” Genome Biology, vol. 9, no. 9, article 135, 2008.
14. S. Nowsheen, T. Cooper, J. A. Stanley, and E. S. Yang, “Synthetic lethal interactions between egfr and parp inhibition in human triple negative breast cancer cells,” PLoS ONE, vol. 7, no. 10, Article ID e46614, 2012.
15. D. A. Barbie, P. Tamayo, J. S. Boehm et al., “Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1,” Nature, vol. 462, no. 7269, pp. 108–112, 2009.
16. Y. Dong, A. Li, J. Wang, J. D. Weber, and L. S. Michel, “Synthetic lethality through combined notch-epidermal growth factor receptor pathway inhibition in basal-like breast cancer,” Cancer Research, vol. 70, no. 13, pp. 5465–5474, 2010.

Predicting Cancer-Specific Vulnerability via Data-Driven Detection of Synthetic Lethality

Volume 158, Issue 5, 28 August 2014, Pages 1199–1209

Resource

Predicting Cancer-Specific Vulnerability via Data-Driven Detection of Synthetic Lethality

Referred to by
• DAISY: Picking Synthetic Lethals from Cancer Genomes

• Cancer Cell, Volume 26, Issue 3, 8 September 2014, Pages 306-308
Open Archive

Highlights

Genome-scale data-driven identification of synthetic lethality in cancer

Synthetic lethality networks successfully predict cancer gene essentiality

Synthetic lethality networks predict 15 year survival in breast cancer patients

Synthetic dosage lethality networks predict drug response in cancer

Summary

Synthetic lethality occurs when the inhibition of two genes is lethal while the inhibition of each single gene is not. It can be harnessed to selectively treat cancer by identifying inactive genes in a given cancer and targeting their synthetic lethal (SL) partners. We present a data-driven computational pipeline for the genome-wide identification of SL interactions in cancer by analyzing large volumes of cancer genomic data. First, we show that the approach successfully captures known SL partners of tumor suppressors and oncogenes. We then validate SL predictions obtained for the tumor suppressor VHL. Next, we construct a genome-wide network of SL interactions in cancer and demonstrate its value in predicting gene essentiality and clinical prognosis. Finally, we identify synthetic lethality arising from gene overactivation and use it to predict drug efficacy. These results form a computational basis for exploiting synthetic lethality to uncover cancer-specific susceptibilities.

Introduction

Synthetic lethality occurs when the perturbation of two nonessential genes is lethal (Hartwell et al., 1997). This phenomenon offers a unique opportunity to develop selective anticancer drugs that will target a gene whose synthetic lethal (SL) partner is inactive only in the cancer cells (Ashworth et al., 2011 and Hartwell et al., 1997). Toward the realization of this potential, screening technologies have been developed to detect SL interactions in model organisms (Byrne et al., 2007, Costanzo et al., 2010 and Typas et al., 2008) and in human cell lines (Bassik et al., 2013, Brough et al., 2011 and Laufer et al., 2013). However, currently their scope is not sufficiently broad to encompass the large volume of genetic interactions that need to be surveyed across different cancer types. New bioinformatics approaches are hence called for to guide and complement the experimental search for SL interactions in cancer.

Previous computational approaches developed to systematically study genetic interactions have mainly focused on yeast, where there are genome-wide maps of experimentally determined SL interactions (Chipman and Singh, 2009, Kelley and Ideker, 2005, Szappanos et al., 2011 and Wong et al., 2004). In cancer, synthetic lethality has been computationally inferred by mapping SL interactions in yeast to their human orthologs (Conde-Pueyo et al., 2009) and by utilizing metabolic models and evolutionary characteristics of metabolic genes (Folger et al., 2011, Frezza et al., 2011 and Lu et al., 2013). Here, we analyze the rapidly accumulating cancer genomic data to identify candidate SL interactions via the data mining synthetic lethality identification pipeline (DAISY). We show that genome-wide cancer SL networks can be used to successfully predict gene essentiality, drug response, and clinical prognosis.

Results

The DAISY

DAISY is an approach for statistically inferring SL interactions from cancer genomic data of both cell lines and clinical samples. DAISY applies three statistical inference procedures, each tailored to specific cancer data sets.

The first inference strategy, termed genomic survival of the fittest (SoF), is based on the observation that cancer cells that have lost two SL-paired genes do not survive, they are strongly selected against. Accordingly, as cells harboring SL coinactivation are eliminated from the cell population, SL interactions can be identified by analyzing somatic copy number alterations (SCNA) and somatic mutation data and detecting events of gene coinactivation that occur significantly less than expected. In fact, very similar concepts are already extensively used to analyze the outcomes of small hairpin RNA (shRNA) screens in cell lines, in which essential genes and SL gene pairs are detected by identifying the shRNA probes that have been rapidly eliminated from the cell population (Cheung et al., 2011 and Marcotte et al., 2012). More recently, a related concept was implemented to identify synthetic lethality in glioblastoma (Szczurek et al., 2013).

The second inference strategy, shRNA-based functional examination, is based on the notion that the knock down of a synthetically lethal gene is lethal to cancer cells where its SL partner is inactive. Accordingly, the SL pairs of a given gene can be detected by searching for genes whose underexpression and low copy number induce its essentiality. This can be conducted via an integrative analysis of the results obtained in shRNA essentiality screens and their accompanying SCNA and transcriptomic profiles.

The third procedure, pairwise gene coexpression, is based on the notion that SL pairs tend to participate in closely related biological processes and hence are likely to be coexpressed (Costanzo et al., 2010 and Kelley and Ideker, 2005). We show that this trend indeed holds in known SLs that have been experimentally detected in cancer (Figure 2).

Given SCNA, somatic mutation, shRNA, and gene expression data of thousands of cancer samples, DAISY traverses over all gene pairs (∼534 million) and examines for every pair if it fulfills each one of the three criteria described above. Gene pairs that fulfill all three criteria in a statistically significant manner are predicted to be SL pairs. Here, we applied DAISY to analyze nine different genome-wide cancer data sets (Barretina et al., 2012, Beroukhim et al., 2010, Cheung et al., 2011, Garnett et al., 2012, Luo et al., 2008,Marcotte et al., 2012 and Cancer Genome Atlas Research Network et al., 2013) (Table S1 available online).

We expanded DAISY to also detect synthetic dosage lethality (Sajesh et al., 2013). While two genes form an SL pair if the inactivation of one gene renders the other essential, two genes form a synthetic dosage lethal (SDL) pair if the overactivity of one of them renders the other gene essential. Importantly, SDL interactions can permit the eradication of cancer cells with overactive oncogenes that are difficult to target directly (such as KRAS), by targeting the SDL partners of such oncogenes. DAISY detects SDL interactions via three inference procedures that are analogous to those outlined above for detecting SL interactions ( Figure 1; Experimental Procedures). More specifically, DAISY defines two genes, A and B, as an SDL pair if their expression is correlated and if the overactivity (amplification and overexpression) of gene A induces the essentiality of gene B. Induced essentiality is detected in two ways: first, according to shRNA screens, by examining if gene B becomes essential when gene A is overactive; second, according to SCNA data, by examining if gene B has a higher SCNA level when gene A is overactive.

Evaluating DAISY Based on Experimentally Detected SL Interactions in Cancer

We first examined DAISY based on SL interactions that have been experimentally tested in cancer. We applied DAISY to identify the SL partners of PARP1, the tumor suppressors VHL and MSH2, and the SDL partners of the oncogene KRAS. The predictions were performed for over 7,276 gene pairs that have been experimentally tested in six large scale screens ( Bommi-Reddy et al., 2008, Lord et al., 2008, Luo et al., 2009, Martin et al., 2009, Steckel et al., 2012 and Turner et al., 2008). For every gene pair, DAISY returns four p values that denote the significance of the SL or SDL interaction between the two genes according to each one of the three inference strategies described in the previous section and according to all three approaches together (Figure 1;Experimental Procedures). We utilized these p values to examine the predictions along an increasing p value threshold and generate receiver operating characteristic (ROC) curves (Extended Experimental Procedures).

The DAISY predictor obtains an overall AUC of 0.779, which shows the concordance between the predicted and observed SL and SDL pairs (empirical p value <1 × 10−4;Figure 2A). To assess which of the inference strategies enables DAISY to correctly predict synthetic lethality, we repeated the predictions when using the p values obtained based on only one inference strategy at a time (Figure 2A). An AUC of 0.683 was obtained by predicting SL interactions based only on the SoF approach. These results are further improved by requiring that the gene pairs will also be coexpressed, reaching to an AUC of 0.770. As shRNA-based functional examination is not predictive on its own (an AUC of 0.477), we modified DAISY to consider the shRNA criterion as a soft constraint (Experimental Procedures). Despite the nonpredictability of the shRNA-based functional examination approach in this task, shRNA data are important for the generation of predictive SDL-networks (Supplemental Information; Figure S6). Importantly, DAISY captures well-established and clinically important SL interactions, including the prominent SL interaction between PARP1 and BRCA1/BRCA2 and the synthetic lethality between MSH2 and DHFR ( Figures 2B–2G).

Experimentally Examining the DAISY-Predicted SL Partners of the Tumor Suppressor VHL

We next turned to experimentally test SL predictions of the tumor suppressor VHL that is frequently mutated in cancer, especially in clear cell renal carcinomas ( Bommi-Reddy et al., 2008). We applied DAISY to predict the SL partners of VHL and identify among these genes those that are essential in renal carcinoma cells (RCC4) exclusively due to the loss of VHL, resulting in a set of 44 genes ( Extended Experimental Procedures).

We performed a small interfering RNA (siRNA) screen to examine if the predicted genes are preferentially essential in VHL−/− renal carcinoma cells compared with isogenic cells in which pVHL function was restored (Extended Experimental Procedures). Overall, compared to the VHL-restored cells, the VHL-deficient cells are significantly more sensitive to the knockdown of the predicted VHL-SL partners (paired t test p value of 8.25 × 10−4) (Figure 3A, Table S2). Reassuringly, compared to the VHL-restored cells, the VHL-deficient cells are not significantly more sensitive to the knockdown of a control set of 30 randomly selected genes (paired t test p value of 0.255). Compared to another screen that searched for the SL partners of VHL among 88 kinases ( Bommi-Reddy et al., 2008), our screen detected 3.83 times more SL genes (Bernoulli p value of 4.76 × 10−9;Extended Experimental Procedures).

We then measured the response of the renal cells to nine drugs whose targets were predicted by DAISY to be selectively essential in the VHL-deficient renal cells. Of note, these drugs are not currently administered to treat cancer, but are Food and Drug Administration (FDA)-approved to treat other clinical conditions, such as hypertension and depression. We managed to identify effects on cell growth for six out of the nine drugs. As predicted, the VHL-deficient cells were significantly more sensitive to each one of these six drugs (higher percentage of inhibition at mideffective concentration) (Figure 3B; Table S2). Reassuringly, this specificity was not observed with the negative control drug Staurosporine, indicating that the selective effect is not due to a general susceptibility of the VHL-deficient cells.

Applying DAISY to Construct Genome-wide Networks of SL and SDL Interactions in Cancer

We applied DAISY to identify all gene pairs that are likely to be synthetically lethal in cancer, resulting in an SL network of 2,077 genes and 2,816 SL interactions (Figure 4), and an SDL network of 3,158 genes and 3,635 SDL interactions (Table S3). As each of the nine data sets examined were analyzed separately to identify SL (SDL) pairs, we tested the mutual overlap between the resulting SL (SDL) sets and found it to be significantly higher than expected (Figure S1).

Both networks display scale-free-like characteristics and are enriched with known cancer-associated genes and biological functions (Figures S1 and S2; Table S4). The genes included in the networks are significantly overexpressed both in normal tissues and especially in cancers (Wilcoxon rank sum p values <6.29 × 10−8). Interestingly, the network genes are significantly associated with cancer proliferation and less associated with normal proliferation (Waldman et al., 2013). They are highly enriched with human orthologs of mouse essential genes (hypergeometric p values <1 × 10−30) and are evolutionary conserved (Wilcoxon rank sum p values <2.99 × 10−17). Moreover, each one of these properties is further emphasized in genes that have a higher degree in the SL or SDL networks (Supplemental Information; Figure S2).

The SL and SDL pairs are highly enriched with genes that interact in the protein-protein interaction (PPI) network (hypergeometric p values <4.02 × 10−7). Testifying to their importance, genes included in the SL or SDL networks have a higher degree in the PPI network compared to other genes, especially if their degree in the SL or SDL network is high (Wilcoxon rank sum p values <5.79 × 10−22; Figure S2D). Examining the genomic location of the SL and SDL pairs we find that while SL pairs tend to reside on different chromosomes, or at a large distance from each other on the same chromosome, the SDL gene pairs show the opposite behavior. The latter trend is observed also when identifying SDL interactions without considering the SoF approach. Discarding SDL gene pairs that reside close to each other depreciates the predictive signal of the network (Supplemental Information; Figure S3).

As a direct experimental validation of the predicted SL and SDL interactions is yet impossible on a genome scale, we tested the interactions by examining their utility in three fundamental prediction assignments, the prediction of gene essentiality, clinical prognosis, and drug efficacy. In all tasks, the networks are utilized to generate cancer-specific predictions given a genomic characterization of a specific cancer cell line or clinical sample.

SL-Based Prediction of Gene Essentiality in Cancer Cell Lines

Predicting gene essentiality based on the SL network is cell-line-specific. Indeed, examining the results of shRNA screens, the majority of genes are essential in very few cancer cell lines (Supplemental Information; Figure S4A). As we examined the predictions based on the results obtained in shRNA gene knockdown screens, we constructed an SL network without any shRNA data to avoid potential circularity. Based on this SL network and the genomic profiles of the cell lines, we predicted a gene as essential in a given cell line if one or more of its SL partners is inactive in that cell line (seeSupplemental Information for further details, analyses, and results).

Overall, we predicted gene essentiality in 129 different cancer cell lines and examined the predictions based on the results of two large-scale gene essentiality screens (Cheung et al., 2011 and Marcotte et al., 2012). Per cell line the predicted essential genes are significantly enriched with genes that were found experimentally to be essential in the pertaining cell line (empirical p value < 2.52 × 10−4; Supplemental Information; Figure 5A; Table S5). Furthermore, the higher the number of predicted inactive SL partners a gene has the more essential it is according to the experimental data (Figures 5B and 5C). Of note, the SL network succeeds more in predicting gene essentiality in cell lines with a higher number of gene deletions (Supplemental Information; Figures S4B and S4C; Table S5). Indeed, in such cell lines it is more likely that gene essentiality arises due to synthetic lethality. Finally, we predicted gene essentiality based on gene pairs that are human orthologs of yeast SLs (Conde-Pueyo et al., 2009). This, however, leads to markedly inferior performance, testifying to the value of the DAISY-inferred SLs (Supplemental Information; Figures S4D and S4E; Table S5).

We improved the unsupervised SL-based gene essentiality predictions described above by considering additional features that describe the state of a specific gene in a given cell line according to the SL network (e.g., the average SCNA level of its SL partners). Using these features, we trained neural network models on gene essentiality data (Extended Experimental Procedures). The performances of these supervised prediction models on unseen test sets resulted in ROC curves with AUCs of 0.755 and 0.854 for the Marcotte et al. (2012) and Achilles (Cheung et al., 2011) data, respectively (Figures 5D and 5E). For comparison, we considered the nine cell lines that were tested in both screens and utilized the shRNA scores obtained in one screen to predict gene essentiality according to the other screen (Extended Experimental Procedures). Using the Achilles screen to predict gene essentiality as reported in the Marcotte screen, or vice versa, results in inferior prediction performance, with AUCs of 0.663 and 0.706, respectively.

To further examine the SL-based gene essentiality predictions, we conducted a whole genome siRNA screen in the breast cancer cell line BT549 under normoxia and hypoxia (Extended Experimental Procedures; Table S6). We defined a refined set of essential genes, composed of genes that are essential in BT549 according to our siRNA screen under both conditions and according to the shRNA screen of Marcotte et al. (2012). Indeed, the performance of the SL-based predictor (that was not trained on gene essentiality data of BT549) is further improved when tested on this refined set of essential genes, obtaining an AUC of 0.951 (Figures 5F and S4F–S4K; Supplemental Information).

Counderexpression of SL Pairs Is a Marker of Better Prognosis in Breast Cancer

To examine the SL network in a clinical setting, we analyzed gene expression and 15 year survival data in a cohort of 1,586 breast cancer patients (Curtis et al., 2012). We postulated that counderexpression of two SL-paired genes would increase tumor vulnerability and result in better prognosis. To test this hypothesis, we classified the patients according to each SL pair into two groups: patients whose tumors counderexpressed the two SL-paired genes (SL group) and patients whose tumors expressed at least one of these genes (SL+ group). For each SL pair, we computed a signed Kaplan-Meier (KM) score (Extended Experimental Procedures). The higher the signed KM score is, the better the prognosis of the SL group is compared to the SL+group. Indeed, the signed KM score of the SL pairs is significantly higher than those of randomly selected gene pairs (one-sided Wilcoxon rank sum p value of 3.09 × 10−59). To examine if this result arises from the mere essentiality of genes in the SL network rather than the interaction between them, we repeated the analysis with randomly selected gene pairs involving genes from the SL network that are not connected by SL interactions. Reassuringly, the SL pairs have significantly higher signed KM scores also compared to these random SL network gene pairs (one-sided Wilcoxon rank sum p value of 2.00 × 10−9). Highly significant KM plots were obtained based on 271 SL pairs (log rank and Cox regression p values <0.05, following multiple hypotheses testing correction) (Figure 6A; Table S7).

Next, we classified the patients according to all the SL pairs in the network together. For each sample, we computed a global SL score that denotes the number of SL pairs it counderexpressed. As predicted, samples that counderexpressed a high number of SL pairs had a significantly better prognosis compared to those that counderexpressed a low number of SL pairs (log rank p value of 1.482 × 10−7; Figures 6B and 6C). Again, we examined if this result is due to the mere essentiality of the SL network genes rather than due to the specific SL interactions; we repeated this analysis using 10,000 topology preserving randomized networks consisting of the breast cancer essential genes (Marcotte et al., 2012) (Extended Experimental Procedures). Reassuringly, none of these random networks managed to predict patient survival as significantly as the SL network.

Because breast cancer is a highly heterogeneous disease, we examined whether higher global SL scores are associated with improved prognosis in specific and more homogenous groups of patients—all consisting of the same subtype, grade, or genomic instability level (Bilal et al., 2013). This is indeed the case for all groups except one—grade 1 patients. The global SL scores provide the most significant separation in the grade 2 normal-like subtype and moderate genomic instability groups (log rank p values of 8.64 × 10−5, 1.01 × 10−3, and 1.25 × 10−4, respectively). As expected, the global SL score is significantly negatively correlated with the tumor grade and genomic instability level (Spearman correlation coefficients of −0.407 and −0.267, p values of 2.58 × 10−62and 2.43 × 10−27, respectively) and highly associated with the tumor subtype (ANOVA p value of 4.25 × 10−102; Figure S5). Normal-like tumors have the highest global SL scores, while basal tumors have the lowest scores (Figure S5E). Notably, the prognostic value of the global SL score is significant even when accounting for the tumor grade, subtype, or genomic instability level (Cox p values of 7.18 × 10−4, 3.12 × 10−7, and 4.37 × 10−8, respectively). Lastly, the prognostic value of the global SL scores is superior to that obtained by using genomic instability levels (Figures S5I and S5J).

Harnessing SDL Interactions to Predict Drug Efficacy

We utilized the SDL network to predict the response of various cancer cell lines to anticancer drugs. As these drugs mainly target oncogenes, we used the SDL network to predict their efficacy rather than the SL network, whose performance is indeed inferior in this task (Supplemental Information). Based on the SDL network and the genomic profiles of the cancer cell lines, we predicted for each drug which cell lines are sensitive and which are resistant to its administration (Extended Experimental Procedures). More specifically, if one of the drug targets had more than one overexpressed SDL partner in a given cell line, the cell line was predicted to be sensitive to the drug administration (Supplemental Information).

To test this approach, we utilized two data sets of drug efficacies that were measured in a panel of cancer cell lines: (1) the Cancer Genome Project (CGP) data (Garnett et al., 2012), and (2) the Cancer Therapeutics Response Portal (CTRP) data (Basu et al., 2013). The SDL network enabled to predict the response of 593 cancer cell lines to 23 drugs and of 241 cancer cell lines to 33 additional drugs when utilizing the CGP and CTRP data sets to test the predictions, respectively. Overall, drugs are significantly more effective in the predicted sensitive cell lines than in the predicted resistant cell lines (empirical p values <5.34 × 10−4; Figures 7A and 7B; Table S8). Considering only the predictions that were obtained for drugs with a sufficiently high number of SDL interactions increases the fraction of drugs that are significantly predicted (Figure 7C). As predicted, the efficacies of drugs increase with the number of overexpressed SDL partners that their targets have in a given cell line (Figure 7D). Exceptions to this trend may be explained by noting that drug efficacy is determined only partially by the essentiality of the drug targets, and additional factors, like the drug membrane permeability, may affect drug efficacies. For comparison, we predicted drug response by applying two other well established approaches: (1) based on the mutation and copy-number status of the drug target(s), and (2) based on the genomic instability index of the cancer cells. The SDL network generates significant predictions for more than twice as many drugs compared to these competing predictors (Supplemental Information).

Focusing on the drugs that were most accurately predicted by using the SDL-network, we found that each one of the SDL interactions involving the targets of these drugs enables, on its own, to accurately predict the response to the pertaining drug (Figure 7E;Extended Experimental Procedures). Among these interactions is the predicted SDL interaction between EGFR and IGFBP3, whose overexpression should accordingly induce sensitivity to drugs targeting EGFR. Reassuringly, it has been shown that IGFBP3is underexpressed in Gefitinib-resistant cells, and the addition of recombinant IGFBP3restored the ability of Gefitinib to inhibit cell growth ( Guix et al., 2008). Another interesting example is the predicted SDL interaction between PARP1 and MDC1. The latter contains two BRCA1 C-terminal motifs and also regulates BRCA1 localization and phosphorylation in DNA damage checkpoint control ( Lou et al., 2003). Indeed,BRCA1/BRCA2 are known to be synthetically lethal with PARP1 ( Lord et al., 2008).

In a manner analogous to that described earlier for predicting gene essentiality, we utilized the SDL network to build supervised neural network predictors of drug efficacies in cancer cell lines (Extended Experimental Procedures). Using only 53 features, we predicted drug efficacies with Spearman correlation coefficients of 0.721 and 0.547 and p values <1 × 10−350 for the CGP and CTRP data, respectively (Figures 7F–7I). We further examined our SDL-based predictors by analyzing results of a large pharmacological screen carried out recently by the same team as CTRP. In this study, the efficacies of ∼500 compounds were measured across >850 cancer cell lines (P.A.C., personal communication). One hundred and twenty six of the tested compounds have at least one target in the SDL network, enabling to predict the response to their administration. Based the SDL network and the genomic profiles of these cell lines (Barretina et al., 2012), we predicted the efficacies of these drugs by using the unsupervised and supervised predictors (trained on the CTRP data). The SDL-based predictors obtained significant predictions (p value < 0.05) of drug efficacy for 83 (65.87%) and 70 (55.6%) drugs, when applying the unsupervised or supervised approach, respectively.

Discussion

DAISY is a genome scale, data-driven, approach for the identification of cancer SL and SDL interactions. As shown, DAISY successfully captures the results obtained in key large scale experimental studies exploring SLs in cancer, identifies valid SL interactions, and enables to predict gene essentiality, drug efficacy, and clinical prognosis in cancer.

DAISY presents a complementary effort to genetic and chemical screens, narrowing down the number of gene pairs that need to be examined experimentally to detect SL and SDL interactions in cancer. Based on the ROC curve presented in Figure 2A, an experimental screen for discovering SL interactions could be designed to check the SL pairs predicted by DAISY such that 5%, 25%, 50%, or 70% of all the SL interactions that are out there will be detected by examining only 0.25%, 4%, 14%, or 24% of all possible gene pairs, respectively. Hence, testing only the most confident predictions will enable to find up to 20 times more SL pairs than expected by random. Likewise, by applying DAISY to design an siRNA screen for detecting the SL interactions of VHL we identified almost four times as many SL interactions compared to a screen that was designed by applying biological reasoning. In light of these results DAISY could facilitate a more rapid and rational discovery of SL interactions in cancer by guiding focused experimental screens.

Nonetheless, DAISY has several limitations one needs to account for. First, it is restricted to the identification of SL interactions in cancer, as it is based on unique cancer-specific data that captures the genomic instability of cancer cells (e.g., SCNA). As such DAISY cannot be tested by applying it to identify SL interactions in model microorganisms as yeast. Second, DAISY identifies SL interactions based on large scale genomic data and shRNA screens, which are at times noisy and inaccurate (Bhinder et al., 2014). Third, as DAISY is based on the identification of gene inactivation, additional mechanisms of gene inactivation, such as epigenetic and posttranscriptional regulation, should be accounted for in the future. Fourth, the genomic location of genes may result in false-negative and false-positive predictions of SL and SDL interactions, respectively (see Supplemental Information for further analysis). Last, the ability of the SL network to accurately predict gene essentiality in vivo remains to be determined.

We have shown that SL and SDL interactions have a marked cumulative effect (Figures 5B, 5C, and 7D). Thus, a gene can form a useful drug target due to the (possibly partial) inactivation or overactivation of several of its SL or SDL partners, respectively. SL-based treatment can therefore be especially effective in targeting genetically unstable tumors that harbor many gene deletions and amplifications. Furthermore, a drug may be able to kill a broad array of genomically heterogeneous cells, each sensitive to the drug due to the inactivity (overactivity) of a different subset of the SL (SDL) partners of the drug targets. Targeting a gene with many inactive SL and/or overactive SDL partners may hence counteract the development of treatment resistance, especially if the SL/SDL partners reside on different chromosomes or in distant genomic locations. Moreover, SL-based treatment can induce the reactivation of a tumor suppressor or the inactivation of an oncogene by targeting its SL or SDL pair, respectively.

Four main translational challenges could potentially be tackled by utilizing SL and SDL networks: (1) ranking existing treatments for a given patient based on the genomic characteristics of the tumor, as initially shown here in cell lines; (2) repurposing approved drugs that are currently used to treat other diseases to treat cancer, as shown here for treating a VHL-deficient cancer; (3) systematically identifying new drug targets; and (4) predicting cancer prognosis, as shown here for breast cancer. Taken together, SL and SDL network-based analysis combined with personalized genomics can provide an important future tool for assessing response to treatment and for developing more selective and effective personalized therapeutics.

Experimental Procedures

Description of DAISY

DAISY identifies candidate SL and SDL interactions by applying three separate statistical inference procedures. Each procedure has its own input and outputs a set of candidate SL or SDL pairs. Gene pairs that are identified as candidate SL or SDL pairs by all three procedures are identified by DAISY as SL or SDL pairs, respectively. The three inference procedures are described below (comments in parenthesis refer to changes made to identify SDL pairs):

(1)

The genomic SoF procedure analyzes a set of input data sets denoted as SoFdata sets. Each data set includes SCNA profiles of cancer samples and optionally their mRNA and somatic mutation profiles. For every pair of genes, denoted as A and B, and every data set S in SoFdata sets, a Wilcoxon rank sum test is conducted to examine if gene B has a significantly higher SCNA level in samples in which gene A is inactive (overactive) than in the rest of the samples. The output consists of gene pairs that, according to at least one of the data sets in SoF data sets, pass the test described above in a statistically significant manner (a Wilcoxon rank sum p value <0.05 following Bonferroni correction for multiple hypotheses testing).

(2)

The shRNA-based functional examination procedure analyzes a set of data sets denoted as shRNAdata sets. Each data set includes the results obtained in a gene essentiality (shRNA) screen together with the SCNA and gene expression profiles of the cancer cell lines examined in that screen. For every pair of genes, denoted as A and B, and every data sets S in shRNAdata sets, a Wilcoxon rank sum test is conducted to examine if gene B has significantly lower shRNA scores in samples in which gene A is inactive (overactive) than in the rest of the samples (the lower the shRNA score is, the more essential the gene is). The output consists of gene pairs that, according to at least one of the data sets in shRNAdata sets, pass the test described above in a statistically significant manner (a Wilcoxon rank sum p value <0.05).

(3)

The pairwise gene coexpression procedure is given a set of transcriptomic data sets of cancer samples and returns gene pair whose expression, in at least one of the data sets, is significantly positively correlated (a Spearman correlation coefficient ≥Rmin and a p value < 0.05 following Bonferroni correction for multiple hypotheses testing).

The candidate SL or SDL pairs that are identified in the first and third procedures are obtained with highly stringent statistical cutoffs, a p value <0.05 following Bonferroni correction. The data obtained in shRNA screens has a low statistical power and is hence utilized (in the second procedure) only to further refine the already highly statistically significant SL and SDL sets identified in the first and third procedures.

The first and second procedures are based on the detection of gene inactivation and overactivation in the samples analyzed. A gene is defined as inactive in a sample if it is underexpressed and its SCNA is below −0.3 or if it is mutated with a deleterious mutation. The latter refers to nonsense and frame-shift mutations. Likewise, a gene is defined as overactive in a sample if it is overexpressed and its SCNA is above 0.3. Of note, the SCNA parameters (−0.3 and 0.3) used here are more stringent cutoffs compared to those used in the literature to define gene amplification and deletion (Beroukhim et al., 2010). A gene is defined as underexpressed in a given sample if its expression level is below the 10th percentile of its expression levels across all samples in the data set, and similarly, as overexpressed if its expression level is above its 90th percentile. In the third procedure we set Rmin to 0.5.

To find the candidate pairs and construct the SL and SDL networks, we applied DAISY with the data sets listed in Table S1 and traversed over all ∼535 million gene pairings. To do so efficiently, DAISY was implemented and run on the HTcondor architecture, which enables parallel computing (Thain et al., 2005).

Network Availability and Visualization

Interactive maps of the networks are accessible through http://www.cs.tau.ac.il/∼livnatje/SL_network.zip and can be explored using the freely available Cytoscape software (Cline et al., 2007). The maps include different gene properties and annotations, as well as alternative views that dissect the network hubs or genes with specific characteristics. We clustered the SL and SDL networks by applying the Girvan-Newman fast greedy algorithm as implemented by the GLay Cytoscape plug-in (Morris et al., 2011 and Su et al., 2010) and performed gene annotation enrichment analysis for every network and every network cluster via DAVID (Huang et al., 2009).

Author Contributions

E.R. supervised the research. E.R. and L.J.A. conceived and designed the computational approach, analyzed the data, and wrote the paper. L.J.A. performed the statistical and machine learning analyses. E.G. designed and supervised the siRNA screens performed in his lab by N.P., L.M., D.J., and E.S., P.A.C., and B.S.-L. provided and analyzed pharmacological screening data. L.J.A. and Y.Y.W. performed the clinical survival analysis. Y.Y.W. performed the evolutionary and PPI network analysis. A.W. preprocessed the SCNA data. T.G. and E.G. provided insights regarding the biological aspects of the work. T.G. and Y.Y.W assisted in writing the paper.

Acknowledgments

We thank A. Wagner, D. Horn, D. Steinberg, E. Halperin, I. Meilijson, L. Wolf, M. Kupiec, M. Oberhardt, and R. Sharan for their help and comments. We thank E. MacKenzie for technical support. L.J.A. and A.W. are partially funded by the Edmond J. Safra bioinformatics center and the Israeli Center of Research Excellence program (I-CORE, Gene Regulation in Complex Human Disease Center No 41/11). L.J.A. was also funded by the Dan David foundation and by the Adams Fellowship Program of the Israel Academy of Sciences and Humanities. Y.Y.W. was supported in part by Eshkol fellowship (the Israeli Ministry of Science and Technology). E.R.’s research in cancer is supported by grants from the Israeli Science Foundation (ISF) and Israeli Cancer Research Fund (ICRF). E.R. and T.G. are supported by the I-CORE program.

References

• Ashworth et al., 2011
• Genetic interactions in cancer progression and treatment
• Cell, 145 (2011), pp. 30–38
• Barretina et al., 2012
• The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity
• Nature, 483 (2012), pp. 603–607
• Bassik et al., 2013
• A systematic mammalian genetic interaction map reveals pathways underlying ricin susceptibility
• Cell, 152 (2013), pp. 909–922
• Basu et al., 2013
• An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules
• Cell, 154 (2013), pp. 1151–1161
• Beroukhim et al., 2010
• The landscape of somatic copy-number alteration across human cancers
• Nature, 463 (2010), pp. 899–905
• Bhinder et al., 2014
• Comparative analysis of RNAi screening technologies at genome-scale reveals an inherent processing inefficiency of the plasmid-based shRNA hairpin
• Comb. Chem. High Throughput Screen., 17 (2014), pp. 98–113
• Bilal et al., 2013
• Improving breast cancer survival analysis through competition-based multidimensional modeling
• PLoS Comput. Biol., 9 (2013), p. e1003047
• Bommi-Reddy et al., 2008
• Kinase requirements in human cells: III. Altered kinase requirements in VHL-/- cancer cells detected in a pilot synthetic lethal screen
• Proc. Natl. Acad. Sci. USA, 105 (2008), pp. 16484–16489
• Brough et al., 2011
• Searching for synthetic lethality in cancer
• Curr. Opin. Genet. Dev., 21 (2011), pp. 34–41
• Byrne et al., 2007
• A global analysis of genetic interactions in Caenorhabditis elegans
• J. Biol., 6 (2007), p. 8
• Cancer Genome Atlas Research Network et al., 2013
• The Cancer Genome Atlas Pan-Cancer analysis project
• Nat. Genet., 45 (2013), pp. 1113–1120
• Cheung et al., 2011
• Systematic investigation of genetic vulnerabilities across cancer cell lines reveals lineage-specific dependencies in ovarian cancer
• Proc. Natl. Acad.Sci. USA, 108 (2011), pp. 12372–12377
• Chipman and Singh, 2009
• Predicting genetic interactions with random walks on biological networks
• BMC Bioinformatics, 10 (2009), p. 17
• Cline et al., 2007
• Integration of biological networks and gene expression data using Cytoscape
• Nat. Protoc., 2 (2007), pp. 2366–2382
• Conde-Pueyo et al., 2009
• Human synthetic lethal inference as potential anti-cancer target gene detection
• BMC Syst. Biol., 3 (2009), p. 116
• Costanzo et al., 2010
• The genetic landscape of a cell
• Science, 327 (2010), pp. 425–431
• Curtis et al., 2012
• The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups
• Nature, 486 (2012), pp. 346–352
• Folger et al., 2011
• Predicting selective drug targets in cancer through metabolic networks
• Mol. Syst. Biol., 7 (2011), p. 501
• Frezza et al., 2011
• Haem oxygenase is synthetically lethal with the tumour suppressor fumarate hydratase
• Nature, 477 (2011), pp. 225–228
• Garnett et al., 2012
• Systematic identification of genomic markers of drug sensitivity in cancer cells
• Nature, 483 (2012), pp. 570–575
• Guix et al., 2008
• Acquired resistance to EGFR tyrosine kinase inhibitors in cancer cells is mediated by loss of IGF-binding proteins
• J. Clin. Invest., 118 (2008), pp. 2609–2619
• Hartwell et al., 1997
• Integrating genetic approaches into the discovery of anticancer drugs
• Science, 278 (1997), pp. 1064–1068
• Huang et al., 2009
• Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
• Nucleic Acids Res., 37 (2009), pp. 1–13
• Kelley and Ideker, 2005
• Systematic interpretation of genetic interactions using protein networks
• Nat. Biotechnol., 23 (2005), pp. 561–566
• Laufer et al., 2013
• Mapping genetic interactions in human cancer cells with RNAi and multiparametric phenotyping
• Nat. Methods, 10 (2013), pp. 427–431
• Lord et al., 2008
• A high-throughput RNA interference screen for DNA repair determinants of PARP inhibitor sensitivity
• DNA Repair (Amst.), 7 (2008), pp. 2010–2019
• Lou et al., 2003
• Mediator of DNA damage checkpoint protein 1 regulates BRCA1 localization and phosphorylation in DNA damage checkpoint control
• J. Biol. Chem., 278 (2003), pp. 13599–13602
• Lu et al., 2013
• Genome evolution predicts genetic interactions in protein complexes and reveals cancer drug targets
• Nat. Commun., 4 (2013), p. 2124
• Luo et al., 2008
• Highly parallel identification of essential genes in cancer cells
• Proc. Natl. Acad. Sci. USA, 105 (2008), pp. 20380–20385
• Luo et al., 2009
• A genome-wide RNAi screen identifies multiple synthetic lethal interactions with the Ras oncogene
• Cell, 137 (2009), pp. 835–848
• Marcotte et al., 2012
• Essential gene profiles in breast, pancreatic, and ovarian cancer cells
• Cancer Discov., 2 (2012), pp. 172–189
• Martin et al., 2009
• Methotrexate induces oxidative DNA damage and is selectively lethal to tumour cells with defects in the DNA mismatch repair gene MSH2
• EMBO Mol. Med., 1 (2009), pp. 323–337
• Morris et al., 2011
• clusterMaker: a multi-algorithm clustering plugin for Cytoscape
• BMC Bioinformatics, 12 (2011), p. 436
• Sajesh et al., 2013
• Synthetic genetic targeting of genome instability in cancer
• Cancers, 5 (2013), pp. 739–761
• Steckel et al., 2012
• Determination of synthetic lethal interactions in KRAS oncogene-dependent cancer cells reveals novel therapeutic targeting strategies
• Cell Res., 22 (2012), pp. 1227–1245
• Su et al., 2010
• GLay: community structure analysis of biological networks
• Bioinformatics, 26 (2010), pp. 3135–3137
• Szappanos et al., 2011
• An integrated approach to characterize genetic interaction networks in yeast metabolism
• Nat. Genet., 43 (2011), pp. 656–662
• Szczurek et al., 2013
• Synthetic sickness or lethality points at candidate combination therapy targets in glioblastoma
• Int. J. Cancer, 133 (2013), pp. 2123–2132
• Thain et al., 2005
• Distributed computing in practice: the Condor experience
• Concurr. Comp-Pract. E., 17 (2005), pp. 323–356
• Turner et al., 2008
• A synthetic lethal siRNA screen identifying genes mediating sensitivity to a PARP inhibitor
• EMBO J., 27 (2008), pp. 1368–1377
• Typas et al., 2008
• High-throughput, quantitative analyses of genetic interactions in E. coli
• Nat. Methods, 5 (2008), pp. 781–787
• Waldman et al., 2013
• A genome-wide systematic analysis reveals different and predictive proliferation expression signatures of cancerous vs. non-cancerous cells
• PLoS Genet., 9 (2013), p. e1003806
• Wong et al., 2004
• Combining biological networks to predict genetic interactions
• Proc. Natl. Acad. Sci. USA, 101 (2004), pp. 15682–15687
Corresponding author
Corresponding author

Analysis of biological processes and diseases using text mining approaches.

Methods Mol Biol. 2010;593:341-82. doi: 10.1007/978-1-60327-194-3_16.

Analysis of biological processes and diseases using text mining approaches.

Abstract

A number of biomedical text mining systems have been developed to extract biologically relevant information directly from the literature, complementing bioinformatics methods in the analysis of experimentally generated data. We provide a short overview of the general characteristics of natural language data, existing biomedical literature databases, and lexical resources relevant in the context of biomedical text mining. A selected number of practically useful systems are introduced together with the type of user queries supported and the results they generate. The extraction of biological relationships, such as protein-protein interactions as well as metabolic and signaling pathways using information extraction systems, will be discussed through example cases of cancer-relevant proteins. Basic strategies for detecting associations of genes to diseases together with literature mining of mutations, SNPs, and epigenetic information (methylation) are described. We provide an overview of disease-centric and gene-centric literature mining methods for linking genes to phenotypic and genotypic aspects. Moreover, we discuss recent efforts for finding biomarkers through text mining and for gene list analysis and prioritization. Some relevant issues for implementing a customized biomedical text mining system will be pointed out. To demonstrate the usefulness of literature mining for the molecular oncology domain, we implemented two cancer-related applications. The first tool consists of a literature mining system for retrieving human mutations together with supporting articles. Specific gene mutations are linked to a set of predefined cancer types. The second application consists of a text categorization system supporting breast cancer-specific literature search and document-based breast cancer gene ranking. Future trends in text mining emphasize the importance of community efforts such as the BioCreative challenge for the development and integration of multiple systems into a common platform provided by the BioCreative Metaserver.

PMID:
19957157
[PubMed – indexed for MEDLINE]

PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool)

Recently, I found this good research paper called “PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool) “. Maybe it will be useful for scientists who are interested in this topic.

Sci Rep. 2015 May 19;5:10021. doi: 10.1038/srep10021.

PALM-IST: Pathway Assembly from Literature Mining–an Information Search Tool.

Abstract

Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular component, etc.) information from the extracted text and subsequently mines various databases to provide their comprehensive inter-relation (e.g., interaction, expression, etc.). PALM-IST constructs protein interaction network and pathway information data relevant to the text search using multiple data mining tools and assembles them to create a meta-interaction network. It also analyzes scientific collaboration by extraction and creation of “co-authorship network,” for a given search context. Hence, this useful combination of literature and data mining provided in PALM-IST can be used to extract novel protein-protein interaction (PPI), to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. PALM-IST is available at www.hpppi.iicb.res.in/ctm.

PMID:
25989388
[PubMed – indexed for MEDLINE]
PMCID:
PMC4437304

Free PMC Article

http://www.hpppi.iicb.res.in/ctm/

PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool) is a computational platform for users to explore biomedical literature resourse (PubMed) using multiple keywords and extract gene/protein(s) name, drug(s), disease(s) centered information along with their relation/interaction from text and databases. PALM-IST provides users a platform where data and literature mining are performed simultaneously. Combined structured data (from data mining) and unstructured data (from text mining) can be used to extract novel association/interaction between biological entities such as proteins, diseases, or drugs, to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. Further, PALM-IST also enables users to assemble human pathways and protein-protein interaction network (PPIN) using information extracted from text and databases.

FEATURES

1. Real time search in PubMed.
2. Identification and highlighting of genes, drugs and diseases extracted from searched abstracts.
3. Interactive co-occurrence based network of gene-disease, gene-drug, drug-disease from literature.
4. Functional annotation by mapping expression information on to human pathway proteins and their interactors.
5. Platform to merge protein-protein interaction of multiple human genes/proteins.
6. Platform to find cross-talk genes/proteins from merged pathways result.
7. Interactive display of pathways with over-laid with protein-protein interaction information.
8. Interactive display of collaborative network between biomedical experts.

KH Coder is a free software for quantitative content analysis or text data mining

https://sourceforge.net/projects/khc/

Description

KH Coder is a free software for quantitative content analysis or text data mining. It is also utilized for computational linguistics. You can analyze Japanese, English, French, German, Italian, Portuguese and Spanish text with KH Coder. Chinese (simplified, UTF-8), Korean and Russian (UTF-8) language data can also be analyzed with the latest alpha version.

KH Coder provides various kinds of search and statistical analysis functions using back-end tools such as Stanford POS Tagger, FreeLing, Snowball stemmer, MySQL and R.

KH Coder Web Site

http://www.sciencedirect.com/science/article/pii/S1672022916000401

Figure 1.

Translational Bioinformatics in context

The Y axis depicts the “central dogma” of informatics, converting data to information and information to knowledge. Along the X axis is the translational spectrum from bench to bedside. Translational bioinformatics spans the data to knowledge spectrum, and bridges the gap between bench research and application to human health. The figure was reproduced from [1] with permission from Springer.

In the general phase of text mining of cancer systems biology, we initially obtained related biomedical text from many available sources, such as PubMed. A number of literature databases provide packed data download service. However, although it is convenient, the included text is not timely updated, and text quantity is also limited. Many literature database systems offers application programming interface, by which we can use scripts to download the text automatically by computers. For examples, through E-utility of PubMed [64] and [101], users can easily get up-to-date text.

Named entity recognition tools can then be used to extract biomedical mentions from the text obtained. The mentions usually include terms such as gene names, protein names, mRNA (message RNA) names, miRNA (micro-RNA) names, metabolism related terms, and cell terms. After finding the biomedical terms, we can build a gene–gene interaction network, metabolism pathways, and other networks. Resources such as Gene Ontology can be used for network building. MicroRNAs are considered to be connected with cancer, so we can investigate how miRNAs work in gene–gene interaction. In the next phase, we can study how components and structures change in dynamic contexts. Certain networks and their variations, such as protein–protein interaction networks [102]and variations in metabolism network, can be built from text. Due to the high false negative rate in text mining-based networks, we can employ some validation and inference algorithms to correct and optimize the network. In each phase, we can use many resources to validate the network, such as homology, co-expression data, rich domain data, and co-biological process data, as well as other information. Through validation, some nodes and interactions with strong evidence will be strengthened, whereas a false one will be removed or updated. Consequently, we can develop a protein–protein interactome based on multiple sources of interaction evidence [47]. Finally, all the networks and components can be used for further studies.

Signaling pathway reconstruction plays a significant role in understand the molecular mechanisms in cancer. Signaling pathway maps are usually obtained from manual literature search, automated text mining, or canonical pathway databases [103]. Pena-Hernandez et al. implemented an extraction tool to find gene relationship and up-to-date pathways from literature [104].

5.2. Examples of integrated biomedical text mining tools

An integrated biomedical text mining systems is supposed to provide the stated functionalities. There are many tools dominated in cancer research. However blindly using the results from text mining tools is not a wise idea because the information and knowledge derived from uncurated text are error prone. Many tools choose to manually curate text by experts. In the following we will briefly introduce the three most popular commercial tools, i.e., Pathway Studio [105], GeneGO [106] and Ingenuity [107].

By Pathway Studio [105], we can analyze pathway, gene regulation networks, protein interaction maps and navigate molecular networks. Its background knowledge database contains more than 100,000 events of regulation, interaction and modification between proteins, cell processes and small molecules. It has a natural language processing module, MedScan, which enables Pathway Studio for entity identification and then applied handcrafted context free grammar (CFG) rules to extract relationships. Pathway Studio can access the entire PubMed database and online resource, full-text journal, literature, experimental and electronic notebooks. Pathways and networks from the extracted facts and interactions extracted from retrieved text. Many algorithms such as Find direct interactions, Find shortest paths, Find common targets or Find common regulators are available.

MetaCore, one of key products of GeneGO [106] is an integrated knowledge database and software suite for pathway analysis of experimental data and gene lists. The knowledge base of MetaCore is manually curated database derived from extensive full-text literature annotation. MetaMiner of GeneGo, mainly including MetaMiner Disease Platforms, MetaMiner Stem Cells, MetaMiner Prostate Cancer, MetaMiner Cystic Fibrosis, offers a knowledge mining and data analysis platforms for oncology. The most important disease reconstruction function is based on three fundamentals, manual annotation of all gene–disease associations, reconstruction of disease pathways and functional data and knowledge mining of OMICs experimental studies published in a disease area. GeneGo also provides API for third party software development.

Ingenuity [107] helps researchers model, analyze, and understand the complex biomedical, biological and chemical systems by integrating data from a variety of experimental platforms. One application example of Ingenuity Systems is analysis of CD44hi breast cancer stem cell-like subpopulations using Ingenuity iReport. The base knowledge of Ingenuity is also extracted by experts from the full text of the scientific literature, including findings about genes, drugs, biomarkers, chemicals, cellular and disease processes, and signaling and metabolic pathways. Researchers can search the scientific literature and find insights most relevant to the desired experimental model or question, build dynamic pathway models, and get confidence in hypotheses and conclusions.

6. Future work and challenges

With the development of the next-generation sequencing technologies, high throughput experimental methods are revolutionizing the life sciences rapidly. The widespread of the cloud computing application is also accelerating the application of text mining technology in the frontier research in life science. We here discuss the work and challenges in the future application of text mining in cancer researches as follows.

The first challenge is to apply biomedical text mining technologies in the personalized medicine development. It is well-known that cancer is a complex disease. Many factors such as race, gender, age and environments may correlate with risk of cancer [108],[109], [110], [111], [112], [113] and [114]. The personalized medicine is becoming a trend and the therapies will be tailored to individual patients with their biomedical information collected and analyzed. Ando et al. have applied the text mining technique to qualitatively identify the differences in the focus of life review interviews by patient’s age, gender, disease age and stage [115]. Ahmed et al. integrated compound–target relationships related with cancer by text mining and presented the spectrum of research on personalized medicine and compound–target interactions [116]. The personalized medicine in cancer will take in all these important aspects into consideration during text mining [117]. One solution is to categorize data before text mining rather than treat them together without any pre-processing. It is a really tough task to categorize data at individual level features. On the other hand, one of the negative consequence of categorization is making it harder for text mining to find a good biomarker for all cases.

The second challenge is the complex of cancer molecular mechanisms. The same cancer phenotype could be caused by different gene or gene sets from the same pathway or network. To study the complex mechanisms of cancer, we need to mine text from a hierarchical network view rather than from a single level. Systems biomedicine carries on analysis and study from different levels, including motif [118] and [119], pathway [120], [121] and [122], module [123], [124] and [125] and network[126] and [127]. The resulting hierarchical data provide us valuable materials to conduct text mining on different levels. However, how to correctly categorize text to hierarchical network, and how to integrate text mining results from different levels and discover new knowledge with a systems biomedicine view are really a hard work.

The third challenge is to apply the text mining techniques in translational medicine research. Translational medicine, an emerging field of biomedicine, involves the transformation of laboratory findings into novel diagnosis and treatment of patients [128]. The knowledge of pre-clinical can be used in clinic to improve treatment. Translational medicine facilitates the course of diseases predicting, preventing, diagnosing, and treating. Bioinformatics will be a driver rather than a passenger for translational biomedical research [128], such as the data integration and data mining platform presented by Liekens et al. [129] could retrospectively confirm recently discovered disease genes and identify potential susceptibility genes. It will add tough tasks for text mining, since translation biomedical text mining should consider various stages of information and various sources of evidence, and integrate the Omics and clinical data sets to find out novel knowledge for both biology and medicine domains. There are many this kind of applications, such as the data integration and data mining platform presented by Liekens et al. [129] could retrospectively confirm recently discovered disease genes and identify potential susceptibility genes.

The fourth challenge for text mining will be the integration of the text information at molecule, cell, tissue, organ, individual and even population levels to understand the complex biological systems. Nevertheless, most of the current text mining studies focus on molecular level, and very little text mining work reported at high levels, which in fact has a close relationship with cancer phenotypes. Text mining at high levels and integrate the text information at all these levels will be a big challenge for cancer study and provide also opportunities for successful cancer diagnosis and treatments.

The last challenge will be the de-noising and testing of the text mining results. Text mining results are often obtained with noising information and false positives since natural language text are often inconsistent. It contains ambiguities caused by semantics, slang and syntax. It can be also suffered from noise and error in text. As a result, the mined information cannot be used blindly. Many methods have been developed to solve the problem. The first is to manually read and understand the contexts, analyze them, and then add semantic tags. This pre-processing in fact turns the unstructured text into structured text with semantic tags. Thereby, the developed tools can easily achieve the goal with high precision rate. However, the approach is very restricted as it needs vast human efforts and turns out to be very time consuming. As a result, the data source for mining could be modest in size, only limiting mining ability. The second method is to carry on text mining on vast biomedical text, and then analyze the results and screen out the final results with prior domain experience. During the mining process, domain knowledge is usually employed to improve mining efficiency as well as the quality of the mined knowledge. This approach although the mined results may still contain more errors, is more powerful on knowledge discovering compared with the first approach. These two approaches are distinct on treating the text to be mined. The first one ensures correctness by carefully manual pre-processing, while the second one is to select correct ones by post-processing by experts. The third approach is to take a compromise between pre-processing and post-processing, where some advanced statistical analysis will be used to roughly clean data at first stage and then conduct mining on them.

7. Conclusions

Currently, there is a huge body of biomedical text and their rapid growth makes it impossible for researchers to address the information manually. Researchers can use biomedical text mining to discover new knowledge. We have reviewed the important research issues related to text mining in the biomedical field. We also provided a review of the state-of-the-art applications and datasets used for text mining in cancer research, thereby providing researchers with the necessary resources to apply or develop text mining tools in their research. We introduced the general workflow of text mining to support cancer systems biology and we illustrated each phase in detail. We can see that text mining has been used widely in cancer research. However, to fully utilize text mining, it is still necessary to develop new methods for full text mining and for highly complex text, as well as platforms for integrating other biomedical knowledge bases.

In spite of the huge potential of applying text mining on biomedicine, it still needs further development. Biomedical text mining systems are not as golden standard tools of biomedical researchers as retrieval systems and sequencing tools. The next important mission of text mining for us is to develop applications that are really helpful to biomedical research, so that researchers can get more productive and make more progress in the information rapid growing ear. To achieve the goal, more concerns should be put on helping biological biomedical scientists to remove the obstacles that block the development rather than discussions that are not related with actual demands. One of the hottest topics of text mining is to coordinate and cooperate with multiple subjects. That is, biomedical text mining, coupled with other data and means, should yield consistent, measurable, and testable results.

Next Generation Sequencing

Next-generation sequencing technologies are revolutionising genomics and their effects are becoming increasingly widespread. Many tools and algorithms relevant to next-generation sequencing applications have been published in Bioinformatics, and so to celebrate this contribution we have gathered these together in this ‘Bioinformatics for Next Generation Sequencing’ virtual issue. This will be a living resource that we will continually update to include the very latest papers in this area to help researchers keep abreast of the latest developments.

Source: http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

Editorial -Bioinformatics for Next Generation Sequencing
Alex Bateman and John Quackenbush
Bioinformatics (2009) 25: 429 Full Text

A Report on the 2009 SIG on Short Read Sequencing and Algorithms (Short-SIG)
Michael Brudno et al.
Bioinformatics (2009) 25: 2863–2864 Full Text

Alignment

Optimal spliced alignments of short sequence reads
Fabio De Bona et al.
Bioinformatics (2008) 24: i174-80 Full Text

PatMaN: rapid alignment of short sequences to large databases
Kay Prüfer et al.
Bioinformatics (2008) 24: 1530-1 Full Text

SeqMap: mapping massive amount of oligonucleotides to the genome
Hui Jiang and Wing Wong
Bioinformatics (2008) 24: 2395-6 Full Text

ZOOM! Zillions of oligos mapped
Hao Lin et al.
Bioinformatics (2008) 24: 2431-7 Full Text

Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications
Brian Ondov et al.
Bioinformatics (2008) 24: 2776-7 Full Text

SOAP: short oligonucleotide alignment
Ruiqiang Li et al.
Bioinformatics (2008) 24: 713-4 Full Text

Annotation of metagenome short reads using Proxygenes
Daniel Dalevi et al.
Bioinformatics (2008) 24: i7-13 Full Text

Optimal pooling for genome re-sequencing with ultra-high-throughput short-read technologies
Iman Hajirasouliha
Bioinformatics (2008) 24: i32-40 Full Text

PASS: a Program to Align Short Sequences
Davide Campagna et al.
Bioinformatics (2009) 25: 967–968 Full Text

MOM: Maximum Oligonucleotide Mapping
Hugh Eaves and Yuan Gao
Bioinformatics (2009) 25: 969–970 Full Text

ProbeMatch: Rapid alignment of oligonucleotides to a genome allowing both gaps and mismatches
Jignesh Patel et al.
Advanced Access publication: 7 April 2009 Full Text

Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform
Li Heng and Richard Durbin
Advanced Access publication: 18 May 2009 Full Text

CloudBurst: highly sensitive read mapping with MapReduce
Michael Schatz
Bioinformatics (2009) 25: 1363–1369 Full Text

SOAP2: an improved ultrafast tool for short read alignment
Ruiqiang Li
Advanced Access publication: 3 June 2009 Full Text

A Fast Hybrid Short Read Fragment Assembly Algorithm
Bertil Schmidt et al.
Advanced Access publication: 17 June 2009 Full Text

PerM: Efficient Mapping of Short Sequencing Reads with Periodic Full Sensitive Spaced Seeds
Yangho Chen et al
Advanced Access publication: 12 August 2009 Full Text

Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data
Jacob Degner et al.
Advanced Access publication: 6 October 2009 Full Text

Andrew Smith et al.
Bioinformatics (2009) 25: 2841–2842 Full Text

Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite
Takehiro Hashimoto et al.
Bioinformatics (2009) 25: 2613-4 Full Text

Classification of DNA sequences using Bloom filters
Henrik Stranneheim et al.
Bioinformatics (2010) 26: 1595–1600 Full Text

The Sequence Alignment/Map (SAM) Format and SAMtools
Heng Li et al.
Advanced Access publication: 8 June 2009 Full Text

Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite
Geoffrey Faulkner et al.
Advanced Access publication: 15 July 2009 Full Text

MicroRazerS: Rapid alignment of small RNA reads
Anne-Katrin Emde et al.
Bioinformatics (2010) 26: 123-124 Full Text

The GNUMAP Algorithm: Unbiased Probabilistic Mapping of Oligonucleotides from Next-Generation Sequencing
Nathan Clement et al.
Bioinformatics (2010) 26: 38-45 Full Text

A Probabilistic Framework for Aligning Paired-end RNA-seq Data
Yin Hu et al.
Advanced Access publication: 23 July 2009 Full Text

An alignment algorithm for bisulfite sequencing using the Applied Biosystems SOLiD System
Brain Ondov et al.
Bioinformatics (2010) 26: 1901-1902 Full Text

GASSST: global alignment short sequence search tool
Guillaume Rizk and Dominique Lavenier
Bioinformatics (2010) 26: 2534–2540 Full Text

Anatomy of a hash-based long read sequence mapping algorithm for next generation DNA sequencing
Sanchit Misra et al
Bioinformatics (2011) 27: 189-195 Full Text

Fast and SNP-tolerant detection of complex variants and splicing in short reads
Thomas Wu and Serban Nacu
Advanced Access publication: 10 February 2010 Full text

RRBSMAP: A Fast, Accurate and User-friendly Alignment Tool for Reduced Representation Bisulfite Sequencing
Yuanxin Xi et al
Bioinformatics (2012) 28: 430-432 Full Text

B-SOLANA: An approach for the analysis of two-base encoding bisulfite sequencing data
Benjamin Kreck et al
Bioinformatics (2012) 28: 428-429 Full Text

Assembly

Aggressive Assembly of Pyrosequencing Reads with Mates
Jason Miller et al.
Bioinformatics (2008) 24: 2818-24 Full Text

Assembly reconciliation
Aleskey Zimin et al.
Bioinformatics (2008) 24: 42-5 Full Text

Consensus Generation and Variant Detection by Celera Assembler
Bioinformatics (2008) 24: 1035-40 Full Text

Assembling millions of short DNA sequences using SSAKE
Rene Warren et al.
Bioinformatics (2007) 23: 500-1 Full Text

Extending assembly of short DNA sequences to handle error
William Jeck et al.
Bioinformatics (2007) 23: 2942-4 Full Text

SCARF: Maximizing next-generation EST assemblies for evolutionary and population genomic analyses
Michael Barker et al.
Bioinformatics (2009) 25: 535-536 Full Text

Profiling model T-cell metagenomes with short reads
René Warren et al
Bioinformatics (2008) 25: 458-64 Full Text

A Consistency-based Consensus Algorithm for De Novo and Reference-guided Sequence Assembly of Short Reads.
Tobias Rausch et al.
Bioinformatics (2009) 25: 1118–1124 Full Text

HI: Haplotype Improver using paired-end short
Quan Long et al.
Advanced Access publication: 1 July 2009 Full Text

Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads
Kai Ye et al.
Advanced Access publication: 26 June 2009 Full Text

Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly
Bas Dutilh et al.
Advanced Access publication: 19 June 2009 Full Text

De novo Transcriptome Assembly with ABySS
Inanc Birol et al.
Advanced Access publication: 15 June 2009 Full Text

Gap5 – editing the billion fragment sequence assembly
James Bonfield and Andrew Whitwham
Advanced Access publication: 30 May 2010 Full text

Efficient construction of an assembly string graph using the FM-index
Jared Simpson and Richard Durbin
Bioinformatics (2010) 26: i367–i373 Full Text

Integrating genome assemblies with MAIA
Jurgen Nijkamp et al
Bioinformatics (2010) 26: i433–i439 Full Text

Scaffolding pre-assembled contigs using SSPACE
Marten Boetzer et al
Bioinformatics (2011) 27: 578-579 Full Text

Scoring-and-unfolding trimmed tree assembler: concepts, constructs and comparisons
Giuseppe Narzisi and Bud Mishra
Bioinformatics (2011) 27: 153-160 Full Text

QuRe: Software for viral quasispecies reconstruction from next-generation sequencing data
Mattia Prosperi and Marco Salemi
Bioinformatics (2012) 28: 132-133 Full Text

Graph accordance of next-generation sequence assemblies
Guohui Yao et al
Bioinformatics (2012) 28: 13-16 Full Text

Fast Scaffolding with Small Independent Mixed Integer Programs
Leena Salmela et al
Bioinformatics (2011) 27: 3259–3265 Full Text

Bambus 2: Scaffolding Metagenomes
Sergey Koren et al
Bioinformatics (2011) 27: 2964–2971 Full Text

Tanga Magoc and Steven Salzberg
Bioinformatics (2011) 27: 2957-2963 Full Text

Mauve Assembly Metrics
Aaron Darling et al
Bioinformatics (2011) 27: 2756–2757 Full Text

Gee Fu: a sequence version and web-services database tool for genomic assembly, genome feature and NGS data
Ricardo Ramirez-Gonzalez et al
Bioinformatics (2011) 27: 2754–2755 Full Text

Paired-end RAD-seq for de-novo assembly and marker design without available reference
Eva-Maria Willing et al
Bioinformatics (2011) 27: 2187–2193 Full Text

Comparative Studies of de novo Assembly Tools for Next-generation Sequencing Technologies
Yong Lin et al
Bioinformatics (2011) 27: 2031–2037 Full Text

Meta-IDBA: A de Novo Assembler for Metagenomic Data
Francis Y. L. Chin
Bioinformatics (2011) 27: i94–i101 Full Text

Base Calling

SHREC: A short-read error correction method
Bertil Schmidt et al.
Advanced Access publication: 19 June 2009 Full Text

Swift: Primary Data Analysis for the Illumina
Nava Whiteford et al.
Advanced Access publication: 23 June 2009 Full Text

TagDust – A program to eliminate artifacts from next generation sequencing data
Timo Lassmann et al.
Bioinformatics (2009) 25: 2839–2840 Full Text

Correction of sequencing errors in a mixed set of reads
Leena Salmela
Bioinformatics (2010) 26: 1284-1290 Full Text

Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology
Thomas Otto
Bioinformatics (2010) 26: 1704-1707 Full Text

Reptile: representative tiling for short read error correction
Xiao Yang et al
Bioinformatics (2010) 26: 2526–2533 Full Text

Transformations for the Compression of FASTQ Quality Scores of Next Generation Sequencing Data
Raymond Wan et al
Advanced Access publication: 13 December 2011 Full Text

CHIP-seq

FindPeaks 3.1: A Tool for Identifying Areas of Enrichment from Massively Parallel Short-Read Sequencing Technology
Anthony Fejes et al.
Bioinformatics (2008) 24: 1729-30 Full Text

F-Seq: A Feature Density Estimator for High-Throughput Sequence Tags
Alan Boyle et al.
Bioinformatics (2008) 24: 2537-8 Full Text

Hierarchical Hidden Markov Model with Application to Joint Analysis of ChIP-chip and ChIP-seq data
Hyungwon Choi et al.
Advanced Access publication: 14 May 2009 Ful text

A clustering approach for identification of enriched domains from histone modification ChIP-Seq data
Weiqun Peng et al.
Advanced Access publication: 8 June 2009 Full Text

Detecting differential binding of transcription factors with ChIP-seq
Kun Liang and Sunduz Keles
Bioinformatics (2012) 28: 121-122 Full Text

TIP: A Probabilistic Method for identifying Transcription Factor Target Genes from ChIP-Seq Binding Profiles
Chao Cheng et al
Bioinformatics (2012) 27: 3221-3227 Full Text

Diagnosis

Statistical Model for Whole Genome Sequencing and Its Application to Minimally Invasive Diagnosis of Fetal Genetic Disease
Tianjiao Chu et al.
Bioinformatics (2009) 25: 1244–1250 Full Text

ISOLATE: A computational strategy for identifying the primary origin of cancers using high throughput sequencing
Gerald Quon and Quaid Morris
Advanced Access publication: 19 June 2009 Full Text

Identity-By-Descent Filtering of Exome Sequence data for Disease-Gene Identification in Autosomal Recessive Disorders
Christian Rödelsperger et al
Advanced Access publication: 28 January 2011 Full Text

Miscellaneous

FrameDP: sensitive peptide detection on noisy matured sequences
Jérôme Gouzy, Sébastien Carrere and Thomas Schiex
Bioinformatics 25: 670–671 Full Text

G-SQZ: Compact Encoding of Genomic Sequence and Quality Data
Waibhav Tembe et al
Advanced Access publication: 6 July 2009 Full Text

ART: a next-generation sequencing read simulator
Weichun Huang et al
Advanced Access publication: 23 December 2011 Full text

Detection of microRNAs in color-space
Antonio Marco and Sam Griffiths-Jones
Bioinformatics (2012) 28: 318-323 Full Text

Identifying small interfering RNA loci from high-throughput sequencing data
Thomas Hardcastle et al
Advanced Access publication: 9 December 2011 Full text

ART: a next-generation sequencing read simulator
Weichun Huang et al
Bioinformatics (2012) 28: 593–594 Full Text

Pipeline

PIQA: Pipeline for Illumina G1 Genome Analyzer Data Quality Assessment
Antonio Martinez-Alcantara et al.
Advanced Access publication: 14 July 2009 Full Text

ShortRead: A Bioconductor package for input, quality assessment, and exploration of high throughput sequence data
Martin Morgan et al.
Advanced Access publication: 3 August 2009 Full Text

inGAP, an integrated next-gen genome analysis pipeline
Ji Qi et al.
Bioinformatics (2010) 26: 127-139 Full Text

Manipulation of FASTQ data with Galaxy
Daniel Blankenberg et al.
Bioinformatics (2010) 26: 1783-1785 Full Text

GAMES identifies and annotates mutations in next-generation sequencing projects
Maria Elena Sana et al
Advanced Access publication: 22 October 2010 Full text

Manipulation of FASTQ data with Galaxy
Daniel Blankenberg et al
Bioinformatics (2010) 26: 1783–1785 Full Text

SAMStat: monitoring biases in next generation sequencing data
Timo Lassmann et al
Bioinformatics (2011) 27: 130-131 Full Text

PASSion: A Pattern Growth Algorithm Based Pipeline for Splice Junction Detection in Paired-end RNA-Seq Data
Yanju Zhang et al
Advanced Access publication: 4 January 2012 Full text

MeQA: A pipeline for MeDIP-seq data quality assessment and analysis
Jinyan Huang et al
Advanced Access publication: 22 December 2011 Full text

PGAP: Pan-Genomes Analysis Pipeline
Yongbing Zhao et al
Bioinformatics (2012) 28: 416-418 Full Text

GenomicTools: a computational platform for developing high-throughput analytics in genomics
Aristotelis Tsirigos et al
Bioinformatics (2012) 28: 282–283 Full Text

Knime4Bio: a set of custom nodes for the interpretation of Next Generation Sequencing data with KNIME
Pierre Lindenbaum et al
Bioinformatics (2011) 27: 3200-3201 Full Text

NARWHAL, a primary analysis pipeline for NGS data
Rutger Brouwer
Bioinformatics (2012) 28: 284-285 Full Text

Pyicos: A versatile toolkit for the analysis of high-throughput sequencing data
Sonja Althammer et al
Bioinformatics (2011) 27: 3333-3340 Full Text

RNA-Seq

Statistical Inferences for Isoform Expression in RNA-Seq.
Hui Jiang and Wing Wong
Bioinformatics (2009) 25: 1026–1032 Full Text

A toolkit for analysing large-scale plant small RNA datasets
Simon Moxon et al.
Bioinformatics (2008) 24: 2252-2253 Full Text

TopHat: discovering splice junctions with RNA-Seq
Cole Trapnell et al.
Bioinformatics (2009) 25: 1105–1111 Full Text

RNA-MATE: A recursive mapping strategy for high-throughput RNA-sequencing data
Nicole Cloonan et al.
Bioinformatics (2009) 25: 2615-6 Full Text

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data
Likun Wang et al.
Bioinformatics (2010) 26: 136-138 Full Text

Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data.
Jacob Degner et al.
Bioinformatics (2009) 25: 3207-3212 Full Text

Supersplat–spliced RNA-seq alignment
Douglas Bryant Jnr. et al.
Bioinformatics (2010) 26: 1500–1505 Full Text

RNA-Seq gene expression estimation with read mapping uncertainty
Bo Li et al.
Bioinformatics (2010) 26: 518-528 Full Text

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
Mark Robinson et al.
Bioinformatics (2010) 26: 139–140 Full Text

Length Bias Correction for RNA-seq Data in Gene Set Analyses
Liyan Gao et al
Bioinformatics (2011) 27: 662–669 Full Text

Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq
Zhengpeng Wu et al
Bioinformatics (2011) 27: 502-508 Full Text

RSEQtools: a modular framework to analyze RNA-Seq data using compact, anonymized data summaries
Lukas Habegger et al
Bioinformatics (2011) 27: 281-283 Full Text

htSeqTools: High-Throughput Sequencing Quality Control, Processing and Visualization in R
Evarist Planet et al
Advanced Access publication: 22 December 2011 Full text

Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq
Ming Hu et al
Bioinformatics (2012) 28: 63-68 Full Text

deepBlockAlign: A tool for aligning RNA-seq profiles of read block patterns
David Langenberger et al
Bioinformatics (2012) 28: 17-24 Full Text

RNA-Seq Analysis in MeV
Eleanor Howe et al
Bioinformatics (2011) 27: 3209-3210 Full Text

Variant detection

VarScan: Variant detection in massively parallel
Daniel Koboldt
Advanced Access publication: 19 June 2009 Full Text

SNP-o-matic
Heinrich Manske and Dominic Kwiatkowski
Advanced Access publication: 2 July 2009 Full Text

Slider – Maximum use of probability information for alignment of short sequence reads and SNP detection
Nawar Malhis et al.
Bioinformatics (2009) 25: 6-13 Full Text

Detecting SNPs and estimating allele frequencies in clonal bacterial populations by sequencing pooled DNA
Kathryn Holt et al.
Bioinformatics (2009) 25: 2074-5 Full Text

Copy number variant detection in inbred strains from short read sequence data
Jared Simpson et al.
Advanced Access publication: 18 December 2009 Full Text

Microindel detection in short-read sequence data
Peter Krawitz er al
Advanced Access publication: 9 February 2010 Full text

SNVMix: predicting single nucleotide variants from next generation sequencing of tumors
Rodrigo Goya et al
Advanced Access publication: 3 February 2010 Full Text

Structural Variation Analysis with Strobe Reads
Anna Ritz et al.
Bioinformatics (2010) 26: 1291-1298 Full Text

Detection of locus and content of novel sequence insertions using paired-end next-generation sequencing
Iman Hajirasouliha et al.
Bioinformatics (2010) 26: 1277-1283 Full Text

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing
Iman Hajirasouliha et al.
Bioinformatics (2010) 26: 1277–1283 Full Text

Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery
Fereydoun Hormozdiari et al.
Bioinformatics (2010) 26: i350–i357 Full Text

VARiD: a variation detection framework for color-space and letter-space platforms
Bioinformatics (2010) 26: i343–i349 Full Text

A statistical method for the detection of variants from next-generation resequencing of DNA pools
Vikas Bansal
Bioinformatics (2010) 26: i318–i324 Full Text

SLOPE: a quick and accurate method for locating non-SNP structural
Haley Abel et al
Bioinformatics (2010) 26: 2684–2688 Full Text

SeqEM: an adaptive genotype-calling approach for next-generation
E. R. Martin et al
Bioinformatics (2010) 26: 2803–2810 Full Text

SVDetect: a tool to identify genomic structural variations from paired-end
Bruno Zeitouni et al
Bioinformatics (2010) 26: 1895–1896 Full Text

Next-generation VariationHunter: combinatorial algorithms for transposon
Fereydoun Hormozdiari et aL
Bioinformatics (2010) 26: i350–i357 Full Text

A statistical method for the detection of variants from next-generation
Vikas Bansal
Bioinformatics (2010) 26: i318-24 Full Text

Detection and characterization of novel sequence insertions using
Iman Hajirasouliha et al
Bioinformatics (2010) 26: 1277–1283 Full Text

ACCUSA–accurate SNP calling on draft genomes
Sebastian Fröhler and Christoph Dieterich
Bioinformatics (2010) 26: 1364–1365 Full Text

Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format
Michael Edmonson et al
Advanced Access publication: 28 January 2011 Full Text

MU2A–reconciling the genome and transcriptome to determine the effects of base substitutions
Vijay Garla et al
Bioinformatics (2011) 27: 416–418 Full Text

VarSifter: Visualizing and analyzing exome-scale sequence variation data on a desktop computer
Jamie Teer et al
Advanced Access publication: 30 December 2011 Full text

Read Count approach for DNA copy number variants detection
Alberto Magi et al
Advanced Access publication: 23 December 2011 Full text

Control-FREEC: a tool for assessing copy number and allelic content using next generation sequencing data
Valentina Boeva et al
Bioinformatics (2012) 28: 423–425 Full Text

SVseq: an approach for detecting exact breakpoints of deletions with low-coverage sequence data
Jin Zhang and Yufeng Wu
Bioinformatics (2011) 27: 3228-3234 Full Text

Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools
Francis San Lucas et al
Bioinformatics (2012) 28: 421-422 Full Text

TREAT: A Bioinformatics Tool for Variant Annotations and Visualizations in Targeted and Exome Sequencing Data
Yan Asman et al
Bioinformatics (2012) 28: 277-278 Full Text

SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data
David Larson et al
Bioinformatics (2012) 28: 311-317 Full Text

Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next generation sequence data
Arief Gusnanto et al
Bioinformatics (2012) 28: 40-47 Full Text

Visualisation

NGSView: an extensible open source editor for next-generation sequencing data
Erik Arner et al.
Bioinformatics (2010) 26: 125-126 Full Text

Tablet – Next Generation Sequence Assembly Visualization
Iain Milne et al.
Advanced Access publication: 4 December 2009 Full Text

CisGenome Browser: A flexible tool for genomic data visualization
Hui Jiang et al.
Advanced Access publication: 30 May 2010 Full text

Savant: Genome Browser for High Throughput Sequencing Data
Marc Flume et al.
Advanced Access publication: 20 June 2010 Full Text

girafe – an R/Bioconductor package for functional exploration of aligned
Joern Toedling et al
Bioinformatics (2010) 26: 2902–2903 Full Text

Artemis: An integrated platform for visualisation and analysis of high-throughput sequence-based experimental data
Tim Carver et al
Advanced Access publication: 22 December 2011 Full Text

Visualization and quality assessment of de novo genome assemblies
Oksana Riba-Grognuz et al
Bioinformatics (2011) 27: 3425-3426 Full Text