blog

# Knowledge-Driven NGS Analysis

## Human biological pathway unification

PathCards is an integrated database of human biological pathways and their annotations. Human pathways were clustered into SuperPaths based on gene content similarity. Each PathCard provides information on one SuperPath which represents one or more human pathways. It includes 1,131 SuperPath entries, consolidated from 12 sources.

Publication Details

Belinky, F., Nativ, N., Stelzer, G., Zimmerman, S., Iny Stein, T., Safran, M. and Lancet, D.PathCards: multi-source consolidation of human biological pathways, Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006 . [PDF]

http://pathcards.genecards.org/

# PathCards: multi-source consolidation of human biological pathways

+Author Affiliations

1. Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel
1. *Corresponding author: Tel: +972-89343188; Fax: +972-89344487; Email: Frida.Belinky@weizmann.ac.il
• Revision received January 13, 2015.
• Accepted January 14, 2015.

## Introduction

The systematic analysis of biological pathways has ever-increasing significance in an age of growing systems analyses and omics data. Mapping genes onto pathways may contribute to a better understanding of biological and biomedical mechanisms. The literature provides a large collection of pathway definition sources (1). Pathway knowledge bases represent the careful collection of genes and their interactions, mapped onto biological processes. These repositories, which include both academic and commercial resources (Figure 1A), provide lists of pathways and their cellular components, each with an idiosyncratic view of the pathway universe.

Figure 1.

The gene-content network of pathway sources. Eighteen sources are shown, 12 of which (colored) are included in SuperPaths generation. Edge widths are proportional to the pairwise Jaccard similarity coefficient computed for the gene contents of the entire source. The sources, depicted in GeneCards Version 3.12, are: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16), QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (see Table 1). White circles correspond to sources not included in the SuperPath generation process: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).

Indeed, the definition of the boundaries of biological pathways differs among sources, as exemplified by the highly studied processes of fatty acid metabolism (2) or the TCA cycle (the tricarboxylic acid cycle) (3). Further, the same pathway name may have widely dissimilar gene content in different sources (4). At present, there is no definitive analysis of pathway similarities, either between or within sources. Thus the multitude of pathway resources can often be confusing when portraying gene-pathway affiliations.

Previous attempts to unify pathways from several sources include NCBI’s Biosystems (5), PathwayCommons (6), PathJam (7), HPD (8), ConsensusPathDB (9), hiPathDB (10) and Pathway Distiller (11). But none of these efforts entail a standardized method to unify numerous sources into a consolidated global repository.

Here, we describe an approach aimed at generating an integrated view across multiple pathway sources. We applied a combination of nearest neighbor graph and hierarchical clustering, utilizing a gene-content metric, to generate a manageable set of 1073 unified pathways (SuperPaths). These optimally encompass all of the information contained in the individual sources, striving to minimize pathway redundancy while maximizing gene-related pathway informativeness. The resultant SuperPaths are integrated into GeneCards (12), enabling clear portrayal of a gene’s set of unified pathways. Finally, these SuperPaths, together with diverse related biological data, are provided in PathCards—a new pathway-centric online database, enabling quick in-depth analysis of each human SuperPath.

## Materials and methods

### Pathway mining and comparison

Pathway gene sets were generated based on the GeneCards platform (12), implementing the gene symbolization process allowing for comparison of pathway gene sets, from 12 different manually curated sources, including: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16) QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (seeTable 1). A binary matrix was generated for all 3125 pathways, where each column represents a gene indicated by 1 for presence in the pathway and 0 for absence. Additionally, six sources were analysed for their cumulative tallying of genes content, including: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).

### Pathway similarity assessment

In the analyses performed, we utilized gene content overlap to estimate pathway similarity. This was done based on the Jaccard coefficient, that measures similarity between finite sample sets, and defined as the size of the intersection divided by the size of the union of the sets. To examine the legitimacy of this method, we performed a comparison to an alternative methodology, embodied in MetaPathwayHunter pathway comparison, that incorporates topology in pairwise pathway alignment (25). For such analysis, we used a set of 151 yeast pathways available in MetaPathwayHunter, and computed Jaccard similarity coefficients (J) for all 11 325 pathway pairs. We then selected a sample of 30 pairs containing 28 unique pathways out of a total of 87 pairs with J ≥ 0.3, ensuring maximal representation for larger pathways. Each of the 28 pathways was queried in MetaPathwayHunter against the entire gamut of 151 with default parameters (a total of 4228 comparisons). We found that 29 out of the 30 sample pathway pairs obtained a significant MetaPathwayHunter alignment (P ≤ 0.01). As only 64 of the 4228 comparisons showed such a P-value, the probability of obtaining this result at random is 1.6 × 1053(Supplementary Table S1). Thus, Jaccard scores appear as excellent predictors for the results of the more elaborate method. A full account of interpathway pairwise similarity is available upon request.

### Clustering algorithm

For the main pathway clustering algorithm, we applied a method described elsewhere (26), which includes the following steps: i) The generation of cluster cores by joining all pathway pairs with Jaccard coefficient ≥T2, the upper cutoff, equivalent to hierarchical clustering. ii) Performing cluster extension by generating new best edges, i.e. joining every pathway to a pathway showing the highest score, as long as it is ≥T1, the lower cutoff, akin to nearest neighbor joining. If two or more target pathways have the same best score, all are joined. Each resultant connected component is defined to be a pathway cluster (SuperPath). Identical pathway sets were joined without considering each other as nearest neighbors (i.e. the best scoring non-identical pathway gene-set is chosen as the nearest neighbor). This clustering algorithm is order independent.

### Determination of cutoffs

Uniqueness of a SuperPath UsUs is defined as log10(1NpNg)log10(∑1NpNg) where Npis the number of pathways that include a certain gene, averaging for each pathway over all genes in the SuperPath (divided by the number of genes Ng). Uniqueness of genes IsIsis symmetrically defined per SuperPath as log10(1NgNp)log10(∑1NgNp) where each Ng is the number of genes included in the relevant pathway, averaging for each gene over all SuperPaths including a gene. In order to then find the best tradeoff between the two scores, we summed up the average Us and Is for each set of T1 and T2 cutoff parameters. Thus Us + Is was calculated for each set of parameters to find the two parameters for which the tradeoff between pathway and gene uniqueness would be optimal. The best cutoffs by maximizing Us + Is were T1 = 0.3 and T2 ≥ 0.5. Further fine tuning of the upper cutoff was performed by resampling of the data, a technique employed by Levin and Domany (27). We used two dilutions (0.75 and 0.9), i.e. randomly sampling 75% and 90% of the pathways (resampling 100 times for each dilution) and performing the clustering algorithm on each sample, each time calculating the percent of the edges present in the original clustering—the percent of cases that two pathways belonged to the same cluster as in the full dataset. In both dilutions, the upper cutoff of 0.7 was found to recover a higher percent of the edges in the original clustering algorithm (Figure 4C).

### Name similarity calculation and concordance with gene similarity

Name similarity was calculated as the Jaccard coefficients of the shared words in the two pathway names, after omitting trivial words and using stemming to identify words with the same root. The cutoff between similar and non-similar names (as well as gene content in regard to comparison with name similarity) was set to J = 0.5. Name similarity was compared with gene content similarity to find the level of concordance between the two.

### Shared publications and PPI data

Publication and Protein-Protein Interactions (PPI) data for each gene were obtained from the GeneCards database, including several combined sources. Publications sources of GeneCards include both manually curated publications (e.g. UniProtKB/Swiss-Prot) as well as text mining approaches that report connections between a gene and a list of publications. A shared publication between two genes is an association of both genes to the same publication and does not indicate a direct interaction between the genes. PPI scores between pairs of genes are also based on several interaction sources in GeneCards. Unlike shared publications, PPIs reflect direct interactions between the two gene products.

### Randomization and comparison

A randomized set of pseudo-SuperPaths was generated, such that the pseudo-SuperPaths are the same size and quantity as the SuperPaths, albeit with genes assigned at random (from the list of genes with any pathway annotation). Gene pairs that belong to at least one SuperPath, but do not belong together in any individual pathway (the test set) were analysed for the number of shared publications and PPI scores for each pair. In comparison, gene pairs that belong to at least one pseudo-SuperPath, but do not belong together in any individual pathway (the control set) were analysed for the same attributes. To compare the two sets which are of different sizes, a random sample of the larger set (the control set) of the same size as the smaller set (the test set) was compared with the smaller set. A one-sided Kolmogorov–Smirnoff test was performed to compare between the test and control sets.

### Gene enrichment analysis comparison

Differentially expressed sets of genes were obtained from the GeneCards database (12) containing 830 different embryonic tissues based on manual curation (28). For the comparison of SuperPaths and their pathway constituents, 89 SuperPaths that contained exactly two pathways with Jaccard similarity coefficient <0.6 were chosen, a value selected to include pairs of relatively dissimilar pathways in order to enhance comparative power. Two gene set enrichment analyses were run for all 830 gene sets: one with SuperPaths and the other with their constituent pathways. Whenever both SuperPath and the constituent pathways received a statistical enrichment score, the difference between negative log Pvalues was computed.

### GeneCards and PathCards

SuperPaths have been implemented in GeneCards and are now included in the standard procedure of GeneCards generation. PathCards is an online compendium of human pathways, based on the GeneCards database, presenting SuperPath-related data in each page.

## Results

### Pathway sources

We analysed 12 pathway sources included in GeneCardshttp://www.genecards.org/ (12) with a total of 3215 biological pathways (Table 1 and Figure 1A). The total number of genes covered by these sources is 11 478, nearly twice as large as the gene count in the largest source (Figure 1B), suggesting the power of analysing multiple sources. Asymptotic behavior is observed in the change of total gene count with increasing number of sources. When considering the incorporation of six additional sources (Supplementary Figure S1), we found that the gene count increment is ∼2% of the currently analysed total. This is an indication that the chosen 12 sources provide adequate coverage of human gene-pathway mappings. Switching between the six non-included sources and six included sources of similar size give a very similar graph, with merely 4% increment in gene count (Supplementary Figure S1).

Analysing the gene repertoires of the four largest sources (Figure 2A), we found that among the 10 770 genes contained within these sources, only 1413 genes were jointly covered by all four sources, and that more than 4000 were unique to one of the four sources. This highlights the notion that source unification is essential to obtain maximal gene coverage. In its simplest embodiment, source unification would entail presenting a unified list of the 3215 pathways included in all 12 sources. This however would ignore the extensive gene-content connectivity embodied in the network representation of this pathway collection (Figure 3A). Further, the original pathway collection has considerable inconsistencies of relations between pathway name and pathway gene content, as exemplified in Figure 2B and C. The summary in Table 2A suggests that only ∼9.4% of all pathway pairs with a similar name have similar gene content, and likewise, only 9.8% of all pathway pairs with similar gene content are named similarly (Supplementary Figure S2).

Figure 2.

Discrepancies between pathway sources. (A)Incomplete gene overlap among sources. Venn diagram (created using VENNYhttp://bioinfogp.cnb.csic.es/tools/venny/) showing the number of shared genes among the four largest pathway sources. For a total of 10 770 genes, only 1413 (13%) are shared by all four sources and 609–1791 genes are unique to each of these sources. (B) Inconsistency of names versus content in meiosis-related pathways. A Venn diagram created using BioVenn (29), exemplifies two pathways, ‘Meiosis’ from Reactome and ‘Oocyte meiosis’ from KEGG with very small gene sharing (7 genes out of 172, J = 0.04). (C) Redundancy in meiosis-related pathways. This is exemplified by the large number of genes (88 of 119, J = 0.74) shared by ‘Meiosis’ and ‘Meiotic recombination’ pathways both from Reactome, and by the large number of genes (52 of 146, J = 0.36) shared by ‘Oocyte meiosis’ and ‘Progesterone-mediated oocyte maturation’ both from KEGG. (D) Pathway size distribution across sources. The pathway size in gene count, is distributed differently across the different sources.

Figure 3.

Network representations of the 3215 analyzed pathways. Nodes represent pathways and edges represent Jaccard similarity coefficients (J) using different methods. Network visualizations were performed using Gephi (30). Colors correspond to pathway sources. (A)No clustering. All edges with J ≥ 0.05 are shown. All but 20 pathways form one large connected component with an average degree of 134. (B) SuperPaths. Each is a connected component obtain by the main clustering algorithm, with thresholdsT1 (best edges) of J ≥ 0.3 and T2 of J ≥ 0.7. There are 544 singletons and 529 multi-pathway clusters; the size of the largest cluster is 70. (C) Pure hierarchical clustering, with thresholds T2 of J ≥ 0.3. There are 544 singletons and 288 multimembered clusters; the size of the largest cluster is 1046 pathways.

Figure 4.

Selection of the T1 andT2 thresholds. (A)Distribution of Jaccard coefficients across all pathway pairs. T1 andT2 respectively represent the lower and upper cutoffs used in the algorithm employed. (B) Us + Isscores across combinations of T1 andT2. The diagonal (T1 = T2) represents pure hierarchical clustering with different thresholds. The best scores are attained when T1 = 0.3 and T2 ≥ 0.5. (C) Determination of T2. T2(upper cutoff) was determined by resampling of the pathway data at two dilution levels (27), 0.75 and 0.9. In both cases J = 0.7 was found to be the optimum in which a higher fraction of the original clustering is recovered.

View this table:

Table 2.

Gene content versusname similarity of pathways and SuperPaths

### Pathway clustering

We performed global pathway analysis aimed at assigning maximally informative pathway-related annotation to every human gene. For this, we converted the pathway compendium into a set of connected components (SuperPaths), each being a limited-size cluster of pathways. We aimed at controlling the size of the resulting SuperPaths, so as to maintain a high measure of annotation specificity and minimize redundancy.

The following two steps were used in the clustering procedure, in which pathways were connected to each other to form SuperPaths. i) Preprocessing of very small pathways: pathways smaller than 20 genes were connected to larger pathways (<200 genes) with a content similarity metric of ≥0.9 relative to the smaller partner. ii) The main pathway clustering algorithm: this was performed using the Jaccard similarity coefficient (J) metric (31) (see Materials and Methods). We used a combination (cf. 26) of modified nearest neighbor graph generation with a threshold T1 and hierarchical clustering with a threshold T2 (Figure 4A and Materials and Methods).

To determine the optimal values of the thresholds T1 and T2, we defined two quantitative attributes of the clustering process. The first is US, the overall uniqueness of the set of SuperPaths. USelevation is the result of increasing pathway clustering, and reflects the gradual disappearance of redundancy, i.e. of cases in which certain gene sets are portrayed in multiple SuperPaths. The second parameter is IS, the overall informativeness of the set of SuperPaths. IS is a measure of how revealing a collection of SuperPaths is for annotating individual genes. It decreases with the extent of pathway clustering, reaching an undesirable minimum of one exceedingly large cluster, whereby identical SuperPath annotation is obtained for all genes. We thus sought an optimal degree of clustering whereby US + IS is maximized (Figure 4B and Materials and Methods).

Our procedure pointed to an optimum at T1 = 0.3 and T2 ≥ 0.5. Further fine tuning by data resampling suggested an optimal value of T2 = 0.7 (Figure 4C and Materials and Methods). This procedure resulted in the definition of 1073 SuperPaths, including 529 SuperPaths ranging in size from 2 to 70 pathways, and 544 singletons (one pathway per SuperPath) (Figures 3B and 5A). Each SuperPath had 3 ± 4.3 pathways (Figure 5A) and 82.7 ± 140.6 genes (Supplementary Figure S3A). The resultant set of SuperPaths indeed enhances the uniqueness US as depicted in Figure 5B.

Figure 5.

SuperPaths increase uniqueness while keeping high informativeness. (A) Number of pathways in hierarchical clusteringversus SuperPath algorithm. The largest cluster with hierarchical clustering includes 1046 pathways, about 33% of the entire input, causing a great reduction of informativeness. In the SuperPath clustering the maximum cluster size is 70, about 2% of all pathways. (B) Increase in uniqueness (Us) following unification of pathways into SuperPaths.

The unification process resulted in relatively small changes in gene count distribution between the original pathways and the resultant SuperPaths (Supplementary Figure S3), suggesting a substantial preservation of gene groupings. Notably, applying pure hierarchical clustering (T1 = T2 = 0.3) resulted in a single very large cluster with 1046 pathways (Figure 3C) and with the same amount of singletons, strongly deviating from the goal of specific pathway annotation for genes (Supplementary Figure S3B). This sub-optimal performance of pure hierarchical clustering is general; any of the examined cases of T1 = T2 (Figure 4B diagonal), shows an Us + Isvalue lower than that for T1 = 0.3 T2 = 0.7.

Each SuperPath is identified by a textual name derived from one of its constituent pathways selected as the most connected pathway (hub) in the SuperPath cluster. For simplicity, the option of de novonaming was not exercised. Selecting the hub’s name, as opposed to that of the largest pathway, was chosen since this tends to enhance the descriptive value for the entire SuperPath. When more than one pathway has the same maximal number of connections, the larger one is chosen.

### SuperPaths make important gene connections

One of the major implications of the process of SuperPath generation is elucidating new connections among genes. This happens because genes that were not connected via any pre-unification pathway become connected through belonging to the same SuperPath. The unification into SuperPaths is important in two ways: first, it brings, under one roof, pathway information from 12 sources, each individually contributing ∼9000 to ∼5 million instances of gene pairing, for a total of 7.3 million pairs (Supplementary Figure S4). Second, by unifying into SuperPaths, the number of gene pairs is further enhanced, reaching 8.3 million (Supplementary Figure S4).

To test the significance of the million new gene–gene connections resulting from SuperPath generation, we checked their correlation with two independent measures of gene pairing. First, a comparison was made to publications shared among gene pairs (Figure 6A). We found that for gene pairs appearing in a SuperPath but not in any of its constituent pathways, there is a 4- to 75-fold increase in instances of >20 shared publications when compared with random pairs of genes with pathway annotation. Added gene pairs have significantly more shared publications than those randomly paired. Second, we performed a similar analysis based on protein–protein interaction information. We found that for the SuperPath-implicated gene pairs there was a 4- to 25-fold increase of PPIs with score >0.2 (Figure 6B) when compared with controls. SuperPaths thus provide significant gene partnering information not conveyed by any of their 3215 constituent individual pathways. This may be seen when performing gene set enrichment analysis on 830 differential expression sets and comparing the scores of SuperPaths to that of their constituent pathways, demonstrating that SuperPaths tend to receive more significant scores compared with their constituent pathways average score (Figure 7A).

Figure 6.

SuperPath-specific gene pairs are informative.(A) Shared publications. SuperPath-specific gene pairs are genes connected only by SuperPaths and not by any of the contained pathways. Enrichment of 10–100 is seen in the high abscissa values. The two distributions are significantly different (Kolmogorov–Smirnof P < 10−100). No random gene pairs with 80–90 publications—this point was treated as having one such publication for computing the ratio. (B) Protein–protein interactions. Experimental interaction score from STRING (32) as depicted in GeneCards (12), for SuperPath versus random gene pairs as in panel A. The two distributions are significantly different (Kolmogorov–Smirnof P < 2.8 × 10−61).

Figure 7.

SuperPath integration attributes. (A)SuperPaths outperform their constituent pathways in significance scores across 830 differentially expressed genes sets.(B) Number of included sources in non-singleton SuperPaths.

### SuperPaths in databases

SuperPath information is available both in the GeneCards pathway section (Supplementary Figure S5A) and in PathCards (Supplementary Figure S5B) http://pathcards.genecards.org/, a GeneCards companion database presenting a web card for each SuperPath. PathCards allows the user a view of the pathway network connectivity within a SuparPath, as well as the gene lists of the SuperPath and of each of its constituent pathways. Links to the original pathways are available from the pathway database symbols, placed to the left of pathway names. PathCards has extensive search capacity including finding any SuperPath that contains a search term within its included pathway names, gene symbols and gene descriptions. Multiple search terms are afforded, allowing fine-tuned results. The search results can be expanded to show exactly where in the SuperPath-related text the terms were found. The list of genes in a PathCard utilizes graded coloring to designate the fraction of included pathways containing this gene, providing an assessment of the importance of a gene in a SuperPath. Other features, including gene list sorting and a search tutorial, are under construction. PathCards is updated regularly, together with GeneCards updates. A new version is released 2–3 times a year.

## Discussion

### Pathway source heterogeneity

This study highlights substantial mutual discrepancies among different pathway sources, e.g. with regard to pathway sizes, names and gene contents. The world of human biological pathways consists of many idiosyncratic definitions provided by mostly independent sources that curate publication data and interpret it into sets of genes and their connections. The idiosyncratic view of the different pathway sources is exemplified by the variation in pathway size distribution among sources (Table 1, Figure 2D), where some sources have overrepresentation of large pathways (QIAGEN), while others have mainly small pathways (HumanCyc). In some cases, the large standard deviation in pathway size (Table 1) is easily explained, as exemplified in the case of Reactome, which provides hierarchies of pathways and therefore contains a spectrum of pathway sizes. However, large standard deviations of pathway size are also observed in KEGG and QIAGEN—sources that are not hierarchical by definition. On the other hand, some sources (e.g. HumanCyc, PID and PharmGKB) have very little variation in their pathway sizes, revealing their focus on pathways of particular size. The idiosyncratic view provided by different sources is also evident when examining the genes covered by each source (Figure 2A), where some genes in the gene space are covered by only one source. This causes the unfavorable outcome that when unifying pathways, irrespective of the algorithm chosen, there is a relatively high proportion of single source pathway clusters. In order to account for the drawback of the Jaccard index to cope with large size differences between pathways, we added a preprocessing step to unify pathways that are almost completely included within other pathways (≥0.9 gene content similarity of the smaller pathway), thereby diminishing the barrier of variable pathway size between sources. Previously published isolated instances of intersource discrepancies include the lack of pathway source consensus for the TCA cycle (3) and fatty acid metabolism (2). The authors of both papers stress that each of their pathway sources has only a partial view of the pathway. For the TCA cycle example (3) there is an attempt to provide an optimal TCA cycle pathway by identifying genes that appear in multiple sources, but such manual curation is not feasible for a collection of >3000 biological pathways. In our procedure, 11 relevant pathways from four sources are unified into a SuperPath entitled ‘Citric acid cycle (TCA cycle)’ (Supplementary Figure S5). PathCards enables one to then view which genes are more highly represented within the constituent pathways. Our algorithm thus mimics human intervention, and greatly simplifies the task of finding concurrence within and among pathway sources.

### Pathway unification

Combining several pathway resources has been attempted before, using different approaches. The first method is to simply aggregate all of the pathways in several knowledge bases into one database, without further processing. This approach is taken, for example, by NCBI’s Biosystems with 2496 human pathways from five sources (5) and by PathwayCommons with 1668 pathways from four sources (6). This was also the approach taken by GeneCards prior to the SuperPaths effort described here, where pathways from six sources were shown separately in every GeneCard. While this approach provides centralized portals with easy access to several pathway sets, it does not reveal interpathway relationships and may result in considerable redundancy. The second unification approach, taken by PathJam (7), and HPD (8) provide proteins versus pathways tables as search output. This scheme allows useful comparisons as related to specific search terms, but is not leveraged into global analyses of interpathway relations. A third line of action is exemplified by ConsensusPathDB (9), which integrates information from 38 sources, including 26 protein–protein interaction compendia as well as 12 knowledge bases with 4873 pathways. This allows users to observe which interactions are supported by each of the information sources. In turn, hiPathDB (10) integrates protein interactions from four pathway sources (1661 pathways) and creates ad hoc unified superpathways for a query gene, without globally generating consolidated pathway sets. Finally, a fourth methodology is employed by Pathway Distiller (11), which mines 2462 pathways from six pathway databases, and subsequently unifies them into clusters of several predecided sizes between 5 and 500, using hierarchical clustering. The third method of interaction mapping taken by ConsensusPathDB and HiPathDB differs conceptually from the fourth method of clustering, where the interaction mapping method provides information on the specific commonalities and discrepancies in protein interactions among sources with regard to specific keywords or genes, while the clustering method suggests which of the pathways are similar enough to be considered for the same cluster. Therefore, the third and fourth methods are complementary approaches aimed at utilization of pathway information in different observation levels, where the fourth (clustering) method is independent of user input or search in resultant consolidation. In the study described herein, we pursued a clustering method similar to the fourth methodology taken by Pathway Distiller, namely consolidation of pathways into clusters. However, in contrast to Pathway Distiller, our aim was to create a single coherent unification of biological pathways, which is essential for having a universal set of descriptors when looking at gene–gene relations. The resulting SuperPaths simplify the pathway-related descriptive space of a gene and reduce it 3-fold. Furthermore, the cutoffs in our algorithm are chosen to optimally adjust the criteria of uniqueness and informativeness, thereby reducing the subjective effect of choosing cutoffs arbitrarily or by predetermining the number of clusters.

### SuperPath generation

A crucial element in our SuperPaths generation method is the definition of interpathway relationships. We have opted for the use of gene content, as described by others (11, 33). One could also consider the use of pathway name similarity (11). However, among the 3215 pathways analysed here, only 79 names were shared by more than one pathway, implying that the efficacy of such an approach would have been rather limited. Further, Table 2 andSupplementary Figure S2 indicate a relatively weak concordance between pathway names and their gene content. Specifically among 79 name-identical pathway groups 52 remained incompletely unified, again suggesting a limited usefulness for unifying based on pathway names. Many resources, including ConsensusPathDB (9) facilitate the option of finding pathways based on keywords in the name. Name sharing is thus a relatively trivial task to overcome when trying to find similar pathways. The more challenging goal is finding pathways that are similar in the biological process that they convey.

In this article we treated pathways as sets of genes, using gene content as a comparative measure and omitting topology and small molecule information. This approach was previously advocated as a means of reducing the complexity of pathway comparisons greatly (34). Further, most sources used in this study provide only the gene set information, hence topology information was unavailable. Finally, the high concordance between significance of pathway alignment and Jaccard coefficients ≥0.3 (P < 1052) indicates that the Jaccard coefficient is a good approximation of the more elaborate pathway alignment procedure (25).

### SuperPath utility

A central aim of pathway source unification is enhancing the inference of gene-to-gene relations needed for pathway enrichment scrutiny (32, 35–40). To this end, we developed an algorithm for pathway clustering so as to optimize this inference and at the same time minimize redundancy.

Extending pathways into SuperPaths affords two major advantages. The first is augmenting the gene grouping used for such inference. Indeed, SuperPaths have slightly larger sizes than the original pathways, as evident by the SuperPath size distribution (Figure 2D). Nevertheless, comparing SuperPaths to pseudo-SuperPaths of the same size and quantity clearly show that the increase in size does not account for the addition of true positive gene connections, as evident by the higher PPIs and larger counts of shared publications for SuperPath gene pairs (Figure 6). Subsequently, it is not surprising that SuperPaths outperform their average pathway constituent’s enrichment analysis scores (Figure 7A). SuperPaths are currently used in two GeneCards-related novel tools, VarElecthttp://varelect.genecards.org/ and GeneAnalyticshttp://geneana lytics.genecards.org/. A second advantage of SuperPaths is in the reduction of redundancy, since they provide a smaller, unified pathway set, and thus diminish the necessary statistical correction for multiple testing. We note that ConsensusPathDB (9) also provides intersource integrated view of interactions. However, gene set analysis in ConsensusPathDB is only allowed for pathways as defined by the original sources. Finally, a third advantage of SuperPaths is their ability to rank genes within a biological mechanism via the multiplicity of constituent pathways within which a gene appears. This can be used not only to gain better functional insight but also to help eliminate suspected false-positive genes appearing in a minority of the pathway versions. A capacity to view such gene ranking is available within the PathCards database.

## Limitations of SuperPaths

The SuperPaths generation procedure appears incomplete, as about a half of all SuperPaths are ‘singleton SuperPath’ (labelled accordingly in PathCards), having only one constituent pathway. This is an outcome of the specific cutoff parameters used. However, this provides a useful indication to the user that a singleton pathway is distinct, differing greatly in its constituent genes from any other pathway.

This SuperPath generation process is intended to reduce redundancies and inconsistencies found when analysing the unified pathways. Although SuperPaths increase uniqueness as compared with the original pathway set (Figure 5B), some redundancy and inconsistency still remain within SuperPaths. There are cases of pathways with similar names, which do not get unified into the same SuperPath. This happens because they have not met the unification criteria employed. We also note similarity in name does not always indicate similarity in gene content (Figure 2B and C,Supplementary Figure S2B), and such events are faithfully conveyed to the user.

A clarifying example is that of the 40 pathways whose names include the string ‘apoptosis’. The final post-unification list has 10 SuperPaths whose name includes ‘apoptosis’. This obviously provides the user with a greatly simplified view of the apoptosis world. Yet, at the same time the outcome is replete with instances of two name-similar pathways being included in different SuperPaths. Employing a more stringent algorithm would result in over-clustering, which would in turn reduce informativeness (seeFigure 3C).

In parallel, there are pathways with overlapping functions that are not consolidated into one SuperPath. For example, the pathway ‘integrated breast cancer pathway’ does not unify with the pathways ‘DNA repair’ and ‘DNA damage response pathway’, despite the strong functional relation of breast cancer with DNA damage and repair (41). This is because the relevant gene content similarity in the original pathway sources is small, respectively, J = 0.03 and 0.13. The need to view information on pathways with low pairwise similarity is addressed in Supplementary Figure S6, and is available as a text file upon request.

Finally, when looking at the number of contributing sources per SuperPath (Figure 7B), it is evident that the majority of SuperPaths are comprised by either one or two sources, and no SuperPaths includes more than five. Although this integration limitation is evident, it mainly arises from the inherent biases in gene coverage for the different information sources (Figure 2A).

### PathCards

Biological pathway information has traditionally been a central facet of GeneCards, the database of human genes (12, 42, 43). In previous versions, pathways were presented separately for each of the pathway sources, and it was difficult for users to relate the separate lists to each other. As a result of the consolidation into SuperPaths described herein, this problem has been effectively addressed. Thus, in every GeneCard, a table portrays all of a gene’s SuperPaths, each with its constituent pathways, with links to the original sources (Supplementary Figure S5A).

GeneCards is gene-centric and inherently does not present (Super) pathway-centric annotations. We therefore developed PathCardshttp://pathcards.genecards.org/, a database that encompasses and displays such information in greater detail. PathCards has a page for every SuperPath, showing the connectivity of its included pathways, as well as gene lists for the SuperPath and its pathways. For every SuperPath, we also show a STRING gene interaction network (32) for the entire gamut of constituent genes, providing perspective on topological relationships within the SuperPath.

## Supplementary Data

Supplementary data are available at Database Online.

## Funding

This research is funded by grants from LifeMap Sciences Inc. California (USA) and the SysKid—EU FP7 project (number 241544). Support is also provided by the Crown Human Genome Center at the Weizmann Institute of Science. Funding for open access charge: LifeMap Sciences Inc. California (USA).

Conflict of interest. None declared.

## Acknowledgements

We thank Prof. Eitan Domany and Prof. Ron Pinter for helpful discussions, as well as Dr. Noa Rappaport and Dr. Omer Markovich for assistance with clustering and visualization.

## Footnotes

• Citation details: Belinky,F., Nativ,N., Stelzer,G., et al. PathCards: multi-source consolidation of human biological pathways.Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

# The MIPS Mammalian Protein-Protein Interaction Database

The MIPS Mammalian Protein-Protein Interaction Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. We took great care to include only data from individually performed experiments since they usually provide the most reliable evidence for physical interactions.

## Search the database

To suit different users needs we provide a variety of interfaces to search the database:

## Background

Protein-protein interactions (PPI) represent a pivotal aspect of protein function. Almost every cellular process relies on transient or permanent physical binding of two or more proteins in order to accomplish the respective task. Comprehensive databases of PPI in Saccharomyces cerevisiae have proved to be invaluable resources for both bioinformatics and experimental research and are used heavily in the scientific community.

Although yeast is a well established model organism, not all interactions in higher eukaryotes have equivalent counterparts in unicellular systems. Currently, publicly available PPI databases contain comparatively few entries from mammals so we embarked on building a high-quality, manually curated database of protein-protein interactions in mammals.

## Conditions of use

You are free to use the database as you please including full download of the dataset for your own analyses as long as you cite the source properly:

Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stümpflen V, Mewes HW, Ruepp A, Frishman D
The MIPS mammalian protein-protein interaction database
Bioinformatics 2005; 21(6):832-834; [Epub 2004 Nov 5]   doi:10.1093/bioinformatics/bti115

## Other PPI resources

There are plenty of interesting databases and other sites on protein-protein interactions. Currently we are aware of the following PPI resources:

APID Agile Protein Interaction DataAnalyzer (Cancer Research Center, Salamanca, Spain)
BIND Biomolecular INteraction Network Database at the University of Toronto, Canada. No species restriction
CYGD PPI section of the Comprehensive Yeast Genome Database. Manually curated comprehensive S. cerevisiae PPI database at MIPS
DIP Database of Interacting Proteins at UCLA. No species restriction.
GRID General Repository for Interaction Datasets. Mount Sinai Hospital, Toronto, Canada
HIV Interaction DB Interactions between HIV and host proteins.
HPRD The Human Protein Reference Database. Institute of Bioinformatics, Bangalore, India and Johns Hopkins University, Baltimore, MD, USA.
HPID Human Protein Interaction Database. Department of computer Science and Information Engineering Inha University, Inchon, Korea
iHOP iHOP (Information Hyperlinked over Proteins). Protein association network built by literature mining
IntAct Protein interaction database at EBI. No species restriction.
InterDom Database of putative interacting protein domains. Institute for InfoComm Research, Singapore.
JCB PPI site at the Jena Centre for Bioinformatics, Germany
MetaCore Commercial software suite and database. Manually curated human PPIs (among other things). GeneGo
MINT Molecular INTeraction database at the Centro di Bioinformatica Moleculare, Universita di Roma, Italy.
MRC PPI links Commented list of links to PPI databases and resources maintained at the MRC Rosalind Franklin Cetre for Genomics Research, Cambridge, UK
OPHID The Online Predicted Human Interaction Database. Ontario Cancer Institute and University of Toronto, Canada.
Pawson Lab Information on protein-interaction domains.
PDZbase Database of PDZ mediated protein-protein interactions.
Predictome Predicted functional associations and interactions. Boston University.
Protein-Protein Interaction Server Analysis of protein-protein interfaces of protein complexes from PDB. University College of London, UK.
PathCalling Proteomics and PPI tool/database. CuraGen Corporation.
PIM Hybrigenics PPI data and tool, H. pylori. Free academic license available
RIKEN Experimental and literature PPIs in mouse.
STRING Protein networks based on experimental data and predictions at EMBL.
YPD “BioKnowledge Library” at Incyte Corporation. Manually curated PPI data from S. cerevisiae. Proprietary.

If we forgot to list your favorite PPI resource or you are providing one yourself please let us know – we will be happy to include it.

### PPI related software

aiSee Commercial graph layout software
Cytoscape Open source software for visualization of PPI networks and data integration
graphviz Graph layout software

You can get the full dataset here (PSI-MI format).

## Acknowledgements

This work is funded by a grant from the German Federal Ministry of Education and Research. It is part of the initiative “Bioinformatics for the Functional Analysis of Mammalian Genomes” (BFAM).

# An evaluation of human protein-protein interaction data in the public domain

BMC Bioinformatics20067(Suppl 5):S19

DOI: 10.1186/1471-2105-7-S5-S19

Published: 18 December 2006

## Abstract

### Background

Protein-protein interaction (PPI) databases have become a major resource for investigating biological networks and pathways in cells. A number of publicly available repositories for human PPIs are currently available. Each of these databases has their own unique features with a large variation in the type and depth of their annotations.

### Results

We analyzed the major publicly available primary databases that contain literature curated PPI information for human proteins. This included BIND, DIP, HPRD, IntAct, MINT, MIPS, PDZBase and Reactome databases. The number of binary non-redundant human PPIs ranged from 101 in PDZBase and 346 in MIPS to 11,367 in MINT and 36,617 in HPRD. The number of genes annotated with at least one interactor was 9,427 in HPRD, 4,975 in MINT, 4,614 in IntAct, 3,887 in BIND and <1,000 in the remaining databases. The number of literature citations for the PPIs included in the databases was 43,634 in HPRD, 11,480 in MINT, 10,331 in IntAct, 8,020 in BIND and <2,100 in the remaining databases.

### Conclusion

Given the importance of PPIs, we suggest that submission of PPIs to repositories be made mandatory by scientific journals at the time of manuscript submission as this will minimize annotation errors, promote standardization and help keep the information up to date. We hope that our analysis will help guide biomedical scientists in selecting the most appropriate database for their needs especially in light of the dramatic differences in their content.

## Background

Protein-protein interactions (PPI) are essential for almost all cellular functions. Proteins seldom carry out their function in isolation; rather, they operate through a number of interactions with other biomolecules. Experimental elucidation and computational analysis of the complex networks formed by individual protein-protein interactions (PPIs) is one of the major challenges in the post-genomic era. PPI databases have thus become valuable resources for the systematic analysis of the molecular networks of a cell [1, 2]. With the accumulation of PPIs from high-throughput experiments, it is increasingly important to store such data for easy retrieval and analysis [3]. Several databases have compiled protein interactions based on manual curation of the scientific literature, automated text mining of articles or computational predictions. In this review, various features of nine different databases are evaluated, including compliance with emerging data standards such as proteomics standards initiative – molecular interaction (PSI-MI) format [4] and BioPAX [5], which define a unified framework for sharing PPI and pathway information, respectively.

### Human protein-protein interaction databases

Protein interaction repositories can be broadly classified into 2 types based on their content: i) Those containing interactions supported by experimental evidence, or, ii) Those containing interactions derived from in silico predictions alone, or, mixed together with experimentally derived PPIs. Here, we evaluate only those databases that exclusively contain experimentally derived PPI data in humans.

Curated literature based repositories have two major mechanisms of incorporating PPIs supported by experimental validation: i) curation by biologists from the literature, or, ii) direct deposit of the experimentally derived PPIs prior to publication by an investigator. Currently, the majority of PPIs in most databases are from curation of the literature. If all scientific journals mandated that PPIs be submitted to repositories as a requirement for publication (as is currently the case with nucleotide sequences), the databases would not only become more comprehensive but perhaps also contain fewer annotation errors. Below, we will briefly describe salient features of nine major PPI databases.

#### H uman P rotein R eference D atabase (HPRD)

HPRD contains annotations pertaining to human proteins based on experimental evidence from the literature [6, 7]. This includes PPIs as well as information about post-translational modifications, subcellular localization, protein domain architecture, tissue expression and association with human diseases. In addition to interactions of proteins with other proteins, HPRD also reports interactions of proteins with nucleic acids and small molecules. The PPI data is sub classified as binary or complex interactions based on topology and the number of participants. Binary PPIs are direct interactions between two proteins while complexes represent interactions with more than 2 participants and the topology of interaction is unknown. Relevant publications are cited for each interaction. The type of experiment is also indicated as in vivo (e.g. coimmunoprecipitation),in vitro (e.g. GST pull-down assays) or yeast two-hybrid. Information about post-translational modifications includes the residue of modification, type of experiment and the upstream enzyme. These modifications can be viewed alongside the protein domain architecture. Each protein is linked to a genome browser, GenProt Viewer [8], which allows protein and transcript information to be visualized in the context of the relevant gene. HPRD is also linked to a compendium of signal transduction pathways, NetPath [9], which is freely available in several different formats. This database includes a tool called PhosphoMotif Finder, which reports the presence of any of over 320 phosphorylation-based motifs curated from the literature in a protein of interest. HPRD also incorporates a new feature, Protein Distributed Annotation System (PDAS) which allows researchers to contribute and share their data with the rest of the community. All interaction information can be downloaded from the website either in PSI-MI format or as tab delimited files.

#### IntAct

The PPI information in the IntAct database includes a brief description of the interaction, experimental method and the literature citation of human proteins as well as proteins derived from several other species [10, 11]. Whenever possible, PPI information is isoform specific. The database can be accessed by either a basic or advanced search. The latter provides the user with additional querying options such as experimental method or controlled vocabulary terms listed in PSI-MI. IntAct also has a tool which predicts best baits for pull-down experiments in humans by prioritizing the proteins which have the highest likelihood of being highly connected, or hubs, based on the available data within IntAct for various species – this is termed Pay-As-You-Go algorithm. Additional software developed as part of the IntAct project includes HierarchView, which depicts interaction networks as 2-dimensional graphs and highlights nodes based on a GO category specified by the user (e.g. cellular component).

#### M olecular INT eraction database (MINT)

MINT is a repository of experimentally verified protein interactions with special emphasis on mammalian interactions [12, 13]. It also features interactions involving non-protein entities such as promoter regions and mRNA transcripts. PPI information includes binary and complex interactions and is isoform specific. Each interaction is given a confidence score based on the number of interactions and type of experiment and the number of citations provided for each interaction. The interactors can be viewed graphically using the ‘MINT Viewer,’ which permits users to view interactors as a network, and to manipulate it such that only the proteins of interest are shown. Users can expand the network by dragging individual interactors, select and visualize PPIs based on confidence scores, and they can also export the data in flat files, PSI-MI format or to Osprey, a system developed for visualizing and manipulating network data [14]. The interaction data are displayed along with the corresponding Swiss-Prot annotation. Proteins with a role in genetic diseases (according to OMIM (Online Mendelian Inheritance in Man)) are further highlighted. MINT features a separate annotation of human PPIs called HomoMINT, which includes in addition to literature derived data information from other organisms mapped to their human orthologs.

#### D atabase of I nteracting P roteins (DIP)

PPI data stored in DIP were obtained through manual curation of the scientific literature and include direct and complex interactions [15, 16]. The JDIP is a Java application based visualization tool; it provides a graphical representation of interactions. New high-throughput experimental and predicted PPI data can be evaluated through other services provided by DIP such as Paralogous Verification Method (PVM), Expression Profile Reliability (EPR) [17] and Domain Pair Verification (DPV) [18]. PVM validates interacting pairs by showing the existence of paralogous interactions; EPR validates comparison based on common expression profiles of interactors and DPV validates through domain-domain interaction preferences. Other satellite projects, Live-DIP and DLRP, use the DIP database for accessing the interactions. Live-DIP annotates proteins under different physiological conditions [19] whereas DLRP annotates protein-ligand and protein-receptor pairs known to interact with each other [20].

#### MIPS Database

MIPS database consists of mammalian interaction data manually curated from the literature [21, 22], and includes experiment type, description of the interaction and binding regions of interacting partners (where available). Data from mass spectrometry and yeast two-hybrid studies are not included. PPIs can be queried based on interaction partners, experimental method, and functional aspects of the PPIs. The results can be retrieved in 2 formats – long and short. The long format details the interaction, including reference, experimental details, binding sites for each protein and a short comment on each interaction, its functional significance or the immediate outcome of the interaction. The short format is restricted to listing the interacting proteins. Both formats are also linked to visualization tools. Each protein is further linked to the corresponding annotation in the mouse PEDANT genome database developed by the same group; which contains pre-computed bioinformatics analyses of publicly available genomes [23].

#### A lliance F or C ellular S ignaling (AfCS)

The AfCS is a multidisciplinary, multi-institutional consortium that studies cellular signaling [24, 25]. “Molecule Pages” in the AfCS database provide qualitative and quantitative information on signaling molecules (mostly murine) and their interactions; – these include results of experiments carried out by the Alliance in addition to literature-derived data. The molecule pages contain automated as well as author-entered data. The former integrate DNA/protein sequence information and structural details along with basic biophysical and biochemical properties from external databases, whereas the latter consist of data manually curated from the literature. This is further assessed by AfCS-appointed editorial board members and anonymously peer-reviewed in a process established by the Nature Publishing Group. The curated data includes a textual description of protein function, regulation of activity, subcellular localization, major sites of expression, splice variants and phenotype of knockout animals. The interaction data are derived from murine proteins, or, if they are from other species, the interaction is mapped to the corresponding mouse orthologs. For some proteins, the annotations include descriptions of signaling molecules under different physiological conditions termed ‘states’ (e.g. binding of a phosphorylated protein with another protein). A number of signaling pathway maps are also available in this database. We have not considered this database in our comparison mainly because of its focus on murine, and not human, proteins.

#### B iomolecular I nteraction N etwork D atabase (BIND)

BIND is a database of biomolecular associations that are classified into 3 categories, binary molecular interactions, molecular complexes and pathways [26, 27]. In BIND, a molecular complex is a collection of two or more molecules that associate to form a functional unit in a cell. These records are supplemented with additional information such as complex topology and the number of subunits involved in the interaction. Pathways are a collection of two or more interactions that occur in a defined sequence within a living system; currently 8 pathways have been annotated. Data pertaining to 1473 organisms is available in BIND. Information on molecular associations is obtained from the literature. The majority of the interactions in BIND are PPIs although it includes some interactions with nucleic acids and small molecules as well. The function of proteins is depicted using ontoglyphs, a series of symbolic characters representing a high-level summary of Gene Ontology (GO) information, and, proteoglyphs, symbols used to represent the structural and binding properties of proteins at the level of conserved domains. Data in BIND can be queried using various database identifiers or by a BLAST search. BIND also stores biomolecular interactions for several other species. For yeast high-throughput PPI datasets, BIND provides a confidence measure based on text mining of publications, existence of homologous interactions, common and related GO annotations, domain composition and phenotypic profiling for the evaluation. The data can be downloaded in flat file and PSI-MI formats and the pathways can be exported to ‘sif’ format which allows visualization by Cytoscape, a software tool developed for visualization and manipulation of pathway data [28]. BIND offers a Standard Object Access Protocol (SOAP) interface for those who wish to access the data from third-party software. BIND also has data imports from FlyBase, MIPS, MGI etc. and entries can be queried through various sources (e.g. Wormbase and KEGG).

#### Reactome

Reactome is a curated knowledgebase of biological pathways [29, 30]. The goal of Reactome is to develop a curated resource of pathways and biochemical reactions in humans; however many of the reactions are also obtained via transfer from other species. The basic unit of this database is a reaction. Information on reactions is either derived from experiments in the literature or is an electronic inference based on sequence similarity. Reactions are also inferred in humans based on the putative human orthologs for the proteins that participate in the same reaction in other species. In such cases, the model organism reaction is annotated in Reactome, the inferred human reaction is annotated as a separate event, and the inferential link between the two reactions is explicitly noted. Each reaction is detailed with input, output, preceding and following events of the reaction, cellular component of the reaction and species of its occurrence. Each reaction is linked to pathways according to the order of reactions in corresponding pathway. The available pathways are integrated and represented graphically as a series of constellations in a ‘starry sky.’ This can be used to navigate through the reactions in biological pathways and visualize connections between them. It must be cautioned that the definition of PPIs in Reactome is quite broad: the interactions can be represented as ‘direct complex,’ ‘indirect complex,’ ‘reaction’ or ‘neighboring reaction.’ In a ‘direct complex,’ interactions occur between proteins present in the same complex and are not true pairwise interaction. ‘Indirect complexes’ contain interactions between interactors in different subcomplexes of a complex. ‘Reactions’ are interactions between proteins that participate in a reaction and the interactors are not reported to be in a complex. ‘Neighboring reactions’ represent the interactors that participate in 2 consecutive reactions, i.e. when one reaction produces a product, which is either an input or a catalyst for another reaction. The information is edited by the Reactome staff at Cold Spring Harbor Laboratory and the European Bioinformatics Institute and is then reviewed by other biological researchers for consistency and accuracy. Each reaction or pathway can be exported to Systems Biology Markup Language (SBML) and BioPAX formats. Reactome also provides tools such as Pathfinder and Skypainter. Pathfinder can identify pathways that connect input with output molecules while Skypainter allows the coloring of reaction maps based on user-specified identifiers that have been linked to each pathway. For our analysis, we have considered only the ‘direct complexes’ as they are the category most likely to correspond to true PPIs.

#### PDZBase

PDZBase is a database that focuses only on PPIs involving proteins with PDZ domains [31, 32]. Only those interactions involving the PDZ domain that have been confirmed by individual in vitro or in vivo biochemical experiments are considered. Thus, interactions discovered solely through high-throughput methods (e.g. yeast two-hybrid or mass spectrometry) are not included in PDZBase. PDZ domains and their ligands can be queried using sequence motifs. Each interaction in PDZBase consists of the residues of the interacting proteins on a 2D-diagram generated by a residue-based-diagram-editor (RBDG). The interacting residues between the PDZ domain and their peptide ligands are predicted based on similarity with the available structures of PDZ-peptide complexes.

### Strategy used for comparison of datasets

The datasets were downloaded from the download sites of PPI databases on October 2, 2006 and scripts were used for parsing out the protein pairs involved in PPIs along with the experiment type and literature references, if provided. The PPIs were further parsed to extract binary interactions for those proteins pairs where both proteins were human. Most databases had Swiss-Prot as one of their accession identifiers except BIND which provided RefSeq, GenBank and PDB identifiers. To determine the overlap among databases, the Swiss-Prot or RefSeq identifiers were mapped to the corresponding Entrez Gene identifiers as of October 2, 2006. Scripts were used to convert these PPIs into a non-redundant list of PPIs (if protein A and B interact, the dataset may have two PPIs, A-B and B-A – only one of the PPI was retained for our analyses). All datasets were compared with each other to obtain the overlap at PPI and protein levels. Experiment types extracted for PPIs were mapped with PSI-MI vocabulary list. Disease annotations for genes were obtained from OMIM and mapped to gene symbols to obtain the number of proteins in PPIs corresponding to disease-associated genes.

### Caveats of comparing PPI data

Assessment of the accuracy of annotation of all PPIs in various publicly available databases is beyond the scope of this article. In this study, we have tried to evaluate parameters that could be measured objectively. Nevertheless, there are still a number of caveats of any analysis comparing PPIs. Below is a list of some of the potential pitfalls and our strategies to tackle them.

1. 1.

Binary interactions including homodimers were considered for this analysis while complex interactions were not. It is not easy to look at complex interactions across databases especially for comparison purposes although ‘spoke’ and ‘matrix’ models have been described previously for comparing protein complexes [33]. In this study, we have chosen not to compare the complex interactions because of predictive nature of these models. However, cases where a protein complex was already converted into binary PPIs by using one of these models (e.g. use of the ‘matrix’ model to computationally predict PPIs in Reactome) were treated as binary interactions.

2. 2.

Some of the binary interactions involved proteins that were non-human. Mapping of orthologs is not an easy task and is not standardized. Thus, we did not attempt to map the human orthologs for proteins from any other species that were listed as interacting proteins.

3. 3.

We mapped all protein isoforms to a unique gene and then examined the overlaps. This was done because often a given isoform is annotated as an interacting protein although the interaction is not specific to that isoform. For example, this strategy allowed us to correctly capture PPIs as overlapping where a given protein was annotated as interacting with one isoform of another protein in one database and with another isoform of that protein in another database.

## Results and Discussion

### Comparison of PPI data

Table 1 summarizes the salient features of each database including total number of PPIs, total number of proteins, method of detection of PPIs, curation methodology, download options and URL links. The availability of data as a downloadable file is also indicated. Fig. 1A shows the distribution of the number of PPIs in each of the literature-based curated databases considered in our analysis. For each database, the total number of human PPIs present in the statistics page or in the downloaded files is shown along with the number of unique (non-redundant) binary human PPIs calculated by us. For this calculation, we only considered binary PPIs in which both members of an interacting pair were human proteins. As explained above, protein complexes were excluded from this analysis because it is difficult to ascertain the topology (i.e. which protein interacts with which protein in a complex) for determining overlap between datasets. The difference in the total and non-redundant PPIs in HPRD is because of protein complexes whereas in all other databases it is mainly due to the redundancy of PPIs. The distribution of PPI data in (Fig. 1A) shows a dramatic variation across these databases.

It is difficult to directly assess the depth of PPIs based on total interactions alone; thus, we analyzed the distribution of number of proteins in each database according to the number of binary (i.e. direct) interactions per protein. The majority of proteins in all databases have <10 interaction partners (Fig. 1B). The number of PPIs that fall under 31–40 and 41–50 PPI bins are high in HPRD and Reactome database. Although these PPIs are distributed across many types of proteins in HPRD, those in Reactome belong to mainly two classes: proteosomal or ribosomal protein complexes. The number of interactions for these two classes of proteins in Reactome is high because a ‘matrix’ model of interpreting protein complexes is used in which all proteins are considered connected to all proteins within a complex. All other database shows the same trend with a greater number of proteins in bins with lower number of PPIs per protein. This does not automatically imply that most proteins truly interact with a small number of interactors. Rather, this is likely due to the fact that not all proteins have been studied thoroughly and because all published interactions have not yet been included in these databases. Additionally, there is a bias of experimental methods in capturing all interactions (e.g. yeast two-hybrid system does not generally detect interactions involving integral membrane proteins). Overall, most databases contain a very small number of proteins with >30 PPIs.

### Comparison of proteins annotated with PPIs

We looked for the total number of unique genes represented in the PPI databases (Fig. 2A). In HPRD, proteins encoded by 9,427 genes have at least one or more direct PPI annotated (out of ~20,000 proteins annotated in this database) while BIND, IntAct and MINT contain 3,887, 4,614 and 4,975 proteins, respectively. Other databases such as DIP, Reactome, MIPS and PDZ Base contain PPIs for <1000 proteins.

### Proteins encoded by disease-associated genes in PPIs

PPIs are attractive as potential targets for small-molecule drugs for treatment of diseases. We checked for proteins encoded by genes listed in the OMIM database that are mutated in inherited genetic disorders (Fig.2B). HPRD has all human disease-associated genes listed in OMIM of which 1,463 have at least one protein interactor while most of the other databases contain significantly less number of proteins encoded by these genes.

### Overlap of PPIs and proteins between databases

As discussed above, there is a significant difference in the total number of PPIs in the various databases. However, this statistic does not provide an idea of the extent to which the PPIs actually overlap across databases. As shown in Fig. 3A, HPRD contains a high proportion of human PPIs that are present in other literature-derived curated databases. The overlap between IntAct (10,244 PPIs) and MINT (11,367 PPIs) is 7,362, which is the highest overlap among the remaining literature-derived databases; the overlap between BIND (6,621 PPIs) and MINT (11,367 PPIs) is only 1,463 and there is no overlap between PDZBase and DIP.

To determine whether the overlap is small because of proteins not being annotated in different databases, we looked at the overlap at the protein level between databases. As shown in Fig. 3B, the overlap of proteins between BIND (3,887 proteins) and IntAct (4,614 proteins) is 1,969 but the overlap at PPI level is only 1,167. HPRD contains 76% and MINT contains 51% of proteins in Reactome, although there is a very low overlap at the level of PPIs across these databases. Overall, although at protein level there is a good overlap between the databases, the PPIs do not overlap as much. Average degree (K) of a protein i.e. the number of interactions that a protein has with other proteins, is 7.6 for HPRD, while that for MIPS, PDZ Base, DIP, BIND, MINT and IntAct ranges from 1.7 to 4.5. Strikingly, the average degree of a protein in Reactome is 12.2, which is because of the interpretation of protein complexes through the ‘matrix’ model as explained above.

We also carried out a comparison of a test set of proteins to check the distribution of interaction partners of PPIs across different databases (Table 2). The test proteins were selected based on the presence of proteins in four or more databases. We required that the protein be present in four or more databases because there was not even a single protein that was common to all databases. The proteins were further selected to cover proteins that participate in several different types of biological processes to avoid any potential bias in the event that any particular database is especially ‘strong’ in certain types of annotations. As shown in Table 2, Caspase 3 (CASP3) has 126 protein interaction partners annotated in HPRD, while BIND, MINT, IntAct and Reactome contain 15, 6, 3 and 1 interaction, respectively. S-phase kinase-associated protein 1A (SKP1A) has 35 PPIs in HPRD, 11 in BIND, 5 in DIP and 13 in MINT. MIPS and PDZBase do not contain any PPIs for this protein. Nuclear factor kappa-B subunit 3 (RELA) has 98 protein interaction partners in HPRD while BIND, MINT, DIP and IntAct contain 13, 103, 13 and 90 PPIs. Overall, for most proteins, there is at least one, and often several, databases that do not contain any PPI annotations (Table 2). This again reflects the fact that the databases are still at an early stage of curation and annotation of published PPIs.

### Literature citations in literature-derived databases

Literature citations are generally linked to interactions in literature-derived datasets. We checked the total citations in PubMed linked to PPIs in the literature-derived databases (Fig. 4A). HPRD has >43,634 published articles to support the PPI data, while BIND and MINT contain ~8,020 and ~11,480 citations, respectively. Reactome contains a total of ~2,000 citations. Another parameter to assess the extent of curation is to determine the number of citations per interaction. More than one citation for a given PPI indicates that the interaction has been verified by more than one group or method. Conversely, however, the presence of a single citation does not automatically imply that there is only one study describing the interaction because it is quite likely that only one published paper was linked although several studies might have been carried out (i.e. incomplete curation). This is illustrated in the section below where the same PPI is compared across multiple databases. As shown in Fig. 4B, 100% of PPIs in PDZBase and >95% of PPIs in MINT, IntAct and MIPS had one PubMed citation. In contrast, 87% in BIND and DIP and 84% of PPIs in HPRD have only one citation. Notably, ~11% and 7% of PPIs in HPRD and BIND, respectively, have 2 citations and ~2% of PPIs in HPRD, BIND and IntAct have more than 5 citations each. The majority of PPIs in Reactome (~96%) are linked to the same 2 published articles because these PPIs are predicted computationally using a matrix approach (i.e. all against all) to link proteins that were identified in two mass spectrometry-based protein complex pulldown studies on spliceosomes [34, 35].

### Comparison of PPI annotations common to multiple databases

Overall statistics of databases might not reflect the breadth and depth of protein annotations from a biologist’s perspective. To provide certain ‘case studies,’ we prepared a list of protein interactions that are common to 4 or more literature-derived databases and then tabulated the number of PPIs in each database. We left out PDZBase because of its small size. Table 3 lists 6 representative PPIs that were common to 4 or more databases along with the article(s) cited for each interaction and the annotation of the experimental methods used to detect the corresponding PPI. As an example, the experimental method annotated for the interaction between transcription factors NFKB1 and NFKB3 reported recently [36] is in vivo (MI:0492) in HPRD, tandem affinity purification (TAP) (MI:0045) in DIP, anti tag coimmunoprecipitation (MI:0109) in MINT and tap tag coip (MI:0007) in IntAct. This example illustrates how databases can describe the same experiment using alternative vocabulary terms. The interaction, TNFRSF1A with TRADD, is annotated as in vivo, in vitro and yeast 2-hybrid with 3 PubMed citations in HPRD, simply ‘experimental’ with 1 PubMed citation in DIP, immunoprecipitation and affinity chromatography with 3 PubMed citations in BIND, co-immunoprecipitation with 1 PubMed citation by MIPS, ‘co-immunoprecipitation, pulldown and two hybrid’ with 2 citations by MINT and ‘anti-bait coip, pulldown and two hybrid’ with 1 citation by IntAct. Together, the 6 databases refer to 8 PubMed citations to describe this interaction while each individual database only uses between 1 and 3 citations. For the interaction of FADD with FAS, HPRD annotation is ‘in vivo, in vitro and yeast 2-hybrid,’ DIP mentions ‘two hybrid test,’ BIND describes it as ‘immunoprecipitation’, MIPS mentions ‘coip,’ MINT describes it as ‘coimmunoprecipitation and two hybrid’ and IntAct annotates it as coip, pull down, anti tag coip and two hybrid.’ Table 3 highlights how different databases use different published articles for annotating the same PPI. Thus, mere presence of a PPI in different literature-derived databases does not automatically guarantee that the annotations will be identical. It also illustrates that merging of annotations from multiple databases will lead to an increase in the depth of individual annotations.

Proteomics Standards Initiative (PSI) is a collaborative initiative for standardization of protein-related data including protein-protein interaction and mass spectrometry data. PSI-molecular interaction (PSI-MI) [37] format is an exchange format, which has already become the standard for PPI data [4]. Table 1 shows that although many databases provide the PPI data in this format such as HPRD, BIND, DIP MINT, MIPS and IntAct, some databases such as AfCS and Reactome do not currently have this option. Reactome also provides data in two pathway-related formats, BioPAX and SBML. The data contained in AfCS is not currently available as a downloadable file.

Although a consensus on the use of standardized vocabulary for denoting PPIs is evolving and is being increasingly used, there is no requirement for use of any particular type of identifiers or database accession numbers for proteins in PPI databases. Different sets of protein database identifiers are used, with many of them being frequently retired, merged or otherwise updated. This creates great difficulties for those who want to combine datasets from different databases. It is not a trivial task to ‘map’ identifiers to a single set of proteins and creates a bioinformatics pitfall of its own. If this ‘mapping’ is done by purely automated methods, there is a risk of wrong assignment of a protein entry from one database to another. To minimize this, we recommend the use of gene symbols in addition to any ‘favorite’ protein identifier. This allows for a relatively more error-free interpretation of PPI data at the gene level.

## Conclusion

There is great interest in protein-protein interactions as a means of understanding the complexities of a cell. Large scale PPI data derived from high-throughput experiments or literature derived curated databases has been used to analyze the molecular networks of human cells [38, 39, 40, 41]. Here, our assessment shows that the number of PPIs in databases varies widely from as low as 100 to over 36,600 interactions. Overlap of PPIs within the same category of databases (e.g. within literature-derived databases) is low despite the presence of overlapping proteins. A comparison of the number of PPIs for a test set of proteins confirms that there is indeed a large variation in the number of interactors across the interaction databases. Also, a comparison of annotations for the PPIs that do overlap between the databases reveals differences in annotations through the use of alternative vocabulary terms. This is partly because of the difference in interpretation of the experimental results by the biologists annotating them and partly because of the overlapping meaning of the terms themselves.

A particularly important issue is that of protein isoforms. Often, only one isoform is annotated as an interactor although there is no evidence that the interaction is specific to that isoform. In other experiments such as coimmunoprecipitation experiments, it is almost impossible to discern which isoform binds unless an isoform-specific antibody is used. Because of this difficulty in mapping isoforms, we suggest that groups carrying out interaction studies, especially large-scale studies, map the identity of the proteins to genes and include this in their data submission. We have also previously done this for protein identification studies using mass spectrometry where a similar difficulty exists with regard to identification of particular isoforms [42]. If this is done, then a binary interaction can be interpreted thus: at least one of the gene products of Gene A interacts with at least one of the gene products of Gene B.

The dissemination of PPI datasets is an important aspect for optimal use of the data. Through decades of research, molecular biologists have discovered a large number of PPIs. Collecting this information, storing it and maintaining a database is a valuable task, which is perhaps not adequately appreciated by the scientific community. Our evaluation of human PPI databases highlights the diverse nature of annotation and representation of PPIs in databases. We hope that this review will assist biomedical scientists in making informed decisions about the most appropriate database to suit their needs and to actively participate with the databases to maintain error-free and updated annotations.

## List of Abbreviations

PSI-MI:

Proteomics Standards Initiative – Molecular Interaction

HPRD:

Human Protein Reference Database

BIND:

Biomolecular Interaction Network Database

DIP:

Database of Interacting Proteins

MINT:

Molecular INTeraction database

AfCS:

Alliance for Cellular Signaling

## Declarations

### Acknowledgements

Akhilesh Pandey is supported by a grant from the National Institutes of Health (U54 RR020839). The Human Protein Reference Database was developed with funding from the National Institutes of Health and the Institute of Bioinformatics. Dr. Pandey serves as Chief Scientific Advisor to the Institute of Bioinformatics. Dr. Pandey is entitled to a share of licensing fees paid to the Johns Hopkins University by commercial entities for use of the database. The terms of these arrangements are being managed by the Johns Hopkins University in accordance with its conflict of interest policies.

This article has been published as part of BMC Bioinformatics Volume 7, Supplement 5, 2006: APBioNet – Fifth International Conference on Bioinformatics (InCoB2006). The full contents of the supplement are available online at http://​www.​biomedcentral.​com/​1471-2105/​7?​issue=​S5.

## References

1. Kemmer D, Huang Y, Shah SP, Lim J, Brumm J, Yuen MM, Ling J, Xu T, Wasserman WW, Ouellette BF: Ulysses – an application for the projection of molecular interactions across species. Genome Biol 2005, 6: R106. 10.1186/gb-2005-6-12-r106
2. Riley R, Lee C, Sabatti C, Eisenberg D: Inferring protein domain interactions from databases of interacting proteins. Genome Biol 2005, 6: R89. 10.1186/gb-2005-6-10-r89
3. Suresh S, Sujatha Mohan S, Mishra G, Hanumanthu GR, Suresh M, Reddy R, Pandey A: Proteomic resources: Integrating biomedical information in humans. Gene 2005, 364: 13–18. 10.1016/j.gene.2005.07.021
4. Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, et al.: The HUPO PSI’s molecular interaction format – a community standard for the representation of protein interaction data. Nat Biotechnol 2004, 22: 177–183. 10.1038/nbt926
5. BioPAX[http://​www.​biopax.​org]
6. HPRD Human Proteins Reference Database[http://​www.​hprd.​org]
7. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, et al.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13: 2363–2371. 10.1101/gr.1680803
8. GenProt[http://​www.​genprot.​org]
9. NetPath[http://​www.​netpath.​org]
10. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, et al.: IntAct: an open source molecular interaction database. Nucleic Acids Res2004, 32: D452–455. 10.1093/nar/gkh052
11. IntAct[http://​www.​ebi.​ac.​uk/​intact]
12. Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G: MINT: a Molecular INTeraction database. FEBS Lett 2002, 513: 135–140. 10.1016/S0014-5793(01)03293-8
13. MINT Molecular INTeraction database[http://​mint.​bio.​uniroma2.​it/​mint]
14. Breitkreutz BJ, Stark C, Tyers M: Osprey: a network visualization system. Genome Biol 2003, 4: R22. 10.1186/gb-2003-4-3-r22
15. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32: D449–451. 10.1093/nar/gkh086
16. DIP Database of Interacting Proteins[http://​dip.​doe-mbi.​ucla.​edu]
17. Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 2002, 1: 349–356. 10.1074/mcp.M100037-MCP200
18. Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions.Genome Res 2002, 12: 1540–1548. 10.1101/gr.153002
19. Duan XJ, Xenarios I, Eisenberg D: Describing biological protein interactions in terms of protein states and state transitions: the LiveDIP database. Mol Cell Proteomics 2002, 1: 104–116. 10.1074/mcp.M100026-MCP200
20. Graeber TG, Eisenberg D: Bioinformatic identification of potential autocrine signaling loops in cancers from gene expression profiles. Nat Genet 2001, 29: 295–300. 10.1038/ng755
21. Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stumpflen V, Mewes HW, et al.: The MIPS mammalian protein-protein interaction database.Bioinformatics 2005, 21: 832–834. 10.1093/bioinformatics/bti115
22. MIPS Mammalian Protein-Protein InteractionDatabase[http://​mips.​gsf.​de/​proj/​ppi]
23. Riley ML, Schmidt T, Wagner C, Mewes HW, Frishman D: The PEDANT genome database in 2005. Nucleic Acids Res 2005, 33: D308–310. 10.1093/nar/gki019
24. Gilman AG, Simon MI, Bourne HR, Harris BA, Long R, Ross EM, Stull JT, Taussig R, Bourne HR, Arkin AP, et al.:Overview of the Alliance for Cellular Signaling. Nature 2002, 420: 703–706. 10.1038/nature01304
25. AfCS Alliance for Cellular Signaling[http://​www.​signaling-gateway.​org]
26. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, et al.: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res 2005, 33: D418–424. 10.1093/nar/gki051
27. BIND Biomolecular Interaction Network Database[http://​www.​bind.​ca]
28. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003,13: 2498–2504. 10.1101/gr.1239303
29. Reactome[http://​www.​reactome.​org]
30. Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, et al.: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005, 33: D428–432. 10.1093/nar/gki072
31. PDZBase[http://​icb.​med.​cornell.​edu/​services/​pdz]
32. Beuming T, Skrabanek L, Niv MY, Mukherjee P, Weinstein H: PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics 2005, 21: 827–828. 10.1093/bioinformatics/bti098
33. Bader GD, Hogue CW: Analyzing yeast protein-protein interaction data obtained from different sources.Nat Biotechnol 2002, 20: 991–997. 10.1038/nbt1002-991
34. Hartmuth K, Urlaub H, Vornlocher HP, Will CL, Gentzel M, Wilm M, Luhrmann R: Protein composition of human prespliceosomes isolated by a tobramycin affinity-selection method. Proc Natl Acad Sci U S A2002, 99: 16719–16724. 10.1073/pnas.262483899
35. Rappsilber J, Ryder U, Lamond AI, Mann M: Large-scale proteomic analysis of the human spliceosome.Genome Res 2002, 12: 1231–1245. 10.1101/gr.473902
36. Bouwmeester T, Bauch A, Ruffner H, Angrand PO, Bergamini G, Croughton K, Cruciat C, Eberhard D, Gagneur J, Ghidelli S, et al.: A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol 2004, 6: 97–105. 10.1038/ncb1086
37. PSI-MI Proteomics Standards Initiative – Molecular Interaction[http://​psidev.​sourceforge.​net/​mi/​xml/​doc/​user]
38. Neduva V, Linding R, Su-Angrand I, Stark A, de Masi F, Gibson TJ, Lewis J, Serrano L, Russell RB: Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol 2005, 3: e405. 10.1371/journal.pbio.0030405
39. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al.: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437: 1173–1178. 10.1038/nature04209
40. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, et al.: A human protein-protein interaction network: a resource for annotating the proteome.Cell 2005, 122: 957–968. 10.1016/j.cell.2005.08.029
41. Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, Mohan SS, Sharma S, Pinkert S, Nagaraju S, Periaswamy B, et al.: Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 2006, 38: 285–293. 10.1038/ng1747
42. Muthusamy B, Hanumanthu G, Suresh S, Rekha B, Srinivas D, Karthick L, Vrushabendra BM, Sharma S, Mishra G, Chatterjee P, et al.: Plasma Proteome Database as a resource for proteomics research.Proteomics 2005, 5: 3531–3536. 10.1002/pmic.200401335

http://www.ebi.ac.uk/intact/

## IntAct Molecular Interaction Database

IntAct provides a freely available, open source database system and analysis tools for molecular interaction data. All interactions are derived from literature curation or direct user submissions and are freely available. The IntAct Team also produce the Complex Portal.

BioGRID interaction data are 100% freely available to both commercial and academic users and are provided WITHOUT ANY WARRANTY. Publications that make use of this data are requested to please cite the contributing authors and : Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. Biogrid: A General Repository for Interaction Datasets. Nucleic Acids Res. Jan1; 34:D535-9 where applicable.

### Syn-Lethality: An Integrative Knowledge Base of Synthetic Lethality towards Discovery of Selective Anticancer Therapies

BioMed Research International
Volume 2014 (2014), Article ID 196034, 7 pages
http://dx.doi.org/10.1155/2014/196034
Research Article

## Syn-Lethality: An Integrative Knowledge Base of Synthetic Lethality towards Discovery of Selective Anticancer Therapies

1Bioinformatics Research Centre (BIRC), School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798
2Institute for Infocomm Research (I2R), 1 Fusionopolis Way, Singapore 138632
3Genome Institute of Singapore (GIS), Biopolis, Singapore 138672

Received 17 November 2013; Accepted 11 March 2014; Published 22 April 2014

Copyright © 2014 Xue-juan Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Synthetic lethality (SL) is a novel strategy for anticancer therapies, whereby mutations of two genes will kill a cell but mutation of a single gene will not. Therefore, a cancer-specific mutation combined with a drug-induced mutation, if they have SL interactions, will selectively kill cancer cells. While numerous SL interactions have been identified in yeast, only a few have been known in human. There is a pressing need to systematically discover and understand SL interactions specific to human cancer. In this paper, we present Syn-Lethality, the first integrative knowledge base of SL that is dedicated to human cancer. It integrates experimentally discovered and verified human SL gene pairs into a network, associated with annotations of gene function, pathway, and molecular mechanisms. It also includes yeast SL genes from high-throughput screenings which are mapped to orthologous human genes. Such an integrative knowledge base, organized as a relational database with user interface for searching and network visualization, will greatly expedite the discovery of novel anticancer drug targets based on synthetic lethality interactions. The database can be downloaded as a stand-alone Java application.

#### 1. Introduction

Finding effective anticancer therapies is a major goal of biomedical research. As a devastating human disease, cancer kills millions of people each year. In 2008, the World Health Organization (WHO) predicted that, if new anticancer treatments are not discovered, there will be 26.4 million cancer patients around the world and 17 million cancer deaths by 2030 [1]. The currently prevalent anticancer treatments, chemotherapies, have several limitations, including the drug resistance and the side-effects of toxicity [2]. Although targeted therapies are being developed, the lack of selectivity (i.e., killing both tumour and healthy cells) remains a major issue for current anticancer therapeutics.

Recently, synthetic lethality (SL) has emerged as a novel anticancer strategy that is promising to be highly selective. A pair of genes is defined to have synthetic lethal interactions if the mutation to either gene will not kill the cell but the mutations to both genes will lead to cell death [2] (Figure 1). Compared with healthy cells, cancer cells contain many genetic mutations. Hence, an SL partner of a cancer-specific mutation will be potentially a selective anticancer drug target. A drug that induces a mutation to the SL partner gene will kill cancer cells but spare normal cells, due to the SL interaction with the cancer-specific mutation that is not present in healthy cells.

Figure 1: The concept of synthetic lethality. (a) If just one of the SL pair genes is mutated, then the cell is alive. A/B wild type, a/b-mutated genes; (b) mutation/inhibition of one gene or both genes of a SL gene pair leads to different cell fates [2].

However, the discovery and clinical applications of SL-based anticancer therapies need to overcome several technical obstacles. Most known SL cases are discovered in yeast, and so far only a few SL gene pairs are known in human. A prevalent technique to discover SL genes is high-throughput screening based on chemical or RNAi libraries [3]. Due to genetic heterogeneity of cancer cells, the SL identified from one screening might not be repeatable in another platform or cancer subtypes. Importantly, the screening-based discovery can hardly yield any mechanistic insight into SL interactions. The interpretation of SL candidates is crucial for reliable application of SL-based therapies. To address these issues, systems biology approaches that can uncover the molecular mechanisms of SL in cancer cells would be needed.

The technique of SL was originated from yeast genetics [4]. Due to its rapid generation time, simple culture, and easy-to-handle genetic manipulation, S. cerevisiae has been extensively used to study SL [5]. Computational methods have also been developed to predict and analyze yeast SL [6]. In contrast, there is a dearth of resources (e.g., data, knowledge, or bioinformatics tools) available about SL in human cancer. Recently, some methods have been developed to infer human SL from yeast SL, considering that the genome integrity and cell-cycle related genes from yeast are highly conserved with human and closely related with cancer disease [7]. Massive screening of yeast SL interaction can provide valuable information for SL inference of human cancer. For example, Conde-Pueyo et al. applied the yeast-to-human inference method to obtain potential cancer-related SL target and identified SL partners of cancer-related genes that are drug target [8]. It is highly desirable to integrate data of human cancer SL pairs to reduce the follow-up experimental research in the manageable size.

In this paper, we present an integrative knowledge base dedicated to SL in human cancer, called Syn-Lethality.From literature, we collected SL gene pairs that have been experimentally discovered and verified and integrated them into a network (Figure 2), where each node is a gene and each edge represents an SL interaction. We call such a network as SL network. Moreover, we associated the SL network with related gene annotations and pathway information, to facilitate mechanistic understanding of SL. In addition to human specific SL, we also collected yeast SL, which were mapped to human genes through orthologous correspondence. The information collected as such has been organized into a relational database with user friendly interface. When users input cancer genes (e.g., TP53), Syn-Lethality will search for SL partners of the query genes and display related annotations (e.g., pathways, gene functions, and hyperlinks to the related literature). The SL network we constructed serves as a roadmap for the whole knowledge base.

Figure 2: SL network of human cancer constructing based on SL literatures. Each node in the network denotes a gene/protein and each edge represents an SL interaction (the arrow direction leads from mutation gene to target gene).

To our best knowledge, Syn-Lethality is the first database dedicated to human synthetic lethality. There are few genome wide screenings for SL interactions with human cancer genes, and they are focused on a few well-known oncogenes (e.g., TP53 and KARS). The large-scale screening for human cancer cells is limited by high-cost, false positives, and difficulty to interpret mechanisms, and the information is scattered in the literature. An integrative approach is indispensable for a systematic and mechanistic understanding of human SL. Syn-Lethality database is one of the first attempts to integrate knowledge and data about SL in human cancer. We have also integrated data from yeast and will do so in the future from other model organisms. We believe that it would be a valuable resource and framework that would facilitate novel discovery of potential selective anticancer therapy based on synthetic lethality.

#### 2. Data Integration

##### 2.1. Data Collection and the Literature Search

The primary aim of our Syn-Lethality database is to collect and maintain a high quality set of SL gene pairs, which serves as a comprehensive, fully classified, and accurately annotated knowledge base for SL-related research. The database also provides extensive cross-references and querying interfaces. The SL pairs in Syn-Lethality database are collected by two alternative methods and we will next introduce them in more detail.

The first method for collecting SL pairs is the literature search. We examined the Web of Knowledge and NCBI PubMed databases with the keywords like “synthetic lethality” and then screened with the keyword “human cancer/tumour” from the abstracts. As such, we collected more than one hundred scientific publications. From these articles, we manually extracted more than one hundred SL gene pairs, which have been verified by experiments for cancer treatment. Although the number of SL pairs collected by the literature search is limited, they are highly trustworthy and thus they lay the foundation for our Syn-Lethality database.

The second source of potential SL pairs is the knowledge transfer from the model organism of yeast to human by comparative genomics analysis. Currently, there are quite a few number of SL pairs in yeast which are experimentally detected by various screening techniques. Meanwhile, some human cancer genes (e.g., related with cell cycle, DNA repair) are observed to be highly evolutionarily conserved with yeast cancer genes for inferring human SL pairs of genes based on human-yeast conservation. Therefore, it is possible to infer some SL pairs in human cancers from yeast. We predict a human gene pair to be an SL pair in human cancer based on the following two constraints. First, this human gene pair has a conserved SL interaction in yeast. Second, one of these two genes is a cancer gene. For example, two yeast genes and form an SL relationship while two human genes and are orthologs of and , respectively. If or is a gene that is observed to be mutated in a certain type of cancer, (, ) is then a predicted SL pair in the human cancer. In this paper, all the yeast SL interactions are downloaded from BioGrid [9] (Table 1). However, we noticed that some of these yeast SL pairs from BioGrid involve essential genes. By the definition of SL (i.e., mutation of one gene should not kill the cell, but mutation of both genes kills the cell), both genes in a SL pair should be nonessential. Therefore, with the list of essential genes downloaded from Gerstein Lab at Yale University (http://bioinfo.mbb.yale.edu/genome/yeast/cluster/essential/) and Saccharomyces Genome Deletion Project (http://www-sequence.stanford.edu/group/yeast_deletion_project/) we collected 6,613 SL pairs without any essential genes. In addition, 507 human cancer genes are downloaded from COSMIC: Cancer Gene Census via the link http://cancer.sanger.ac.uk/cancergenome/projects/census/. Finally, we inferred 1,114 SL pairs related with human cancers that are predicted from yeast.

Table 1: Representative entries for human cancer Syn-Lethality database.

Based on the above in silico analysis, the Syn-Lethality database contains 113 SL pairs from NCBI PubMed abstracts and 1,114 SL pairs from the model organism of yeast (Table 3). We also provide additional information about the genes/proteins involved in these SL pairs as shown in Table 1, for example, Entrez gene IDs, full gene name, symbols, gene type (oncogene or tumour suppressor gene), cancer type, pathway information, and some remarks on the molecular mechanisms.

##### 2.2. Pathway/Mechanism Analysis of SL Pairs Directly from the Literature

From the list of SL gene pairs, it is interesting to note that a large fraction of SL pairs are involved in fundamental processes of cell fates, cell cycle, and DNA damage response. We first take the KRAS oncogene as an example. Genome-wide RNAi screen was conducted to identify SL interaction partners of KRAS [10]. We observed that the SL interaction partners of KRAS are involved in the mitotic progression, including the subunits of the anaphase-promoting complex/cyclosome (APC/C) complex (ANAPC1, ANAPC4, CDC16, and CDC27), cyclin A2 (CCNA2), kinesin-like protein 2C (KIF2C), KNL-1 (CASC5), hMis18a and hMis18b (C21ORF45 and OIP5), borealin (CDCA8), and SMC4 and polo-like kinase 1 (PLK1). The inhibition of the above genes will lead to the death of cells in which the KRAS has been mutated [10]. TP53 is another example. It is a major downstream effector of DNA-damage kinase pathways. In response to DNA damage, a normal cell will activate a complex signaling network to arrest cell-cycle progression and facilitate the DNA repair. In contrast, TP53-deficient tumor cells rely on other G2/M checkpoint regulators such as checkpoint kinase 1 (CHK1) to arrest cell-cycle progression. Recently, the SL interactions between TP53 (TP53 is mutated) and ATR/Chk1, WEE1, ATM/Chk2, and MK2 targets have been investigated [11]. As an example, myelocytomatosis viral oncogene homolog, MYC, is a multifunctional, nuclear phosphoprotein that plays a role in cell cycle progression, apoptosis, and cellular transformation, as a transcription factor. Overexpression of MYC sensitizes fibroblasts to agonists of the TNF-related apoptosis-inducing ligand (TRAIL) death receptor DR5. It was shown that MYC mediates increased DR5 expression and signaling as a result of enhanced procaspase-8 autocatalytic activities [12].

As reported by [3, 13], the authors proposed the following four types of mechanisms for SL interactions in human cancers from the perspectives of protein complexes and pathways. First, two complexes may be synthetic lethal when they have an essential function in common and they are uniquely redundant. Second, two units within an essential protein complexes may form SL relationship. Third, two components in a linear essential pathway may be SL partners, because the mutation of each component decreases the flow through the pathway but the pathway still has signal flow, whereas the mutation of both will destroy the pathway. Forth, two components in two parallel essential pathways may be backups of each other for the lethality. Generally, the SL pairs can be interpreted as due to the above four mechanisms. For example, EGFR and BRCA1 are SL pairs because they are in the same essential protein complex [14]. In this paper, we will focus on the analysis of SL pairs from the perspective of signalling pathways and provide three SL examples, in which two partners are from two parallel pathways.

First, TANK binding kinase (TBK1) was identified as a synthetic lethal gene of KRAS [15]. TBK1 is a noncanonical inhibitor of B protein (IB) that is known to regulate nuclear factor B (NFB) signalling. TBK1 activates NF-kB antiapoptotic signals involving c-Rel and BCL-XL (also known as BCL2L1) that were essential for survival. These indicate that TBK1 and NF-kB signalling pathways are essential in KRAS mutant tumours. Second, the inhibition of both EGFR and Notch signalling pathway is dramatically more effective for suppressing tumor growth than blocking EGFR or Notch signalling pathway alone. Normally the activated form of Notch1 restores AKT activity and enables cells to overcome cell death after dual-pathway blockade [16]. Here, the combined EGFR and Notch inhibition decreases significantly the AKT activation and thus suppresses tumor growth more effectively. Third, EGFR, a protooncogene, belongs to a family of four transmembrane receptor tyrosine kinases that mediate the growth, differentiation, and survival of cells. It is often overexpressed in aggressive triple negative breast cancers (TNBCs) and is also associated with other aggressive disease phenotypes. Nowsheens group recently reported that a contextual synthetic lethality can be achieved both in vitro and in vivo with combined EGFR and PARP inhibition with lapatinib and ABT-888, respectively [14]. The mechanism involves a transient deficit of DNA double strand break repair induced by lapatinib and a subsequent activation of the intrinsic pathway of apoptosis. Our Syn-Lethality database contains SL pairs of genes that likely belong to one of the above four mechanisms. The gene function and pathway information in our database will facilitate in silico interpretation of mechanisms.

#### 3. Database Interface

##### 3.1. Usage of SL Database

Our synthetic lethality database contains SL gene pairs in organised form and provides interface to perform query in the database. Our preliminary database is available in the downloadable form fromhttp://www.ntu.edu.sg/home/zhengjie/software/Syn-Lethality/. This software is a Java executable file and requires the installation of Java. The required version 10 of Java (free) and it can be installed fromhttp://www.java.com/en/download/index.jsp. Once the Java is installed on local machine, just double clicking on the Java executable file will launch the database interface. Since the database is available in the single setup file, the database can be used simultaneously by many end users for performing the query (Figure 4). The database includes information such as synthetic lethal gene pairs, type of lethality, type of gene alteration, and target genes for synthetic lethality.

Searching in our database can be divided in the following categories.(a)Simple Search. The user is required to provide abbreviations for gene names. For example, for epidermal growth factor receptor we just need to write EGFR and for cyclin-dependent kinase we just need to write CDK in the search field. This helps the user in search for the SL gene pair information without typing long gene names.(b)Batch Search. User can directly copy and paste names of various genes (separated by space) in each field. Figure 3 shows an example of using KRAS as input to query its related SL pairs. This helps find information simultaneously for various synthetic lethal gene pairs.(c)Smart Search. Users have flexibility of searching SL gene pairs based on the Boolean logical operators by selecting logical AND and OR operators from the drop down menu. This helps in analyzing various combinations of SL gene pairs.(d)Genetic Alteration Search. The interface of our database provides user flexibility to screen the SL pairs based on various types of the gene alteration which refer to the gene mutated in cancer. The gene alteration types captured in our database includes overexpression, mutation, activation, inactivation, and deficiency.

Figure 3: An example of KRAS related SL pairs (the alteration types refer to the cancer mutated gene).
Figure 4: SL query interface.

As of now, it is possible to retrieve complete SL gene pair information based on information such as gene names (MYC, EGFR, CDK, and so forth) (Table 2) and types of genetic alterations (overexpression, mutation, activation, and so forth). The relevant research papers for the SL gene pair are provided via web hyperlinks in database search results.

Table 2: List of annotation database links in Syn-Lethality database.
Table 3: Total statistics for human cancer Syn-Lethality database.
##### 3.2. Synthetic Lethality Network

To provide more clear understanding of SL gene pairs, we constructed the network for available SL gene pairs (Figure 2). The diagram depicts the synthetic lethal genes and the target genes. For example, the SL pair information for MYC oncogenic gene is depicted as shown in Figure 5.

Figure 5: Subnetwork of our SL network for human cancer.

#### 4. Conclusion and Future Perspectives

Syn-Lethality is the first comprehensive database constructed through integrating experimentally validated SL pairs of human cancer with the inferred SL pairs from yeast according to the orthologous relation between human and yeast. It is the first attempt to apply the experimentally verified SL pairs to construct a SL network. In the SL network, each node represents a gene/protein and each edge denotes the SL interactions which can be easily linked to the annotation information including gene/protein alteration type, screening method, pathway, mechanism, and the related literature. It is a valuable resource for better understanding SL mechanism in human cancer and developing useful information for anticancer medicine.

Considering that our current database only includes the predicted SL pairs from yeast, it is desirable to collect and predict more SL pairs from other model organisms, such as Caenorhabditis elegans, Zebrafish, and mouse. With the progress of SL experimental screening technology, it is believed that more SL interactions are expected to be identified. We will continue to collect and curate SL pairs of genes. Additionally, using our SL database, we plan to develop data mining algorithms to quickly extract SL information and mechanistic insights. Moreover, by incorporating the signalling pathways associated with the SL pairs of genes, we will construct a comprehensive and global SL network about human cancer.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Authors’ Contribution

Xue-juan Li and Shital K. Mishra contributed equally to this paper.

#### Acknowledgments

This research was supported by NTU Start-up Grant (COE-SUG/RSS-1FEB11-1/8), Singapore Ministry of Education (MOE) AcRF Tier 1 Grant RG32/11 (M4010977), and ARC 39/13 (MOE2013-T2-1-079).

#### References

1. P. Boyle and B. E. Levin, World Cancer Report, IARC Press, 2008.
2. W. G. Kaelin, “Synthetic lethality: a framework for the development of wiser cancer therapeutics,”Genome Medicine, vol. 1, no. 10, article 99, 2009.
3. W. G. Kaelin Jr., “The concept of synthetic lethality in the context of anticancer therapy,” Nature Reviews Cancer, vol. 5, no. 9, pp. 689–698, 2005.
4. L. H. Hartwell, P. Szankasi, C. J. Roberts, A. W. Murray, and S. H. Friend, “Integrating genetic approaches into the discovery of anticancer drugs,” Science, vol. 278, no. 5340, pp. 1064–1068, 1997.
5. M. A. Heiskanen and T. Aittokallio, “Mining high-throughput screens for cancer drug targetslessons from yeast chemical-genomic profiling and synthetic lethality,” WIREs Data Mining Knowl Discovery, vol. 2, no. 3, pp. 263–272, 2012.
6. M. Wu, X. J. Li, F. Zhang, X. L. Li, C. K. Kwoh, and J. Zheng, Meta-Analysis of Genomic and Proteomic Features To Predict Synthetic Lethality of Yeast and Human Cancer, ACM-BCB, 2013.
7. K. W. Y. Yuen, C. D. Warren, O. Chen, T. Kwok, P. Hieter, and F. A. Spencer, “Systematic genome instability screens in yeast and their potential relevance to cancer,” Proceedings of the National Academy of Sciences of the United States of America, vol. 104, no. 10, pp. 3925–3930, 2007.
8. N. Conde-Pueyo, A. Munteanu, R. V. Solé, and C. Rodríguez-Caso, “Human synthetic lethal inference as potential anti-cancer target gene detection,” BMC Systems Biology, vol. 3, article 116, 2009.
9. A. Chatr-Aryamontri, B. J. Breitkreutz, S. Heinicke et al., “The biogrid interaction database: 2013 update,” Nucleic Acids Research, vol. 41, pp. 816–823, 2013.
10. J. Luo, M. J. Emanuele, D. Li et al., “A genome-wide RNAi screen identifies multiple synthetic lethal interactions with the ras oncogene,” Cell, vol. 137, no. 5, pp. 835–848, 2009.
11. S. Morandell and M. B. Yaffe, “Exploiting synthetic lethal interactions between dna damage signaling, checkpoint control, and p53 for targeted cancer therapy,” Progress in Molecular Biology and Translational Science, vol. 110, pp. 289–314, 2012.
12. Y. Wang, I. H. Engels, D. A. Knee, M. Nasoff, Q. L. Deveraux, and K. C. Quon, “Synthetic lethal targeting of MYC by activation of the DR5 death receptor pathway,” Cancer Cell, vol. 5, no. 5, pp. 501–512, 2004.
13. N. Le Meur and R. Gentleman, “Modeling synthetic lethality,” Genome Biology, vol. 9, no. 9, article 135, 2008.
14. S. Nowsheen, T. Cooper, J. A. Stanley, and E. S. Yang, “Synthetic lethal interactions between egfr and parp inhibition in human triple negative breast cancer cells,” PLoS ONE, vol. 7, no. 10, Article ID e46614, 2012.
15. D. A. Barbie, P. Tamayo, J. S. Boehm et al., “Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1,” Nature, vol. 462, no. 7269, pp. 108–112, 2009.
16. Y. Dong, A. Li, J. Wang, J. D. Weber, and L. S. Michel, “Synthetic lethality through combined notch-epidermal growth factor receptor pathway inhibition in basal-like breast cancer,” Cancer Research, vol. 70, no. 13, pp. 5465–5474, 2010.

### Predicting Cancer-Specific Vulnerability via Data-Driven Detection of Synthetic Lethality

Volume 158, Issue 5, 28 August 2014, Pages 1199–1209

Resource

# Predicting Cancer-Specific Vulnerability via Data-Driven Detection of Synthetic Lethality

Referred to by
• ## DAISY: Picking Synthetic Lethals from Cancer Genomes

• Cancer Cell, Volume 26, Issue 3, 8 September 2014, Pages 306-308
Open Archive

## Highlights

Genome-scale data-driven identification of synthetic lethality in cancer

Synthetic lethality networks successfully predict cancer gene essentiality

Synthetic lethality networks predict 15 year survival in breast cancer patients

Synthetic dosage lethality networks predict drug response in cancer

## Summary

Synthetic lethality occurs when the inhibition of two genes is lethal while the inhibition of each single gene is not. It can be harnessed to selectively treat cancer by identifying inactive genes in a given cancer and targeting their synthetic lethal (SL) partners. We present a data-driven computational pipeline for the genome-wide identification of SL interactions in cancer by analyzing large volumes of cancer genomic data. First, we show that the approach successfully captures known SL partners of tumor suppressors and oncogenes. We then validate SL predictions obtained for the tumor suppressor VHL. Next, we construct a genome-wide network of SL interactions in cancer and demonstrate its value in predicting gene essentiality and clinical prognosis. Finally, we identify synthetic lethality arising from gene overactivation and use it to predict drug efficacy. These results form a computational basis for exploiting synthetic lethality to uncover cancer-specific susceptibilities.

## Introduction

Synthetic lethality occurs when the perturbation of two nonessential genes is lethal (Hartwell et al., 1997). This phenomenon offers a unique opportunity to develop selective anticancer drugs that will target a gene whose synthetic lethal (SL) partner is inactive only in the cancer cells (Ashworth et al., 2011 and Hartwell et al., 1997). Toward the realization of this potential, screening technologies have been developed to detect SL interactions in model organisms (Byrne et al., 2007, Costanzo et al., 2010 and Typas et al., 2008) and in human cell lines (Bassik et al., 2013, Brough et al., 2011 and Laufer et al., 2013). However, currently their scope is not sufficiently broad to encompass the large volume of genetic interactions that need to be surveyed across different cancer types. New bioinformatics approaches are hence called for to guide and complement the experimental search for SL interactions in cancer.

Previous computational approaches developed to systematically study genetic interactions have mainly focused on yeast, where there are genome-wide maps of experimentally determined SL interactions (Chipman and Singh, 2009, Kelley and Ideker, 2005, Szappanos et al., 2011 and Wong et al., 2004). In cancer, synthetic lethality has been computationally inferred by mapping SL interactions in yeast to their human orthologs (Conde-Pueyo et al., 2009) and by utilizing metabolic models and evolutionary characteristics of metabolic genes (Folger et al., 2011, Frezza et al., 2011 and Lu et al., 2013). Here, we analyze the rapidly accumulating cancer genomic data to identify candidate SL interactions via the data mining synthetic lethality identification pipeline (DAISY). We show that genome-wide cancer SL networks can be used to successfully predict gene essentiality, drug response, and clinical prognosis.

## Results

### The DAISY

DAISY is an approach for statistically inferring SL interactions from cancer genomic data of both cell lines and clinical samples. DAISY applies three statistical inference procedures, each tailored to specific cancer data sets.

The first inference strategy, termed genomic survival of the fittest (SoF), is based on the observation that cancer cells that have lost two SL-paired genes do not survive, they are strongly selected against. Accordingly, as cells harboring SL coinactivation are eliminated from the cell population, SL interactions can be identified by analyzing somatic copy number alterations (SCNA) and somatic mutation data and detecting events of gene coinactivation that occur significantly less than expected. In fact, very similar concepts are already extensively used to analyze the outcomes of small hairpin RNA (shRNA) screens in cell lines, in which essential genes and SL gene pairs are detected by identifying the shRNA probes that have been rapidly eliminated from the cell population (Cheung et al., 2011 and Marcotte et al., 2012). More recently, a related concept was implemented to identify synthetic lethality in glioblastoma (Szczurek et al., 2013).

The second inference strategy, shRNA-based functional examination, is based on the notion that the knock down of a synthetically lethal gene is lethal to cancer cells where its SL partner is inactive. Accordingly, the SL pairs of a given gene can be detected by searching for genes whose underexpression and low copy number induce its essentiality. This can be conducted via an integrative analysis of the results obtained in shRNA essentiality screens and their accompanying SCNA and transcriptomic profiles.

The third procedure, pairwise gene coexpression, is based on the notion that SL pairs tend to participate in closely related biological processes and hence are likely to be coexpressed (Costanzo et al., 2010 and Kelley and Ideker, 2005). We show that this trend indeed holds in known SLs that have been experimentally detected in cancer (Figure 2).

Given SCNA, somatic mutation, shRNA, and gene expression data of thousands of cancer samples, DAISY traverses over all gene pairs (∼534 million) and examines for every pair if it fulfills each one of the three criteria described above. Gene pairs that fulfill all three criteria in a statistically significant manner are predicted to be SL pairs. Here, we applied DAISY to analyze nine different genome-wide cancer data sets (Barretina et al., 2012, Beroukhim et al., 2010, Cheung et al., 2011, Garnett et al., 2012, Luo et al., 2008,Marcotte et al., 2012 and Cancer Genome Atlas Research Network et al., 2013) (Table S1 available online).

We expanded DAISY to also detect synthetic dosage lethality (Sajesh et al., 2013). While two genes form an SL pair if the inactivation of one gene renders the other essential, two genes form a synthetic dosage lethal (SDL) pair if the overactivity of one of them renders the other gene essential. Importantly, SDL interactions can permit the eradication of cancer cells with overactive oncogenes that are difficult to target directly (such as KRAS), by targeting the SDL partners of such oncogenes. DAISY detects SDL interactions via three inference procedures that are analogous to those outlined above for detecting SL interactions ( Figure 1; Experimental Procedures). More specifically, DAISY defines two genes, A and B, as an SDL pair if their expression is correlated and if the overactivity (amplification and overexpression) of gene A induces the essentiality of gene B. Induced essentiality is detected in two ways: first, according to shRNA screens, by examining if gene B becomes essential when gene A is overactive; second, according to SCNA data, by examining if gene B has a higher SCNA level when gene A is overactive.

### Evaluating DAISY Based on Experimentally Detected SL Interactions in Cancer

We first examined DAISY based on SL interactions that have been experimentally tested in cancer. We applied DAISY to identify the SL partners of PARP1, the tumor suppressors VHL and MSH2, and the SDL partners of the oncogene KRAS. The predictions were performed for over 7,276 gene pairs that have been experimentally tested in six large scale screens ( Bommi-Reddy et al., 2008, Lord et al., 2008, Luo et al., 2009, Martin et al., 2009, Steckel et al., 2012 and Turner et al., 2008). For every gene pair, DAISY returns four p values that denote the significance of the SL or SDL interaction between the two genes according to each one of the three inference strategies described in the previous section and according to all three approaches together (Figure 1;Experimental Procedures). We utilized these p values to examine the predictions along an increasing p value threshold and generate receiver operating characteristic (ROC) curves (Extended Experimental Procedures).

The DAISY predictor obtains an overall AUC of 0.779, which shows the concordance between the predicted and observed SL and SDL pairs (empirical p value <1 × 10−4;Figure 2A). To assess which of the inference strategies enables DAISY to correctly predict synthetic lethality, we repeated the predictions when using the p values obtained based on only one inference strategy at a time (Figure 2A). An AUC of 0.683 was obtained by predicting SL interactions based only on the SoF approach. These results are further improved by requiring that the gene pairs will also be coexpressed, reaching to an AUC of 0.770. As shRNA-based functional examination is not predictive on its own (an AUC of 0.477), we modified DAISY to consider the shRNA criterion as a soft constraint (Experimental Procedures). Despite the nonpredictability of the shRNA-based functional examination approach in this task, shRNA data are important for the generation of predictive SDL-networks (Supplemental Information; Figure S6). Importantly, DAISY captures well-established and clinically important SL interactions, including the prominent SL interaction between PARP1 and BRCA1/BRCA2 and the synthetic lethality between MSH2 and DHFR ( Figures 2B–2G).

### Experimentally Examining the DAISY-Predicted SL Partners of the Tumor Suppressor VHL

We next turned to experimentally test SL predictions of the tumor suppressor VHL that is frequently mutated in cancer, especially in clear cell renal carcinomas ( Bommi-Reddy et al., 2008). We applied DAISY to predict the SL partners of VHL and identify among these genes those that are essential in renal carcinoma cells (RCC4) exclusively due to the loss of VHL, resulting in a set of 44 genes ( Extended Experimental Procedures).

We performed a small interfering RNA (siRNA) screen to examine if the predicted genes are preferentially essential in VHL−/− renal carcinoma cells compared with isogenic cells in which pVHL function was restored (Extended Experimental Procedures). Overall, compared to the VHL-restored cells, the VHL-deficient cells are significantly more sensitive to the knockdown of the predicted VHL-SL partners (paired t test p value of 8.25 × 10−4) (Figure 3A, Table S2). Reassuringly, compared to the VHL-restored cells, the VHL-deficient cells are not significantly more sensitive to the knockdown of a control set of 30 randomly selected genes (paired t test p value of 0.255). Compared to another screen that searched for the SL partners of VHL among 88 kinases ( Bommi-Reddy et al., 2008), our screen detected 3.83 times more SL genes (Bernoulli p value of 4.76 × 10−9;Extended Experimental Procedures).

We then measured the response of the renal cells to nine drugs whose targets were predicted by DAISY to be selectively essential in the VHL-deficient renal cells. Of note, these drugs are not currently administered to treat cancer, but are Food and Drug Administration (FDA)-approved to treat other clinical conditions, such as hypertension and depression. We managed to identify effects on cell growth for six out of the nine drugs. As predicted, the VHL-deficient cells were significantly more sensitive to each one of these six drugs (higher percentage of inhibition at mideffective concentration) (Figure 3B; Table S2). Reassuringly, this specificity was not observed with the negative control drug Staurosporine, indicating that the selective effect is not due to a general susceptibility of the VHL-deficient cells.

### Applying DAISY to Construct Genome-wide Networks of SL and SDL Interactions in Cancer

We applied DAISY to identify all gene pairs that are likely to be synthetically lethal in cancer, resulting in an SL network of 2,077 genes and 2,816 SL interactions (Figure 4), and an SDL network of 3,158 genes and 3,635 SDL interactions (Table S3). As each of the nine data sets examined were analyzed separately to identify SL (SDL) pairs, we tested the mutual overlap between the resulting SL (SDL) sets and found it to be significantly higher than expected (Figure S1).

Both networks display scale-free-like characteristics and are enriched with known cancer-associated genes and biological functions (Figures S1 and S2; Table S4). The genes included in the networks are significantly overexpressed both in normal tissues and especially in cancers (Wilcoxon rank sum p values <6.29 × 10−8). Interestingly, the network genes are significantly associated with cancer proliferation and less associated with normal proliferation (Waldman et al., 2013). They are highly enriched with human orthologs of mouse essential genes (hypergeometric p values <1 × 10−30) and are evolutionary conserved (Wilcoxon rank sum p values <2.99 × 10−17). Moreover, each one of these properties is further emphasized in genes that have a higher degree in the SL or SDL networks (Supplemental Information; Figure S2).

The SL and SDL pairs are highly enriched with genes that interact in the protein-protein interaction (PPI) network (hypergeometric p values <4.02 × 10−7). Testifying to their importance, genes included in the SL or SDL networks have a higher degree in the PPI network compared to other genes, especially if their degree in the SL or SDL network is high (Wilcoxon rank sum p values <5.79 × 10−22; Figure S2D). Examining the genomic location of the SL and SDL pairs we find that while SL pairs tend to reside on different chromosomes, or at a large distance from each other on the same chromosome, the SDL gene pairs show the opposite behavior. The latter trend is observed also when identifying SDL interactions without considering the SoF approach. Discarding SDL gene pairs that reside close to each other depreciates the predictive signal of the network (Supplemental Information; Figure S3).

As a direct experimental validation of the predicted SL and SDL interactions is yet impossible on a genome scale, we tested the interactions by examining their utility in three fundamental prediction assignments, the prediction of gene essentiality, clinical prognosis, and drug efficacy. In all tasks, the networks are utilized to generate cancer-specific predictions given a genomic characterization of a specific cancer cell line or clinical sample.

### SL-Based Prediction of Gene Essentiality in Cancer Cell Lines

Predicting gene essentiality based on the SL network is cell-line-specific. Indeed, examining the results of shRNA screens, the majority of genes are essential in very few cancer cell lines (Supplemental Information; Figure S4A). As we examined the predictions based on the results obtained in shRNA gene knockdown screens, we constructed an SL network without any shRNA data to avoid potential circularity. Based on this SL network and the genomic profiles of the cell lines, we predicted a gene as essential in a given cell line if one or more of its SL partners is inactive in that cell line (seeSupplemental Information for further details, analyses, and results).

Overall, we predicted gene essentiality in 129 different cancer cell lines and examined the predictions based on the results of two large-scale gene essentiality screens (Cheung et al., 2011 and Marcotte et al., 2012). Per cell line the predicted essential genes are significantly enriched with genes that were found experimentally to be essential in the pertaining cell line (empirical p value < 2.52 × 10−4; Supplemental Information; Figure 5A; Table S5). Furthermore, the higher the number of predicted inactive SL partners a gene has the more essential it is according to the experimental data (Figures 5B and 5C). Of note, the SL network succeeds more in predicting gene essentiality in cell lines with a higher number of gene deletions (Supplemental Information; Figures S4B and S4C; Table S5). Indeed, in such cell lines it is more likely that gene essentiality arises due to synthetic lethality. Finally, we predicted gene essentiality based on gene pairs that are human orthologs of yeast SLs (Conde-Pueyo et al., 2009). This, however, leads to markedly inferior performance, testifying to the value of the DAISY-inferred SLs (Supplemental Information; Figures S4D and S4E; Table S5).

We improved the unsupervised SL-based gene essentiality predictions described above by considering additional features that describe the state of a specific gene in a given cell line according to the SL network (e.g., the average SCNA level of its SL partners). Using these features, we trained neural network models on gene essentiality data (Extended Experimental Procedures). The performances of these supervised prediction models on unseen test sets resulted in ROC curves with AUCs of 0.755 and 0.854 for the Marcotte et al. (2012) and Achilles (Cheung et al., 2011) data, respectively (Figures 5D and 5E). For comparison, we considered the nine cell lines that were tested in both screens and utilized the shRNA scores obtained in one screen to predict gene essentiality according to the other screen (Extended Experimental Procedures). Using the Achilles screen to predict gene essentiality as reported in the Marcotte screen, or vice versa, results in inferior prediction performance, with AUCs of 0.663 and 0.706, respectively.

To further examine the SL-based gene essentiality predictions, we conducted a whole genome siRNA screen in the breast cancer cell line BT549 under normoxia and hypoxia (Extended Experimental Procedures; Table S6). We defined a refined set of essential genes, composed of genes that are essential in BT549 according to our siRNA screen under both conditions and according to the shRNA screen of Marcotte et al. (2012). Indeed, the performance of the SL-based predictor (that was not trained on gene essentiality data of BT549) is further improved when tested on this refined set of essential genes, obtaining an AUC of 0.951 (Figures 5F and S4F–S4K; Supplemental Information).

### Counderexpression of SL Pairs Is a Marker of Better Prognosis in Breast Cancer

To examine the SL network in a clinical setting, we analyzed gene expression and 15 year survival data in a cohort of 1,586 breast cancer patients (Curtis et al., 2012). We postulated that counderexpression of two SL-paired genes would increase tumor vulnerability and result in better prognosis. To test this hypothesis, we classified the patients according to each SL pair into two groups: patients whose tumors counderexpressed the two SL-paired genes (SL group) and patients whose tumors expressed at least one of these genes (SL+ group). For each SL pair, we computed a signed Kaplan-Meier (KM) score (Extended Experimental Procedures). The higher the signed KM score is, the better the prognosis of the SL group is compared to the SL+group. Indeed, the signed KM score of the SL pairs is significantly higher than those of randomly selected gene pairs (one-sided Wilcoxon rank sum p value of 3.09 × 10−59). To examine if this result arises from the mere essentiality of genes in the SL network rather than the interaction between them, we repeated the analysis with randomly selected gene pairs involving genes from the SL network that are not connected by SL interactions. Reassuringly, the SL pairs have significantly higher signed KM scores also compared to these random SL network gene pairs (one-sided Wilcoxon rank sum p value of 2.00 × 10−9). Highly significant KM plots were obtained based on 271 SL pairs (log rank and Cox regression p values <0.05, following multiple hypotheses testing correction) (Figure 6A; Table S7).

Next, we classified the patients according to all the SL pairs in the network together. For each sample, we computed a global SL score that denotes the number of SL pairs it counderexpressed. As predicted, samples that counderexpressed a high number of SL pairs had a significantly better prognosis compared to those that counderexpressed a low number of SL pairs (log rank p value of 1.482 × 10−7; Figures 6B and 6C). Again, we examined if this result is due to the mere essentiality of the SL network genes rather than due to the specific SL interactions; we repeated this analysis using 10,000 topology preserving randomized networks consisting of the breast cancer essential genes (Marcotte et al., 2012) (Extended Experimental Procedures). Reassuringly, none of these random networks managed to predict patient survival as significantly as the SL network.

Because breast cancer is a highly heterogeneous disease, we examined whether higher global SL scores are associated with improved prognosis in specific and more homogenous groups of patients—all consisting of the same subtype, grade, or genomic instability level (Bilal et al., 2013). This is indeed the case for all groups except one—grade 1 patients. The global SL scores provide the most significant separation in the grade 2 normal-like subtype and moderate genomic instability groups (log rank p values of 8.64 × 10−5, 1.01 × 10−3, and 1.25 × 10−4, respectively). As expected, the global SL score is significantly negatively correlated with the tumor grade and genomic instability level (Spearman correlation coefficients of −0.407 and −0.267, p values of 2.58 × 10−62and 2.43 × 10−27, respectively) and highly associated with the tumor subtype (ANOVA p value of 4.25 × 10−102; Figure S5). Normal-like tumors have the highest global SL scores, while basal tumors have the lowest scores (Figure S5E). Notably, the prognostic value of the global SL score is significant even when accounting for the tumor grade, subtype, or genomic instability level (Cox p values of 7.18 × 10−4, 3.12 × 10−7, and 4.37 × 10−8, respectively). Lastly, the prognostic value of the global SL scores is superior to that obtained by using genomic instability levels (Figures S5I and S5J).

### Harnessing SDL Interactions to Predict Drug Efficacy

We utilized the SDL network to predict the response of various cancer cell lines to anticancer drugs. As these drugs mainly target oncogenes, we used the SDL network to predict their efficacy rather than the SL network, whose performance is indeed inferior in this task (Supplemental Information). Based on the SDL network and the genomic profiles of the cancer cell lines, we predicted for each drug which cell lines are sensitive and which are resistant to its administration (Extended Experimental Procedures). More specifically, if one of the drug targets had more than one overexpressed SDL partner in a given cell line, the cell line was predicted to be sensitive to the drug administration (Supplemental Information).

To test this approach, we utilized two data sets of drug efficacies that were measured in a panel of cancer cell lines: (1) the Cancer Genome Project (CGP) data (Garnett et al., 2012), and (2) the Cancer Therapeutics Response Portal (CTRP) data (Basu et al., 2013). The SDL network enabled to predict the response of 593 cancer cell lines to 23 drugs and of 241 cancer cell lines to 33 additional drugs when utilizing the CGP and CTRP data sets to test the predictions, respectively. Overall, drugs are significantly more effective in the predicted sensitive cell lines than in the predicted resistant cell lines (empirical p values <5.34 × 10−4; Figures 7A and 7B; Table S8). Considering only the predictions that were obtained for drugs with a sufficiently high number of SDL interactions increases the fraction of drugs that are significantly predicted (Figure 7C). As predicted, the efficacies of drugs increase with the number of overexpressed SDL partners that their targets have in a given cell line (Figure 7D). Exceptions to this trend may be explained by noting that drug efficacy is determined only partially by the essentiality of the drug targets, and additional factors, like the drug membrane permeability, may affect drug efficacies. For comparison, we predicted drug response by applying two other well established approaches: (1) based on the mutation and copy-number status of the drug target(s), and (2) based on the genomic instability index of the cancer cells. The SDL network generates significant predictions for more than twice as many drugs compared to these competing predictors (Supplemental Information).

Focusing on the drugs that were most accurately predicted by using the SDL-network, we found that each one of the SDL interactions involving the targets of these drugs enables, on its own, to accurately predict the response to the pertaining drug (Figure 7E;Extended Experimental Procedures). Among these interactions is the predicted SDL interaction between EGFR and IGFBP3, whose overexpression should accordingly induce sensitivity to drugs targeting EGFR. Reassuringly, it has been shown that IGFBP3is underexpressed in Gefitinib-resistant cells, and the addition of recombinant IGFBP3restored the ability of Gefitinib to inhibit cell growth ( Guix et al., 2008). Another interesting example is the predicted SDL interaction between PARP1 and MDC1. The latter contains two BRCA1 C-terminal motifs and also regulates BRCA1 localization and phosphorylation in DNA damage checkpoint control ( Lou et al., 2003). Indeed,BRCA1/BRCA2 are known to be synthetically lethal with PARP1 ( Lord et al., 2008).

In a manner analogous to that described earlier for predicting gene essentiality, we utilized the SDL network to build supervised neural network predictors of drug efficacies in cancer cell lines (Extended Experimental Procedures). Using only 53 features, we predicted drug efficacies with Spearman correlation coefficients of 0.721 and 0.547 and p values <1 × 10−350 for the CGP and CTRP data, respectively (Figures 7F–7I). We further examined our SDL-based predictors by analyzing results of a large pharmacological screen carried out recently by the same team as CTRP. In this study, the efficacies of ∼500 compounds were measured across >850 cancer cell lines (P.A.C., personal communication). One hundred and twenty six of the tested compounds have at least one target in the SDL network, enabling to predict the response to their administration. Based the SDL network and the genomic profiles of these cell lines (Barretina et al., 2012), we predicted the efficacies of these drugs by using the unsupervised and supervised predictors (trained on the CTRP data). The SDL-based predictors obtained significant predictions (p value < 0.05) of drug efficacy for 83 (65.87%) and 70 (55.6%) drugs, when applying the unsupervised or supervised approach, respectively.

## Discussion

DAISY is a genome scale, data-driven, approach for the identification of cancer SL and SDL interactions. As shown, DAISY successfully captures the results obtained in key large scale experimental studies exploring SLs in cancer, identifies valid SL interactions, and enables to predict gene essentiality, drug efficacy, and clinical prognosis in cancer.

DAISY presents a complementary effort to genetic and chemical screens, narrowing down the number of gene pairs that need to be examined experimentally to detect SL and SDL interactions in cancer. Based on the ROC curve presented in Figure 2A, an experimental screen for discovering SL interactions could be designed to check the SL pairs predicted by DAISY such that 5%, 25%, 50%, or 70% of all the SL interactions that are out there will be detected by examining only 0.25%, 4%, 14%, or 24% of all possible gene pairs, respectively. Hence, testing only the most confident predictions will enable to find up to 20 times more SL pairs than expected by random. Likewise, by applying DAISY to design an siRNA screen for detecting the SL interactions of VHL we identified almost four times as many SL interactions compared to a screen that was designed by applying biological reasoning. In light of these results DAISY could facilitate a more rapid and rational discovery of SL interactions in cancer by guiding focused experimental screens.

Nonetheless, DAISY has several limitations one needs to account for. First, it is restricted to the identification of SL interactions in cancer, as it is based on unique cancer-specific data that captures the genomic instability of cancer cells (e.g., SCNA). As such DAISY cannot be tested by applying it to identify SL interactions in model microorganisms as yeast. Second, DAISY identifies SL interactions based on large scale genomic data and shRNA screens, which are at times noisy and inaccurate (Bhinder et al., 2014). Third, as DAISY is based on the identification of gene inactivation, additional mechanisms of gene inactivation, such as epigenetic and posttranscriptional regulation, should be accounted for in the future. Fourth, the genomic location of genes may result in false-negative and false-positive predictions of SL and SDL interactions, respectively (see Supplemental Information for further analysis). Last, the ability of the SL network to accurately predict gene essentiality in vivo remains to be determined.

We have shown that SL and SDL interactions have a marked cumulative effect (Figures 5B, 5C, and 7D). Thus, a gene can form a useful drug target due to the (possibly partial) inactivation or overactivation of several of its SL or SDL partners, respectively. SL-based treatment can therefore be especially effective in targeting genetically unstable tumors that harbor many gene deletions and amplifications. Furthermore, a drug may be able to kill a broad array of genomically heterogeneous cells, each sensitive to the drug due to the inactivity (overactivity) of a different subset of the SL (SDL) partners of the drug targets. Targeting a gene with many inactive SL and/or overactive SDL partners may hence counteract the development of treatment resistance, especially if the SL/SDL partners reside on different chromosomes or in distant genomic locations. Moreover, SL-based treatment can induce the reactivation of a tumor suppressor or the inactivation of an oncogene by targeting its SL or SDL pair, respectively.

Four main translational challenges could potentially be tackled by utilizing SL and SDL networks: (1) ranking existing treatments for a given patient based on the genomic characteristics of the tumor, as initially shown here in cell lines; (2) repurposing approved drugs that are currently used to treat other diseases to treat cancer, as shown here for treating a VHL-deficient cancer; (3) systematically identifying new drug targets; and (4) predicting cancer prognosis, as shown here for breast cancer. Taken together, SL and SDL network-based analysis combined with personalized genomics can provide an important future tool for assessing response to treatment and for developing more selective and effective personalized therapeutics.

## Experimental Procedures

### Description of DAISY

DAISY identifies candidate SL and SDL interactions by applying three separate statistical inference procedures. Each procedure has its own input and outputs a set of candidate SL or SDL pairs. Gene pairs that are identified as candidate SL or SDL pairs by all three procedures are identified by DAISY as SL or SDL pairs, respectively. The three inference procedures are described below (comments in parenthesis refer to changes made to identify SDL pairs):

(1)

The genomic SoF procedure analyzes a set of input data sets denoted as SoFdata sets. Each data set includes SCNA profiles of cancer samples and optionally their mRNA and somatic mutation profiles. For every pair of genes, denoted as A and B, and every data set S in SoFdata sets, a Wilcoxon rank sum test is conducted to examine if gene B has a significantly higher SCNA level in samples in which gene A is inactive (overactive) than in the rest of the samples. The output consists of gene pairs that, according to at least one of the data sets in SoF data sets, pass the test described above in a statistically significant manner (a Wilcoxon rank sum p value <0.05 following Bonferroni correction for multiple hypotheses testing).

(2)

The shRNA-based functional examination procedure analyzes a set of data sets denoted as shRNAdata sets. Each data set includes the results obtained in a gene essentiality (shRNA) screen together with the SCNA and gene expression profiles of the cancer cell lines examined in that screen. For every pair of genes, denoted as A and B, and every data sets S in shRNAdata sets, a Wilcoxon rank sum test is conducted to examine if gene B has significantly lower shRNA scores in samples in which gene A is inactive (overactive) than in the rest of the samples (the lower the shRNA score is, the more essential the gene is). The output consists of gene pairs that, according to at least one of the data sets in shRNAdata sets, pass the test described above in a statistically significant manner (a Wilcoxon rank sum p value <0.05).

(3)

The pairwise gene coexpression procedure is given a set of transcriptomic data sets of cancer samples and returns gene pair whose expression, in at least one of the data sets, is significantly positively correlated (a Spearman correlation coefficient ≥Rmin and a p value < 0.05 following Bonferroni correction for multiple hypotheses testing).

The candidate SL or SDL pairs that are identified in the first and third procedures are obtained with highly stringent statistical cutoffs, a p value <0.05 following Bonferroni correction. The data obtained in shRNA screens has a low statistical power and is hence utilized (in the second procedure) only to further refine the already highly statistically significant SL and SDL sets identified in the first and third procedures.

The first and second procedures are based on the detection of gene inactivation and overactivation in the samples analyzed. A gene is defined as inactive in a sample if it is underexpressed and its SCNA is below −0.3 or if it is mutated with a deleterious mutation. The latter refers to nonsense and frame-shift mutations. Likewise, a gene is defined as overactive in a sample if it is overexpressed and its SCNA is above 0.3. Of note, the SCNA parameters (−0.3 and 0.3) used here are more stringent cutoffs compared to those used in the literature to define gene amplification and deletion (Beroukhim et al., 2010). A gene is defined as underexpressed in a given sample if its expression level is below the 10th percentile of its expression levels across all samples in the data set, and similarly, as overexpressed if its expression level is above its 90th percentile. In the third procedure we set Rmin to 0.5.

To find the candidate pairs and construct the SL and SDL networks, we applied DAISY with the data sets listed in Table S1 and traversed over all ∼535 million gene pairings. To do so efficiently, DAISY was implemented and run on the HTcondor architecture, which enables parallel computing (Thain et al., 2005).

### Network Availability and Visualization

Interactive maps of the networks are accessible through http://www.cs.tau.ac.il/∼livnatje/SL_network.zip and can be explored using the freely available Cytoscape software (Cline et al., 2007). The maps include different gene properties and annotations, as well as alternative views that dissect the network hubs or genes with specific characteristics. We clustered the SL and SDL networks by applying the Girvan-Newman fast greedy algorithm as implemented by the GLay Cytoscape plug-in (Morris et al., 2011 and Su et al., 2010) and performed gene annotation enrichment analysis for every network and every network cluster via DAVID (Huang et al., 2009).

## Author Contributions

E.R. supervised the research. E.R. and L.J.A. conceived and designed the computational approach, analyzed the data, and wrote the paper. L.J.A. performed the statistical and machine learning analyses. E.G. designed and supervised the siRNA screens performed in his lab by N.P., L.M., D.J., and E.S., P.A.C., and B.S.-L. provided and analyzed pharmacological screening data. L.J.A. and Y.Y.W. performed the clinical survival analysis. Y.Y.W. performed the evolutionary and PPI network analysis. A.W. preprocessed the SCNA data. T.G. and E.G. provided insights regarding the biological aspects of the work. T.G. and Y.Y.W assisted in writing the paper.

## Acknowledgments

We thank A. Wagner, D. Horn, D. Steinberg, E. Halperin, I. Meilijson, L. Wolf, M. Kupiec, M. Oberhardt, and R. Sharan for their help and comments. We thank E. MacKenzie for technical support. L.J.A. and A.W. are partially funded by the Edmond J. Safra bioinformatics center and the Israeli Center of Research Excellence program (I-CORE, Gene Regulation in Complex Human Disease Center No 41/11). L.J.A. was also funded by the Dan David foundation and by the Adams Fellowship Program of the Israel Academy of Sciences and Humanities. Y.Y.W. was supported in part by Eshkol fellowship (the Israeli Ministry of Science and Technology). E.R.’s research in cancer is supported by grants from the Israeli Science Foundation (ISF) and Israeli Cancer Research Fund (ICRF). E.R. and T.G. are supported by the I-CORE program.

## References

• Ashworth et al., 2011
• Genetic interactions in cancer progression and treatment
• Cell, 145 (2011), pp. 30–38
• Barretina et al., 2012
• The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity
• Nature, 483 (2012), pp. 603–607
• Bassik et al., 2013
• A systematic mammalian genetic interaction map reveals pathways underlying ricin susceptibility
• Cell, 152 (2013), pp. 909–922
• Basu et al., 2013
• An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules
• Cell, 154 (2013), pp. 1151–1161
• Beroukhim et al., 2010
• The landscape of somatic copy-number alteration across human cancers
• Nature, 463 (2010), pp. 899–905
• Bhinder et al., 2014
• Comparative analysis of RNAi screening technologies at genome-scale reveals an inherent processing inefficiency of the plasmid-based shRNA hairpin
• Comb. Chem. High Throughput Screen., 17 (2014), pp. 98–113
• Bilal et al., 2013
• Improving breast cancer survival analysis through competition-based multidimensional modeling
• PLoS Comput. Biol., 9 (2013), p. e1003047
• Bommi-Reddy et al., 2008
• Kinase requirements in human cells: III. Altered kinase requirements in VHL-/- cancer cells detected in a pilot synthetic lethal screen
• Proc. Natl. Acad. Sci. USA, 105 (2008), pp. 16484–16489
• Brough et al., 2011
• Searching for synthetic lethality in cancer
• Curr. Opin. Genet. Dev., 21 (2011), pp. 34–41
• Byrne et al., 2007
• A global analysis of genetic interactions in Caenorhabditis elegans
• J. Biol., 6 (2007), p. 8
• Cancer Genome Atlas Research Network et al., 2013
• The Cancer Genome Atlas Pan-Cancer analysis project
• Nat. Genet., 45 (2013), pp. 1113–1120
• Cheung et al., 2011
• Systematic investigation of genetic vulnerabilities across cancer cell lines reveals lineage-specific dependencies in ovarian cancer
• Proc. Natl. Acad.Sci. USA, 108 (2011), pp. 12372–12377
• Chipman and Singh, 2009
• Predicting genetic interactions with random walks on biological networks
• BMC Bioinformatics, 10 (2009), p. 17
• Cline et al., 2007
• Integration of biological networks and gene expression data using Cytoscape
• Nat. Protoc., 2 (2007), pp. 2366–2382
• Conde-Pueyo et al., 2009
• Human synthetic lethal inference as potential anti-cancer target gene detection
• BMC Syst. Biol., 3 (2009), p. 116
• Costanzo et al., 2010
• The genetic landscape of a cell
• Science, 327 (2010), pp. 425–431
• Curtis et al., 2012
• The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups
• Nature, 486 (2012), pp. 346–352
• Folger et al., 2011
• Predicting selective drug targets in cancer through metabolic networks
• Mol. Syst. Biol., 7 (2011), p. 501
• Frezza et al., 2011
• Haem oxygenase is synthetically lethal with the tumour suppressor fumarate hydratase
• Nature, 477 (2011), pp. 225–228
• Garnett et al., 2012
• Systematic identification of genomic markers of drug sensitivity in cancer cells
• Nature, 483 (2012), pp. 570–575
• Guix et al., 2008
• Acquired resistance to EGFR tyrosine kinase inhibitors in cancer cells is mediated by loss of IGF-binding proteins
• J. Clin. Invest., 118 (2008), pp. 2609–2619
• Hartwell et al., 1997
• Integrating genetic approaches into the discovery of anticancer drugs
• Science, 278 (1997), pp. 1064–1068
• Huang et al., 2009
• Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
• Nucleic Acids Res., 37 (2009), pp. 1–13
• Kelley and Ideker, 2005
• Systematic interpretation of genetic interactions using protein networks
• Nat. Biotechnol., 23 (2005), pp. 561–566
• Laufer et al., 2013
• Mapping genetic interactions in human cancer cells with RNAi and multiparametric phenotyping
• Nat. Methods, 10 (2013), pp. 427–431
• Lord et al., 2008
• A high-throughput RNA interference screen for DNA repair determinants of PARP inhibitor sensitivity
• DNA Repair (Amst.), 7 (2008), pp. 2010–2019
• Lou et al., 2003
• Mediator of DNA damage checkpoint protein 1 regulates BRCA1 localization and phosphorylation in DNA damage checkpoint control
• J. Biol. Chem., 278 (2003), pp. 13599–13602
• Lu et al., 2013
• Genome evolution predicts genetic interactions in protein complexes and reveals cancer drug targets
• Nat. Commun., 4 (2013), p. 2124
• Luo et al., 2008
• Highly parallel identification of essential genes in cancer cells
• Proc. Natl. Acad. Sci. USA, 105 (2008), pp. 20380–20385
• Luo et al., 2009
• A genome-wide RNAi screen identifies multiple synthetic lethal interactions with the Ras oncogene
• Cell, 137 (2009), pp. 835–848
• Marcotte et al., 2012
• Essential gene profiles in breast, pancreatic, and ovarian cancer cells
• Cancer Discov., 2 (2012), pp. 172–189
• Martin et al., 2009
• Methotrexate induces oxidative DNA damage and is selectively lethal to tumour cells with defects in the DNA mismatch repair gene MSH2
• EMBO Mol. Med., 1 (2009), pp. 323–337
• Morris et al., 2011
• clusterMaker: a multi-algorithm clustering plugin for Cytoscape
• BMC Bioinformatics, 12 (2011), p. 436
• Sajesh et al., 2013
• Synthetic genetic targeting of genome instability in cancer
• Cancers, 5 (2013), pp. 739–761
• Steckel et al., 2012
• Determination of synthetic lethal interactions in KRAS oncogene-dependent cancer cells reveals novel therapeutic targeting strategies
• Cell Res., 22 (2012), pp. 1227–1245
• Su et al., 2010
• GLay: community structure analysis of biological networks
• Bioinformatics, 26 (2010), pp. 3135–3137
• Szappanos et al., 2011
• An integrated approach to characterize genetic interaction networks in yeast metabolism
• Nat. Genet., 43 (2011), pp. 656–662
• Szczurek et al., 2013
• Synthetic sickness or lethality points at candidate combination therapy targets in glioblastoma
• Int. J. Cancer, 133 (2013), pp. 2123–2132
• Thain et al., 2005
• Distributed computing in practice: the Condor experience
• Concurr. Comp-Pract. E., 17 (2005), pp. 323–356
• Turner et al., 2008
• A synthetic lethal siRNA screen identifying genes mediating sensitivity to a PARP inhibitor
• EMBO J., 27 (2008), pp. 1368–1377
• Typas et al., 2008
• High-throughput, quantitative analyses of genetic interactions in E. coli
• Nat. Methods, 5 (2008), pp. 781–787
• Waldman et al., 2013
• A genome-wide systematic analysis reveals different and predictive proliferation expression signatures of cancerous vs. non-cancerous cells
• PLoS Genet., 9 (2013), p. e1003806
• Wong et al., 2004
• Combining biological networks to predict genetic interactions
• Proc. Natl. Acad. Sci. USA, 101 (2004), pp. 15682–15687
Corresponding author
Corresponding author

### Analysis of biological processes and diseases using text mining approaches.

Methods Mol Biol. 2010;593:341-82. doi: 10.1007/978-1-60327-194-3_16.

# Analysis of biological processes and diseases using text mining approaches.

### Abstract

A number of biomedical text mining systems have been developed to extract biologically relevant information directly from the literature, complementing bioinformatics methods in the analysis of experimentally generated data. We provide a short overview of the general characteristics of natural language data, existing biomedical literature databases, and lexical resources relevant in the context of biomedical text mining. A selected number of practically useful systems are introduced together with the type of user queries supported and the results they generate. The extraction of biological relationships, such as protein-protein interactions as well as metabolic and signaling pathways using information extraction systems, will be discussed through example cases of cancer-relevant proteins. Basic strategies for detecting associations of genes to diseases together with literature mining of mutations, SNPs, and epigenetic information (methylation) are described. We provide an overview of disease-centric and gene-centric literature mining methods for linking genes to phenotypic and genotypic aspects. Moreover, we discuss recent efforts for finding biomarkers through text mining and for gene list analysis and prioritization. Some relevant issues for implementing a customized biomedical text mining system will be pointed out. To demonstrate the usefulness of literature mining for the molecular oncology domain, we implemented two cancer-related applications. The first tool consists of a literature mining system for retrieving human mutations together with supporting articles. Specific gene mutations are linked to a set of predefined cancer types. The second application consists of a text categorization system supporting breast cancer-specific literature search and document-based breast cancer gene ranking. Future trends in text mining emphasize the importance of community efforts such as the BioCreative challenge for the development and integration of multiple systems into a common platform provided by the BioCreative Metaserver.

PMID:
19957157
[PubMed – indexed for MEDLINE]

# Review

Nature Reviews Cancer 5, 689-698 (September 2005) |doi:10.1038/nrc1691

## The Concept of Synthetic Lethality in the Context of Anticancer Therapy

William G. Kaelin, Jr1  About the author

top

Two genes are synthetic lethal if mutation of either alone is compatible with viability but mutation of both leads to death. So, targeting a gene that is synthetic lethal to a cancer-relevant mutation should kill only cancer cells and spare normal cells. Synthetic lethality therefore provides a conceptual framework for the development of cancer-specific cytotoxic agents. This paradigm has not been exploited in the past because there were no robust methods for systematically identifying synthetic lethal genes. This is changing as a result of the increased availability of chemical and genetic tools for perturbing gene function in somatic cells.

The bottleneck to the development of safe and effective anticancer drugs does not lie in an inability to identify chemicals that will kill cancer cells. In fact, thousands of compounds have been identified over the past 50 years that will accomplish this feat. Instead, the bottle-neck lies in our inability to identify chemicals that will kill cancer cells at concentrations that do not harm patients. Most of the chemotherapeutic agents used today have remarkably low THERAPEUTIC INDICES and narrowTHERAPEUTIC WINDOWS. The therapeutic window is influenced by a number of factors, including the shape of the curve that relates the intended biological effect of the drug to changes in the activity of its intended target (‘on-target’), and the propensity of the drug to affect unintended targets (‘off-targets’) at higher doses. Off-target effects can cause toxicity and, in some cases, antagonize on-target biological effects. Most anticancer drugs in use today were discovered based on their ability to kill rapidly dividing cancer cells in vitro. Predictably, when administered to patients, many of these drugs also injure rapidly dividing normal cells, such as bone-marrow haematopoietic precursors and gastrointestinal mucosal epithelial cells. In addition, many of these drugs are toxic to normal cells that are not rapidly dividing. Examples include doxorubicin (toxic to the heart), bleomycin (toxic to the lung) and cytarabine (toxic to the cerebellum). These other forms of organ damage become particularly important (dose-limiting) in settings in which toxicity to rapidly dividing cells can be partially ameliorated through supportive-care measures (such as bone-marrow transplantation). For these reasons, it is imperative that anticancer drugs be developed that can kill cancer cells at clinically achievable concentrations, with therapeutic indices that are higher than those of classic cytotoxic agents.

#### Therapeutic index

Many factors influence the therapeutic index of a drug. Some relate to the quality of the drug itself — for example, its ability to distinguish between intended and unintended targets. Others relate to the nature of its target — for example, its distribution, its normal function(s), and the degree to which those functions must be altered to achieve the desired effect. Most antibacterial agents are remarkably safe because their targets are present in the organisms they are designed to kill but not in normal host cells. However, many other relatively ‘safe’ drugs — such as anti-hypertensives, anti-anxiety drugs and cholesterol-lowering agents — inhibit normal cellular proteins. These drugs are clinically useful because their effects are titratable (through changes in dose and schedule), and quantitative changes in the activities of their targets lead to the desired changes in host physiology.

Two paths can be envisioned to arrive at an anticancer drug that would selectively kill cancer cells. The first, which is modelled on the development of anti-infectious agents, would be to identify drug targets that are essential for the viability of cancer cells but are not present in normal cells (the so-called ‘target-driven therapeutic index’)1, 2 (Fig. 1). The fusion proteins generated by cancer-associated chromosomal translocations might, at first glance, seem to be ideal in this regard. However, this presumes that drugs can be developed that will discriminate between a particular protein (or functional subdomain) in its normal context and in its pathogenic, fused state. This might be difficult. For example, it is fallacious to argue that the efficacy and safety of imatinib mesylate (Glivec) for the treatment of chronic myelogenous leukaemia (CML) stems from the fact that its target, breakpoint cluster region (BCR)–Abelson murine leukaemia viral oncogene homologue (ABL), is unique to CML cells because imatinib mesylate inhibits the kinase activities of both BCR–ABL and ABL (in addition to several other cellular kinases)3. So, the relatively high therapeutic index of imatinib mesylate cannot be explained by the restriction of its target(s) to CML cells (see below for potential alternative explanations). Similarly, it might be difficult to develop drugs that directly inhibit oncoproteins that result from point mutations without affecting their normal counterparts.

##### Figure 1 | Framework for developing anticancer drugs with a high therapeutic index.

An anticancer drug might have a high therapeutic index because its target is uniquely present in cancer cells (a), or because the requirement for its target is quantitatively or qualitatively different in cancer cells than in normal cells (b and c). This differential requirement might be because of intrinsic differences in the cells (b), such as genetic (red) and epigenetic (blue) differences, or extrinsic differences in the cells (c), such as loss of survival signals provided by normal cell–cell and cell–matrix interactions. Modified with permission from Ref. 2 © (2002) Elsevier Science.

A second way to achieve enhanced cancer-cell selectivity, however, would be to identify situations where the requirement for a particular target was enhanced in the context of a cancer cell compared with normal cells (the so-called ‘context-driven therapeutic index’)1, 2 (Fig. 1). The requirement for a particular target might be increased because of changes that are intrinsic to the cancer cell (for example, through epigenetic or genetic changes), extrinsic to the cancer cell (for example, as a result of microenvironmental changes leading to altered cell–matrix and cell–cell interactions), or both.

All of the anticancer drugs in use today affect targets that are shared between normal cells and cancer cells, including enzymes involved in fundamental processes such as DNA replication. The fact that their therapeutic indices, however small, exceed unity, coupled with the observation that they can, in certain settings, induce striking remissions and occasionally cures (for example, cisplatinum-based regimens for testicular cancer), indicates that contextual differences between normal cells and cancer cells are therapeutically exploitable. So, can our growing knowledge of cancer genetics, coupled with a more sophisticated understanding of gene–gene interactions, be used to identify drug targets that have enhanced therapeutic indices by virtue of such contextual differences? Studies of gene–gene interactions in model organisms have provided a conceptual framework for this task.

#### Synthetic lethality

Two genes (‘A‘ and ‘B‘) are said to be ‘synthetic lethal’ if mutation of either gene alone is compatible with viability but simultaneous mutation of both genes causes death4, 5, 6, 7, 8, 9 (Fig. 2). This concept can be extended to situations in which simultaneous mutation of two genes impairs cellular fitness more than mutation of either gene alone (‘synthetic sick’). In either of these two situations, A buffers the effect of changes in B and vice-versa, but this buffering is lost when both A and B are mutated at the same time4, 6, 10. Synthetic lethal interactions have most commonly been described for loss-of-function alleles, but can also involve gain-of-function alleles. For example, gene B might become essential for survival when a particular gene A is overexpressed (known as synthetic dosage lethality)11, 12, 13. Approximately 20% of genes in the budding yeastSaccharomyces cerevisiae are individually essential, but genetic screens in this organism suggest that synthetic lethal interactions are common among the remaining 80% (perhaps on the order of 10 interactions per gene)10, 14, 15.

##### Figure 2 | Gene–gene interactions: synthetic lethal and suppressive interactions for two genes.

Two genes (‘A‘ and ‘B‘) are said to be ‘synthetic lethal’ if mutation of either gene alone is compatible with viability but simultaneous mutation of both genes causes death. B is an extragenic suppressor of A if mutation of Bsuppresses the phenotype observed whenA is mutated. A lowercase letter denotes a mutant.

Loss-of-function alleles having a synthetic lethal (or synthetic sick) relationship can often, but not always, be easily rationalized based on the functions of their protein products. They might, for example, be uniquely redundant with respect to an essential function (as occurs in some PARALOGUES), be two subunits of an essential multiprotein complex, be two interconnected components in an essential linear pathway (with each mutation decreasing the flux through the pathway), or participate in parallel pathways that are together essential for survival (for example, a crucial metabolic pathway and an alternative or salvage pathway). The concept of synthetic lethality can be further extended to embrace the situation where mutation of A is lethal only in combination with mutations that affect several non-essential genes B, C, D and so on2, 6.

It has been suggested that the concept of synthetic lethality could be used to choose anticancer drug targets1, 7, 16. In particular, the protein products of genes that are synthetic lethal to known cancer-causing mutations, if amenable to pharmacological attack (for example, if they encode an enzyme), should theoretically represent excellent targets for anticancer therapy. This approach simultaneously tackles two vexing problems in cancer pharmacology. The first relates to the fact that many cancer-associated mutations, like most drugs, induce a loss of function1, 2. Therefore, it is not immediately obvious how to pharmacologically approach cancer cells in which, for example, a particular tumour-suppressor protein is crippled (or worse yet, absent). Targeting a protein that is synthetic lethal to such a lost or crippled protein provides an elegant solution to this problem. The second problem relates to whether it is possible to achieve selectivity by inhibiting proteins that are also important for cellular homestasis. If A and B are synthetic lethal (or synthetic sick), then inhibitors of B should selectively kill (or inhibit) cancer cells with mutant A. In the ideal situation, complete neutralization of B, genetically or pharmacologically, would have no effect on normal cells, and even partial inhibition of B in cancer cells would cause death (because of mutant A; Fig. 3, left panel). However, Binhibitors might display a significant therapeutic index even when these ideal conditions are not met. This would require that the Amutation shifts or alters the fitness dose–response curve of the Binhibitor such that keeping B activity below a certain threshold selectively impairs cells with mutant A (Fig. 3, middle and right panels).

##### Figure 3 | Theoretical fitness curves for wild-type andA-/- cells in response to a drug that inhibits the B gene product.

A reading of 0% fitness denotes death, whereas 100% fitness denotes the wild-type state (for simplicity, fitness >100% is not considered in these examples). In the middle panel, a therapeutic window is created by a shift in the fitness curve when gene A is absent. In the left and right panels the therapeutic window is created by changes in the shapes of the fitness curves when gene A is absent.

It could be argued that some (and perhaps most) anticancer drugs in use today are, at least in hindsight, exploiting synthetic lethal, or synthetic sick, interactions. For example, synthetic lethal relationships between DNA-replication genes (such as certain DNA polymerases) and DNA-repair genes (such as mismatch-repair genes) are well documented in model organisms7, 16. It seems likely that the efficacy of the many anticancer drugs that interfere with DNA synthesis is due, at least in some cases, to the presence of tumour-associated mutations that affect DNA repair or the response to DNA damage. Another example of synthetic interactions is provided by certain chemotherapeutic agents and mutations that directly or indirectly compromise the function of the retinoblastoma protein (pRB, encoded by the RB1 gene) tumour suppressor. Inactivation of pRB has been documented in many cancers and leads to an increase in E2F activity, which, in turn, activates various genes involved in S-phase entry17. One of these, topoisomerase II, causes DNA strand breaks and apoptosis when bound to topoisomerase inhibitors such as etoposide. As would be predicted, pRB-pathway mutations sensitize cells to drugs that inhibit topoisomerase II (Refs 18–21). In addition, E2F1, like the oncoprotein MYC, increases the expression of many pro-apoptotic genes, including the p53 paralogue p73, which might sensitize pRB-defective cells to drugs that elicit additional apoptotic signals (such as DNA-damaging agents)22, 23, 24, 25.

Two newer anticancer agents also exploit contextual differences between cancer cells and normal cells. Studies in model organisms suggest that mutations affecting chaperones that are involved in protein folding can unmask the deleterious consequences of various mutations26. Preclinical data indicate that HSP90 (heat-shock protein of 90kDa) inhibitors have anticancer activity, and that certain mutant oncoproteins, such as mutant BRAF and mutant EGFR (epidermal growth factor receptor), have an increased requirement for HSP90 function27,28, 29. One HSP90 inhibitor, 17AAG, has completed phase I testing and is entering phase II studies. The accumulation of mutated and/or misfolded proteins might also alter the requirement of a cell for proteasomal function30. The proteasomal inhibitor bortezomib is well tolerated in humans and was recently approved for the treatment of multiple myeloma31.

#### Discovery of human synthetic lethal interactions

Our knowledge of the molecular networks that are established in normal cells and cancer cells is too rudimentary to allow reliable predictions of the genes that will be synthetic lethal to a given cancer gene. Nonetheless, a few ideas have been put forward for how synthetic lethal combinations might be achieved, based on first principles. Many oncoproteins, including E2F1 and MYC, represent a double-edged sword for cancer cells because they deliver both pro-mitogenic and pro-apoptotic signals. A counterintuitive approach to treating cancer cells that have hyperactive oncoproteins such as these would be treating them with drugs that enhance their action further, in the hope of crossing an apoptotic threshold. For example, E2F1 is negatively regulated by both pRB and cyclin A32, 33, 34, 35. Loss of the pRB pathway establishes a positive-feedback loop in which E2F1 activates its own promoter36, and blocking the remaining interaction of cyclin A with E2F1 kills transformed cells but not their normal counterparts37, 38, 39. Unfortunately, inhibiting the activity of the cyclin-A partner CDK2 (cyclin-dependent kinase 2) does not have the same effect40, possibly because another catalytic partner can substitute for CDK2 in its absence41, 42. Synthetic lethal interactions might also be predicted based on the loss of particular cell-cycle checkpoints16. For example, S-phase cells, in contrast to G1 cells, can be induced to undergo premature chromosomal condensation under certain conditions, such as treatment with caffeine at doses that inhibit ATR (ataxia telangiectasia and RAD3-related protein)43, 44, 45, 46. Cells that lack p53, which has a role in G1 control, are more susceptible to caffeine than their wild-type counterparts47.

There are now multiple examples of cancers that seem to be dependent on or ‘addicted’ to certain activated oncogenes (gene-replacement experiments suggest that tumour cells can also become addicted to the inactivation of tumour-suppressor genes). Oncogene addiction might underlie the success of the kinase inhibitor imatinib mesylate for CML (in which the oncogene is BCR–ABL) and gastrointestinal stromal tumours (in which the oncogene is KIT)3 and of the EGFR inhibitor gefitinib for EGFR-mutated non-smallcell lung cancer48, 49, 50, 51. Bernard Weinstein, who coined the term ‘oncogene addiction’, initially envisioned that this phenomena was related to the ability of such oncogenes, which can be viewed as nodes in complex molecular networks, to simultaneously deliver proliferative and antiproliferative signals52 (Fig. 4A). As long as the oncogene signal is sustained, the proliferative signal — which might promote mitogenesis, survival, or both — would dominate. However, if the oncogene is acutely silenced, the antiproliferative signal dominates, leading to cessation of growth or cell death (in this scenario it must be invoked that the antiproliferative signal ‘decays’ more slowly than the proliferative signal when the oncogene is inhibited)2.

##### Figure 4 | Models of oncogene addiction.

a | Many oncogenes paradoxically induce pro-mitogenic signals as well as anti-mitogenic (or pro-apoptotic) signals. Growth stimulation results from oncogene activation presumably because the former is dominant to the latter. However, acute inactivation of the oncogene might cause growth cessation or death if the anti-mitogenic/pro-apoptotic signals decay more slowly than the mitogenic signals (for example, because of differences in mRNA and protein half-life). Adapted from Ref. 53. b | Oncogene dependency due to gene–gene interactions. Cancer cells accumulate mutations (arrows) over time that cumulatively lead to a transformed phenotype. Selection favours acquisition of mutations that are neutral or beneficial (adaptive) in the context of the mutations that preceded them. However, some of these changes might be deleterious (red arrow) were it not for the changes that preceded them. If true, correcting early genetic changes (yellow arrow) will unmask these deleterious effects. In this model, cancer cells behave like a molecular ‘house of cards’. c | Activation (indicated by bold arrow) of an oncogenic pathway diminishes selection pressure to maintain collateral signalling pathways. Silencing of these collateral pathways over time, because of genetic or epigenetic changes, leads to oncogene dependency. Adapted from Ref. 57.

Superimposed on the network abnormalities that are induced by activated oncogenes are network abnormalities that are induced by mutations at other loci. The resulting abnormalities in molecular circuitry create additional opportunities for oncogene addiction1, 2, 53, 54, including those that arise as a result of gene–gene interactions, such as synthetic lethality and extragenic suppression. Cancers arise through sequential genetic changes that ultimately convert a normal cell to a fully transformed one. These mutations are under selective pressure to be adaptive or neutral, from the point of view of the cancer, in the context of the mutations that preceded them (Fig. 4B). It seems likely, a priori, that some of the mutations that occur late in the evolution of a cancer cell might only be advantageous, or indeed even tolerated, because of the mutations that preceded them (or put another way, these mutations would be deleterious if not for the mutations that had preceded them). In the extreme case, an early A mutation might be an extragenic suppressor of the lethality that would otherwise be caused by a late B mutation (Fig. 2, right panel). If this is true, correcting the A mutation should cause death because of the acquisition of the B mutation. For example, RB1 inactivation, as described above, leads to increased E2F activity, which can stimulate S-phase entry but can also promote p53-dependent apoptosis55, 56. So, a tumour in which TP53 was already mutated might derive an additional benefit from mutating RB1 but at the price of becoming addicted to p53 loss (in the sense that restoring p53 function would lead to apoptosis).

Similarly, Mills and colleagues have suggested that oncogene addiction might arise because of the loss of collateral signalling pathways. This is due to genomic instability coupled with the loss of selection pressure to maintain the collateral signalling pathways57, a process referred to as ‘genetic streamlining’58(Fig. 4C). Collectively, these ideas suggest that the pathways that are activated early in the course of tumour progression (owing to oncogene activation or tumour-suppressor-gene inactivation) are likely to be excellent therapeutic targets because of synthetic interactions with the mutational changes that followed them. Silencing these pathways should reveal the deleterious consequences of these subsequent changes, whether these changes did or did not contribute to tumour progression. The potential interrelationship between oncogene addiction and synthetic lethality is illustrated by the phosphatase and tensin homologue (PTEN) tumour-suppressor protein, which negatively regulates the phosphatidylinositol 3-kinase (PI3K) pathway, and mTOR (mammalian target of rapamycin). PTEN-/- cells are reported be more sensitive to the antiproliferative effects of mTOR inhibitors than their wild-type counterparts59. This observation indicates that PTEN-/- cells are ‘addicted’ to PI3K–mTOR signalling, and that PTEN and mTOR have a synthetic sick relationship.

Chromosomal deletions in cancer cells lead to the loss of one or both copies of many genes. Frei suggested that cancer-cell vulnerabilities to pharmacological attack might also be gleaned by examining the functions of contiguous genes that are homozygously deleted along with tumour-suppressor genes60. For example, the gene encoding methylthioadenosine phosphorylase (MTAP) — which has a role in a salvage pathway for adenosine biosynthesis — is often co-deleted with the adjacent CDKN2A locus, which encodes the tumour-suppressor proteins INK4A and ARF on 9p21 (Ref. 61). As would be predicted, cells that lack MTAP have increased sensitivity to L-alanosine — a potent inhibitor of de novo AMP synthesis — and to an inhibitor of de novo purine-nucleotide synthesis, 6-methylmercaptopurine riboside (MMPR)62.

Kamb suggested that expression databases be mined for paralogous genes in which one or more members were underexpressed in cancer cells relative to normal cells (for example, as a result of haploinsufficiency or homozygous deletion)58. A drug that inhibited the remaining paralogue(s), but not the differentially expressed paralogue, would, theoretically, be cancer-cell selective. This approach, however, presumes that it is possible to develop drugs that can discriminate between paralogous proteins. Moreover, synthetic lethal screens in yeast indicate that paralogous pairs represent a minority of the potential synthetic lethal combinations in a cell10, 15, 63. Therefore, unbiased chemical and genetic screens are likely to be the most fruitful methods for identifying novel synthetic lethal relationships on which to base new cancer treatments.

#### Screens for synthetic lethal interactors

The example of topisomerase II inhibitors, as cited above, demonstrates that proteins bound to drugs might have effects that are very different from those predicted by true null mutations, or by techniques such as RNA interference (RNAi) that cause quantitative reductions in protein abundance. For example, a drug might interfere with one function of a multifunctional protein, or cause a protein to act in a dominant-negative or dominant-positive manner. For this reason, screens for synthetic lethality that are carried out using libraries of chemical compounds are likely to be complementary to screens that are carried out using genetic tools (such as RNAi or short interfering RNA; siRNA).

Chemical screens. Hartwell and Friend pioneered the idea of screening for drug-like chemicals that specifically kill yeast deletion mutants with defects in cell-cycle checkpoints or DNA repair16, 64. This paradigm can be extended to human cells. A number of groups have identified chemicals from collections of pure compounds, or that are present in complex mixtures (for example, extracts or broths), that selectively inhibit cells with cancer-relevant genetic alterations using isogenic human cell-line pairs grown in multiwell plates (Fig. 5). Schreiber and co-workers identified marine sponge extracts that preferentially inhibited the proliferation of Trp53-/- mouse embryonic fibroblasts, as determined by BROMODEOXYURIDINE (BRDU) INCORPORATION, relative to wild-type mouse embryonic fibroblasts65. However, the chemical entities responsible for these effects were not identified. Kinzler and co-workers co-cultured KRAS-mutated colon cancer cells (engineered to produce blue fluorescent protein) with a subclone in which the mutant KRAS allele was eliminated by homologous recombination (and engineered to produce yellow fluorescent protein), and monitored differential killing using the ratio of blue/yellow fluorescence66 (Fig. 6A). Several chemical entities, including a novel cytidine nucleoside, were found that selectively killed cells containing mutant KRAS. A fluorescence-based mammalian synthetic lethal assay, which was modelled after earlier yeast assays67, was also developed by Canaani and colleagues68, 69 (Fig. 6B). Leder and co-workers discovered a small molecule called F16, which selectively killsERBB2 (also known as HER2/NEU)-overexpressing mammary epithelial cells, compared with their normal counterparts70, 71. The toxicity of F16 correlates with its selective uptake in, and disruption of, mitochondria of cells that are transformed with ERBB2. Stockwell and co-workers identified a number of compounds that preferentially killed primary human cells that were transformed in vitro with human telomerase reverse transcriptase (TERT), RAS, and oncoproteins that affect pRB, p53 and/or protein phosphatase 2A (PP2A)21. Included among these were clinically useful inhibitors of topoisomerase I and II. In a focused screen of pro-apoptotic agents Quon and colleagues discovered that human cells overexpressing MYC displayed increased sensitivity to the death receptor DR5 agonist tumour-necrosis-factor-related apoptosis-inducing ligand (TRAIL) in vitroand in vivo, and linked this to p53-independent induction of DR5 by MYC72. Recent studies suggest that it is possible to screen pairwise combinations of drugs against ISOGENIC cell lines to uncover novel drug–gene and drug–drug interactions73, 74.

##### Figure 5 | Synthetic lethal screening with chemical or interfering RNA libraries.

Isogenic cell-line pairs that do or do not harbour a cancer-relevant mutation (in the case illustrated, the cell-line pair differs only with respect to a particular tumour-suppressor gene (TSG)) are grown in multiwell plates to which different chemical or genetic (short interfering RNAs, short hairpin RNAs or other interfering RNAs) perturbants are added. In time, such assays might be carried out using microarrays spotted with chemicals or siRNA species104, 105. A ‘hit’ is a perturbant that is cytostatic or cytotoxic to the cell with the cancer-relevant mutation (arrow). It should be noted that the interpretation of such assays needs to consider potentially confounding effects, such as differences in proliferation rate and cell-cycle distribution.

##### Figure 6 | Fluorescence-based mammalian synthetic lethal assay.

a | The Kinzler method66. Isogenic cell-line pairs that do/do not harbour a cancer-relevant mutation are engineered to produce blue fluorescent protein (BFP) and yellow fluorescent protein (YFP), respectively, and are co-cultured in multiwell plates to which different chemicals are added. Selective killing of blue cells is indicative of a synthetic lethal interaction (yellow well). b | The Canaani method68, 69. Cells lacking a tumour-suppressor gene (TSG) are engineered to stably produce a green fluorescent protein (GFP) with an emission wavelength of ‘1’. These cells are transfected with an unstable episomal plasmid encoding theTSG along with a GFP that has a different emission wavelength (‘2′). Retention of the episomal plasmid after exposure to chemical or genetic perturbants is indicative of a synthetic lethal relationship. WT, wild type.

The use of isogenic cell-line pairs to identify compounds that selectively kill cancer cells as a result of synthetic interactions is a powerful approach for the following reason. It is not uncommon for 1% of the compounds in a chemical library to inhibit the growth of human cancer cells at the concentrations used in typical high-throughput screens. This translates into thousands of potential anticancer drugs from a screen conducted with 105 to 106 compounds (such as might be found at a large pharmaceutical company or public consortium). Without the use of a filter, such as differential killing in a genotype-specific manner, there are too many ‘hits’ to pursue. In the past, this has led to ‘hits’ being prioritized on the basis of factors such as ease of synthesis, potency, intellectual-property issues and the likelihood of having desirable absorption, distribution, metabolism and excretion (ADME) properties based on accepted criteria such as ‘LIPINSKI’S RULES75, 76. Although they are important, none of these latter considerations address selectivity. Furthermore, these factors can sometimes be addressed by modifying the chemical structure of the initial compound (medicinal chemistry). It would be ironic if chemicals that can selectively kill cancer cells through synthetic lethal interactions were present but missed for this reason during the countless cytotoxic screens that have been conducted since the mid-twentieth century.

A generic problem for cell-based screening of libraries of chemical compounds relates to successful target identification. In some cases, it is possible to use a chemical entity identified in such a screen to capture its protein target by affinity chromatography77,78. For chemicals that induce a phenotype in yeast, mutants that display increased or decreased resistance (fitness) can be sought79, 80. Such mutants often provide clues as to the pathways that are affected by a compound, and therefore its potential target (or targets). A conceptually attractive approach to target identification would be to generate compendia of molecular signatures (for example, gene-expression profiles) for various loss-of-function mutations in a suitable host (for example, yeast or human cells)81. The signature generated by the compound of interest could then be compared in silico to the compendium, with the rationale that the compound signature and target-disruption signature should be near(est) neighbours in an ideal situation. The search for targets of chemicals identified in cell-based synthetic lethal screens should also be expedited by a knowledge of the genes that score as synthetic lethal in genetic screens carried out in model organisms and human cells, as described below.

Genetic screens. In the past, genetic screens for synthetic lethal interactors have been largely relegated to model organisms such as yeast, the fruitfly Drosophila melanogaster and the wormCaenorhabditis elegans that are amenable to forward-genetic approaches. Typically, these approaches have combined random mutagenesis with phenotypic screens, reflecting the retention of the query gene linked to a suitable reporter. Synthetic lethal screens in yeast have been invaluable for elucidating certain principles surrounding synthetic lethal interactions. Unfortunately, many tumour-suppressor genes and oncogenes do not have clear yeast orthologues. Although forward-genetic screens are more cumbersome in fruitflies and worms than in yeast, they offer the advantage that their genomes do contain orthologues of most human cancer genes. In worms the RB1 orthologue, lin-35, has been well studied in the context of vulvar development82. Fay and co-workers reported that a gene encoding a ubiquitin-conjugating enzyme related to human UBCH7 is synthetic lethal to lin-35 (Ref. 83), as is the worm homologue of CDH1 (Ref. 84). Using a fruitfly-based screen in which the fruitfly RB1-like geneRbf1 was conditionally inactivated in the eye, Belvin and co-workers discovered that RBF1 is synthetic lethal to a novel prolyl isomerase85. It is not yet known whether these synthetic lethal interactions will hold true in all cell types, nor whether they will hold true across species.

However, forward-genetic approaches such as these are now giving way to genome-wide reverse-genetic approaches. Successful studies have been carried out in yeast (Box 1) but, for the reasons cited above, metazoan models are usually more appropriate than yeast for synthetic lethal screens for human cancer genes.

##### Box 1 | DNA bar code screens for genes that alter fitness

RNAi is a powerful method for silencing genes in worms and fruitflies, and collections of interfering RNAs have been created to facilitate high-throughput genome-wide screens in these organisms86, 87, 88, 89 (for an excellent review, see Ref. 90). RNAi can be conveniently achieved in wild-type or mutant worms by growing them on lawns of Escherichia coli carrying a plasmid that produces the interfering RNAs of interest, which are then ingested. Alternatively, interfering RNAs can be delivered to worms by soaking them in a solution that contains the appropriate molecules. An interfering RNA that exacerbated the mutant phenotype without affecting wild-type animals would indicate a synthetic lethal, or synthetic sick, interaction. High-throughput screens have also been conducted to identify interfering RNAs that inhibit the proliferation of fruitfly cells grown in multiwell plates88. Such screens could easily be adapted to carry out synthetic lethal screens. In this scenario, the identification of interfering RNAs that do not affect wild-type fruitfly cells but kill fruitfly cells in which the gene of interest was mutated or silenced would be desired. If required, silencing could be accomplished by simultaneously administering two interfering RNAs (one corresponding to the query gene and one corresponding to the gene of interest).

Many cancer-relevant genes are linked to specific types of cancer despite being ubiquitously expressed and performing functions that are thought to be generic rather than tissue specific. In addition, there are now many examples where different phenotypes have been observed following heterozygous inactivation of a particular tumour-suppressor gene in both mice and humans. These observations indicate that context, with respect to cell-type and species, is important. As a corollary, they indicate that synthetic lethal relationships ultimately need to be discovered or validated in relevant human cells, and that caution needs to be exercised when extrapolating cell-culture results to intact organisms. In the past, the use of RNAi in mammalian cells was problematic because double-stranded RNA elicits an antiviral response on entry into mammalian cells. In 2001, however, Tuschl and co-workers showed that siRNAs can be used to silence genes in mammalian cells without triggering a nonspecific host response91. Soon thereafter several groups showed that the actions of siRNAs in cells can be mimicked with short hairpin RNAs (shRNAs) encoded by plasmid or viral vectors92, 93, 94,95, 96. siRNA libraries and shRNA vector libraries are being created, and proof-of-concept experiments indicate that these libraries can be used to carry out genome-wide phenotypic screens in mammalian cells (including human cells)97, 98, 99,100. In theory, these libraries could be used to carry out synthetic lethal screens using isogenic cell-line pairs, scoring for siRNA (or shRNA) species that specifically kill cells with a cancer-relevant mutation in a one well/one siRNA (or shRNA) species format (Fig. 5). Alternatively, several groups are incorporating DNA ‘bar codes’ (Box 1) into shRNA vectors, modelled after the use of DNA bar codes in yeast and E. coli (or have used the shRNA sequence itself as a bar code)97, 98. If successful, it should be possible to infect isogenic cell-line pairs with pools of vectors encoding different shRNAs, and then identify those shRNAs that cause a fitness defect specifically in those cells that harbour the cancer-relevant mutation under investigation.

#### Combination therapy

Random mutations that lead to gene inactivation should theoretically decrease the genetic buffering capacity of an individual cancer cell. As outlined above, therapies predicated on synthetic lethal relationships are one way to exploit this. At the same time, random mutations and genome plasticity, viewed at the level of a tumour, markedly increase the likelihood that rare therapy-resistant subclones will emerge. Decades of clinical experience, including recent examples of imatinib mesylate resistance101, 102, as well as tumour models incorporating the use of conditionally expressed oncogenes103, support this view. A 1-cm3 tumour already contains >109 cells. So, the likelihood of clinical success will increase with early diagnosis (to minimize the number of cells in the pool from which resistant cells might arise) and the use of effective drug combinations. The use of drug combinations to minimize chemotherapeutic resistance is a well-established pharmaceutical principle. It is based on the knowledge that the probability of a given cell being simultaneously resistant to a combination of non-cross-resistant drugs varies as the product of the probabilities of becoming resistant to each of the individual components. The choice of which drugs to combine might be based on a knowledge of cancer molecular biology (for example, by simultaneously targeting two or more cancer-relevant mutations), empirical testing (for example, by systematically testing combinations of active agents for additive or synergistic effects) or both.

#### Implications and future directions

Over the decades, the medical therapy of metastatic cancer has, with a few notable exceptions, been a frustrating and often futile exercise. This has contributed to the view that each mutation within a cancer cell is another plate of armour that serves as a barrier to successful therapy. However, our empirical knowledge of the susceptibility of cancer cells to drugs in humans stems from an armamentarium that was largely discovered and developed using the same paradigm. Moreover, there is every reason to believe that certain genetic changes within cancer cells will create liabilities under the appropriate conditions. There are now tools to systematically search for mutated oncogenes that encode molecules, such as kinases, that can be targeted by drugs, as well as the tools to reveal vulnerabilities created by synthetic lethal interactions. Understanding how the phenotypes created by cancer genotypes (for example, tumour type and resistance to therapy), as well as synthetic lethal relationships, are influenced by contextual differences (for example, cell type and species) remains a formidable task. Nonetheless, we are clearly poised to move away from empirically discovered cytotoxics and towards new agents that are based on a knowledge of cancer genetics and a more sophisticated view of gene–gene interactions.

top

### Acknowledgements

I would like to thank S. Elledge, A. Reddy, M. Tyers, P. Silver and M. Vidal for their critical reading of this manuscript and/or helpful comments. I apologize to colleagues whose work was not cited due to space limitations or my ignorance. Dedicated to the memory of Nancy P. Kaelin.

#### Competing interests statement

The author declares no competing financial interests.

top

### References

1. Kaelin, W. G. Jr. Choosing anticancer drug targets in the postgenomic era. J. Clin. Invest. 104, 1503–1506 (1999).

2. Reddy, A. & Kaelin, W. G. Jr. Using cancer genetics to guide the selection of anticancer drug targets. Curr. Opin. Pharmacol. 2, 366–373 (2002).

3. Kaelin, W. G. Jr. Gleevec: prototype or outlier? Sci. STKE2004, PE12 (2004).
References 1–3 provide counter-arguments to naysayers who suggest that genetically complex cancers will never be successfully treated with drugs.

4. Hartman, J. T., Garvik, B. & Hartwell, L. Principles for the buffering of genetic variation. Science 291, 1001–1004 (2001).

5. Guarente, L. Synthetic enhancement in gene interaction: a genetic tool come of age. Trends Genet. 9, 362–366 (1993).

6. Kamb, A. Mutation load, functional overlap, and synthetic lethality in the evolution and treatment of cancer. J. Theor. Biol. 223, 205–213 (2003).
This paper and reference 58 are thoughtful essays on maladaptive genetic changes in cancer cells that might render them vunerable to pharmacological attack.

7. Friend, S. & Oliff, A. Emerging uses for genomic information in drug discovery. N. Engl. J. Med. 338, 125–126 (1998).

8. Dobzhansky, T. Genetics of natural populations. XIII. Recombination and variability in populations of Drosophila pseudoobscura. Genetics 31, 269–290 (1946).

9. Lucchesi, J. C. Synthetic lethality and semi-lethality among functionally related mutants of Drosophila melanogaster.Genetics 59, 37–44 (1968).

10. Sharom, J. R., Bellows, D. S. & Tyers, M. From large networks to small molecules. Curr. Opin. Chem. Biol. 8, 81–90 (2004).
Excellent introduction to systems biology as applied to cancer and cancer pharmacology.

11. Kroll, E. S., Hyland, K. M., Hieter, P. & Li, J. J. Establishing genetic interactions by a synthetic dosage lethality phenotype. Genetics 143, 95–102 (1996).

12. Measday, V. & Hieter, P. Synthetic dosage lethality. Methods Enzymol. 350, 316–326 (2002).

13. Li, J. J. & Herskowitz, I. Isolation of ORC6, a component of the yeast origin recognition complex by a one-hybrid system.Science 262, 1870–1874 (1993).

14. Tong, A. H. et al. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294, 2364–2368 (2001).

15. Tong, A. H. et al. Global mapping of the yeast genetic interaction network. Science 303, 808–813 (2004).
References 14 and 15 provide a glimpse into the complexity of synthetic lethal networks in yeast.

16. Hartwell, L., Szankasi, P., Roberts, C., Murray, A. & Friend, S.Integrating genetic approaches into the discovery of anticancer drugs. Science 278, 1064–1068 (1997).
This seminal paper argues that synthetic lethal interactions be exploited to arrive at safer, more efficacious cancer drugs.

17. Sellers, W. R. & Kaelin, W. G. Jr. Role of the retinoblastoma protein in the pathogenesis of human cancer. J. Clin. Oncol.15, 3301–3312 (1997).

18. Nip, J. et al. E2F-1 cooperates with topoisomerase II inhibition and DNA damage to selectively augment p53-independent apoptosis. Mol. Cell. Biol. 17, 1049–1056 (1997).

19. Almasan, A. et al. Deficiency of retinoblastoma protein leads to inappropriate S-phase entry, activation of E2F-responsive genes, and apoptosis. Proc. Natl Acad. Sci. USA 92, 5436–5440 (1995).

20. Banerjee, D. et al. Role of E2F-1 in chemosensitivity. Cancer Res. 58, 4292–4296 (1998).

21. Dolma, S., Lessnick, S. L., Hahn, W. C. & Stockwell, B. R.Identification of genotype-selective antitumor agents using synthetic lethal chemical screening in engineered human tumor cells. Cancer Cell 3, 285–296 (2003).

22. Evan, G. I. & Vousden, K. H. Proliferation, cell cycle and apoptosis in cancer. Nature 411, 342–348 (2001).

23. Zaika, A., Irwin, M., Sansome, C. & Moll, U. M. Oncogenes induce and activate endogenous p73 protein. J. Biol. Chem.276, 11310–11316 (2001).

24. Meng, R., Phillips, P. & El-Deiry, W. p53-independent increase in E2F-1 expression enhances the cytoxic effects of etoposide and of adriamycin. Intl J. Oncol. 14, 5–14 (1999).

25. Irwin, M. S. et al. Chemosensitivity linked to p73 function.Cancer Cell 3, 403–410 (2003).

26. Rutherford, S. L. & Lindquist, S. HSP90 as a capacitor for morphological evolution. Nature 396, 336–342 (1998).

27. Isaacs, J. S., Xu, W. & Neckers, L. Heat shock protein 90 as a molecular target for cancer therapeutics. Cancer Cell 3, 213–217 (2003).

28. Workman, P. Altered states: selectively drugging the HSP90 cancer chaperone. Trends Mol. Med. 10, 47–51 (2004).

29. Neckers, L. & Neckers, K. Heat-shock protein 90 inhibitors as novel cancer chemotherapeutics – an update. Expert Opin. Emerg. Drugs 10, 137–149 (2005).

30. Goldberg, A. L. Protein degradation and protection against misfolded or damaged proteins. Nature 426, 895–899 (2003).

31. Rajkumar, S. V., Richardson, P. G., Hideshima, T. & Anderson, K. C. Proteasome inhibition as a novel therapeutic target in human cancer. J. Clin. Oncol. 23, 630–639 (2005).

32. Krek, W., Xu, G., & Livingston, D. M. Cyclin A-kinase regulation of E2F1 DNA binding function underlies suppression of an S phase checkpoint. Cell 83, 1149–1158 (1995).

33. Dynlacht, B. D., Flores, O., Lees, J. A. & Harlow, E.Differential regulation of E2F transactivation by cyclin/CDK complexes. Genes Dev. 8, 1772–1786 (1994).

34. Krek, W. et al. Negative regulation of the growth-promoting transcription factor E2F-1 by a stably bound cyclin A-dependent protein kinase. Cell 78, 161–172 (1994).

35. Xu, M., Sheppard, K. A., Peng, C-Y., Yee, A. S. & Piwnica-Worms, H. Cyclin A/CDK2 binds directly to E2F1 and inhibits the DNA-binding activity of E2F1/DP1 by phosphorylation.Mol. Cell. Biol. 14, 8420–8431 (1994).

36. Parr, M. J. et al. Tumor-selective transgene expression in vivomediated by an E2F-responsive adenoviral vector. Nature Med. 3, 1145–1149 (1997).

37. Chen, Y. et al. Selective killing of transformed cells by cyclin/cyclin-dependent kinase 2 antagonists. Proc. Natl Acad. Sci. USA 96, 4325–4329 (1999).

38. Chen, W., Lee, J., Cho, S. Y. & Fine, H. A. Proteasome-mediated destruction of the cyclin A/cyclin-dependent kinase 2 complex suppresses tumor cell growth in vitro and in vivo.Cancer Res. 64, 3949–3957 (2004).

39. Mendoza, N. et al. Selective cyclin-dependent kinase 2/cyclin A antagonists that differ from ATP site inhibitors block tumor growth. Cancer Res. 63, 1020–1024 (2003).

40. Tetsu, O. & McCormick, F. Proliferation of cancer cells despite CDK2 inhibition. Cancer Cell 3, 233–245 (2003).

41. Berthet, C., Aleem, E., Coppola, V., Tessarollo, L. & Kaldis, P.CDK2 knockout mice are viable. Curr. Biol. 13, 1775–1785 (2003).

42. Ortega, S. et al. Cyclin-dependent kinase 2 is essential for meiosis but not for mitotic cell division in mice. Nature Genet.35, 25–31 (2003).

43. Schlegel, R. & Pardee, A. B. Caffeine-induced uncoupling of mitosis from the completion of DNA replication in mammalian cells. Science 232, 1264–1266 (1986).

44. Nishimoto, T., Ishida, R., Ajiro, K., Yamamoto, S. & Takahashi, T. The synthesis of protein(s) for chromosome condensation may be regulated by a post-transcriptional mechanism. J. Cell. Physiol. 109, 299–308 (1981).

45. Hall-Jackson, C. A., Cross, D. A., Morrice, N. & Smythe, C.ATR is a caffeine-sensitive, DNA-activated protein kinase with a substrate specificity distinct from DNA-PK. Oncogene 18, 6707–6713 (1999).

46. Sarkaria, J. N. et al. Inhibition of ATM and ATR kinase activities by the radiosensitizing agent, caffeine. Cancer Res.59, 4375–4382 (1999).

47. Nghiem, P., Park, P., Kim, Y., Vaziri, C. & Schreiber, S. ATR inhibition selectively sensitizes G1 checkpoint-deficient cells to lethal premature chromatin condensation. Proc. Natl Acad. Sci. USA 98, 9092–9097 (2001).

48. Sordella, R., Bell, D. W., Haber, D. A. & Settleman, J.Gefitinib-sensitizing EGFR mutations in lung cancer activate anti-apoptotic pathways. Science 305, 1163–1167 (2004).

49. Lynch, T. J. et al. Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N. Engl. J. Med. 350, 2129–2139 (2004).

50. Paez, J. G. et al. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science 304, 1497–1500 (2004).

51. Pao, W. et al. EGF receptor gene mutations are common in lung cancers from ‘never smokers’ and are associated with sensitivity of tumors to gefitinib and erlotinib. Proc. Natl Acad. Sci. USA 101, 13306–13311 (2004).

52. Weinstein, I. B. et al. Disorders in cell circuitry associated with multistage carcinogenesis: exploitable targets for cancer prevention and therapy. Clin. Cancer Res. 3, 2696–2702 (1997).

53. Weinstein, I. B. Disorders in cell circuitry during multistage carcinogenesis: the role of homeostasis. Carcinogenesis 21, 857–864 (2000).

54. Weinstein, I. B. Cancer. Addiction to oncogenes — the Achilles heal of cancer. Science 297, 63–64 (2002).
References 52–54, introduced the term ‘oncogene addiction’.

55. Adams, P. & Kaelin, W. J. Jr. The cellular effects of E2F overexpression. Curr. Top. Microbiol. Immunol. 208, 79–93 (1996).

56. Sherr, C. The Pezcoller lecture: cancer cell cycles revisited.Cancer Res. 60, 3689–3695 (2000).

57. Mills, G., Lu, Y. & Kohn, E. Linking molecular therapeutics to molecular diagnostics: inhibition of the FRAP/RAFT/TOR component of the PI3K pathway preferentially blocks PTEN mutant cells in vitro and in vivo. Proc. Natl Acad. Sci. USA98, 10031–10033 (2001).

58. Kamb, A. Consequences of nonadaptive alterations in cancer.Mol. Biol. Cell 14, 2201–2205 (2003).

59. Neshat, M. et al. Enhanced sensitivity of PTEN-deficient tumors to inhibition of FRAP/mTOR. Proc. Natl Acad. Sci. USA98, 10314–10319 (2001).

60. Frei, E. D. Gene deletion: a new target for cancer chemotherapy. Lancet 342, 662–664 (1993).

61. Cairns, P. et al. Frequency of homozygous deletion at p16/CDKN2 in primary human tumours. Nature Genet. 11, 210–212 (1995).

62. Li, W. et al. Status of methylthioadenosine phosphorylase and its impact on cellular response to L-alanosine and methylmercaptopurine riboside in human soft tissue sarcoma cells. Oncol. Res. 14, 373–379 (2004).

63. Wong, S. L. et al. Combining biological networks to predict genetic interactions. Proc. Natl Acad. Sci. USA 101, 15682–15687 (2004).

64. Simon, J. A. et al. Differential toxicities of anticancer agents among DNA repair and checkpoint mutants of Saccharomyces cerevisiae. Cancer Res. 60, 328–333 (2000).

65. Stockwell, B., Haggarty, S. & Schreiber, S. High-throughput screening of small molecules in miniaturized mammalian cell-based assays involving post-translational modifications.Chem. Biol. 6, 71–83 (1999).
References 64 and 65 are two early examples of using isogenic cell lines to isolate compounds that kill cells in a genotype-specific manner.

66. Torrance, C., Agrawal, V., Vogelstein, B. & Kinzler, K. Use of isogenic human cancer cells for high-throughput screening and drug discovery. Nature Biotechnol. 19, 940–945 (2001).

67. Bender, A. & Pringle, J. R. Use of a screen for synthetic lethal and multicopy suppressee mutants to identify two new genes involved in morphogenesis in Saccharomyces cerevisiae. Mol. Cell. Biol. 11, 1295–1305 (1991).

68. Simons, A., Dafni, N., Dotan, I., Oron, Y. & Canaani, D.Establishment of a chemical synthetic lethality screen in cultured human cells. Genome Res. 11, 266–273 (2001).

69. Simons, A., Dafni, N., Dotan, I., Oron, Y. & Canaani, D.Genetic synthetic lethality screen at the single gene level in cultured human cells. Nucleic Acids Res. 29, E100 (2001).

70. Fantin, V. R. & Leder, P. F16, a mitochondriotoxic compound, triggers apoptosis or necrosis depending on the genetic background of the target carcinoma cell. Cancer Res. 64, 329–336 (2004).

71. Fantin, V. R., Berardi, M. J., Scorrano, L., Korsmeyer, S. J. & Leder, P. A novel mitochondriotoxic small molecule that selectively inhibits tumor cell growth. Cancer Cell 2, 29–42 (2002).

72. Wang, Y. et al. Synthetic lethal targeting of MYC by activation of the DR5 death receptor pathway. Cancer Cell 5, 501–512 (2004).

73. Haggarty, S. J., Clemons, P. A. & Schreiber, S. L. Chemical genomic profiling of biological networks using graph theory and combinations of small molecule perturbations. J. Am. Chem. Soc. 125, 10543–10545 (2003).

74. Borisy, A. A. et al. Systematic discovery of multicomponent therapeutics. Proc. Natl Acad. Sci. USA 100, 7977–7982 (2003).

75. Lipinski, C. A., Lombardo, F., Dominy, B. W. & Feeney, P. J.Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 46, 3–26 (2001).

76. Lipinski, C. A. Drug-like properties and the causes of poor solubility and poor permeability. J. Pharmacol. Toxicol. Methods 44, 235–249 (2000).

77. Knockaert, M. et al. Intracellular targets of cyclin-dependent kinase inhibitors: identification by affinity chromatography using immobilised inhibitors. Chem. Biol. 7, 411–422 (2000).

78. Hultsch, T., Albers, M. W., Schreiber, S. L. & Hohman, R. J.Immunophilin ligands demonstrate common features of signal transduction leading to exocytosis or transcription. Proc. Natl Acad. Sci. USA 88, 6229–6233 (1991).

79. Baetz, K. et al. Yeast genome-wide drug-induced haploinsufficiency screen to determine drug mode of action.Proc. Natl Acad. Sci. USA 101, 4525–4230 (2004).

80. Giaever, G. et al. Genomic profiling of drug sensitivities via induced haploinsufficiency. Nature Genet. 21, 278–283 (1999).

81. Marton, M. et al. Drug target validation and identification of secondary drug target effects using DNA microarrays. Nature Med. 4, 1293–1301 (1998).

82. Lu, X. & Horvitz, H. R. lin-35 and lin-53, two genes that antagonize a C. elegans Ras pathway, encode proteins similar to RB and its binding protein RBAp48. Cell 95, 981–991 (1998).

83. Fay, D. S., Large, E., Han, M. & Darland, M. lin-35/Rb andubc-18, an E2 ubiquitin-conjugating enzyme, function redundantly to control pharyngeal morphogenesis in C. elegans. Development 130, 3319–3330 (2003).

84. Fay, D. S., Keenan, S. & Han, M. fzr-1 and lin-35/Rb function redundantly to control cell proliferation in C. elegans as revealed by a nonbiased synthetic screen. Genes Dev. 16, 503–517 (2002).

85. Edgar, K. A. et al Synthetic lethality of retinoblastoma mutant cells in the Drosophila eye by mutation of a novel peptidyl prolyl isomerase gene. Genetics 170, 161–171 (2005).

86. Kamath, R. S. et al. Systematic functional analysis of theCaenorhabditis elegans genome using RNAi. Nature 421, 231–237 (2003).

87. Ashrafi, K. et al. Genome-wide RNAi analysis ofCaenorhabditis elegans fat regulatory genes. Nature 421, 268–272 (2003).

88. Cherry, S. et al. Genome-wide RNAi screen reveals a specific sensitivity of IRES-containing RNA viruses to host translation inhibition. Genes Dev. 19, 445–452 (2005).

89. Rual, J. F. et al. Toward improving Caenorhabditis elegansphenome mapping with an ORFeome-based RNAi library.Genome Res. 14, 2162–2168 (2004).

90. Willingham, A. T., Deveraux, Q. L., Hampton, G. M. & Aza-Blanc, P. RNAi and HTS: exploring cancer by systematic loss-of-function. Oncogene 23, 8392–8400 (2004).

91. Elbashir, S. et al. Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature 411, 494–498 (2001).

92. Brummelkamp, T. R., Bernards, R. & Agami, R. Stable suppression of tumorigenicity by virus-mediated RNA interference. Cancer Cell 2, 243–247 (2002).

93. Brummelkamp, T. R., Bernards, R. & Agami, R. A system for stable expression of short interfering RNAs in mammalian cells. Science 296, 550–553 (2002).

94. Lee, N. S. et al. Expression of small interfering RNAs targeted against HIV-1 rev transcripts in human cells. Nature Biotechnol. 20, 500–505 (2002).

95. Paddison, P. J., Caudy, A. A., Bernstein, E., Hannon, G. J. & Conklin, D. S. Short hairpin RNAs (shRNAs) induce sequence-specific silencing in mammalian cells. Genes Dev. 16, 948–958 (2002).

96. Sui, G. et al. A DNA vector-based RNAi technology to suppress gene expression in mammalian cells. Proc. Natl Acad. Sci. USA 99, 5515–5520 (2002).

97. Berns, K. et al. A large-scale RNAi screen in human cells identifies new components of the p53 pathway. Nature 428, 431–437 (2004).

98. Paddison, P. J. et al. A resource for large-scale RNA-interference-based screens in mammals. Nature 428, 427–431 (2004).
References 97 and 98 suggest that it should eventually be possible to carry out synthetic lethal screens in isogenic human cell-line pairs using bar-coded shRNA libraries.

99. Shirane, D. et al. Enzymatic production of RNAi libraries from cDNAs. Nature Genet. 36, 190–196 (2004).

100. Aza-Blanc, P. et al. Identification of modulators of TRAIL-induced apoptosis via RNAi-based phenotypic screening. Mol. Cell 12, 627–637 (2003).

101. Gorre, M. et al. Clinical resistance to STI-571 cancer therapy caused by BCRABL gene mutation or amplification. Science293, 876–880 (2001).

102. Shah, N. P. et al. L. Multiple BCR–ABL kinase domain mutations confer polyclonal resistance to the tyrosine kinase inhibitor imatinib (STI571) in chronic phase and blast crisis chronic myeloid leukemia. Cancer Cell 2, 117–125 (2002).

103. Jonkers, J. & Berns, A. Oncogene addiction: sometimes a temporary slavery. Cancer Cell 6, 535–538 (2004).

104. Bailey, S. N., Sabatini, D. M. & Stockwell, B. R. Microarrays of small molecules embedded in biodegradable polymers for use in mammalian cell-based screens. Proc. Natl Acad. Sci. USA101, 16144–16149 (2004).

105. Wheeler, D. B. et al. RNAi living-cell microarrays for loss-of-function screens in Drosophila melanogaster cells. Nature Methods 1, 127–132 (2004).

106. Ooi, S. L., Shoemaker, D. D. & Boeke, J. D. DNA helicase gene interaction network defined using synthetic lethality analyzed by microarray. Nature Genet. 35, 277–286 (2003).
Describes the use of DNA bar codes coupled with oligonucleotide microarrays to conduct synthetic lethal assays in yeast.

107. Shoemaker, D. D., Lashkari, D. A., Morris, D., Mittmann, M. & Davis, R. W. Quantitative phenotypic analysis of yeast deletion mutants using a highly parallel molecular bar-coding strategy. Nature Genet. 14, 450–456 (1996).

108. Winzeler, E. A. et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis.Science 285, 901–906 (1999).

109. Eason, R. G. et al. Characterization of synthetic DNA bar codes in Saccharomyces cerevisiae gene-deletion strains.Proc. Natl Acad. Sci. USA 101, 11046–11051 (2004).

110. Hensel, M. et al. Simultaneous identification of bacterial virulence genes by negative selection. Science 269, 400–403 (1995).

### Author affiliations

1. Howard Hughes Medical Institute, 44 Binney Street, Mayer 457, Boston, Massachusetts 02115, USA.
Email: william_kaelin@dfci.harvard.edu

# SynLethDB: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets

1. Jing Guo1,
2. Hui Liu1,2,* and
3. Jie Zheng1,3,*

+Author Affiliations

1. 1School of Computer Engineering, Nanyang Technological University, Singapore 639798, Singapore

2. 2Lab of Information Management, Changzhou University, Jiangsu 213164, China

3. 3Genome Institute of Singapore (GIS), Biopolis, Singapore 138672, Singapore
1. *To whom correspondence should be addressed. Tel: +65 6790-4287; Fax: +65 6792-6559; Email: ZhengJie@ntu.edu.sg
2. Correspondence may also be addressed to Hui Liu. Tel: +86 519-8633-0316; Email: hliu@cczu.edu.cn
• Revision received October 6, 2015.
• Accepted October 9, 2015.

## Abstract

Synthetic lethality (SL) is a type of genetic interaction between two genes such that simultaneous perturbations of the two genes result in cell death or a dramatic decrease of cell viability, while a perturbation of either gene alone is not lethal. SL reflects the biologically endogenous difference between cancer cells and normal cells, and thus the inhibition of SL partners of genes with cancer-specific mutations could selectively kill cancer cells but spare normal cells. Therefore, SL is emerging as a promising anticancer strategy that could potentially overcome the drawbacks of traditional chemotherapies by reducing severe side effects. Researchers have developed experimental technologies and computational prediction methods to identify SL gene pairs on human and a few model species. However, there has not been a comprehensive database dedicated to collecting SL pairs and related knowledge. In this paper, we propose a comprehensive database, SynLethDB (http://histone.sce.ntu.edu.sg/SynLethDB/), which contains SL pairs collected from biochemical assays, other related databases, computational predictions and text mining results on human and four model species, i.e. mouse, fruit fly, worm and yeast. For each SL pair, a confidence score was calculated by integrating individual scores derived from different evidence sources. We also developed a statistical analysis module to estimate the druggability and sensitivity of cancer cells upon drug treatments targeting human SL partners, based on large-scale genomic data, gene expression profiles and drug sensitivity profiles on more than 1000 cancer cell lines. To help users access and mine the wealth of the data, we developed other practical functionalities, such as search and filtering, orthology search, gene set enrichment analysis. Furthermore, a user-friendly web interface has been implemented to facilitate data analysis and interpretation. With the integrated data sets and analytics functionalities, SynLethDB would be a useful resource for biomedical research community and pharmaceutical industry.

## BACKGROUND

Two genes are said to be in a synthetic lethality (SL) relationship if a perturbation of either gene alone is not lethal but perturbations of both genes lead to cell death or a dramatic decrease in cell viability (1). For example, the mutation of a given gene (a loss-of-function or gain-of-function defect) renders another gene essential so that this pair of genes form an SL relationship. Synthetic lethal interactions provide functional buffering and robustness, thereby enabling cells to maintain homeostasis in the face of diverse genetic and environmental challenges (2). By exposing the critical endogenous differences between cancer cells and normal cells, SL suggests a promising anticancer strategy. For instance, chemical inhibition of the SL partners of oncogenic genes would selectively kill cancer cells but spare normal cells (3). Therefore, SL-based therapeutics has the potential to overcome the drawbacks of traditional chemotherapies including severe side effects (4,5).

Since SL was first described in the studies on Drosophila melanogaster models (6), it has been most extensively explored in human and other model species. Two projects of genome-wide quantitative mapping of synthetic lethal interactions have been conducted for Saccharomyces cerevisiae, and the resulting SL networks provide valuable resources for understanding the functional relationships among genes (7,8). Recognizing the great potential of SL in anticancer therapies, researchers have developed experimental methods to detect SL interactions in cancer cells (9,10). For example, high-throughput pooled shRNA screening for gene essentiality has been developed, by which cell lines are infected with short hairpin RNA libraries targeting genome-wide mRNA. Then, the cells are cultured to allow the depletion of those cells containing shRNAs that target essential genes, after which synthetic lethal interactions can be identified by examining whether a gene is essential in the perturbed cell line but non-essential in the control cell line using microarray or deep sequencing (11).

However, the technology of pooled shRNA screening is still not able to cover the large number of genetic interactions that need to be surveyed across different cancer types so far. Hence, a few computational approaches have been proposed to complement the experimental screening for identifying SL interactions (1214). Mostin silico methods depend on comparative genomics to search for orthologous genes of the SL pairs in yeast that have been experimentally validated (14), or exploit other features such as evolutionary characteristics, metabolic networks and signaling pathways (1517). Recently, a data-driven method, named DAISY, used the somatic copy number alterations, shRNA-based essentiality screens and co-expression patterns on hundreds of cancer cell lines to detect SL pairs in human (13).

With the increasing amount of SL-related data, a comprehensive database is urgently needed to gather SL gene pairs and relevant genomic and functional annotations. Also, the estimation of the druggability of SL gene pairs as drug targets and efficacy of inhibiting cancer cell viability is also important for the development of anticancer treatments. In this paper, we present SynLethDB, a comprehensive database dedicated to collecting SL pairs identified in various species, and integrating genomic and drug sensitivity data to conduct statistical estimation on druggability and efficacy. As a substantial extension of our previously proposed SL knowledge base, Syn-Lethality (18), we collected SL pairs from biochemical assays, other related databases, computational predictions and text mining results. For each SL pair, we computed a confidence score by integrating individual scores derived from different types of evidence. We also developed a statistical analysis module to estimate the druggability and efficacy of drug molecules for human SL pairs, based on genomic data (e.g. mutations, copy number alterations and gene expression profiles), drug–protein interactions and drug sensitivity profiles on more than 1000 cancer cell lines. To help users explore the wealth of data, we developed other practical functionalities, such as query and filtering, orthologous gene search, gene set enrichment analysis. Furthermore, we implemented a user-friendly web interface, including an interactive network and tabular viewer, statistical diagrams and graphical visualization plugins, to facilitate data display and interpretation. To the best of our knowledge, SynLethDB is the first comprehensive database that harbors a large set of SLs, and also contains data resources for systematic evaluation of SLs in anticancer drug discovery and development. We believe that SynLethDB would greatly facilitate and accelerate the discovery of selective and sensitive anticancer drug targets, based on the SL mechanism.

## SOURCES OF DATA

The first source of data in SynLethDB is the manually curated SL pairs from research papers concentrated on SL studies via biochemical experiments. Our previous SL knowledge base, Syn-Lethality (18), which contains manually collected SL pairs from the experimental literature, was integrated. Also, we collected SL pairs identified from high-throughput screening experiments, such as pooled shRNA screens, bi-specific shRNA screens (from the DECIPHER Project1), and combinatorial RNAi and drug screens. For the combinatorial RNAi and drug screening, the SL pairs were detected by conjugating the essential genes identified by RNAi with the drug’s primary target genes deposited in DrugBank database (19). Secondly, a large number of genetic interactions annotated as SL pairs in BioGRID (20) were integrated into SynLethDB. Also, some gene pairs were annotated as SL in GenomeRNAi (21), a database devoted to collecting phenotypes from RNAi screens for Drosophilaand Homo sapiens, and therefore these gene pairs have been added into our database. Thirdly, we included some human SL pairs computationally predicted by DAISY (13), in order to enrich our data set of human SL candidates that are potentially valuable for the discovery of anticancer drug targets. Figure 1 illustrates the various types of sources from which we collected SL pairs.

View larger version:

Figure 1.

Schematic diagram of the data resources, functional modules and graphical visualization components included in SynLethDB. The SL sources include manual curations from publications, three related databases (Syn-lethality, BioGRID and GenomeRNAi), bi-specific SL shRNA screens (DECIPHER), computational predictions (DAISY) and text mining results. Genomic data (mutations, copy number alterations and gene expression profiles from COSMIC), drug targets (DrugBank, STITCH and KIBA) and three drug sensitivity data sets (CCLE, GDSC and NCI-60) are integrated, so that we can conduct Wilcoxon rank-sum tests to estimate the druggability and sensitivity of cancer cells upon drug treatments targeting human SL partners of genes mutated in the cancer cells. Six functional modules are developed to explore the data resources, and graphical visualization components are also implemented to facilitate data display and interpretation.

To extend the coverage of our database, we employed text mining tools to search for SL pairs that have been scattered in the literature. Using ‘synthetic lethal’ and ‘synthetic lethality’ as query keywords, we searched the whole PubMed database, and obtained 331 distinct publications with titles including either of the two keywords. As the contents of these publications focus on synthetic lethality, we used their abstracts as the training set to train the literature ranking tool MedlineRanker (22), which ranks the biomedical literature according to the relevance of a topic learned from the training set. The trained MedlineRanker was used to rank the PubMed publications published in recent 10 years, and the top 1000 publications were selected to conduct the following text mining procedures.

Next, we adopted PESCADOR (23), an information extraction tool for mining co-occurrences of concepts and gene/protein pairs from the literature, to extract gene/proteins associated with the concept of SL from the abstracts of the 1000 publications. In particular, the discriminative words identified by MedlineRanker, including ‘lethality’, ‘lethal’, ‘viability’, ‘apoptosis’, ‘cell death’, ‘synthetic lethality’ and ‘synthetic lethal’, were used as customized concepts that were taken as input by PESCADOR to discover concept-related word co-occurrences. According to the semantic structure of each sentence and the whole abstract, the genes/protein pairs co-occurring with the customized concepts are likely SLs reported in the literature. Furthermore, an appealing characteristic of PESCADOR is that the genes/protein pairs are categorized into four graded relevance degrees according to the scope (abstract or sentence) of the co-occurrence with the customized concept: genes/protein pairs and customized concepts co-occurring in an abstract (type 4), in a sentence (type 3), in a sentence with a biointeraction term (e.g. activates, induces, inhibits) (type 2) or in a sentence with a biointeraction term between the bioentity names (type 1). Based on the degree of relevance to the customized concepts, we regarded the genes/proteins pairs as SL and set their confidence scores to 0.2, 0,5, 0.7 and 0.9 for types 4, 3, 2 and 1, respectively. Finally, we manually curated the 337 PubMed publications whose titles include the terms ‘synthetic lethality’ or ‘synthetic lethal’, to ensure that we would not miss the SL pairs that have been explicitly reported by these studies.

In summary, the current version of SynLethDB contains 34 089 SL pairs that comprise 19 952 of Homo sapiens, 366 of Mus musculus, 423 of Drosophila melanogaster, 107 of Caenorhabditis elegans and 13 241 of Saccharomyces cerevisiae. More than 200 types of diseases and information of over 3314 publications have been deposited in SynLethDB. For each collected SL pair, we annotated its supporting evidence (e.g. mutations, RNAi screenings or predictions), species, diseases, references to PubMed and other relevant information, so that users can access the detailed information to explore the SL gene pairs. Furthermore, to prioritize SL pairs according to their reliability, we developed a scoring scheme to compute an integrative confidence score for each SL pair based on the annotations, as described in the following section.

## INTEGRATIVE CONFIDENCE SCORES

The SL pairs in our database were collected from different types of sources, including biochemical assays, other related databases, computational predictions and text mining results. Furthermore, biochemical assays were based on different experimental technologies and platforms, such as genetic mutation and transfection, RNA interference and drug inhibition. As multiple types of evidence contribute to the identification of a specific SL, an integrative confidence score combining scores from all these evidence sources can give an overall estimation of the reliability of an SL interaction. In principle, we assume that (i) experimental evidence contributes more significantly to the confidence score than evidence derived from predictive algorithms or text mining, and (ii) the SL pairs supported by more evidence sources should be given higher confidence scores than those supported by less evidence sources.

Due to the lack of a gold-standard set of SL pairs for validating the confidence scores, we aim to develop a scoring scheme that does not rely on comparison with any third-party data but rather relies on the available annotations associated with each SL pair. We developed a procedure of two steps, i.e. quantification andintegration, to compute the confidence scores. A large number of SL pairs collected from wet-lab experiments and other related databases have only qualitative annotation evidence (such as ‘high-throughput’ or ‘low-throughput’), or technological descriptions about the wet-lab experiments (such as ‘shRNA screening’ or ‘mutation’), hence the quantification step is necessary to assign quantitative scores to those SL pairs before the calculation of integrative scores. Similar to the scoring scheme for protein–protein interactions (PPI) proposed by Cao et al. (25), we assigned the quantitative scores based on the experimental methods that were used to perturb SL partners, as shown in Table 1. For instance, ‘Mutant & Mutant’ means that the pair of SL genes are both perturbed via mutations induced by transgenic or genetic deletions, and ‘RNA interference & Mutant’ means that one gene is perturbed by RNAi and the other is perturbed via mutation. In general, results from low-throughput experiments, due to a lower false positive rate, are considered to be more reliable than results from high-throughput experiments, hence we assigned a higher confidence score to low-throughput evidence than high-throughput evidence. RNA interference experiments, such as shRNA, siRNA and dsRNA, frequently manifest considerable variability in knockdown efficacy and off-target effects; drug inhibitors also tend to show limited inhibition on target proteins and off-target effects which may lead to false positives. Accordingly, they are assigned relatively low confidence scores compared to the scores of mutation or transfection experiments.

View this table:

Table 1. Quantitative scores assigned to SLs according to the experimental methods annotated in evidence sources

If there exist multiple pieces of evidence of the same type (e.g. experimental evidence) supporting a specific SL pair, we adopted the probability disjunction formula to combine the individual scores as follows:

s=1i=1n(1pi),s=1−∏i=1n(1−pi),

(1)in which s represents the integrative score corresponding to the experimental evidence, pi is the individual score and n is the total number of pieces of experimentally supporting evidence. For example, an SL with one ‘RNA interference & Mutant’ evidence and one ‘bi-specific RNA interference’ screening evidence will lead to the combined score of 0.875, i.e. 1 − (1 − 0.75)(1 − 0.5) = 0.875. Note that the probability disjunction formula has been frequently used to calculate combined scores in the case that multiple pieces of evidence exist, such as in STITCH (26) and ComPPI (27).

In the integration step, we introduced weight factors to reflect the importance of different types of evidence. To obtain a normalized score between 0 and 1, such that a score closer to 1 represents higher confidence, we computed the normalized weighted sum as:

S=wmsm+wdsd+wpsp+wtstwm+wd+wp+wt,S=wmsm+wdsd+wpsp+wtstwm+wd+wp+wt,

(2)in which S represents the integrative confidence score; wwm, wwd, wwpand wwt are the weight factors of biochemical experiment, other related databases, computational prediction and text mining-based evidence; sm, sd, sp and st are corresponding individual scores. Following the convention that evidence from biochemical experiments is the most reliable, followed by other related databases and in silico predictions, and text mining-based evidence is the least reliable, we set the weight factors wwm, wwd, wwpand wwt to 0.8, 0.5, 0.3 and 0.2, respectively.

## STATISTICAL ANALYSIS OF DRUG SENSITIVITY

Although a perturbation of an SL pair via genetic mutation or RNAi inhibition can induce cell death with a high probability, it is likely that only low sensitivity or even no lethal response upon drug treatments can be observed. A reason may be that the proteins encoded by the SL parters are not accessible to drug molecules (i.e. lack of druggability), or their biological functions are not completely blocked by small drug molecules (i.e. low efficacy). Insufficient response to drug treatments could hinder the practical application of the SL concept to anticancer drug design.

To give a preliminary evaluation of the SL pairs as potential anticancer drug targets, we developed a statistical analysis module to evaluate the druggability and efficacy of SL pairs upon drug treatments, based on the large-scale drug sensitivity data sets. In particular, we built a set of high-quality drug–protein interactions from the drug targets in DrugBank (19), drug–protein interactions with experimentally supportive scores >0.9 in STITCH (26), and the drug–kinase binding affinity profiles, referred to as KIBA (28), which were integrated from three drug bioactivity assays (2931) and ChEMBL (32). We also integrated three large-scale drug sensitivity data sets, i.e. CCLE (33), GDSC (34) and NCI-60 (35), together with genome-wide gene expression profiles, copy number alterations (CNA) and mutations obtained from the Catalogue of Somatic Mutations in Cancer (COSMIC) database (36). Overall, these data sets contain drug sensitivity values (represented as the half maximal inhibitory concentration values, i.e. IC50) of 19 017 unique approved and experimental drugs on more than 1000 cancer cell lines. The large amount of data allows us to carry out powerful statistical tests to examine whether a specific SL can induce significant cancer cell death or reduce cancer cell viability when perturbed by a drug. Formally, for each SL pair, denoted as Aand B, a Wilcoxon rank sum test can be conducted to examine if inhibiting gene B by drugs yields significant drug sensitivity levels in samples in which gene A is inactive (or overactive) than in the rest of the samples. It is worth noting that such a statistical test was also used by the DAISY method to detect SL pairs from somatic copy number alterations and shRNA essentiality screening data (13).

## FUNCTIONALITIES

We have developed six functional modules to help users explore the wealth of data. The query, filtering and ranking module take as input one or more gene names to search for all associated SL partners, and the SL pairs are represented in the form of both network and tabular viewers. To provide users with a biological context, the network also includes the SL relationships between the genes associated with query genes. In the network viewer, the widths of the edges are proportional to the integrative confidence scores corresponding to the SL pairs, and users can filter the query results by specifying different thresholds of the confidence score and numbers of SLs, as shown in Figure 2. Each gene is linked to public resources such as UniProt (37), Ensembl (38) and NCBI GenBank (39). In the tabular viewer, the species, diseases and integrative confidence scores are displayed for each SL pair. Detailed information about the evidence sources and individual scores can be displayed by clicking the hyperlinks of evidence sources. With the ranking function of the tabular viewer, users can easily pick up high-confidence SL pairs according to the integrative confidence scores, as shown in Figure 3.

View larger version:

Figure 2.

Screenshot of the main page of the SynLethDB database which displays the search result of the query gene Fen1 on human. This network shows all human SL pairs collected by our database. Users can update the network by set a different threshold for the confidence score and the number of SL pairs to be displayed via the network viewer. On the right part of the page, statistics about the percentages of evidence sources, reference number and confidence score curve are displayed.

View larger version:

Figure 3.

Screenshot of a tabular viewer that displays all the SL partners of Fen1 deposited in SynLethDB, along with the corresponding evidence sources, species, diseases, confidence scores and PubMed references associated with each SL pair. Users can rank the SL pairs according to integrative confidence scores by clicking the column name. Also, one click can launch the statistical analysis of the responses of cancer cells upon drug treatments targeting human SL genes.

As comparative genomic analysis has been successfully used to predict SL by searching for orthologous genes across species, we collected the orthologs among the five organisms identified by four leading methods, i.e. InParanoid (release 8.0) (40), HomoloGene2(build68), Ensembl Compara (41) and PhylomeDB v4 (42). The four methods differ from each other in the underlying rationales for orthology inference and thus complement each other, allowing us to construct a comprehensive set of orthologs (43,44). For any SL pair of interest in one species, users can search for the orthologous genes in the other four species. This functionality could potentially extend the coverage of our SL database. Particularly, if any pair of orthologs found in other species has been already annotated as SL, this could strengthen our confidence in the SL pair, although currently we have not yet considered its contribution to the integrative confidence score.

For human SL pairs, we developed the statistical analysis of drug sensitivity functional module to test the druggability and efficacy to drugs targeting SL partners based on the collected large-scale drug sensitivity data sets. For each SL pair, one click can launch the statistical analysis procedure and the statistical significance (measured by P-value) will be calculated. To facilitate data interpretation, graphical representations with interactive features, such as scatter plots, statistical boxplots and scatter plots, are employed. In these graphical plots, drug names, sensitivity values and cancer cell lines are interactively displayed. Also, the drugs targeting the SL partners of interest can be viewed via the drug-SL partner interaction query functionality. All displayed drugs are linked to the PubChem database (45) which provides detailed properties and chemical structures.

Furthermore, as gene set enrichment analysis (GSEA) is helpful for understanding the molecular mechanisms of SL interactions, we carried out gene set enrichment analysis to find statistically significant pathways and GO (gene ontology) functional annotation terms, based on the subset of genes constituting SL relationships with each specific gene. For the identified pathways and GO terms, links to external databases, such as KEGG (46), Reactome (47) and Gene Ontology (48), are provided.

## CONCLUSION AND FUTURE DEVELOPMENT

In this paper, we proposed SynLethDB, a comprehensive database of synthetic lethality. SL pairs were collected from multiple sources, including biochemical assays, other related databases, computational predictions and text-mining outputs for five species. To extend the coverage of SL gene pairs, we adopted text mining tools to analyze the PubMed literature related to synthetic lethality. To facilitate the data interpretation and evaluation, we developed useful functional modules such as orthology search, query and filtering, statistical analysis on drug sensitivity and gene set enrichment analysis, etc. As the first comprehensive database dedicated to synthetic lethality, which is an emerging anticancer strategy promising to be selective and sensitive, SynLethDB can be a valuable resource to facilitate the discovery of new anticancer drug targets.

In future, we will expand the coverage of data types and species, on the basis of a rapidly increasing numbers of studies focused on SL screening and sensitivity analysis of cancer cells to drugs. We will continuously increase the number of manually curated SL pairs to ensure the reliability of data, and build a gold standard for human SL, which would be very helpful for biomedical research community in validating and evaluating results produced with both experimental and computational approaches. In addition, we will incorporate new SL pairs from other sources, such as more computational predictions and text mining results, to complement the manual curations.

Furthermore, it has been realized that the cellular response of cancer cells to drug treatments depends strongly on the genetic context, such as spectrum of mutations, copy number alterations and epigenetic modifications (49). We will go on to identify cancer-specific SL pairs by integrating the genomic and epigenetic features into our database. Also, we will develop more functional modules and data visualization tools to analyze and display the data.

## FUNDING

MOE AcRF Tier 2 [ARC 39/13 (MOE2013-T2-1-079)]; Ministry of Education, Singapore. Funding for open access charge: MOE AcRF Tier 2 [ARC 39/13 (MOE2013-T2-1-079)]; Ministry of Education, Singapore.

Conflict of interest statement. None declared.

## Acknowledgments

The SL pairs identified by bi-specific shRNA screening from the DECIPHER Project was kindly provided by Cellecta based on NIH-funded research grants 44RR024095 and 44HG003355. We would like to thank Oliver Pelz for kindly answering our questions about the usage of GenomeRNAi.

## Footnotes

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

## REFERENCES

1. 1.
2. 2.
3. 3.
4. 4.
5. 5.
6. 6.
7. 7.
8. 8.
9. 9.
10. 10.
11. 11.
12. 12.
13. 13.
14. 14.
15. 15.
16. 16.
17. 17.
18. 18.
19. 19.
20. 20.
21. 21.
22. 22.
23. 23.
24. 24.
25. 25.
26. 26.
27. 27.
28. 28.
29. 29.
30. 30.
31. 31.
32. 32.
33. 33.
34. 34.
35. 35.
36. 36.
37. 37.
38. 38.
39. 39.
40. 40.
41. 41.
42. 42.
43. 43.
44. 44.
45. 45.
46. 46.
47. 47.
48. 48.
49. 49.

### PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool)

Recently, I found this good research paper called “PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool) “. Maybe it will be useful for scientists who are interested in this topic.

Sci Rep. 2015 May 19;5:10021. doi: 10.1038/srep10021.

# PALM-IST: Pathway Assembly from Literature Mining–an Information Search Tool.

### Abstract

Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular component, etc.) information from the extracted text and subsequently mines various databases to provide their comprehensive inter-relation (e.g., interaction, expression, etc.). PALM-IST constructs protein interaction network and pathway information data relevant to the text search using multiple data mining tools and assembles them to create a meta-interaction network. It also analyzes scientific collaboration by extraction and creation of “co-authorship network,” for a given search context. Hence, this useful combination of literature and data mining provided in PALM-IST can be used to extract novel protein-protein interaction (PPI), to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. PALM-IST is available at www.hpppi.iicb.res.in/ctm.

PMID:
25989388
[PubMed – indexed for MEDLINE]
PMCID:
PMC4437304

Free PMC Article

http://www.hpppi.iicb.res.in/ctm/

PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool) is a computational platform for users to explore biomedical literature resourse (PubMed) using multiple keywords and extract gene/protein(s) name, drug(s), disease(s) centered information along with their relation/interaction from text and databases. PALM-IST provides users a platform where data and literature mining are performed simultaneously. Combined structured data (from data mining) and unstructured data (from text mining) can be used to extract novel association/interaction between biological entities such as proteins, diseases, or drugs, to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. Further, PALM-IST also enables users to assemble human pathways and protein-protein interaction network (PPIN) using information extracted from text and databases.

FEATURES

1. Real time search in PubMed.
2. Identification and highlighting of genes, drugs and diseases extracted from searched abstracts.
3. Interactive co-occurrence based network of gene-disease, gene-drug, drug-disease from literature.
4. Functional annotation by mapping expression information on to human pathway proteins and their interactors.
5. Platform to merge protein-protein interaction of multiple human genes/proteins.
6. Platform to find cross-talk genes/proteins from merged pathways result.
7. Interactive display of pathways with over-laid with protein-protein interaction information.
8. Interactive display of collaborative network between biomedical experts.

### KH Coder is a free software for quantitative content analysis or text data mining

https://sourceforge.net/projects/khc/

## Description

KH Coder is a free software for quantitative content analysis or text data mining. It is also utilized for computational linguistics. You can analyze Japanese, English, French, German, Italian, Portuguese and Spanish text with KH Coder. Chinese (simplified, UTF-8), Korean and Russian (UTF-8) language data can also be analyzed with the latest alpha version.

KH Coder provides various kinds of search and statistical analysis functions using back-end tools such as Stanford POS Tagger, FreeLing, Snowball stemmer, MySQL and R.

KH Coder Web Site

http://www.sciencedirect.com/science/article/pii/S1672022916000401

Figure 1.

Translational Bioinformatics in context

The Y axis depicts the “central dogma” of informatics, converting data to information and information to knowledge. Along the X axis is the translational spectrum from bench to bedside. Translational bioinformatics spans the data to knowledge spectrum, and bridges the gap between bench research and application to human health. The figure was reproduced from [1] with permission from Springer.

In the general phase of text mining of cancer systems biology, we initially obtained related biomedical text from many available sources, such as PubMed. A number of literature databases provide packed data download service. However, although it is convenient, the included text is not timely updated, and text quantity is also limited. Many literature database systems offers application programming interface, by which we can use scripts to download the text automatically by computers. For examples, through E-utility of PubMed [64] and [101], users can easily get up-to-date text.

Named entity recognition tools can then be used to extract biomedical mentions from the text obtained. The mentions usually include terms such as gene names, protein names, mRNA (message RNA) names, miRNA (micro-RNA) names, metabolism related terms, and cell terms. After finding the biomedical terms, we can build a gene–gene interaction network, metabolism pathways, and other networks. Resources such as Gene Ontology can be used for network building. MicroRNAs are considered to be connected with cancer, so we can investigate how miRNAs work in gene–gene interaction. In the next phase, we can study how components and structures change in dynamic contexts. Certain networks and their variations, such as protein–protein interaction networks [102]and variations in metabolism network, can be built from text. Due to the high false negative rate in text mining-based networks, we can employ some validation and inference algorithms to correct and optimize the network. In each phase, we can use many resources to validate the network, such as homology, co-expression data, rich domain data, and co-biological process data, as well as other information. Through validation, some nodes and interactions with strong evidence will be strengthened, whereas a false one will be removed or updated. Consequently, we can develop a protein–protein interactome based on multiple sources of interaction evidence [47]. Finally, all the networks and components can be used for further studies.

Signaling pathway reconstruction plays a significant role in understand the molecular mechanisms in cancer. Signaling pathway maps are usually obtained from manual literature search, automated text mining, or canonical pathway databases [103]. Pena-Hernandez et al. implemented an extraction tool to find gene relationship and up-to-date pathways from literature [104].

### 5.2. Examples of integrated biomedical text mining tools

An integrated biomedical text mining systems is supposed to provide the stated functionalities. There are many tools dominated in cancer research. However blindly using the results from text mining tools is not a wise idea because the information and knowledge derived from uncurated text are error prone. Many tools choose to manually curate text by experts. In the following we will briefly introduce the three most popular commercial tools, i.e., Pathway Studio [105], GeneGO [106] and Ingenuity [107].

By Pathway Studio [105], we can analyze pathway, gene regulation networks, protein interaction maps and navigate molecular networks. Its background knowledge database contains more than 100,000 events of regulation, interaction and modification between proteins, cell processes and small molecules. It has a natural language processing module, MedScan, which enables Pathway Studio for entity identification and then applied handcrafted context free grammar (CFG) rules to extract relationships. Pathway Studio can access the entire PubMed database and online resource, full-text journal, literature, experimental and electronic notebooks. Pathways and networks from the extracted facts and interactions extracted from retrieved text. Many algorithms such as Find direct interactions, Find shortest paths, Find common targets or Find common regulators are available.

MetaCore, one of key products of GeneGO [106] is an integrated knowledge database and software suite for pathway analysis of experimental data and gene lists. The knowledge base of MetaCore is manually curated database derived from extensive full-text literature annotation. MetaMiner of GeneGo, mainly including MetaMiner Disease Platforms, MetaMiner Stem Cells, MetaMiner Prostate Cancer, MetaMiner Cystic Fibrosis, offers a knowledge mining and data analysis platforms for oncology. The most important disease reconstruction function is based on three fundamentals, manual annotation of all gene–disease associations, reconstruction of disease pathways and functional data and knowledge mining of OMICs experimental studies published in a disease area. GeneGo also provides API for third party software development.

Ingenuity [107] helps researchers model, analyze, and understand the complex biomedical, biological and chemical systems by integrating data from a variety of experimental platforms. One application example of Ingenuity Systems is analysis of CD44hi breast cancer stem cell-like subpopulations using Ingenuity iReport. The base knowledge of Ingenuity is also extracted by experts from the full text of the scientific literature, including findings about genes, drugs, biomarkers, chemicals, cellular and disease processes, and signaling and metabolic pathways. Researchers can search the scientific literature and find insights most relevant to the desired experimental model or question, build dynamic pathway models, and get confidence in hypotheses and conclusions.

## 6. Future work and challenges

With the development of the next-generation sequencing technologies, high throughput experimental methods are revolutionizing the life sciences rapidly. The widespread of the cloud computing application is also accelerating the application of text mining technology in the frontier research in life science. We here discuss the work and challenges in the future application of text mining in cancer researches as follows.

The first challenge is to apply biomedical text mining technologies in the personalized medicine development. It is well-known that cancer is a complex disease. Many factors such as race, gender, age and environments may correlate with risk of cancer [108],[109], [110], [111], [112], [113] and [114]. The personalized medicine is becoming a trend and the therapies will be tailored to individual patients with their biomedical information collected and analyzed. Ando et al. have applied the text mining technique to qualitatively identify the differences in the focus of life review interviews by patient’s age, gender, disease age and stage [115]. Ahmed et al. integrated compound–target relationships related with cancer by text mining and presented the spectrum of research on personalized medicine and compound–target interactions [116]. The personalized medicine in cancer will take in all these important aspects into consideration during text mining [117]. One solution is to categorize data before text mining rather than treat them together without any pre-processing. It is a really tough task to categorize data at individual level features. On the other hand, one of the negative consequence of categorization is making it harder for text mining to find a good biomarker for all cases.

The second challenge is the complex of cancer molecular mechanisms. The same cancer phenotype could be caused by different gene or gene sets from the same pathway or network. To study the complex mechanisms of cancer, we need to mine text from a hierarchical network view rather than from a single level. Systems biomedicine carries on analysis and study from different levels, including motif [118] and [119], pathway [120], [121] and [122], module [123], [124] and [125] and network[126] and [127]. The resulting hierarchical data provide us valuable materials to conduct text mining on different levels. However, how to correctly categorize text to hierarchical network, and how to integrate text mining results from different levels and discover new knowledge with a systems biomedicine view are really a hard work.

The third challenge is to apply the text mining techniques in translational medicine research. Translational medicine, an emerging field of biomedicine, involves the transformation of laboratory findings into novel diagnosis and treatment of patients [128]. The knowledge of pre-clinical can be used in clinic to improve treatment. Translational medicine facilitates the course of diseases predicting, preventing, diagnosing, and treating. Bioinformatics will be a driver rather than a passenger for translational biomedical research [128], such as the data integration and data mining platform presented by Liekens et al. [129] could retrospectively confirm recently discovered disease genes and identify potential susceptibility genes. It will add tough tasks for text mining, since translation biomedical text mining should consider various stages of information and various sources of evidence, and integrate the Omics and clinical data sets to find out novel knowledge for both biology and medicine domains. There are many this kind of applications, such as the data integration and data mining platform presented by Liekens et al. [129] could retrospectively confirm recently discovered disease genes and identify potential susceptibility genes.

The fourth challenge for text mining will be the integration of the text information at molecule, cell, tissue, organ, individual and even population levels to understand the complex biological systems. Nevertheless, most of the current text mining studies focus on molecular level, and very little text mining work reported at high levels, which in fact has a close relationship with cancer phenotypes. Text mining at high levels and integrate the text information at all these levels will be a big challenge for cancer study and provide also opportunities for successful cancer diagnosis and treatments.

The last challenge will be the de-noising and testing of the text mining results. Text mining results are often obtained with noising information and false positives since natural language text are often inconsistent. It contains ambiguities caused by semantics, slang and syntax. It can be also suffered from noise and error in text. As a result, the mined information cannot be used blindly. Many methods have been developed to solve the problem. The first is to manually read and understand the contexts, analyze them, and then add semantic tags. This pre-processing in fact turns the unstructured text into structured text with semantic tags. Thereby, the developed tools can easily achieve the goal with high precision rate. However, the approach is very restricted as it needs vast human efforts and turns out to be very time consuming. As a result, the data source for mining could be modest in size, only limiting mining ability. The second method is to carry on text mining on vast biomedical text, and then analyze the results and screen out the final results with prior domain experience. During the mining process, domain knowledge is usually employed to improve mining efficiency as well as the quality of the mined knowledge. This approach although the mined results may still contain more errors, is more powerful on knowledge discovering compared with the first approach. These two approaches are distinct on treating the text to be mined. The first one ensures correctness by carefully manual pre-processing, while the second one is to select correct ones by post-processing by experts. The third approach is to take a compromise between pre-processing and post-processing, where some advanced statistical analysis will be used to roughly clean data at first stage and then conduct mining on them.

## 7. Conclusions

Currently, there is a huge body of biomedical text and their rapid growth makes it impossible for researchers to address the information manually. Researchers can use biomedical text mining to discover new knowledge. We have reviewed the important research issues related to text mining in the biomedical field. We also provided a review of the state-of-the-art applications and datasets used for text mining in cancer research, thereby providing researchers with the necessary resources to apply or develop text mining tools in their research. We introduced the general workflow of text mining to support cancer systems biology and we illustrated each phase in detail. We can see that text mining has been used widely in cancer research. However, to fully utilize text mining, it is still necessary to develop new methods for full text mining and for highly complex text, as well as platforms for integrating other biomedical knowledge bases.

In spite of the huge potential of applying text mining on biomedicine, it still needs further development. Biomedical text mining systems are not as golden standard tools of biomedical researchers as retrieval systems and sequencing tools. The next important mission of text mining for us is to develop applications that are really helpful to biomedical research, so that researchers can get more productive and make more progress in the information rapid growing ear. To achieve the goal, more concerns should be put on helping biological biomedical scientists to remove the obstacles that block the development rather than discussions that are not related with actual demands. One of the hottest topics of text mining is to coordinate and cooperate with multiple subjects. That is, biomedical text mining, coupled with other data and means, should yield consistent, measurable, and testable results.