Browse by OMIC TECHNOLOGIES
Browse by OMIC INTERPRETATION
Epigenomic analysis software tools and databases
Transcriptomic analysis software tools and databases
Metabolomic analysis software tools and databases
Fluxomic analysis software tools and databases
Biological pathway analysis software tools and databases
Many efforts are still devoted to the discovery of genes involved with specific phenotypes, in particular, diseases. High-throughput techniques are thus applied frequently to detect dozens or even hundreds of candidate genes. However, the experimental validation of many candidates is often an expensive and time-consuming task. Therefore, a great variety of computational approaches has been developed to support the identification of the most promising candidates for follow-up studies. The biomedical knowledge already available about the disease of interest and related genes is commonly exploited to find new gene–disease associations and to prioritize candidates. In this review, we highlight recent methodological advances in this research field of candidate gene prioritization. We focus on approaches that use network information and integrate heterogeneous data sources. Furthermore, we discuss current benchmarking procedures for evaluating and comparing different prioritization methods. WIREs Syst Biol Med 2012. doi: 10.1002/wsbm.1177
For further resources related to this article, please visit the WIREs website.
COMPLEX DISEASES AND THE IDENTIFICATION OF RELEVANT GENES
Many common diseases are complex and polygenic, involving dozens of human genes that might predispose to, be causative of, or modify the respective disease phenotype.1–3 This intricate interplay of disease genotypes and phenotypes still renders the identification of all relevant disease genes difficult.4–6 Therefore, a number of experimental techniques exist to discover disease genes. In particular, high-throughput methods such as genome-wide association (GWA) studies2,7,8 and large-scale RNA interference screens6,9,10 yield lists of up to hundreds of candidate disease genes. As validating the actual disease relevance of candidate genes in experimental follow-up studies is a time-consuming and expensive task, many methods and web services for the computational prioritization of candidate disease genes have already been developed.11–19
The concrete problem of candidate gene prioritization can be formulated as follows: Given a disease (or, generally spoken, a specific phenotype) of interest and some list of candidate genes, identify potential gene–disease associations by ranking the candidate genes in decreasing order of their relevance to the disease phenotype. When abstracting from the methodological details, the vast majority of computational approaches to this prioritization problem work in a similar manner. Most of them rely on the biological information already available for the disease phenotype of interest and the known, already verified, disease genes as well as for the additional candidate genes. In this context, functional information, particularly, manually curated or automatically derived functional annotation, often provides strong evidence for establishing links between diseases and relevant genes and proteins.20–27 Many prioritization methods use protein interaction data as rich information source for finding relationships between gene products of candidate genes and disease genes.11,16,18,25,28–45 In addition, the phenotypic similarity of diseases can help to increase the total number of known disease genes for less studied disease phenotypes.35,46–55 Other sources of biological information frequently used by prioritization approaches are sequence properties, gene expression data, molecular pathways, functional orthology between organisms, and relevant biomedical literature.12,14,19
These data then serve as input for statistical learning methods or are integrated into network representations, which are further analyzed by network scoring algorithms. Although individual data sources such as functional annotations or protein interactions provide quite powerful information for prioritizing candidate genes, the integration of multiple data sources has been reported to increase the performance even more.23–25,35,47–69 However, a generally accepted and consistent benchmarking strategy for all the diverse prioritization methods has not emerged yet, which complicates performance evaluation and comparison.
Therefore, this advanced review does not only highlight recent prioritization approaches (published until the end of 2011), but also discusses different benchmarking strategies applied by authors and the need of standardized procedures for performance measurement. Other computational tasks such as the structural and functional interpretation as well as prioritization of disease-associated nucleotide and amino acid changes are not discussed in this article, but are reviewed elsewhere.3,70–74 In the following, the various prioritization methods are categorized according to the biological data and their representation that are primarily considered when scoring and ranking candidate disease genes: gene and protein characteristics, network information on molecular interactions, and integrated biomedical knowledge.
PRIORITIZATION METHODS USING GENE AND PROTEIN CHARACTERISTICS
The first computational approaches in the hunt for disease genes focused on molecular characteristics of disease genes, which discriminate them from non-disease genes. As described below, researchers developed methods related to individual gene and protein sequence properties75–77 as well as functional annotations of gene products.20–22,26,27,78 In principle, if a candidate satisfies certain characteristics as derived from known disease genes and proteins, its disease relevance is considered to be higher than otherwise.
Gene and Protein Sequence Properties
López-Bigas and Ouzounis75 derived several important characteristics of disease genes from the amino acid sequence of their gene products. In comparison with other proteins encoded in the human genome, disease proteins tend to be longer, to exhibit a wider phylogenetic extent, that is, to have more homologs in both vertebrates and invertebrates, to possess a low number of close paralogs, and to be more evolutionarily conserved. Using these sequence properties as input to a decision-tree algorithm, the researchers performed a genome-wide identification of genes involved in (hereditary) diseases.
Similarly, Adie et al.76 developed PROSPECTR, a method for candidate disease gene prioritization based on an alternating decision-tree algorithm. However, their approach examined a broader set of sequence features and thus produced a more successful classifier. In particular, Adie et al. found that disease genes tend to have different nucleotide compositions at the end of the sequence, a higher number of CpG islands at the 5′ end, and longer 3′ untranslated regions.
By demonstrating the strong correlation between gene and protein function and disease features, such as age of onset, Jimenez-Sanchez et al.78motivated prioritization approaches that exploit the functional annotation of known disease genes for ranking candidates.20–22,26,27
Perez-Iratxeta et al.20 applied text mining on biomedical literature to relate disease phenotypes with functional annotations using Medical Subject Headings (MeSH)79 and Gene Ontology (GO) terms.80 They ranked the candidate genes according to the characteristic functional annotations shared with the disease of interest. In a similar fashion, Freudenberg and Propping21identified candidate genes based on their annotated GO terms that are shared with groups of known disease genes associated with similar phenotypes. In contrast, the approach POCUS22 assesses the shared over-representation of functional annotation terms between genes in different loci for the same disease.
Recently, Schlicker et al.26 developed a prioritization method that makes use of the similarity between the functional annotations of disease genes and candidates. In contrast to the approaches that consider solely identical functional annotations or compute only GO term enrichments, MedSim automatically derives functional profiles for each disease phenotype from the GO term annotation of known disease genes and, optionally, of their orthologs or interaction partners. Candidate genes are then scored and ranked according to the functional similarity of their annotation profiles to a disease profile. In addition, Ramírez et al.27 introduced the BioSim method for discovering biological relationships between genes or proteins. While MedSim is based only on GO term annotations, BioSim quantifies functional gene and protein similarity according to multiple data sources of functional annotations and can also be applied to rank candidate genes based on their functional similarity to known disease genes.
The success of the presented studies also shows that phenotypically similar diseases often involve common molecular mechanisms and thus functionally related genes. This also explains the frequent use of functional annotations as important biological evidence in integrative prioritization approaches.23–25,56–58,61–66,68,69 Notably, the information value of functional annotations can be further increased by improved scoring of functional similarity, reaching the performance of complex integrative methods based on multiple data sources.26
PRIORITIZATION METHODS USING NETWORK INFORMATION
In the last decade, molecular interaction networks have become an indispensable tool and a valuable information source in the study of human diseases. Regarding methods for prioritization of candidate disease genes, it was repeatedly observed that protein interaction networks are among the most powerful data sources in addition to functional annotations.11,16,18,28,29,81 As in case of sequence properties, disease genes and their products have discriminatory network properties that allow their distinction from non-disease genes. In particular, molecular interactions naturally support the application of the guilt-by-association principle to identify disease genes. In the following, we highlight a representative selection of network-based prioritization approaches.
Local Network Information
Early prioritization methods have focused on local network information such as close network neighborhood of a node representing a candidate gene or protein (see Box 1). This can be explained by the observation that disease proteins tend to cluster and interact with each other.30,49,82–84 Molecular triangulation is one of the first methods that used protein interaction networks to rank candidates and their network nodes with respect to their shortest path distances to nodes of known disease proteins.31 An evidence score such as the MLS score corresponding to linkage peak association85 is assigned to each disease protein node and transferred to its neighbor nodes. The candidates are then ranked according to the accumulated sum of evidence scores. This means that candidates represented by nodes close to several disease protein nodes with good evidence scores are considered to be the most promising ones.
Local network information refers to the topological neighborhood of a node, and corresponding measures are less sensitive to the overall network topology. Examples are the node degree kn (number of edges linked to node n) and the shortest path length dnm (minimum number of edges between the nodes n and m). In disease gene networks, Xu et al.34 define, for each node n, the 1N index and the 2N index . Here, is the number of edges between node nand disease genes, and Nn is the set of direct neighbors of n. Given the set of disease genes M, the average shortest path distance of a node n to disease genes is .
Global network information relates to the overall network topology and measures that characterize the role of a node in the whole network. Common centrality measures are shortest path closeness and betweenness as well as random-walk-related properties such as hitting time, visit frequency and stationary distribution. For instance, closeness centrality indicates how distant a node is to the other network nodes, and it is calculated as with Vn denoting the set of nodes reachable from n. The random walk with restart40 is defined as pt+1 = (1 − r)Wpt + rp0. Here, W is the column-normalized adjacency matrix of the network, pt contains the probability of being at each node at time step t, p0 denotes the initial probability vector, and r is restart probability. The steady-state probability vector p∞ can be obtained by performing iterations until the change between pt and pt+1 falls below some significance threshold, e.g., 10−6.40
In a related approach, Karni et al.32 identified the minimal set of candidates so that there is a path between the products of known disease genes in the protein interaction network. Oti et al.33 proposed an even simpler method for a genome-wide prediction of disease genes. For each known disease protein, they identified its interaction partners and the chromosomal locations of the encoding genes. A gene is then considered to be relevant for a disease of interest if it resides within a known disease locus and its gene product shares an interaction with a protein known to be associated with the same disease.
To make the most out of the potential of local network measures, Xu and Li34computed multiple topological properties for three different molecular networks consisting of literature-curated, experimentally derived, and predicted protein–protein interactions. The considered properties are the node degree, the average distance to known disease genes, the 1N and 2N node indices (see Box 1 and Figure 1), and the positive topological coefficient.86 The authors then trained a k-nearest-neighbor classifier using the aforementioned topological properties of known disease genes and achieved comparable performance for all three networks. They also detected a possible bias in the literature-curated network because disease genes tend to be studied more extensively.
In addition to local network information, Lage et al.35 incorporated phenotypic data into the disease gene prioritization. Each candidate and its direct interaction partners are considered as a candidate complex. All disease proteins in a candidate complex are assigned phenotypic similarity scores, which are used as input to a Bayesian predictor. Thus, a candidate gene obtains a high score if the other proteins in the complex are involved in phenotypes very similar to the disease of interest. Care et al.36 elaborated on this approach by combining it with deleterious SNP predictions for the candidate gene products and their interaction partners. Using the method by Lage et al., Berchtold et al.37 successfully prioritized proteins associated with type 1 diabetes (T1D). Further studies of protein interaction networks underlying specific diseases such as breast cancer38 and T1D39 also deal with the application of similar network-based prioritization approaches.
Global Network Information
Beyond local network information that ignores potential network-mediated effects from distant nodes, the utilization of global network measures can considerably improve the performance of prioritization methods for candidate disease genes.40,41,44 Especially for the study of polygenic diseases, network topology analysis can provide more insight into multiple paths of long-range protein interactions and their impact on the functionality and interplay of disease genes.
Köhler et al.40 demonstrated that random-walk analysis of protein–protein interaction networks outperforms local network-based methods such as shortest path distances and direct interactions as well as sequence-based methods like PROSPECTR.76 In their method, the authors ranked the gene products in a given network according to the steady-state probability of a random walk, which starts at known disease proteins and can restart with a predefined probability (see Box 1). Although the ranking criterion is the proximity of candidates to known disease proteins, this approach is more discriminative than local measures because it accounts for the global network structure.
In a similar manner, Chen et al.41 adapted three sophisticated algorithms from social network and web analysis for the problem of disease gene prioritization. To this end, they analyzed a protein–protein interaction network using modified versions of the random-walk-based methods PageRank,87,88Hyperlink-Induced Topic Search (HITS),88,89 and K-Step Markov method (KSMM).88 PageRank, HITS, and KSMM consider the global network topology and compute the relevance of all nodes representing candidates with regard to the set of known disease proteins in the network. All three methods achieved comparable performance to each other.
To address the issue of finding causal genes within expression quantitative trait loci (eQTL),90,91 Suthram et al.42 introduced the eQTL electrical diagrams method (eQED). They modeled confidence weights of protein interactions as conductances, while the P-values of associations between genetic loci and the expression of candidate genes served as current sources. The best candidate gene is the one passed by the highest current. The currents in electric circuits can be determined efficiently using random-walk computations.92,93
Network Centrality Measures
For many years, global centrality measures such as closeness or betweenness have been used in social sciences to assess how important individual nodes are for the overall network connectivity. Recently, such measures have been applied to several problems in bioinformatics including disease gene prioritization. For example, Dezső et al.43 applied an adapted version of shortest path betweenness to prioritize candidates in a protein–protein interaction network. A candidate is scored more relevant to the disease of interest if it lies on significantly more shortest paths connecting nodes of known disease proteins than other nodes in the network.
In a recent case study on primary immunodeficiencies (PIDs), Ortutay and Vihinen25 integrated functional GO annotations with protein interaction networks to discover novel PID genes. The authors conducted a topological analysis on an immunome network consisting of all essential proteins related to the human immune system and their interactions. In particular, they used the node degree as well as the global centrality measures of vulnerability and closeness to assess the importance of candidate genes in the network (see Box 1). Additionally, they performed functional enrichment analysis to determine genes with PID-related GO terms. With some modifications, the described prioritization method could be generalized to other diseases of interest.
Combining Network Measures
Recently, Navlakha and Kingsford44 compared different network-based prioritization methods. The authors observed that random-walk-based measures40 outperform measures focused on the local network neighborhood33 or clustering.94–96 A consensus method that uses a random-forest classifier to combine all methods yielded the most accurate ranking. Therefore, apart from stressing the potential of protein interaction data, Navlakha and Kingsford also showed that disease gene prioritization can benefit from the integration of multiple information sources.
In summary, as many other studies have also demonstrated, molecular interaction networks, in particular, based on protein interactions, provide valuable biological knowledge for ranking candidate disease genes.11,16,18,28,29 It has also become clear that global network measures achieve better results in comparison to local measures.40,41,44 Nevertheless, the performance of such prioritization approaches depends heavily on the quality of the network data. Protein interaction data are well known to be biased toward extensively studied proteins and subject to inherent noise.34,97,98 Therefore, it is often suggested that existing methods will perform better when more accurate data become available. Furthermore, Erten et al.45 pointed out that network-based methods can also be improved by integrating statistical adjustments for the skewed degree distribution of protein interaction networks.
PRIORITIZATION METHODS USING INTEGRATED KNOWLEDGE
Network information on molecular interactions as well as individual gene and protein characteristics such as sequence properties and functional annotations are major sources of biological evidence for scoring and ranking candidate disease genes. However, a prioritization approach based on a single information source alone usually achieves only limited performance due to noisy and incomplete datasets. To address this problem, the integration of multiple sources of biological knowledge has proven to be a good solution in bioinformatics. Different types of data can complement each other well to increase the amount of available information and its overall quality. While some of the methods presented above already make successful use of relatively simple integration procedures for a few different sources of functional information and annotations, this section will focus on more sophisticated methods for knowledge integration and the prioritization of candidate disease genes.
Complementing Molecular Interactions with Phenotypic Network Information
In the last years, several groups investigated the similarities and differences between disease phenotypes. The main finding was that similar phenotypes often share underlying genes or even pathways.46,99,100 In particular, van Driel et al.46 classified all human phenotypes contained in the Online Mendelian Inheritance in Man database (OMIM)101 by defining a measure of phenotypic similarity based on text mining of the corresponding OMIM records. Such phenotypic knowledge can be very useful to discover new potential disease genes by transferring known gene–phenotype associations to similar diseases and phenotypes.
Therefore, phenotypic similarity has become another major data source exploited by computational methods for prioritization of candidate disease genes.35,47–55 In this context, a two-layered heterogeneous data network is typically constructed so that the phenome layer consists of connections between similar phenotypes, while the interactome layer contains protein–protein interactions. The two network layers are then linked by known gene–phenotype associations.
To demonstrate the importance of the additional phenotype network layer for identifying novel gene–phenotype associations and disease–disease relationships, Li et al.48 extended the random-walk algorithm used by Köhler et al.,40 as described in the previous section, to heterogeneous networks. Both the candidate genes and the disease phenotypes are prioritized simultaneously. In contrast, Yao et al.49 estimated the closeness of a candidate gene to a disease of interest by computing the hitting time of a random walk that starts at the corresponding disease phenotype and ends at the candidate. This approach also allows the genome-wide identification of potential disease genes for phenotypic disease subtypes.
Chen et al.50 reformulated the candidate gene prioritization problem as a maximum flow problem on a heterogeneous network. They represented the capacities of connections between phenotypes by their phenotypic similarity. Capacities on edges within the interactome and on edges bridging the phenome and interactome were estimated during the evaluation procedure. By calculating a maximum flow from a phenotype of interest through the interactome, the authors ranked candidate genes with regard to the amount of efflux.
A computationally simpler approach based on the same network type was suggested by Guo et al.,51 who computed the association score between a gene and a disease as the weighted sum of all association scores between similar diseases and between neighboring genes in the interaction layer. To this end, the authors formulated an iterative matrix multiplication of disease–gene–association matrices and disease-similarity matrices corresponding to the network structure. While the maximum flow problem solved by Chen et al.50 already accounts for the phenotypic overlap between diseases, the approach by Guo et al.51 additionally considers the genetic overlap of diseases. The recent PhenomeNET52 is even a cross-species network of phenotypic similarities between genotypes and diseases based on a uniform representation of different phenotype and anatomy ontologies. In particular, it can be used to perform whole-phenome discovery of genes for diseases with unknown molecular basis.
Two other studies used iterative network flow propagation on a heterogeneous network to identify protein complexes related to disease. Vanunu et al.53 developed a prioritization method that propagates flow from a phenotype of interest through the whole network and identifies dense subnetworks around high-scored genes as potential phenotype-related protein complexes. In contrast, Yang et al.54 modified the described heterogeneous network and included an additional layer of protein complexes. In the resulting network, phenotypes are connected to protein complexes, and complexes are linked with each other according to the protein interactions in the interactome layer. The method derives novel gene–phenotype associations by propagating the network flow within the protein complex layer.
Disease gene prioritization methods usually rank candidate genes relative to a phenotype of interest. However, the discovery of gene–phenotype associations can also be approached the other way around. Hwang et al.55devised a method to identify the phenotype that could result from a given set of candidate genes. For that purpose, the authors considered a gene network and a phenotype similarity network. In both networks, the nodes were ranked separately with graph Laplacian scores, and a rank coherence was calculated from the score differences between genes and phenotypes connected by known associations. Hwang et al. showed that their approach is suitable to predict the resulting phenotype for a given set of candidate genes.
Integrating Heterogeneous Data Sources of Biological Knowledge
Two distinct approaches to disease gene prioritization that exploit multiple data sources are exemplarily highlighted in the following (Figure 2). The first approach considers each data source separately when assessing the molecular and phenotypic relationships of candidate genes with the disease of interest, and aggregates the resulting multiple ranking lists into a final ranking of the candidates. The alternative approach combines all biological information into a network representation and subsequently applies network measures to score and rank candidates with regard to their network proximity to nodes representing known disease genes.
In detail, the prioritization method Endeavour56,102 utilizes more than 20 data sources such as ontologies and functional annotations, protein–protein interactions, cis-regulatory information, gene expression data, sequence information, and text-mining results. For each data source, candidate genes are first ranked separately based on their similarity to a profile derived from known disease genes. Afterwards, all individual candidate rankings are merged into a final overall ranking using rank order statistics. The authors showed that this approach is quite successful in finding potential disease genes as well as genes involved in specific pathways or biological functions. Recently, Endeavour has also been benchmarked using various disease marker sets and pathway maps103 to confirm that it performs very well if sufficient data is available for the disease or pathway of interest and the candidate genes. Furthermore, Li et al.57 proposed a discounted rating system, an algorithm for integrating multiple rank lists, and compared it with the rank aggregation procedure used by Endeavour.
Like Endeavour, the method MetaRanker59 also combines many heterogeneous data sources and forms separate evidence layers from SNP-to-phenotype associations, candidate protein interactions, linkage study data, quantitative disease similarity, and gene expression information. For each layer, all genes in the human genome are ranked with regard to their probability to be associated with the phenotype of interest. The overall score of a gene is the product of its rank scores for each layer. The evaluation of MetaRanker indicates that it is particularly suited to uncover associations in complex polygenic diseases and that the integration of multiple data layers improves the identification of weak contributions to the phenotype of interest in comparison to the use of only few data sources.
Another combination of network-based methods with score aggregation has been proposed by Chen et al.60 The authors generate an individual network for each data source and quantify potential gene–disease relationships in each network using a global network measure based on diffusion kernels. The final candidate ranking considers only the most informative network score for each candidate gene. Furthermore, an alternative way of integrating information from multiple data sources is the application of machine learning techniques. Here, each data source can be represented as one or more individual features and used as input for the training of supervised learning methods. In particular, support vector machines,61–63 decision-tree-based classifiers,64 and PU learning65 (machine learning from positive and unlabeled examples) have been applied to prioritize candidate disease genes using multiple data sources.
In contrast, one of the first alternative approaches that integrate information from multiple data sources into a network representation has been Prioritizer.24 Its authors constructed a comprehensive functional human gene network based on a number of datasets from molecular pathway and interaction databases such as KEGG,104 BIND,105 HPRD,106 Reactome107as well as from GO annotations,80 yeast-2-hybrid screens, gene expression experiments, and protein interaction predictions. In this network, positional candidates from different disease loci are ranked according to the length of the shortest paths between them. In functional networks as used by Prioritizer, the main assumption is that relevant genes are involved in specific disease-related pathways and cluster together in the network even if their products are not closely linked by physical protein interactions.
Building upon Prioritizer, several research groups have assembled different types of integrated networks as biological evidence for candidate disease gene prioritization. One example is the two-layered network by Li et al.48presented in the previous section that combines protein interactions and phenotypic similarity. Another method was presented by Linghu et al.66 who employed naïve Bayes integration of diverse functional genomics datasets to generate a weighted functional linkage network and to prioritize candidate genes based on their shortest path distance to known disease genes. Similarly, Huttenhower et al.67 incorporated information from several thousands genomic experiments to generate a functional relationship network. From this network, the authors could derive functional maps of different phenotypes and showed in a case study for macroautophagy that these maps can be used successfully to find novel gene associations.
Recently, Lee et al.68 also provided a large-scale human network of functional gene–gene associations and evaluated the performance of six different network-based methods using it. Similar to the findings by Navlakha and Kingsford,44 the authors concluded that the strongest overall performance is achieved with algorithms that account for the global network structure such as Google’s PageRank. A more general view of the relationships between phenotypes and genes is introduced by BioGraph,69 a heterogeneous network containing diverse biomedical entities and relations between them, which are extracted from over 20 publicly available databases. By computing random walks on this network, the authors aim at the automated generation of functional hypotheses between different concepts, in particular, of candidate genes and diseases.
EVALUATION AND BENCHMARKING OF PRIORITIZATION METHODS
To show the biological applicability and scientific value of disease gene prioritization methods, their authors are normally expected to conduct an extensive performance evaluation and, if possible, a thorough comparison with other methods. To this end, many authors usually benchmark disease phenotypes from OMIM. Depending on the requirements of their method, only phenotypes with at least two or three known disease genes may be suitable. Hence, the number of evaluated diseases can vary from tens to hundreds with hundreds to thousands corresponding genes. The range of disease phenotypes and genes, for which a given method is applicable, depends on the data used by the method. For instance, only about 10% of all human protein–protein interactions have probably been described so far,108only about 10% of all human genes have at least one known disease association,101 and only about every second gene or protein is functionally annotated.109
Leave-one-out cross-validation is a widely used and generally accepted test for how a method might perform on previously unseen data. In each run, one of the known disease genes, the so-called target disease gene, is removed from the training data. The remaining disease genes are used to identify the omitted gene from a test set of genes that are not known to be associated with the disease of interest. In the best case, the top rank should be assigned to the target disease gene and lower ranks to the other test genes. Since cross-validation is a standard performance test, a number of suitable measures of predictive power exist, for example, sensitivity and specificity, receiver operating characteristic (ROC) curve, precision and recall, enrichment and mean rank ratio (see Box 2). Unfortunately, none of these measures is considered as default, which renders the comparison between different methods of disease gene prioritization difficult. In particular, it would be useful to report the performance for the top-ranked candidate genes, e.g., the first 10 or 20 genes, because only a few candidates can usually be considered for further validation experiments.
Another important aspect of the benchmarking strategy is the choice of genes in the test set, i.e., the candidate genes that are prioritized together with the target disease gene. One usual input for prioritization methods is a set of susceptibility loci as determined by GWA studies. These loci typically contain up to several hundreds of possible disease genes. Therefore, different strategies have been followed by authors to derive useful test sets, i.e., the definition of artificial gene loci, the random selection of genes, the use of the whole genome, and the small-scale choice of genes.
Here, we briefly describe frequently used measures for evaluating the performance of disease gene prioritization methods. A simple measure is the mean rank ratio defined as the average of rank ratios for all tested disease genes.110 One speaks of n/m-fold enrichment on average if disease genes are ranked in the top m% of all genes in n% of the linkage intervals.47 Other performance measures are calculated using the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) at a specific rank or score cut-off that discriminates predicted from not predicted ones. Positives are disease genes, while negatives are candidate genes without disease association. For instance, sensitivity is the percentage of correctly identified disease genes among all genes above the cut-off (TP/(TP + FN)), while specificity is the percentage of correctly dismissed candidate genes among all genes below the cut-off (TN/(TN + FP)). Plotting sensitivity versus specificity while varying the cut-off yields a ROC curve. The area-under-the-ROC curve (AUC) is a standard measure for the overall performance of binary classification methods (here, disease genes vs. others). The AUC is 100% in case of perfect prioritization and 50% if the disease genes were ranked randomly. In some cases, the authors of prioritization methods give the percentage of disease genes ranked in the top 1% and 5% of all genes, which corresponds to reporting the sensitivity at 99% or 95% specificity, respectively. The percentage of correctly prioritized disease genes among all disease genes is defined as precision (TP/(TP + FP)), while recall is equal to sensitivity. Thus, the plot of a precision-recall curvecan also be used to evaluate method performance.44
Endeavour56 and several related methods61,103,111 were evaluated with a test set containing 99 candidate genes chosen at random from the whole genome in addition to the target disease gene. Other methods were also benchmarked with this strategy, primarily, in order to be comparable with Endeavour.26,41,47,58,60,110,112 However, since similar genes tend to cluster in chromosomal neighborhoods,113 another, presumably, more difficult setting for performance benchmarking and especially more relevant for GWA studies is the definition of artificial linkage intervals with genes that surround the disease gene on the chromosome.22,24,26,35,40,45,47,48,50,53,57,60,76,112,114 The size of such intervals, as found in the relevant literature, ranges from the 100 nearest genes to 300 genes on average if a 10 Mb genomic neighborhood is considered.26 The average gene number of linkage intervals associated with diseases according to OMIM is estimated to be 108.35
The third option for assembling a test set is the use of all genes in the genome except for the known disease genes in the training set.44,47–49,110This setting is chosen only by the few methods that are capable of performing genome-wide disease gene prioritization. Finally, prioritization methods that consider, for instance, gene expression data are evaluated only on a smaller scale because there is not enough data for a comprehensive benchmarking over many disease phenotypes.38,43,68,77,115–117 Therefore, the authors commonly choose only few diseases that have, for example, the required experimental data available.
In this review, we gave an overview of different approaches to the prioritization of candidate disease genes. We described how disease genes can be identified by their molecular characteristics based on sequence properties, functional annotations, and network information. In particular, we presented recent approaches, which make use of phenotypic information and comprehensive knowledge integration. Finally, we discussed common benchmarking strategies of prioritization methods.
Many disease gene prioritization methods exploit discriminative gene and protein properties and successfully rank candidate genes according to their functional and phenotypic similarity or network proximity to known disease genes. Further improvement of the prioritization performance can be achieved by integrating biological information contained in multiple data sources. Many integrative methods first combine heterogeneous datasets and then apply specific analysis techniques. However, in the course of such analysis, the very useful insight which data source provides the most relevant biological information for the prioritization is usually lost. Therefore, it is also beneficial to follow the alternative approach that first analyzes each data source separately using the most suitable techniques and then combines the resulting ranking lists using sophisticated rank aggregation algorithms. This procedure also facilitates backtracking the origin of the most relevant information.
Among the most widely used data sources for disease gene prioritization are functional annotations and protein interactions as well as phenotypic similarity. In particular, performance evaluations of methods such as Endeavour, MedSim, and Prioritizer demonstrated consistently that functional GO term annotations constitute by far one of the most useful biological evidence sources for candidate prioritization.24,26,56 Further performance gain can be attributed to comprehensive knowledge integration, which reduces the noise in the integrated data and provides additional information from data sources that is not captured (yet) by GO term annotations. Even more performance increase can be expected when the used data sources become more and more complete and exhibit high quality without significant bias toward intensively studied genes and proteins.
Currently, the multitude of benchmarking strategies pursued by different researchers considerably hampers the performance comparison of disease gene prioritization methods. Moreover, some methods make use of only small test datasets due to the lack of the required training data and the limited amount of known disease genes. Nevertheless, established procedures to derive test sets and the application of different standard performance measures should form part of every benchmarking strategy to evaluate new prioritization methods comprehensively with respect to other well-performing methods. To facilitate future performance comparisons, the training and test datasets should always be made publicly available together with the published work. In the end, since follow-up validation experiments tend to be expensive and time-consuming, it is vital that the correct disease genes are found on the few top ranks of the prioritization list.
Part of this study was financially supported by the BMBF through the German National Genome Research Network (NGFN) and the Greifswald Approach to Individualized Medicine (GANI_MED). The research was also conducted in the context of the DFG-funded Cluster of Excellence for Multimodal Computing and Interaction.
The rapid development of high-throughput technologies and computational frameworks enables the examination of biological systems in unprecedented detail. The ability to study biological phenomena at omics levels in turn is expected to lead to significant advances in personalized and precision medicine. Patients can be treated according to their own molecular characteristics. Individual omes as well as the integrated profiles of multiple omes, such as the genome, the epigenome, the transcriptome, the proteome, the metabolome, the antibodyome, and other omics information are expected to be valuable for health monitoring, preventative measures, and precision medicine. Moreover, omics technologies have the potential to transform medicine from traditional symptom-oriented diagnosis and treatment of diseases toward disease prevention and early diagnostics. We discuss here the advances and challenges in systems biology-powered personalized medicine at its current stage, as well as a prospective view of future personalized health care at the end of this review. WIREs Syst Biol Med 2013, 5:73–82. doi: 10.1002/wsbm.1198
Conflict of interest: M.S. serves as founder and consultant for Personalis, a member of the scientific advisory board of GenapSys, and a consultant for Illumina.
For further resources related to this article, please visit the WIREs website.
Personalized or precision medicine is expected to become the paradigm of future health care, owing to the substantial improvement of high-throughput technologies and systems approaches in the past two decades.1,2Conventional symptoms-oriented disease diagnosis and treatment has a number of significant limitations: for example, it focuses on only late/terminal symptoms and generally neglects preclinical pathophenotypes or risk factors; it generally disregards the underlying mechanisms of the symptoms; the disease descriptions are often quite broad so that they may actually include multiple diseases with shared symptoms; the reductionist approach to identify therapeutic targets in traditional medicine may over-simplify the complex nature of most diseases.3 Advances in the ability to perform large-scale genetic and molecular profiling are expected to overcome these limitations by addressing individualized differences in diagnosis and treatment in unprecedented detail.
The rapid development of high-throughput technologies also drives modern biological and medical researches from traditional hypothesis-driven designs toward data-driven studies. Modern high-throughput technologies, such as high-throughout DNA sequencing and mass spectrometry, have enabled the facile monitoring of thousands of molecules simultaneously instead of just a few components that have been analyzed in traditional research, thus generating a huge amount of data to document the real-time molecular details of a given biological system. Ultimately, when enough knowledge is gained, these molecular signatures, as well as the biological networks they form, may be associated with the physiological state/phenotype of the biological system at the very moment when the sample is taken.
Future personalized health care is expected to benefit from the combined personal omics data, which should include genomic information as well as longitudinal documentation of all possible molecular components. This combined information not only determines the genetic susceptibility of the person, but also monitors his/her real-time physiological states, as our integrative Personal Omics Profile (iPOP) study exemplified.4 In this review we will cover recent advances in systems biology and personalized medicine. We will also discuss limitations and concerns in applying omics approaches to individualized, precision health care.
GENOMICS IN DISEASE-ORIENTED MEDICINE
The revolution of omics profiling technologies significantly benefited disease-oriented studies and health care, especially in disease mechanism elucidation, molecular diagnosis, and personalized treatment. These new technologies greatly facilitated the development of genomics, transcriptomics, proteomics, and metabolomics, which have become powerful tools for disease studies. Today, molecular disease analyses using large-scale approaches are pursued by an increasing number of physicians and pathologists.5,6
Initially, genome-wide association studies (http://gwas.nih.gov/) were launched in search of association of common genetic variants to certain phenotypes of interest, which typically assayed more than 500,000 single nucleotide polymorphisms (SNPs) and/or copy number variations (CNVs) with DNA microarrays in thousands to hundred thousands of participants.7 To date, 1,355 publications are listed in the National Human Genome Research Institute (NHGRI) GWAS Catalog reporting the association of 7,226 SNPs with 710 complex traits.7 The studied complex traits vary vastly, from cancers (e.g., prostate cancer and breast cancer) and complex diseases (e.g., type 1 and type 2 diabetes (T2D), Crohn’s Disease) to common traits (e.g., height and body mass index). These findings greatly broadened our knowledge on disease loci, and can potentially benefit disease risk prediction and drug treatments (as discussed in the section Integrative Omics in Preventative Medicine). Although powerful, GWAS studies have proven difficult for most complex diseases as typically a large number of loci are identified, each contributing to a small fraction of the genetic risk. These studies have many limitations including the small fraction of the genome that is analyzed, and failure to account for gene-gene interactions, epistasis and environmental factors.8
Whole genome sequencing (WGS) and whole exome sequencing (WES) have become more and more affordable for genomic studies and are rapidly replacing DNA microarrays. Single-base analysis of a genome/exome is achieved, which allows scientists to investigate the genetic basis of health and disease in unprecedented detail. Assigning variants to paternal and maternal chromosomes i.e. ‘phasing’ can be obtained through the analysis of families9or other methods.1,10,11 With the generation of massive amount of whole genome and exome data from diseased and healthy populations, understanding of both human population variation and genetic diseases, especially complex diseases, has been brought to a new level.1,12
One field that significantly benefited from WGS technologies is cancer-related research. A large number of cancer genomes have been sequenced through individual or collaborative efforts, such as the International Cancer Genome Consortium (http://www.icgc.org/) and the Cancer Genome Atlas (http://cancergenome.nih.gov/). The DNA from many types of cancer have been sequenced, including breast cancer,13–15 chronic lymphocytic leukaemia,16 hepatocellular carcinoma,17 pediatric glioblastoma,13melanoma,18 ovarian cancer,19 small-cell lung cancer,20 and Sonic-Hedgehog medulloblastoma,21 and databases are established, such as the cancer cell line encyclopedia.22 In addition, single-cell level cancer genome has also been investigated by WES for clear cell renal cell carcinoma23 and JAK2-negative myeloproliferative neoplasm.24 Somatic mutations and subtyping molecular markers were identified from these genomes. These different studies have revealed that nearly every tumor is different with distinct types of potential ‘driver’ mutations. Importantly, cancer genome sequencing often reveals potential targets that may suggest precision cancer treatment for the specific patients. As an example, a novel spontaneous germline mutation in the p53 gene was identified by WGS in a female patient, which accounted for the three types of cancers she developed in merely 5 years.25 An attempt has been made recently to treat a female patient with T Cell Lymphoma based on the target gene, CTLA4, identified by whole genome sequencing.26 The patient’s cancer was suppressed for two months with the anti-CTLA4 drug ipilimumab, although she died of recurrence soon after.
Whole genome and exome sequencing can also facilitate the identification of possible causal genes for hereditary genetic diseases, and is increasingly used in attempts to understand the basis of these ‘mystery diseases’ once obvious candidates are ruled out. In one successful example, whole genome sequencing of a fraternal twin pair with dopa (3,4-dihydroxyphenylalanine)-responsive dystonia helped the identification of one pair of personalized compound heterozygous mutations in the gene SPR, which accounted for the disease in both individuals.27 Importantly, based on the genome information the authors supplemented the l-dopa therapy with 5-hydroxytryptophan (SPR-dependent serotonin precursor) and significantly improved the health of both patients. In another example, Roach et al. sequenced the whole genomes of a family quartet and identified rare mutations in the genes DHODH and DNAH5 responsible for the two recessive disorders in both children—Miller syndrome and primary ciliary dyskinesia.28
Pharmacogenomics is another important application of genomic sequencing. It is known that the same drug may have different effect on different individuals due to their personal genomic background and living habits.8,29Genetic information can be used to assign drug doses and reduce side effects. For example, genetic variants are known to affect patients’ response to antipsychotic drugs.30 Based on pharmacogenomic trials, genetic tests for four drugs are required by the US Food and Drug Administration (FDA) before the administration of these drugs to patients, including the anti-cancer drugs cetuximab, trastuzumab, and dasatinib, and the anti-HIV drug maraviroc, and more are recommended such as the anticoagulant drug Warfarin and the anti-HIV drug Abacavir.8
OTHER OMICS TECHNOLOGIES AND MEDICINE
Other omics technologies are also likely to impact medicine. High throughput sequencing technologies have enabled whole transcriptome (cDNA) sequencing, or abbreviated as RNA-Seq.31 RNA-Seq has become a powerful tool for disease-related studies, as it has great accuracy and sensitivity relative to microarray technology and it can also detect splicing isoforms.32 As RNA profiles reflect actual gene activity, it is closer to the real phenotype compared to genomic sequence. With RNA-Seq, Shah et al. discovered varied clonal preference and allelic abundance in 104 cases of primary triple-negative breast cancers, and observed that ∼36% of the genomic mutations were actually expressed.33 Combining such information with genomic information may be valuable in treatment of cancer and other diseases. Moreover, RNA-Seq also captures more complex aspects of the transcriptome, such as splicing isoforms34 and editing events,35 which are generally overlooked by hybridization-based methods. Splicing variants have now been associated with several distinct types of cancer and cancer prognosis.36–40
Although proteins have long been deemed as the executors of most biological functions, clinical proteomics is still a relatively young field due to technological limitations to profile the complexity of the proteome with high sensitivity and accuracy. Since the development of new soft desorption methods that enabled the analysis of biological macromolecules with mass spectrometry, proteomics advanced significantly in the past decade.41,42With current mass spectrometry technology, one can now quantify thousands of proteins in a single sample. For example, we were able to reliably detect 6,280 proteins in the human peripheral blood mononuclear cell proteome.4Mass spectrometry also allows the detection of expressed mutations, allele-specific sequences and editing events in the human proteome,4,43 as well as profiling of the phosphoproteome.44 Also of note is the MALDI-TOF (matrix-assisted laser desorption/ionization-time of flight) mass spectrometry-based imaging technology (MALDI-MSI) developed by Cornett et al., which allows spatial proteome profiling in defined two-dimentional laser-shot areas using tissue sections.45 Using MALDI-MSI, Kang et al. identified immunoglobulin heavy constant α2 as a novel potential marker for breast cancer metastasis.46
The field of metabolomics has also advanced significantly with the improvement of mass spectrometry. Both hydrophilic and hydrophobic metabolites can be profiled in specific samples.4,47 As the metabolome reflects the real-time energy status as well as metabolism of the living organism, it is expected that certain metabolome profiles may be associated with different diseases.48 Therefore, metabolomic profiles become an important aspect for personalized medicine.49,50 Jamshidi et al. profiled the metabolome of a female patient with Hereditary Hemorrhagic Telangiectasia (HHT) along with four healthy controls, and identified differences which highlighted the nitric oxide synthase pathway.51 The authors then treated the patient with bevacizumab and shifted her metabolomic profile toward those of the healthy controls and improved the patient’s health. In addition, branched-chain amino acids such as isoleucine have been associated with T2D and may ultimately prove to be valuable biomarkers.52 Finally, since some metabolites bind and directly regulate the activity of other biomolecules (e.g., kinases),53 there is significant potential to modulate cellular pathways using diet and metabolic analogs that serve as agonist or antagonist of protein function.
INTEGRATIVE OMICS IN PREVENTATIVE MEDICINE
The concept of personalized medicine emphasizes not only personalized diagnosis and treatment, but also personalized disease susceptibility assessment, health monitoring and preventative medicine. Because disease is easier to manage prior to it onset or when a disease is at its early stages, risk assessment and early detection will be transformative in personalized medicine. Systems biology has the potential to capture real-time molecular phenotypes of a biological system, which enables the detection of subtle network perturbations preluding the actual development of clinical symptoms.
Disease susceptibility and drug response can be assessed with a person’s genomic information.8 This information may serve as a guideline for monitoring the health of a particular patient to achieve personalized health care, as showcased by Ashley et al.54 Whole genome sequence revealed variants for both high-penetrance Mendelian disorders, such as HTT(Huntington’s disease55) and PAH (Phenylketonuria56), as well as common, complex diseases, such as the disease-associated genetic variants reported in GWAS studies.57 Disease risks can be evaluated for a given person and an increase or decrease in disease risk compared with the population risk (of the same ethnicity, age, and gender) can be estimated (Figure 1). In the study of Ashley et al., the genome of a patient was analyzed and increased post-test probability risks for myocardial infarction and coronary artery disease were estimated.54 Their estimation matched the fact that the patient, although generally healthy, had a family history of vascular disease as well as early sudden death.58 Genetic variants associated with heart-related morbidities as well as drug response were identified in the patient’s genome, the information of which, as the authors stated, may direct the future health care for this particular patient. Similarly, Dewey et al. further extended this work by analysing a family quartet using a major allele reference sequence, and identified high-risk genes for familial thrombophilia, obesity, and psoriasis.59
To further explore variation and power of the full human genome, projects and databases (such as the Personal Genome Project60) are being launched to help advance this field. However, genomic information alone usually is not adequate to predict disease onset, and other factors such as environment are expected to play a critical role in this process.61,62 The predictive capability of whole genome sequence was assessed by Roberts et al. through modeling 24 disease risks in monozygotic twins.63 For each disease, the authors modeled the genotype distribution in the twin population according to the observed concordance/discordance, and discovered that for most individuals and most diseases, the relative risk would be tested negative compared to the population, and in the best-case scenario, only one disease or more could be forewarned for any individual. The results of Roberts et al. are not surprising, as disease manifestation is probabilistic and not deterministic. Nonetheless, whole genome information by itself is expected to have partial value in disease prediction for complex diseases. In addition, from a systems point of view, peripheral components of the biological network would be more likely to contribute to complex diseases, as perturbation of the main nodes, which are usually essential genes, would be lethal.64 Therefore it is more difficult to identify the exact contributors of complex diseases. Moreover, as stated above, non-genomic factors may also exist and further complicate the situation. As an example of this, multiple sclerosis is known to have genetic components, however, Baranzini et al. failed to identify genomic, epigenomic or transcriptomic contributors in discordant monozygotic twins, which may indicate the existence of other factors, such as the environment.65
Current technologies, especially high-throughput sequencing and mass spectrometry, enable the monitoring of at least 105 molecular components, including DNA, RNA, protein, and metabolites in the human body. Therefore it is now feasible to identify the profiles of these components that correlate with various physiological states of the body, and profile alterations as a result of physiological state changes and diseases. Compared with genomic sequences alone, the profiles of transcriptome, proteome and metabolome are closer indicators to the real-time phenotype, therefore collecting these omics information in a longitudinal manner would allow monitoring of an individual’s physiological states. To test this concept, we implemented a study by following a generally healthy participant for 14 (now 32) months with integrated Personal Omics Profile (iPOP) analysis, incorporating information of the participant’s genome with longitudinal data from the person’s transcriptome, proteome, metabolome, and autoantibodyome.4 As blood constantly circulates the human body and exchanges biological matters with local tissues and is presently analyzed in medical tests, we chose to monitor the participant’s physiological states by profiling the blood components (PBMCs, serum and plasma) with iPOP analysis. The genome of this individual was sequenced with two WGS (Illumina and Complete Genomics) and three WES (Agilent, Roche Nimblegen, and Illumina) platforms to achieve high accuracy, which was further analyzed for disease risk and drug efficiency. The identified elevated risks included coronary artery disease, basal-cell carcinoma, hypertriglyceridemia and T2D, and the participant was estimated to have favorable response to rosiglitazone and metformin, both are antidiabetic medications. Although the participant has a known family history for some of the high-risk diseases (but not T2D), he was free from most of them (except for hypertriglyceridemia, for which he used medication) and had a normal Body Mass Index at the start of our study. Nonetheless, these elevated disease risks served as a guideline to monitor his personal health with iPOP analysis. We profiled the transcriptome, proteome and metabolome from 20 time points in the 14 months, and monitored molecular profile changes for physiological state change events during our study, including two viral infections. The subject also acquired T2D during the study, immediately after one of the viral (respiratory syncytial virus) infections. Two types of changes were observed from our iPOP data: the autocorrelated trends that reflect chronic changes, and the spikes which include significantly up/down-regulated genes and pathways especially at the onset of each event. With our iPOP approach, we acquired a comprehensive picture of detailed molecular differences between different physiological states, as well as during disease onset. In particular, interesting changes in glucose and insulin signaling pathways were observed during the onset of T2D. We also obtained other important information from our omics data, such as dynamic changes in allele-specific expression and RNA-editing events, as well as personalized autoantibody profiles. Overall, this study revealed an important application of the use of genomics and other omics profiling for personalized disease risk estimation and precision medicine, as we discovered the increased T2D risk, monitored its early onset, and helped the participant effectively control and eventually reverse the phenotype by proactive interventions (diet change and physical exercise).
Another important feature of our study is that samples are collected in a longitudinal fashion so that aberrant/disease states can be compared to healthy states of the same individual. One other advantage of our iPOP approach is its modularity, as other omics and quantifiable information can also be included in the iPOP profile, which can be readily tailored to monitor any biological or pathological event of interest (Figure 2). Examples of other information are: epigenome,66 gut microbiome,67 microRNA profiles68 and immune receptor repertoire.69 Moreover, quantifiable behavioral parameters such as nutrition, exercise, stress control and sleep may also be added to the profile.70
THE IMPORTANCE OF DATA MINING AND RE-MINING
One important aspect of systems biology is data mining. Data management and access can become a daunting task given the tremendous amount of data generated with current high-throughput technologies, and the data size is constantly increasing with time.71 Challenges exist computationally in each step to handle, process and annotate high-throughput data, integrate data from different sources and platforms, and pursue clinical interpretation of the data.72 These steps can be quite computationally intensive and require significant computational hardware; for example, to map short reads to achieve 30× coverage of the human genome, 13 CPU days is typically required72 although these times are rapidly decreasing. Moreover, as biological systems act more than just the sum of its individual parts, knowledge from multiple levels (such as epistasis, interaction, localization, and activation status) should be considered to capture the underlying highly organized networks for functional annotations.73 Ultimately it will be important to have a comprehensive database that contains Electronic Health records (including treatment information), genome sequences with variant calls and as much molecular information as possible. In principle with appropriate algorithms such a database could be mined by physicians to make data-driven medical decisions.
Currently many high-throughput datasets of similar types (e.g., expression and genome-wide association data collected from different populations of the same disease) were created as smaller, separate studies. Thus combining these publicly available datasets bioinformatically may provide more statistical power and lead to a clearer conclusion that could not be achieved in the individual studies. The work by Roberts et al. mentioned above serves as one example.63 In order to test the capacity of whole genome information, the authors combined monozygotic twin pair data from a total of five sources in 13 publications to obtain a much large dataset for their test. Similarly, Butte and colleagues combined the results of 130 functional microarray experiments for T2D and re-mined the data for repeatedly appeared candidate genes.74 They identified CD44 as the top candidate gene associated with T2D. In a related effort, by analyzing curated data of 2,510 individuals from 74 populations, the group led by Butte also discovered that T2D risk alleles were unevenly distributed across different human populations, with the risk higher in African and lower in Asian populations.75
CONCERNS AND LIMITATIONS
Personalized health monitoring and precision medicine is just accelerating at a rapid pace because of the development of systems biology. As noted above, multiple efforts in both technology development and biological application have occurred, and an increasing number of researchers and physicians alike are sharing this vision. Hood et al. termed this approach as ‘P4 Medicine’ for predictive, preventive, personalized and participatory medicine.12
Nevertheless, many concerns also exist, and guidelines on translational omics research have been recommended by the Institute of Medicine.76 Khoury et al. suggested ‘a fifth P’, that is, the population perspective be added to personalized medicine77 and population validation of systems results with strong evidence should be achieved before its clinical application. Many disease-associated genetic variants discovered in GWAS still need to be functionally validated.78 In addition, Khoury et al. raised concerns that restricted health care resources might be wasted if unneeded disease screening/subclassification with systems approaches were conducted rather than lowering health care costs. However, with the rapid drop in technology costs and carefully designed pilot studies, the optimal screening frequencies/levels of subclassification necessary for precision medicine could be determined and costs maintained at affordable levels. It is worth noting that generating personalized omics data with appropriate interpretation can greatly benefit our understanding of physiological events for health and disease, and precision health care as we gain more knowledge in this field. In addition to personalized diagnosis and treatment, the future of precision medicine with omics approaches should emphasize personalized health monitoring, molecular symptom, early detection and preventative medicine, a paradigm shift from traditional health care.
As the human body is a highly organized, complex system with multiple organs and tissues, it is important to select the correct sample type for understanding a specific biological problem. However, as many sample types are unavailable (e.g., brain tissue) or not regularly accessible (e.g., biopsy samples from internal organs) from living individuals, our scope for personalized health monitoring is thus restricted. Therefore systems biology results, especially iPOP results, should not be over-interpreted. Although iPOP data from blood components may indicate changes in the other parts of the human body, the actual profiles for the tissue of interest might be underrepresented in blood or delayed in phase.
It is still not clear who is to develop and deliver personalized treatments for personalized medicine if they are not available as conventional medication. The cost for developing personalized drugs may become prohibitive to accurately address personal specificity, and may face other difficulties such as Food and Drug Administration approval. However, advances in high-throughput drug discovery will help accelerate this field.
In addition, personalized medicine using omics approaches relies heavily on technology development for biological research. This includes advances in both research instrumentation and computational framework. For example, it is still not possible to accurately determine the entire sequence of a genome due to limitations of current WGS/WES methods,79,80 even after computational improvement of signal-to-noise ratio.81,82 A low sequencing error rate was claimed by both the Illumina HiSeq (for 2 × 100 bp reads, more than 80% of the bases have a quality score above Q30, or 99.9% accuracy, http://www.illumina.com/documents//products/datasheets/datasheet_hiseq_systems.pdf) and the Complete Genomics platform (1 × 10−5 at the time of our study80 and 2 × 10−6 as of October 8th, 2012, www.completegenomics.com); however, per variant error rate is still high (15.50% and 9.08% for Illumina and Complete Genomics respectively with no filter, and 1.01% and 1.12% post multiple filters) as reported by Reumers et al.,81 which agreed with our observation that only 88.1% of the SNP calls overlapped when the same genome was sequenced with the two platforms.80 Thus possible disease-associated variants in these platform-specific regions might be overlooked or misinterpreted. Another issue lies in storage and processing of the omics data, as petabytes of data can easily be generated for a small iPOP study of 200 participants and demanding computing resources will be needed for data analysis. Therefore, interdisciplinary efforts from biologists, computer scientists and hardware engineers should be organized to ensure the continued improvement of this field.
The era of personalized precision medicine is about to emerge. The steady improvement of high-throughput technologies greatly facilitates this process by enabling profiling of various omes such as whole genome, epigenome, transcriptome, proteome and metabolome, which convey detailed information of the human body. Integrated profiles of these omes should reflect the physiological status of the host at the time the samples are collected. Personalized omics approach catalyzes precision medicine at two levels: for diseases and biological processes whose mechanisms are still unclear, omics approach will facilitate researches that would greatly advance our understanding; and when the mechanisms are clarified, individualized health care can be provided through health monitoring, preventative medicine, and personalized treatment. This would be especially helpful for complex diseases such as autism83 and Alzheimer’s disease,84 where multiple factors are responsible for the phenotypes. Furthermore, omics approach also facilitates the development of other less-stressed but important health-related fields, such as nutritional systems biology, which studies personalized diet and its relationship to health in systems point of view.85 With the rapid decrease in the cost of omics profiling, we anticipate an increased number of personalized medicine applications in many aspects of health care besides our proof-of-principle study. This will significantly improve the health of the general public and cut down health care costs. Scientists, governments, pharmaceutical companies and patients should work closely together to ensure the success of this transformation.86
This work is supported by funding from the Stanford University Department of Genetics and the National Institutes of Health. We thank Drs. George I. Mias and Hogune Im for their help in proof-reading the article and the insightful discussions.
Preparing scientists to harness and apply the power of data to understand the complexities of human biology
Bioinformatics (BI) is a curriculm pathway within the Biological and Medical Informatics Graduate Program at the University of California, San Francisco (UCSF). The program offers two optional designated emphases in:
- Computational Biology and Bioinformatics (CBBI).
- Complex Biological Systems (CBS).
We prepare scientists to use tools from mathematics to physics and from chemistry to biology to gather, store, analyze, predict, and disseminate information about biology. The field is essential, for without quantitative analysis of the massive and growing amounts of biological data generated by various systems, biology and -omics data cannot be interpreted or exploited.
The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester.
On our website, you can find pointers to sources of information about text mining such as links to
- text mining services provided by NaCTeM
- software tools, both those developed by the NaCTeM team and by other text mining groups
- seminars, general events, conferences and workshops
- tutorials and demonstrations
- text mining publications
NaCTeM Software Tools
The National Centre for Text Mining bases its service systems on a number of text mining software tools.
- Part-of-speech (POS) taggers
- Named entitities/terms
- AnatomyTagger — an open-source entity mention tagger for anatomical entities
- Named-entity Recognizer — Part of the GENIA Tagger
- NEMine — Recognizes gene/protein names in text.
- Yeast MetaboliNER — Recognizes yeast metabolite names in text.
- ACELA — Tool for efficient annotation of named entitites
- Smart dictionary lookup — machine learning-based gene/protein name lookup
- Smart Dictionary Lookup Tool Web Service — Looks up term variations of a given gene/protein name based on an automatically trained similarity measure
- Term Normalization Tool — Normalizes terms with string rewriting rules automatically generated based on a dictionary.
- DECA — A species disambiguation system for biological named entities
- RF-TermAlign — a bilingual dictionary extraction tool that uses a Random Forest method to learn string similarity of terms between a source and target language.
- Other tools
- EventMine — A machine learning-based event extraction system.
- brat — A free, open-source, web-based tool for text annotation visualisation and editing.
- Cafetiere — An easy-to-use text mining system for carrying text mining on your own document collection
- Sentence and paragraph breaker — An accurate sentence and paragraph detector based on heuristic rules
- Clinical Document Classification — automatic document classification demo
- Sentiment Analysis Tool — Analyses sentiment of input text.
|KEGG1||http://www.genome.jp/kegg/||BioPAX, png, KGML|
|Reactome2||http://www.reactome.org/||BioPAX, png, pdf|
|Pathway Commons7||http://www.pathwaycommons.org/||BioPAX, Sif, png|
|PANTHER pathway3||http://www.pantherdb.org/pathway/||BioPAX, SBML|
|WikiPathways5||http://www.wikipathways.org/||BioPAX, svg, png, pdf, gpml|
|Nature/NCI PathwayInteractionDatabase63||http://pid.nci.nih.gov/||BioPAX, jpg, svg|
|BioCyc4||http://biocyc.org/||BioPAX, png, SBML|
|INOH84||http://inoh.hgc.jp/||BioPAX, INOH (xml)|
|Netpath85||http://www.netpath.org/||BioPAX, SBML, PSI-MI|
|PharmGKB86||http://www.pharmgkb.org/||BioPAX, pdf, gpml|
Abbreviations: BioPAX, Biological Pathway Exchange; KGML, KEGG Markup Language; PSI-MI, Proteomics Standards Initiative Molecular Interaction; SBML, Systems Biology Markup Language; NCI, National Cancer Institute; INOH, Integrating Network Objects with Hierarchies; PharmGKB, Pharmacogenomics Knowledge Base; KEGG, Kyoto Encyclopedia of Genes and Genomes.
Tools for visualization and analysis of molecular networks, pathways, and -omics data
Pathway mining and comparison
Pathway gene sets were generated based on the GeneCards platform (12), implementing the gene symbolization process allowing for comparison of pathway gene sets, from 12 different manually curated sources, including: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16) QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (seeTable 1). A binary matrix was generated for all 3125 pathways, where each column represents a gene indicated by 1 for presence in the pathway and 0 for absence. Additionally, six sources were analysed for their cumulative tallying of genes content, including: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).
PathCards: multi-source consolidation of human biological pathways
Welcome to the Biological General Repository for Interaction Datasets
BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Our current index is version 3.4.137 and searches 56,733 publications for 1,067,443 protein and genetic interactions, 27,501 chemical associations and 38,559 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in standardized formats.
© STRING CONSORTIUM 2016
STRING is a database of known and predicted protein-protein interactions. The database contains information from numerous sources, including experimental repositories, computational prediction methods and public text collections. STRING is regularly updated and gives a comprehensive view on protein-protein interactions currently available.
Pathway Viewer Web
PCViz is an open-source web-based network visualization tool that helps users queryPathway Commons and obtain details about genes and their interactions extracted from multiple pathway data resources.
It allows interactive exploration of the gene networks where users can:
- expand the network by adding new genes of interest
- reduce the size of the network by filtering genes or interactions based on different criteria
- load cancer context to see the overall frequency of alteration for each gene in the network
- download networks in various formats for further analysis or use in publication
BioPAX Editor Desktop
Ethan G. Cerami, Benjamin E. Gross, […], and Chris Sander
Pathway Commons (http://www.pathwaycommons.org) is a collection of publicly available pathway data from multiple organisms. Pathway Commons provides a web-based interface that enables biologists to browse and search a comprehensive collection of pathways from multiple sources represented in a common language, a download site that provides integrated bulk sets of pathway information in standard or convenient formats and a web service that software developers can use to conveniently query and access all data. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical interactions involving proteins, DNA, RNA, small molecules and complexes. Pathway Commons aims to collect and integrate all public pathway data available in standard formats. Pathway Commons currently contains data from nine databases with over 1400 pathways and 687 000 interactions and will be continually expanded and updated.
|Data Source||Format||Size||Updated||Focus (species)||Reference or URL|
|BioGRID||PSI–MI 2.5||347 508 Interactions||August 2010 (3.0.67)||Model organisms||(20)|
|Cancer Cell Map||BioPAX L2||10 Pathways||May 2006||Human||http://cancer.cellmap.org|
|HPRD||PSI–MI 2.5||40 618 Interactions||13 April 2010 Version 9||Human||(21)|
|HumanCyc||BioPAX L2||266 Pathways||16 June 2010 Version 14.1||Human||(22)|
|IMID||BioPAX L2||1729 Interactions||March, 2009||Human||http://www.sbcny.org/|
|IntAct||PSI–MI 2.5||154 567 Interactions||8 August 2010 Version 3.1, r14760||All||(23)|
|MINT||PSI–MI 2.5||117 202 Interactions||28 July 2010||All||(24)|
|NCI/Nature PID||BioPAX L2||186 Pathways||10 August 2010||Human||(25)|
|13 879 Interactions|
|Reactome||BioPAX L2||1015 Pathways||18 June 2010 Version 33||Human||(5)|
|All Integrated||BioPAX L2||1477 Pathways||Multiple||http:///www.pathwaycommons.org|
|687 883 Interactions|
New sources are periodically added and listed on the Pathway Commons website. Note that pathway and interaction statistics represent non-unique counts from source databases, as these records are not currently merged from multiple sources (only molecules are currently merged).
Data Sources (http://www.pathwaycommons.org/pc2/datasources)
Warehouse data (canonical molecules, ontologies) are converted to BioPAX utility classes, such as EntityReference, ControlledVocabulary, EntityFeature sub-classes, and saved as the initial BioPAX model, which forms the foundation for integrating pathway data and for id-mapping.
Pathway and binary interaction data (interactions, participants) are normalized next and merged into the database. Original reference molecules are replaced with the corresponding BioPAX warehouse objects.
Links to the access summary for Warehouse data sources are not provided below; however, the total number of requests minus errors will be fair estimate. Access statistics are computed from January 2014, except unique IP addresses, which are computed from November 2014.
The Pathway Commons team much appreciates the fundamental contribution of all the data providers, authors, Identifiers.org, all the open biological ontologies, the open-source projects and standards, which made creating of this integrated BioPAX web service and database feasible.
ConsensusPathDB—a database for integrating human functional interaction networks
The MIPS Mammalian Protein-Protein Interaction Database
The MIPS Mammalian Protein-Protein Interaction Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. We took great care to include only data from individually performed experiments since they usually provide the most reliable evidence for physical interactions.
Other PPI resources
There are plenty of interesting databases and other sites on protein-protein interactions. Currently we are aware of the following PPI resources:
|APID||Agile Protein Interaction DataAnalyzer (Cancer Research Center, Salamanca, Spain)|
|BIND||Biomolecular INteraction Network Database at the University of Toronto, Canada. No species restriction|
|CYGD||PPI section of the Comprehensive Yeast Genome Database. Manually curated comprehensive S. cerevisiae PPI database at MIPS|
|DIP||Database of Interacting Proteins at UCLA. No species restriction.|
|GRID||General Repository for Interaction Datasets. Mount Sinai Hospital, Toronto, Canada|
|HIV Interaction DB||Interactions between HIV and host proteins.|
|HPRD||The Human Protein Reference Database. Institute of Bioinformatics, Bangalore, India and Johns Hopkins University, Baltimore, MD, USA.|
|HPID||Human Protein Interaction Database. Department of computer Science and Information Engineering Inha University, Inchon, Korea|
|iHOP||iHOP (Information Hyperlinked over Proteins). Protein association network built by literature mining|
|IntAct||Protein interaction database at EBI. No species restriction.|
|InterDom||Database of putative interacting protein domains. Institute for InfoComm Research, Singapore.|
|JCB||PPI site at the Jena Centre for Bioinformatics, Germany|
|MetaCore||Commercial software suite and database. Manually curated human PPIs (among other things). GeneGo|
|MINT||Molecular INTeraction database at the Centro di Bioinformatica Moleculare, Universita di Roma, Italy.|
|MRC PPI links||Commented list of links to PPI databases and resources maintained at the MRC Rosalind Franklin Cetre for Genomics Research, Cambridge, UK|
|OPHID||The Online Predicted Human Interaction Database. Ontario Cancer Institute and University of Toronto, Canada.|
|Pawson Lab||Information on protein-interaction domains.|
|PDZbase||Database of PDZ mediated protein-protein interactions.|
|Predictome||Predicted functional associations and interactions. Boston University.|
|Protein-Protein Interaction Server||Analysis of protein-protein interfaces of protein complexes from PDB. University College of London, UK.|
|PathCalling||Proteomics and PPI tool/database. CuraGen Corporation.|
|PIM||Hybrigenics PPI data and tool, H. pylori. Free academic license available|
|RIKEN||Experimental and literature PPIs in mouse.|
|STRING||Protein networks based on experimental data and predictions at EMBL.|
|YPD||“BioKnowledge Library” at Incyte Corporation. Manually curated PPI data from S. cerevisiae. Proprietary.|
Knowledge-Driven NGS Analysis
The GeneCards Suite-powered end-to-end NGS analysis and interpretation platform
Human biological pathway unification
PathCards is an integrated database of human biological pathways and their annotations. Human pathways were clustered into SuperPaths based on gene content similarity. Each PathCard provides information on one SuperPath which represents one or more human pathways. It includes 1,131 SuperPath entries, consolidated from 12 sources.
Belinky, F., Nativ, N., Stelzer, G., Zimmerman, S., Iny Stein, T., Safran, M. and Lancet, D.PathCards: multi-source consolidation of human biological pathways, Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006 . [PDF]
PathCards: multi-source consolidation of human biological pathways
- Frida Belinky*,
- Noam Nativ,
- Gil Stelzer,
- Shahar Zimmerman,
- Tsippi Iny Stein,
- Marilyn Safran and
- Doron Lancet
- Received September 22, 2014.
- Revision received January 13, 2015.
- Accepted January 14, 2015.
The study of biological pathways is key to a large number of systems analyses. However, many relevant tools consider a limited number of pathway sources, missing out on many genes and gene-to-gene connections. Simply pooling several pathways sources would result in redundancy and the lack of systematic pathway interrelations. To address this, we exercised a combination of hierarchical clustering and nearest neighbor graph representation, with judiciously selected cutoff values, thereby consolidating 3215 human pathways from 12 sources into a set of 1073 SuperPaths. Our unification algorithm finds a balance between reducing redundancy and optimizing the level of pathway-related informativeness for individual genes. We show a substantial enhancement of the SuperPaths’ capacity to infer gene-to-gene relationships when compared with individual pathway sources, separately or taken together. Further, we demonstrate that the chosen 12 sources entail nearly exhaustive gene coverage. The computed SuperPaths are presented in a new online database, PathCards, showing each SuperPath, its constituent network of pathways, and its contained genes. This provides researchers with a rich, searchable systems analysis resource.Database URL:http://pathcards.genecards.org/
The systematic analysis of biological pathways has ever-increasing significance in an age of growing systems analyses and omics data. Mapping genes onto pathways may contribute to a better understanding of biological and biomedical mechanisms. The literature provides a large collection of pathway definition sources (1). Pathway knowledge bases represent the careful collection of genes and their interactions, mapped onto biological processes. These repositories, which include both academic and commercial resources (Figure 1A), provide lists of pathways and their cellular components, each with an idiosyncratic view of the pathway universe.
Indeed, the definition of the boundaries of biological pathways differs among sources, as exemplified by the highly studied processes of fatty acid metabolism (2) or the TCA cycle (the tricarboxylic acid cycle) (3). Further, the same pathway name may have widely dissimilar gene content in different sources (4). At present, there is no definitive analysis of pathway similarities, either between or within sources. Thus the multitude of pathway resources can often be confusing when portraying gene-pathway affiliations.
Previous attempts to unify pathways from several sources include NCBI’s Biosystems (5), PathwayCommons (6), PathJam (7), HPD (8), ConsensusPathDB (9), hiPathDB (10) and Pathway Distiller (11). But none of these efforts entail a standardized method to unify numerous sources into a consolidated global repository.
Here, we describe an approach aimed at generating an integrated view across multiple pathway sources. We applied a combination of nearest neighbor graph and hierarchical clustering, utilizing a gene-content metric, to generate a manageable set of 1073 unified pathways (SuperPaths). These optimally encompass all of the information contained in the individual sources, striving to minimize pathway redundancy while maximizing gene-related pathway informativeness. The resultant SuperPaths are integrated into GeneCards (12), enabling clear portrayal of a gene’s set of unified pathways. Finally, these SuperPaths, together with diverse related biological data, are provided in PathCards—a new pathway-centric online database, enabling quick in-depth analysis of each human SuperPath.
Materials and methods
Pathway mining and comparison
Pathway gene sets were generated based on the GeneCards platform (12), implementing the gene symbolization process allowing for comparison of pathway gene sets, from 12 different manually curated sources, including: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16) QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (seeTable 1). A binary matrix was generated for all 3125 pathways, where each column represents a gene indicated by 1 for presence in the pathway and 0 for absence. Additionally, six sources were analysed for their cumulative tallying of genes content, including: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).
Pathway similarity assessment
In the analyses performed, we utilized gene content overlap to estimate pathway similarity. This was done based on the Jaccard coefficient, that measures similarity between finite sample sets, and defined as the size of the intersection divided by the size of the union of the sets. To examine the legitimacy of this method, we performed a comparison to an alternative methodology, embodied in MetaPathwayHunter pathway comparison, that incorporates topology in pairwise pathway alignment (25). For such analysis, we used a set of 151 yeast pathways available in MetaPathwayHunter, and computed Jaccard similarity coefficients (J) for all 11 325 pathway pairs. We then selected a sample of 30 pairs containing 28 unique pathways out of a total of 87 pairs with J ≥ 0.3, ensuring maximal representation for larger pathways. Each of the 28 pathways was queried in MetaPathwayHunter against the entire gamut of 151 with default parameters (a total of 4228 comparisons). We found that 29 out of the 30 sample pathway pairs obtained a significant MetaPathwayHunter alignment (P ≤ 0.01). As only 64 of the 4228 comparisons showed such a P-value, the probability of obtaining this result at random is 1.6 × 10−53(Supplementary Table S1). Thus, Jaccard scores appear as excellent predictors for the results of the more elaborate method. A full account of interpathway pairwise similarity is available upon request.
For the main pathway clustering algorithm, we applied a method described elsewhere (26), which includes the following steps: i) The generation of cluster cores by joining all pathway pairs with Jaccard coefficient ≥T2, the upper cutoff, equivalent to hierarchical clustering. ii) Performing cluster extension by generating new best edges, i.e. joining every pathway to a pathway showing the highest score, as long as it is ≥T1, the lower cutoff, akin to nearest neighbor joining. If two or more target pathways have the same best score, all are joined. Each resultant connected component is defined to be a pathway cluster (SuperPath). Identical pathway sets were joined without considering each other as nearest neighbors (i.e. the best scoring non-identical pathway gene-set is chosen as the nearest neighbor). This clustering algorithm is order independent.
Determination of cutoffs
Uniqueness of a SuperPath is defined as where Npis the number of pathways that include a certain gene, averaging for each pathway over all genes in the SuperPath (divided by the number of genes Ng). Uniqueness of genes is symmetrically defined per SuperPath as where each Ng is the number of genes included in the relevant pathway, averaging for each gene over all SuperPaths including a gene. In order to then find the best tradeoff between the two scores, we summed up the average Us and Is for each set of T1 and T2 cutoff parameters. Thus Us + Is was calculated for each set of parameters to find the two parameters for which the tradeoff between pathway and gene uniqueness would be optimal. The best cutoffs by maximizing Us + Is were T1 = 0.3 and T2 ≥ 0.5. Further fine tuning of the upper cutoff was performed by resampling of the data, a technique employed by Levin and Domany (27). We used two dilutions (0.75 and 0.9), i.e. randomly sampling 75% and 90% of the pathways (resampling 100 times for each dilution) and performing the clustering algorithm on each sample, each time calculating the percent of the edges present in the original clustering—the percent of cases that two pathways belonged to the same cluster as in the full dataset. In both dilutions, the upper cutoff of 0.7 was found to recover a higher percent of the edges in the original clustering algorithm (Figure 4C).
Name similarity calculation and concordance with gene similarity
Name similarity was calculated as the Jaccard coefficients of the shared words in the two pathway names, after omitting trivial words and using stemming to identify words with the same root. The cutoff between similar and non-similar names (as well as gene content in regard to comparison with name similarity) was set to J = 0.5. Name similarity was compared with gene content similarity to find the level of concordance between the two.
Shared publications and PPI data
Publication and Protein-Protein Interactions (PPI) data for each gene were obtained from the GeneCards database, including several combined sources. Publications sources of GeneCards include both manually curated publications (e.g. UniProtKB/Swiss-Prot) as well as text mining approaches that report connections between a gene and a list of publications. A shared publication between two genes is an association of both genes to the same publication and does not indicate a direct interaction between the genes. PPI scores between pairs of genes are also based on several interaction sources in GeneCards. Unlike shared publications, PPIs reflect direct interactions between the two gene products.
Randomization and comparison
A randomized set of pseudo-SuperPaths was generated, such that the pseudo-SuperPaths are the same size and quantity as the SuperPaths, albeit with genes assigned at random (from the list of genes with any pathway annotation). Gene pairs that belong to at least one SuperPath, but do not belong together in any individual pathway (the test set) were analysed for the number of shared publications and PPI scores for each pair. In comparison, gene pairs that belong to at least one pseudo-SuperPath, but do not belong together in any individual pathway (the control set) were analysed for the same attributes. To compare the two sets which are of different sizes, a random sample of the larger set (the control set) of the same size as the smaller set (the test set) was compared with the smaller set. A one-sided Kolmogorov–Smirnoff test was performed to compare between the test and control sets.
Gene enrichment analysis comparison
Differentially expressed sets of genes were obtained from the GeneCards database (12) containing 830 different embryonic tissues based on manual curation (28). For the comparison of SuperPaths and their pathway constituents, 89 SuperPaths that contained exactly two pathways with Jaccard similarity coefficient <0.6 were chosen, a value selected to include pairs of relatively dissimilar pathways in order to enhance comparative power. Two gene set enrichment analyses were run for all 830 gene sets: one with SuperPaths and the other with their constituent pathways. Whenever both SuperPath and the constituent pathways received a statistical enrichment score, the difference between negative log Pvalues was computed.
GeneCards and PathCards
SuperPaths have been implemented in GeneCards and are now included in the standard procedure of GeneCards generation. PathCards is an online compendium of human pathways, based on the GeneCards database, presenting SuperPath-related data in each page.
We analysed 12 pathway sources included in GeneCardshttp://www.genecards.org/ (12) with a total of 3215 biological pathways (Table 1 and Figure 1A). The total number of genes covered by these sources is 11 478, nearly twice as large as the gene count in the largest source (Figure 1B), suggesting the power of analysing multiple sources. Asymptotic behavior is observed in the change of total gene count with increasing number of sources. When considering the incorporation of six additional sources (Supplementary Figure S1), we found that the gene count increment is ∼2% of the currently analysed total. This is an indication that the chosen 12 sources provide adequate coverage of human gene-pathway mappings. Switching between the six non-included sources and six included sources of similar size give a very similar graph, with merely 4% increment in gene count (Supplementary Figure S1).
Analysing the gene repertoires of the four largest sources (Figure 2A), we found that among the 10 770 genes contained within these sources, only 1413 genes were jointly covered by all four sources, and that more than 4000 were unique to one of the four sources. This highlights the notion that source unification is essential to obtain maximal gene coverage. In its simplest embodiment, source unification would entail presenting a unified list of the 3215 pathways included in all 12 sources. This however would ignore the extensive gene-content connectivity embodied in the network representation of this pathway collection (Figure 3A). Further, the original pathway collection has considerable inconsistencies of relations between pathway name and pathway gene content, as exemplified in Figure 2B and C. The summary in Table 2A suggests that only ∼9.4% of all pathway pairs with a similar name have similar gene content, and likewise, only 9.8% of all pathway pairs with similar gene content are named similarly (Supplementary Figure S2).
View this table:
We performed global pathway analysis aimed at assigning maximally informative pathway-related annotation to every human gene. For this, we converted the pathway compendium into a set of connected components (SuperPaths), each being a limited-size cluster of pathways. We aimed at controlling the size of the resulting SuperPaths, so as to maintain a high measure of annotation specificity and minimize redundancy.
The following two steps were used in the clustering procedure, in which pathways were connected to each other to form SuperPaths. i) Preprocessing of very small pathways: pathways smaller than 20 genes were connected to larger pathways (<200 genes) with a content similarity metric of ≥0.9 relative to the smaller partner. ii) The main pathway clustering algorithm: this was performed using the Jaccard similarity coefficient (J) metric (31) (see Materials and Methods). We used a combination (cf. 26) of modified nearest neighbor graph generation with a threshold T1 and hierarchical clustering with a threshold T2 (Figure 4A and Materials and Methods).
To determine the optimal values of the thresholds T1 and T2, we defined two quantitative attributes of the clustering process. The first is US, the overall uniqueness of the set of SuperPaths. USelevation is the result of increasing pathway clustering, and reflects the gradual disappearance of redundancy, i.e. of cases in which certain gene sets are portrayed in multiple SuperPaths. The second parameter is IS, the overall informativeness of the set of SuperPaths. IS is a measure of how revealing a collection of SuperPaths is for annotating individual genes. It decreases with the extent of pathway clustering, reaching an undesirable minimum of one exceedingly large cluster, whereby identical SuperPath annotation is obtained for all genes. We thus sought an optimal degree of clustering whereby US + IS is maximized (Figure 4B and Materials and Methods).
Our procedure pointed to an optimum at T1 = 0.3 and T2 ≥ 0.5. Further fine tuning by data resampling suggested an optimal value of T2 = 0.7 (Figure 4C and Materials and Methods). This procedure resulted in the definition of 1073 SuperPaths, including 529 SuperPaths ranging in size from 2 to 70 pathways, and 544 singletons (one pathway per SuperPath) (Figures 3B and 5A). Each SuperPath had 3 ± 4.3 pathways (Figure 5A) and 82.7 ± 140.6 genes (Supplementary Figure S3A). The resultant set of SuperPaths indeed enhances the uniqueness US as depicted in Figure 5B.
The unification process resulted in relatively small changes in gene count distribution between the original pathways and the resultant SuperPaths (Supplementary Figure S3), suggesting a substantial preservation of gene groupings. Notably, applying pure hierarchical clustering (T1 = T2 = 0.3) resulted in a single very large cluster with 1046 pathways (Figure 3C) and with the same amount of singletons, strongly deviating from the goal of specific pathway annotation for genes (Supplementary Figure S3B). This sub-optimal performance of pure hierarchical clustering is general; any of the examined cases of T1 = T2 (Figure 4B diagonal), shows an Us + Isvalue lower than that for T1 = 0.3 T2 = 0.7.
Each SuperPath is identified by a textual name derived from one of its constituent pathways selected as the most connected pathway (hub) in the SuperPath cluster. For simplicity, the option of de novonaming was not exercised. Selecting the hub’s name, as opposed to that of the largest pathway, was chosen since this tends to enhance the descriptive value for the entire SuperPath. When more than one pathway has the same maximal number of connections, the larger one is chosen.
SuperPaths make important gene connections
One of the major implications of the process of SuperPath generation is elucidating new connections among genes. This happens because genes that were not connected via any pre-unification pathway become connected through belonging to the same SuperPath. The unification into SuperPaths is important in two ways: first, it brings, under one roof, pathway information from 12 sources, each individually contributing ∼9000 to ∼5 million instances of gene pairing, for a total of 7.3 million pairs (Supplementary Figure S4). Second, by unifying into SuperPaths, the number of gene pairs is further enhanced, reaching 8.3 million (Supplementary Figure S4).
To test the significance of the million new gene–gene connections resulting from SuperPath generation, we checked their correlation with two independent measures of gene pairing. First, a comparison was made to publications shared among gene pairs (Figure 6A). We found that for gene pairs appearing in a SuperPath but not in any of its constituent pathways, there is a 4- to 75-fold increase in instances of >20 shared publications when compared with random pairs of genes with pathway annotation. Added gene pairs have significantly more shared publications than those randomly paired. Second, we performed a similar analysis based on protein–protein interaction information. We found that for the SuperPath-implicated gene pairs there was a 4- to 25-fold increase of PPIs with score >0.2 (Figure 6B) when compared with controls. SuperPaths thus provide significant gene partnering information not conveyed by any of their 3215 constituent individual pathways. This may be seen when performing gene set enrichment analysis on 830 differential expression sets and comparing the scores of SuperPaths to that of their constituent pathways, demonstrating that SuperPaths tend to receive more significant scores compared with their constituent pathways average score (Figure 7A).
SuperPaths in databases
SuperPath information is available both in the GeneCards pathway section (Supplementary Figure S5A) and in PathCards (Supplementary Figure S5B) http://pathcards.genecards.org/, a GeneCards companion database presenting a web card for each SuperPath. PathCards allows the user a view of the pathway network connectivity within a SuparPath, as well as the gene lists of the SuperPath and of each of its constituent pathways. Links to the original pathways are available from the pathway database symbols, placed to the left of pathway names. PathCards has extensive search capacity including finding any SuperPath that contains a search term within its included pathway names, gene symbols and gene descriptions. Multiple search terms are afforded, allowing fine-tuned results. The search results can be expanded to show exactly where in the SuperPath-related text the terms were found. The list of genes in a PathCard utilizes graded coloring to designate the fraction of included pathways containing this gene, providing an assessment of the importance of a gene in a SuperPath. Other features, including gene list sorting and a search tutorial, are under construction. PathCards is updated regularly, together with GeneCards updates. A new version is released 2–3 times a year.
Pathway source heterogeneity
This study highlights substantial mutual discrepancies among different pathway sources, e.g. with regard to pathway sizes, names and gene contents. The world of human biological pathways consists of many idiosyncratic definitions provided by mostly independent sources that curate publication data and interpret it into sets of genes and their connections. The idiosyncratic view of the different pathway sources is exemplified by the variation in pathway size distribution among sources (Table 1, Figure 2D), where some sources have overrepresentation of large pathways (QIAGEN), while others have mainly small pathways (HumanCyc). In some cases, the large standard deviation in pathway size (Table 1) is easily explained, as exemplified in the case of Reactome, which provides hierarchies of pathways and therefore contains a spectrum of pathway sizes. However, large standard deviations of pathway size are also observed in KEGG and QIAGEN—sources that are not hierarchical by definition. On the other hand, some sources (e.g. HumanCyc, PID and PharmGKB) have very little variation in their pathway sizes, revealing their focus on pathways of particular size. The idiosyncratic view provided by different sources is also evident when examining the genes covered by each source (Figure 2A), where some genes in the gene space are covered by only one source. This causes the unfavorable outcome that when unifying pathways, irrespective of the algorithm chosen, there is a relatively high proportion of single source pathway clusters. In order to account for the drawback of the Jaccard index to cope with large size differences between pathways, we added a preprocessing step to unify pathways that are almost completely included within other pathways (≥0.9 gene content similarity of the smaller pathway), thereby diminishing the barrier of variable pathway size between sources. Previously published isolated instances of intersource discrepancies include the lack of pathway source consensus for the TCA cycle (3) and fatty acid metabolism (2). The authors of both papers stress that each of their pathway sources has only a partial view of the pathway. For the TCA cycle example (3) there is an attempt to provide an optimal TCA cycle pathway by identifying genes that appear in multiple sources, but such manual curation is not feasible for a collection of >3000 biological pathways. In our procedure, 11 relevant pathways from four sources are unified into a SuperPath entitled ‘Citric acid cycle (TCA cycle)’ (Supplementary Figure S5). PathCards enables one to then view which genes are more highly represented within the constituent pathways. Our algorithm thus mimics human intervention, and greatly simplifies the task of finding concurrence within and among pathway sources.
Combining several pathway resources has been attempted before, using different approaches. The first method is to simply aggregate all of the pathways in several knowledge bases into one database, without further processing. This approach is taken, for example, by NCBI’s Biosystems with 2496 human pathways from five sources (5) and by PathwayCommons with 1668 pathways from four sources (6). This was also the approach taken by GeneCards prior to the SuperPaths effort described here, where pathways from six sources were shown separately in every GeneCard. While this approach provides centralized portals with easy access to several pathway sets, it does not reveal interpathway relationships and may result in considerable redundancy. The second unification approach, taken by PathJam (7), and HPD (8) provide proteins versus pathways tables as search output. This scheme allows useful comparisons as related to specific search terms, but is not leveraged into global analyses of interpathway relations. A third line of action is exemplified by ConsensusPathDB (9), which integrates information from 38 sources, including 26 protein–protein interaction compendia as well as 12 knowledge bases with 4873 pathways. This allows users to observe which interactions are supported by each of the information sources. In turn, hiPathDB (10) integrates protein interactions from four pathway sources (1661 pathways) and creates ad hoc unified superpathways for a query gene, without globally generating consolidated pathway sets. Finally, a fourth methodology is employed by Pathway Distiller (11), which mines 2462 pathways from six pathway databases, and subsequently unifies them into clusters of several predecided sizes between 5 and 500, using hierarchical clustering. The third method of interaction mapping taken by ConsensusPathDB and HiPathDB differs conceptually from the fourth method of clustering, where the interaction mapping method provides information on the specific commonalities and discrepancies in protein interactions among sources with regard to specific keywords or genes, while the clustering method suggests which of the pathways are similar enough to be considered for the same cluster. Therefore, the third and fourth methods are complementary approaches aimed at utilization of pathway information in different observation levels, where the fourth (clustering) method is independent of user input or search in resultant consolidation. In the study described herein, we pursued a clustering method similar to the fourth methodology taken by Pathway Distiller, namely consolidation of pathways into clusters. However, in contrast to Pathway Distiller, our aim was to create a single coherent unification of biological pathways, which is essential for having a universal set of descriptors when looking at gene–gene relations. The resulting SuperPaths simplify the pathway-related descriptive space of a gene and reduce it 3-fold. Furthermore, the cutoffs in our algorithm are chosen to optimally adjust the criteria of uniqueness and informativeness, thereby reducing the subjective effect of choosing cutoffs arbitrarily or by predetermining the number of clusters.
A crucial element in our SuperPaths generation method is the definition of interpathway relationships. We have opted for the use of gene content, as described by others (11, 33). One could also consider the use of pathway name similarity (11). However, among the 3215 pathways analysed here, only 79 names were shared by more than one pathway, implying that the efficacy of such an approach would have been rather limited. Further, Table 2 andSupplementary Figure S2 indicate a relatively weak concordance between pathway names and their gene content. Specifically among 79 name-identical pathway groups 52 remained incompletely unified, again suggesting a limited usefulness for unifying based on pathway names. Many resources, including ConsensusPathDB (9) facilitate the option of finding pathways based on keywords in the name. Name sharing is thus a relatively trivial task to overcome when trying to find similar pathways. The more challenging goal is finding pathways that are similar in the biological process that they convey.
In this article we treated pathways as sets of genes, using gene content as a comparative measure and omitting topology and small molecule information. This approach was previously advocated as a means of reducing the complexity of pathway comparisons greatly (34). Further, most sources used in this study provide only the gene set information, hence topology information was unavailable. Finally, the high concordance between significance of pathway alignment and Jaccard coefficients ≥0.3 (P < 10−52) indicates that the Jaccard coefficient is a good approximation of the more elaborate pathway alignment procedure (25).
A central aim of pathway source unification is enhancing the inference of gene-to-gene relations needed for pathway enrichment scrutiny (32, 35–40). To this end, we developed an algorithm for pathway clustering so as to optimize this inference and at the same time minimize redundancy.
Extending pathways into SuperPaths affords two major advantages. The first is augmenting the gene grouping used for such inference. Indeed, SuperPaths have slightly larger sizes than the original pathways, as evident by the SuperPath size distribution (Figure 2D). Nevertheless, comparing SuperPaths to pseudo-SuperPaths of the same size and quantity clearly show that the increase in size does not account for the addition of true positive gene connections, as evident by the higher PPIs and larger counts of shared publications for SuperPath gene pairs (Figure 6). Subsequently, it is not surprising that SuperPaths outperform their average pathway constituent’s enrichment analysis scores (Figure 7A). SuperPaths are currently used in two GeneCards-related novel tools, VarElecthttp://varelect.genecards.org/ and GeneAnalyticshttp://geneana lytics.genecards.org/. A second advantage of SuperPaths is in the reduction of redundancy, since they provide a smaller, unified pathway set, and thus diminish the necessary statistical correction for multiple testing. We note that ConsensusPathDB (9) also provides intersource integrated view of interactions. However, gene set analysis in ConsensusPathDB is only allowed for pathways as defined by the original sources. Finally, a third advantage of SuperPaths is their ability to rank genes within a biological mechanism via the multiplicity of constituent pathways within which a gene appears. This can be used not only to gain better functional insight but also to help eliminate suspected false-positive genes appearing in a minority of the pathway versions. A capacity to view such gene ranking is available within the PathCards database.
Limitations of SuperPaths
The SuperPaths generation procedure appears incomplete, as about a half of all SuperPaths are ‘singleton SuperPath’ (labelled accordingly in PathCards), having only one constituent pathway. This is an outcome of the specific cutoff parameters used. However, this provides a useful indication to the user that a singleton pathway is distinct, differing greatly in its constituent genes from any other pathway.
This SuperPath generation process is intended to reduce redundancies and inconsistencies found when analysing the unified pathways. Although SuperPaths increase uniqueness as compared with the original pathway set (Figure 5B), some redundancy and inconsistency still remain within SuperPaths. There are cases of pathways with similar names, which do not get unified into the same SuperPath. This happens because they have not met the unification criteria employed. We also note similarity in name does not always indicate similarity in gene content (Figure 2B and C,Supplementary Figure S2B), and such events are faithfully conveyed to the user.
A clarifying example is that of the 40 pathways whose names include the string ‘apoptosis’. The final post-unification list has 10 SuperPaths whose name includes ‘apoptosis’. This obviously provides the user with a greatly simplified view of the apoptosis world. Yet, at the same time the outcome is replete with instances of two name-similar pathways being included in different SuperPaths. Employing a more stringent algorithm would result in over-clustering, which would in turn reduce informativeness (seeFigure 3C).
In parallel, there are pathways with overlapping functions that are not consolidated into one SuperPath. For example, the pathway ‘integrated breast cancer pathway’ does not unify with the pathways ‘DNA repair’ and ‘DNA damage response pathway’, despite the strong functional relation of breast cancer with DNA damage and repair (41). This is because the relevant gene content similarity in the original pathway sources is small, respectively, J = 0.03 and 0.13. The need to view information on pathways with low pairwise similarity is addressed in Supplementary Figure S6, and is available as a text file upon request.
Finally, when looking at the number of contributing sources per SuperPath (Figure 7B), it is evident that the majority of SuperPaths are comprised by either one or two sources, and no SuperPaths includes more than five. Although this integration limitation is evident, it mainly arises from the inherent biases in gene coverage for the different information sources (Figure 2A).
Biological pathway information has traditionally been a central facet of GeneCards, the database of human genes (12, 42, 43). In previous versions, pathways were presented separately for each of the pathway sources, and it was difficult for users to relate the separate lists to each other. As a result of the consolidation into SuperPaths described herein, this problem has been effectively addressed. Thus, in every GeneCard, a table portrays all of a gene’s SuperPaths, each with its constituent pathways, with links to the original sources (Supplementary Figure S5A).
GeneCards is gene-centric and inherently does not present (Super) pathway-centric annotations. We therefore developed PathCardshttp://pathcards.genecards.org/, a database that encompasses and displays such information in greater detail. PathCards has a page for every SuperPath, showing the connectivity of its included pathways, as well as gene lists for the SuperPath and its pathways. For every SuperPath, we also show a STRING gene interaction network (32) for the entire gamut of constituent genes, providing perspective on topological relationships within the SuperPath.
Supplementary data are available at Database Online.
This research is funded by grants from LifeMap Sciences Inc. California (USA) and the SysKid—EU FP7 project (number 241544). Support is also provided by the Crown Human Genome Center at the Weizmann Institute of Science. Funding for open access charge: LifeMap Sciences Inc. California (USA).
Conflict of interest. None declared.
We thank Prof. Eitan Domany and Prof. Ron Pinter for helpful discussions, as well as Dr. Noa Rappaport and Dr. Omer Markovich for assistance with clustering and visualization.
Citation details: Belinky,F., Nativ,N., Stelzer,G., et al. PathCards: multi-source consolidation of human biological pathways.Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006
- © The Author(s) 2015. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.