Recent approaches to the prioritization of candidate disease genes
Many efforts are still devoted to the discovery of genes involved with specific phenotypes, in particular, diseases. High-throughput techniques are thus applied frequently to detect dozens or even hundreds of candidate genes. However, the experimental validation of many candidates is often an expensive and time-consuming task. Therefore, a great variety of computational approaches has been developed to support the identification of the most promising candidates for follow-up studies. The biomedical knowledge already available about the disease of interest and related genes is commonly exploited to find new gene–disease associations and to prioritize candidates. In this review, we highlight recent methodological advances in this research field of candidate gene prioritization. We focus on approaches that use network information and integrate heterogeneous data sources. Furthermore, we discuss current benchmarking procedures for evaluating and comparing different prioritization methods. WIREs Syst Biol Med 2012. doi: 10.1002/wsbm.1177
For further resources related to this article, please visit the WIREs website.
COMPLEX DISEASES AND THE IDENTIFICATION OF RELEVANT GENES
Many common diseases are complex and polygenic, involving dozens of human genes that might predispose to, be causative of, or modify the respective disease phenotype.1–3 This intricate interplay of disease genotypes and phenotypes still renders the identification of all relevant disease genes difficult.4–6 Therefore, a number of experimental techniques exist to discover disease genes. In particular, high-throughput methods such as genome-wide association (GWA) studies2,7,8 and large-scale RNA interference screens6,9,10 yield lists of up to hundreds of candidate disease genes. As validating the actual disease relevance of candidate genes in experimental follow-up studies is a time-consuming and expensive task, many methods and web services for the computational prioritization of candidate disease genes have already been developed.11–19
The concrete problem of candidate gene prioritization can be formulated as follows: Given a disease (or, generally spoken, a specific phenotype) of interest and some list of candidate genes, identify potential gene–disease associations by ranking the candidate genes in decreasing order of their relevance to the disease phenotype. When abstracting from the methodological details, the vast majority of computational approaches to this prioritization problem work in a similar manner. Most of them rely on the biological information already available for the disease phenotype of interest and the known, already verified, disease genes as well as for the additional candidate genes. In this context, functional information, particularly, manually curated or automatically derived functional annotation, often provides strong evidence for establishing links between diseases and relevant genes and proteins.20–27 Many prioritization methods use protein interaction data as rich information source for finding relationships between gene products of candidate genes and disease genes.11,16,18,25,28–45 In addition, the phenotypic similarity of diseases can help to increase the total number of known disease genes for less studied disease phenotypes.35,46–55 Other sources of biological information frequently used by prioritization approaches are sequence properties, gene expression data, molecular pathways, functional orthology between organisms, and relevant biomedical literature.12,14,19
These data then serve as input for statistical learning methods or are integrated into network representations, which are further analyzed by network scoring algorithms. Although individual data sources such as functional annotations or protein interactions provide quite powerful information for prioritizing candidate genes, the integration of multiple data sources has been reported to increase the performance even more.23–25,35,47–69 However, a generally accepted and consistent benchmarking strategy for all the diverse prioritization methods has not emerged yet, which complicates performance evaluation and comparison.
Therefore, this advanced review does not only highlight recent prioritization approaches (published until the end of 2011), but also discusses different benchmarking strategies applied by authors and the need of standardized procedures for performance measurement. Other computational tasks such as the structural and functional interpretation as well as prioritization of disease-associated nucleotide and amino acid changes are not discussed in this article, but are reviewed elsewhere.3,70–74 In the following, the various prioritization methods are categorized according to the biological data and their representation that are primarily considered when scoring and ranking candidate disease genes: gene and protein characteristics, network information on molecular interactions, and integrated biomedical knowledge.
PRIORITIZATION METHODS USING GENE AND PROTEIN CHARACTERISTICS
The first computational approaches in the hunt for disease genes focused on molecular characteristics of disease genes, which discriminate them from non-disease genes. As described below, researchers developed methods related to individual gene and protein sequence properties75–77 as well as functional annotations of gene products.20–22,26,27,78 In principle, if a candidate satisfies certain characteristics as derived from known disease genes and proteins, its disease relevance is considered to be higher than otherwise.
Gene and Protein Sequence Properties
López-Bigas and Ouzounis75 derived several important characteristics of disease genes from the amino acid sequence of their gene products. In comparison with other proteins encoded in the human genome, disease proteins tend to be longer, to exhibit a wider phylogenetic extent, that is, to have more homologs in both vertebrates and invertebrates, to possess a low number of close paralogs, and to be more evolutionarily conserved. Using these sequence properties as input to a decision-tree algorithm, the researchers performed a genome-wide identification of genes involved in (hereditary) diseases.
Similarly, Adie et al.76 developed PROSPECTR, a method for candidate disease gene prioritization based on an alternating decision-tree algorithm. However, their approach examined a broader set of sequence features and thus produced a more successful classifier. In particular, Adie et al. found that disease genes tend to have different nucleotide compositions at the end of the sequence, a higher number of CpG islands at the 5′ end, and longer 3′ untranslated regions.
By demonstrating the strong correlation between gene and protein function and disease features, such as age of onset, Jimenez-Sanchez et al.78motivated prioritization approaches that exploit the functional annotation of known disease genes for ranking candidates.20–22,26,27
Perez-Iratxeta et al.20 applied text mining on biomedical literature to relate disease phenotypes with functional annotations using Medical Subject Headings (MeSH)79 and Gene Ontology (GO) terms.80 They ranked the candidate genes according to the characteristic functional annotations shared with the disease of interest. In a similar fashion, Freudenberg and Propping21identified candidate genes based on their annotated GO terms that are shared with groups of known disease genes associated with similar phenotypes. In contrast, the approach POCUS22 assesses the shared over-representation of functional annotation terms between genes in different loci for the same disease.
Recently, Schlicker et al.26 developed a prioritization method that makes use of the similarity between the functional annotations of disease genes and candidates. In contrast to the approaches that consider solely identical functional annotations or compute only GO term enrichments, MedSim automatically derives functional profiles for each disease phenotype from the GO term annotation of known disease genes and, optionally, of their orthologs or interaction partners. Candidate genes are then scored and ranked according to the functional similarity of their annotation profiles to a disease profile. In addition, Ramírez et al.27 introduced the BioSim method for discovering biological relationships between genes or proteins. While MedSim is based only on GO term annotations, BioSim quantifies functional gene and protein similarity according to multiple data sources of functional annotations and can also be applied to rank candidate genes based on their functional similarity to known disease genes.
The success of the presented studies also shows that phenotypically similar diseases often involve common molecular mechanisms and thus functionally related genes. This also explains the frequent use of functional annotations as important biological evidence in integrative prioritization approaches.23–25,56–58,61–66,68,69 Notably, the information value of functional annotations can be further increased by improved scoring of functional similarity, reaching the performance of complex integrative methods based on multiple data sources.26
PRIORITIZATION METHODS USING NETWORK INFORMATION
In the last decade, molecular interaction networks have become an indispensable tool and a valuable information source in the study of human diseases. Regarding methods for prioritization of candidate disease genes, it was repeatedly observed that protein interaction networks are among the most powerful data sources in addition to functional annotations.11,16,18,28,29,81 As in case of sequence properties, disease genes and their products have discriminatory network properties that allow their distinction from non-disease genes. In particular, molecular interactions naturally support the application of the guilt-by-association principle to identify disease genes. In the following, we highlight a representative selection of network-based prioritization approaches.
Local Network Information
Early prioritization methods have focused on local network information such as close network neighborhood of a node representing a candidate gene or protein (see Box 1). This can be explained by the observation that disease proteins tend to cluster and interact with each other.30,49,82–84 Molecular triangulation is one of the first methods that used protein interaction networks to rank candidates and their network nodes with respect to their shortest path distances to nodes of known disease proteins.31 An evidence score such as the MLS score corresponding to linkage peak association85 is assigned to each disease protein node and transferred to its neighbor nodes. The candidates are then ranked according to the accumulated sum of evidence scores. This means that candidates represented by nodes close to several disease protein nodes with good evidence scores are considered to be the most promising ones.
Local network information refers to the topological neighborhood of a node, and corresponding measures are less sensitive to the overall network topology. Examples are the node degree kn (number of edges linked to node n) and the shortest path length dnm (minimum number of edges between the nodes n and m). In disease gene networks, Xu et al.34 define, for each node n, the 1N index and the 2N index . Here, is the number of edges between node nand disease genes, and Nn is the set of direct neighbors of n. Given the set of disease genes M, the average shortest path distance of a node n to disease genes is .
Global network information relates to the overall network topology and measures that characterize the role of a node in the whole network. Common centrality measures are shortest path closeness and betweenness as well as random-walk-related properties such as hitting time, visit frequency and stationary distribution. For instance, closeness centrality indicates how distant a node is to the other network nodes, and it is calculated as with Vn denoting the set of nodes reachable from n. The random walk with restart40 is defined as pt+1 = (1 − r)Wpt + rp0. Here, W is the column-normalized adjacency matrix of the network, pt contains the probability of being at each node at time step t, p0 denotes the initial probability vector, and r is restart probability. The steady-state probability vector p∞ can be obtained by performing iterations until the change between pt and pt+1 falls below some significance threshold, e.g., 10−6.40
In a related approach, Karni et al.32 identified the minimal set of candidates so that there is a path between the products of known disease genes in the protein interaction network. Oti et al.33 proposed an even simpler method for a genome-wide prediction of disease genes. For each known disease protein, they identified its interaction partners and the chromosomal locations of the encoding genes. A gene is then considered to be relevant for a disease of interest if it resides within a known disease locus and its gene product shares an interaction with a protein known to be associated with the same disease.
To make the most out of the potential of local network measures, Xu and Li34computed multiple topological properties for three different molecular networks consisting of literature-curated, experimentally derived, and predicted protein–protein interactions. The considered properties are the node degree, the average distance to known disease genes, the 1N and 2N node indices (see Box 1 and Figure 1), and the positive topological coefficient.86 The authors then trained a k-nearest-neighbor classifier using the aforementioned topological properties of known disease genes and achieved comparable performance for all three networks. They also detected a possible bias in the literature-curated network because disease genes tend to be studied more extensively.
In addition to local network information, Lage et al.35 incorporated phenotypic data into the disease gene prioritization. Each candidate and its direct interaction partners are considered as a candidate complex. All disease proteins in a candidate complex are assigned phenotypic similarity scores, which are used as input to a Bayesian predictor. Thus, a candidate gene obtains a high score if the other proteins in the complex are involved in phenotypes very similar to the disease of interest. Care et al.36 elaborated on this approach by combining it with deleterious SNP predictions for the candidate gene products and their interaction partners. Using the method by Lage et al., Berchtold et al.37 successfully prioritized proteins associated with type 1 diabetes (T1D). Further studies of protein interaction networks underlying specific diseases such as breast cancer38 and T1D39 also deal with the application of similar network-based prioritization approaches.
Global Network Information
Beyond local network information that ignores potential network-mediated effects from distant nodes, the utilization of global network measures can considerably improve the performance of prioritization methods for candidate disease genes.40,41,44 Especially for the study of polygenic diseases, network topology analysis can provide more insight into multiple paths of long-range protein interactions and their impact on the functionality and interplay of disease genes.
Köhler et al.40 demonstrated that random-walk analysis of protein–protein interaction networks outperforms local network-based methods such as shortest path distances and direct interactions as well as sequence-based methods like PROSPECTR.76 In their method, the authors ranked the gene products in a given network according to the steady-state probability of a random walk, which starts at known disease proteins and can restart with a predefined probability (see Box 1). Although the ranking criterion is the proximity of candidates to known disease proteins, this approach is more discriminative than local measures because it accounts for the global network structure.
In a similar manner, Chen et al.41 adapted three sophisticated algorithms from social network and web analysis for the problem of disease gene prioritization. To this end, they analyzed a protein–protein interaction network using modified versions of the random-walk-based methods PageRank,87,88Hyperlink-Induced Topic Search (HITS),88,89 and K-Step Markov method (KSMM).88 PageRank, HITS, and KSMM consider the global network topology and compute the relevance of all nodes representing candidates with regard to the set of known disease proteins in the network. All three methods achieved comparable performance to each other.
To address the issue of finding causal genes within expression quantitative trait loci (eQTL),90,91 Suthram et al.42 introduced the eQTL electrical diagrams method (eQED). They modeled confidence weights of protein interactions as conductances, while the P-values of associations between genetic loci and the expression of candidate genes served as current sources. The best candidate gene is the one passed by the highest current. The currents in electric circuits can be determined efficiently using random-walk computations.92,93
Network Centrality Measures
For many years, global centrality measures such as closeness or betweenness have been used in social sciences to assess how important individual nodes are for the overall network connectivity. Recently, such measures have been applied to several problems in bioinformatics including disease gene prioritization. For example, Dezső et al.43 applied an adapted version of shortest path betweenness to prioritize candidates in a protein–protein interaction network. A candidate is scored more relevant to the disease of interest if it lies on significantly more shortest paths connecting nodes of known disease proteins than other nodes in the network.
In a recent case study on primary immunodeficiencies (PIDs), Ortutay and Vihinen25 integrated functional GO annotations with protein interaction networks to discover novel PID genes. The authors conducted a topological analysis on an immunome network consisting of all essential proteins related to the human immune system and their interactions. In particular, they used the node degree as well as the global centrality measures of vulnerability and closeness to assess the importance of candidate genes in the network (see Box 1). Additionally, they performed functional enrichment analysis to determine genes with PID-related GO terms. With some modifications, the described prioritization method could be generalized to other diseases of interest.
Combining Network Measures
Recently, Navlakha and Kingsford44 compared different network-based prioritization methods. The authors observed that random-walk-based measures40 outperform measures focused on the local network neighborhood33 or clustering.94–96 A consensus method that uses a random-forest classifier to combine all methods yielded the most accurate ranking. Therefore, apart from stressing the potential of protein interaction data, Navlakha and Kingsford also showed that disease gene prioritization can benefit from the integration of multiple information sources.
In summary, as many other studies have also demonstrated, molecular interaction networks, in particular, based on protein interactions, provide valuable biological knowledge for ranking candidate disease genes.11,16,18,28,29 It has also become clear that global network measures achieve better results in comparison to local measures.40,41,44 Nevertheless, the performance of such prioritization approaches depends heavily on the quality of the network data. Protein interaction data are well known to be biased toward extensively studied proteins and subject to inherent noise.34,97,98 Therefore, it is often suggested that existing methods will perform better when more accurate data become available. Furthermore, Erten et al.45 pointed out that network-based methods can also be improved by integrating statistical adjustments for the skewed degree distribution of protein interaction networks.
PRIORITIZATION METHODS USING INTEGRATED KNOWLEDGE
Network information on molecular interactions as well as individual gene and protein characteristics such as sequence properties and functional annotations are major sources of biological evidence for scoring and ranking candidate disease genes. However, a prioritization approach based on a single information source alone usually achieves only limited performance due to noisy and incomplete datasets. To address this problem, the integration of multiple sources of biological knowledge has proven to be a good solution in bioinformatics. Different types of data can complement each other well to increase the amount of available information and its overall quality. While some of the methods presented above already make successful use of relatively simple integration procedures for a few different sources of functional information and annotations, this section will focus on more sophisticated methods for knowledge integration and the prioritization of candidate disease genes.
Complementing Molecular Interactions with Phenotypic Network Information
In the last years, several groups investigated the similarities and differences between disease phenotypes. The main finding was that similar phenotypes often share underlying genes or even pathways.46,99,100 In particular, van Driel et al.46 classified all human phenotypes contained in the Online Mendelian Inheritance in Man database (OMIM)101 by defining a measure of phenotypic similarity based on text mining of the corresponding OMIM records. Such phenotypic knowledge can be very useful to discover new potential disease genes by transferring known gene–phenotype associations to similar diseases and phenotypes.
Therefore, phenotypic similarity has become another major data source exploited by computational methods for prioritization of candidate disease genes.35,47–55 In this context, a two-layered heterogeneous data network is typically constructed so that the phenome layer consists of connections between similar phenotypes, while the interactome layer contains protein–protein interactions. The two network layers are then linked by known gene–phenotype associations.
To demonstrate the importance of the additional phenotype network layer for identifying novel gene–phenotype associations and disease–disease relationships, Li et al.48 extended the random-walk algorithm used by Köhler et al.,40 as described in the previous section, to heterogeneous networks. Both the candidate genes and the disease phenotypes are prioritized simultaneously. In contrast, Yao et al.49 estimated the closeness of a candidate gene to a disease of interest by computing the hitting time of a random walk that starts at the corresponding disease phenotype and ends at the candidate. This approach also allows the genome-wide identification of potential disease genes for phenotypic disease subtypes.
Chen et al.50 reformulated the candidate gene prioritization problem as a maximum flow problem on a heterogeneous network. They represented the capacities of connections between phenotypes by their phenotypic similarity. Capacities on edges within the interactome and on edges bridging the phenome and interactome were estimated during the evaluation procedure. By calculating a maximum flow from a phenotype of interest through the interactome, the authors ranked candidate genes with regard to the amount of efflux.
A computationally simpler approach based on the same network type was suggested by Guo et al.,51 who computed the association score between a gene and a disease as the weighted sum of all association scores between similar diseases and between neighboring genes in the interaction layer. To this end, the authors formulated an iterative matrix multiplication of disease–gene–association matrices and disease-similarity matrices corresponding to the network structure. While the maximum flow problem solved by Chen et al.50 already accounts for the phenotypic overlap between diseases, the approach by Guo et al.51 additionally considers the genetic overlap of diseases. The recent PhenomeNET52 is even a cross-species network of phenotypic similarities between genotypes and diseases based on a uniform representation of different phenotype and anatomy ontologies. In particular, it can be used to perform whole-phenome discovery of genes for diseases with unknown molecular basis.
Two other studies used iterative network flow propagation on a heterogeneous network to identify protein complexes related to disease. Vanunu et al.53 developed a prioritization method that propagates flow from a phenotype of interest through the whole network and identifies dense subnetworks around high-scored genes as potential phenotype-related protein complexes. In contrast, Yang et al.54 modified the described heterogeneous network and included an additional layer of protein complexes. In the resulting network, phenotypes are connected to protein complexes, and complexes are linked with each other according to the protein interactions in the interactome layer. The method derives novel gene–phenotype associations by propagating the network flow within the protein complex layer.
Disease gene prioritization methods usually rank candidate genes relative to a phenotype of interest. However, the discovery of gene–phenotype associations can also be approached the other way around. Hwang et al.55devised a method to identify the phenotype that could result from a given set of candidate genes. For that purpose, the authors considered a gene network and a phenotype similarity network. In both networks, the nodes were ranked separately with graph Laplacian scores, and a rank coherence was calculated from the score differences between genes and phenotypes connected by known associations. Hwang et al. showed that their approach is suitable to predict the resulting phenotype for a given set of candidate genes.
Integrating Heterogeneous Data Sources of Biological Knowledge
Two distinct approaches to disease gene prioritization that exploit multiple data sources are exemplarily highlighted in the following (Figure 2). The first approach considers each data source separately when assessing the molecular and phenotypic relationships of candidate genes with the disease of interest, and aggregates the resulting multiple ranking lists into a final ranking of the candidates. The alternative approach combines all biological information into a network representation and subsequently applies network measures to score and rank candidates with regard to their network proximity to nodes representing known disease genes.
In detail, the prioritization method Endeavour56,102 utilizes more than 20 data sources such as ontologies and functional annotations, protein–protein interactions, cis-regulatory information, gene expression data, sequence information, and text-mining results. For each data source, candidate genes are first ranked separately based on their similarity to a profile derived from known disease genes. Afterwards, all individual candidate rankings are merged into a final overall ranking using rank order statistics. The authors showed that this approach is quite successful in finding potential disease genes as well as genes involved in specific pathways or biological functions. Recently, Endeavour has also been benchmarked using various disease marker sets and pathway maps103 to confirm that it performs very well if sufficient data is available for the disease or pathway of interest and the candidate genes. Furthermore, Li et al.57 proposed a discounted rating system, an algorithm for integrating multiple rank lists, and compared it with the rank aggregation procedure used by Endeavour.
Like Endeavour, the method MetaRanker59 also combines many heterogeneous data sources and forms separate evidence layers from SNP-to-phenotype associations, candidate protein interactions, linkage study data, quantitative disease similarity, and gene expression information. For each layer, all genes in the human genome are ranked with regard to their probability to be associated with the phenotype of interest. The overall score of a gene is the product of its rank scores for each layer. The evaluation of MetaRanker indicates that it is particularly suited to uncover associations in complex polygenic diseases and that the integration of multiple data layers improves the identification of weak contributions to the phenotype of interest in comparison to the use of only few data sources.
Another combination of network-based methods with score aggregation has been proposed by Chen et al.60 The authors generate an individual network for each data source and quantify potential gene–disease relationships in each network using a global network measure based on diffusion kernels. The final candidate ranking considers only the most informative network score for each candidate gene. Furthermore, an alternative way of integrating information from multiple data sources is the application of machine learning techniques. Here, each data source can be represented as one or more individual features and used as input for the training of supervised learning methods. In particular, support vector machines,61–63 decision-tree-based classifiers,64 and PU learning65 (machine learning from positive and unlabeled examples) have been applied to prioritize candidate disease genes using multiple data sources.
In contrast, one of the first alternative approaches that integrate information from multiple data sources into a network representation has been Prioritizer.24 Its authors constructed a comprehensive functional human gene network based on a number of datasets from molecular pathway and interaction databases such as KEGG,104 BIND,105 HPRD,106 Reactome107as well as from GO annotations,80 yeast-2-hybrid screens, gene expression experiments, and protein interaction predictions. In this network, positional candidates from different disease loci are ranked according to the length of the shortest paths between them. In functional networks as used by Prioritizer, the main assumption is that relevant genes are involved in specific disease-related pathways and cluster together in the network even if their products are not closely linked by physical protein interactions.
Building upon Prioritizer, several research groups have assembled different types of integrated networks as biological evidence for candidate disease gene prioritization. One example is the two-layered network by Li et al.48presented in the previous section that combines protein interactions and phenotypic similarity. Another method was presented by Linghu et al.66 who employed naïve Bayes integration of diverse functional genomics datasets to generate a weighted functional linkage network and to prioritize candidate genes based on their shortest path distance to known disease genes. Similarly, Huttenhower et al.67 incorporated information from several thousands genomic experiments to generate a functional relationship network. From this network, the authors could derive functional maps of different phenotypes and showed in a case study for macroautophagy that these maps can be used successfully to find novel gene associations.
Recently, Lee et al.68 also provided a large-scale human network of functional gene–gene associations and evaluated the performance of six different network-based methods using it. Similar to the findings by Navlakha and Kingsford,44 the authors concluded that the strongest overall performance is achieved with algorithms that account for the global network structure such as Google’s PageRank. A more general view of the relationships between phenotypes and genes is introduced by BioGraph,69 a heterogeneous network containing diverse biomedical entities and relations between them, which are extracted from over 20 publicly available databases. By computing random walks on this network, the authors aim at the automated generation of functional hypotheses between different concepts, in particular, of candidate genes and diseases.
EVALUATION AND BENCHMARKING OF PRIORITIZATION METHODS
To show the biological applicability and scientific value of disease gene prioritization methods, their authors are normally expected to conduct an extensive performance evaluation and, if possible, a thorough comparison with other methods. To this end, many authors usually benchmark disease phenotypes from OMIM. Depending on the requirements of their method, only phenotypes with at least two or three known disease genes may be suitable. Hence, the number of evaluated diseases can vary from tens to hundreds with hundreds to thousands corresponding genes. The range of disease phenotypes and genes, for which a given method is applicable, depends on the data used by the method. For instance, only about 10% of all human protein–protein interactions have probably been described so far,108only about 10% of all human genes have at least one known disease association,101 and only about every second gene or protein is functionally annotated.109
Leave-one-out cross-validation is a widely used and generally accepted test for how a method might perform on previously unseen data. In each run, one of the known disease genes, the so-called target disease gene, is removed from the training data. The remaining disease genes are used to identify the omitted gene from a test set of genes that are not known to be associated with the disease of interest. In the best case, the top rank should be assigned to the target disease gene and lower ranks to the other test genes. Since cross-validation is a standard performance test, a number of suitable measures of predictive power exist, for example, sensitivity and specificity, receiver operating characteristic (ROC) curve, precision and recall, enrichment and mean rank ratio (see Box 2). Unfortunately, none of these measures is considered as default, which renders the comparison between different methods of disease gene prioritization difficult. In particular, it would be useful to report the performance for the top-ranked candidate genes, e.g., the first 10 or 20 genes, because only a few candidates can usually be considered for further validation experiments.
Another important aspect of the benchmarking strategy is the choice of genes in the test set, i.e., the candidate genes that are prioritized together with the target disease gene. One usual input for prioritization methods is a set of susceptibility loci as determined by GWA studies. These loci typically contain up to several hundreds of possible disease genes. Therefore, different strategies have been followed by authors to derive useful test sets, i.e., the definition of artificial gene loci, the random selection of genes, the use of the whole genome, and the small-scale choice of genes.
Here, we briefly describe frequently used measures for evaluating the performance of disease gene prioritization methods. A simple measure is the mean rank ratio defined as the average of rank ratios for all tested disease genes.110 One speaks of n/m-fold enrichment on average if disease genes are ranked in the top m% of all genes in n% of the linkage intervals.47 Other performance measures are calculated using the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) at a specific rank or score cut-off that discriminates predicted from not predicted ones. Positives are disease genes, while negatives are candidate genes without disease association. For instance, sensitivity is the percentage of correctly identified disease genes among all genes above the cut-off (TP/(TP + FN)), while specificity is the percentage of correctly dismissed candidate genes among all genes below the cut-off (TN/(TN + FP)). Plotting sensitivity versus specificity while varying the cut-off yields a ROC curve. The area-under-the-ROC curve (AUC) is a standard measure for the overall performance of binary classification methods (here, disease genes vs. others). The AUC is 100% in case of perfect prioritization and 50% if the disease genes were ranked randomly. In some cases, the authors of prioritization methods give the percentage of disease genes ranked in the top 1% and 5% of all genes, which corresponds to reporting the sensitivity at 99% or 95% specificity, respectively. The percentage of correctly prioritized disease genes among all disease genes is defined as precision (TP/(TP + FP)), while recall is equal to sensitivity. Thus, the plot of a precision-recall curvecan also be used to evaluate method performance.44
Endeavour56 and several related methods61,103,111 were evaluated with a test set containing 99 candidate genes chosen at random from the whole genome in addition to the target disease gene. Other methods were also benchmarked with this strategy, primarily, in order to be comparable with Endeavour.26,41,47,58,60,110,112 However, since similar genes tend to cluster in chromosomal neighborhoods,113 another, presumably, more difficult setting for performance benchmarking and especially more relevant for GWA studies is the definition of artificial linkage intervals with genes that surround the disease gene on the chromosome.22,24,26,35,40,45,47,48,50,53,57,60,76,112,114 The size of such intervals, as found in the relevant literature, ranges from the 100 nearest genes to 300 genes on average if a 10 Mb genomic neighborhood is considered.26 The average gene number of linkage intervals associated with diseases according to OMIM is estimated to be 108.35
The third option for assembling a test set is the use of all genes in the genome except for the known disease genes in the training set.44,47–49,110This setting is chosen only by the few methods that are capable of performing genome-wide disease gene prioritization. Finally, prioritization methods that consider, for instance, gene expression data are evaluated only on a smaller scale because there is not enough data for a comprehensive benchmarking over many disease phenotypes.38,43,68,77,115–117 Therefore, the authors commonly choose only few diseases that have, for example, the required experimental data available.
In this review, we gave an overview of different approaches to the prioritization of candidate disease genes. We described how disease genes can be identified by their molecular characteristics based on sequence properties, functional annotations, and network information. In particular, we presented recent approaches, which make use of phenotypic information and comprehensive knowledge integration. Finally, we discussed common benchmarking strategies of prioritization methods.
Many disease gene prioritization methods exploit discriminative gene and protein properties and successfully rank candidate genes according to their functional and phenotypic similarity or network proximity to known disease genes. Further improvement of the prioritization performance can be achieved by integrating biological information contained in multiple data sources. Many integrative methods first combine heterogeneous datasets and then apply specific analysis techniques. However, in the course of such analysis, the very useful insight which data source provides the most relevant biological information for the prioritization is usually lost. Therefore, it is also beneficial to follow the alternative approach that first analyzes each data source separately using the most suitable techniques and then combines the resulting ranking lists using sophisticated rank aggregation algorithms. This procedure also facilitates backtracking the origin of the most relevant information.
Among the most widely used data sources for disease gene prioritization are functional annotations and protein interactions as well as phenotypic similarity. In particular, performance evaluations of methods such as Endeavour, MedSim, and Prioritizer demonstrated consistently that functional GO term annotations constitute by far one of the most useful biological evidence sources for candidate prioritization.24,26,56 Further performance gain can be attributed to comprehensive knowledge integration, which reduces the noise in the integrated data and provides additional information from data sources that is not captured (yet) by GO term annotations. Even more performance increase can be expected when the used data sources become more and more complete and exhibit high quality without significant bias toward intensively studied genes and proteins.
Currently, the multitude of benchmarking strategies pursued by different researchers considerably hampers the performance comparison of disease gene prioritization methods. Moreover, some methods make use of only small test datasets due to the lack of the required training data and the limited amount of known disease genes. Nevertheless, established procedures to derive test sets and the application of different standard performance measures should form part of every benchmarking strategy to evaluate new prioritization methods comprehensively with respect to other well-performing methods. To facilitate future performance comparisons, the training and test datasets should always be made publicly available together with the published work. In the end, since follow-up validation experiments tend to be expensive and time-consuming, it is vital that the correct disease genes are found on the few top ranks of the prioritization list.
Part of this study was financially supported by the BMBF through the German National Genome Research Network (NGFN) and the Greifswald Approach to Individualized Medicine (GANI_MED). The research was also conducted in the context of the DFG-funded Cluster of Excellence for Multimodal Computing and Interaction.