blog

The news in our blog

Recent approaches to the prioritization of candidate disease genes

Recent approaches to the prioritization of candidate disease genes

Authors

Abstract

Many efforts are still devoted to the discovery of genes involved with specific phenotypes, in particular, diseases. High-throughput techniques are thus applied frequently to detect dozens or even hundreds of candidate genes. However, the experimental validation of many candidates is often an expensive and time-consuming task. Therefore, a great variety of computational approaches has been developed to support the identification of the most promising candidates for follow-up studies. The biomedical knowledge already available about the disease of interest and related genes is commonly exploited to find new gene–disease associations and to prioritize candidates. In this review, we highlight recent methodological advances in this research field of candidate gene prioritization. We focus on approaches that use network information and integrate heterogeneous data sources. Furthermore, we discuss current benchmarking procedures for evaluating and comparing different prioritization methods. WIREs Syst Biol Med 2012. doi: 10.1002/wsbm.1177

For further resources related to this article, please visit the WIREs website.

COMPLEX DISEASES AND THE IDENTIFICATION OF RELEVANT GENES

Many common diseases are complex and polygenic, involving dozens of human genes that might predispose to, be causative of, or modify the respective disease phenotype.1–3 This intricate interplay of disease genotypes and phenotypes still renders the identification of all relevant disease genes difficult.4–6 Therefore, a number of experimental techniques exist to discover disease genes. In particular, high-throughput methods such as genome-wide association (GWA) studies2,7,8 and large-scale RNA interference screens6,9,10 yield lists of up to hundreds of candidate disease genes. As validating the actual disease relevance of candidate genes in experimental follow-up studies is a time-consuming and expensive task, many methods and web services for the computational prioritization of candidate disease genes have already been developed.11–19

The concrete problem of candidate gene prioritization can be formulated as follows: Given a disease (or, generally spoken, a specific phenotype) of interest and some list of candidate genes, identify potential gene–disease associations by ranking the candidate genes in decreasing order of their relevance to the disease phenotype. When abstracting from the methodological details, the vast majority of computational approaches to this prioritization problem work in a similar manner. Most of them rely on the biological information already available for the disease phenotype of interest and the known, already verified, disease genes as well as for the additional candidate genes. In this context, functional information, particularly, manually curated or automatically derived functional annotation, often provides strong evidence for establishing links between diseases and relevant genes and proteins.20–27 Many prioritization methods use protein interaction data as rich information source for finding relationships between gene products of candidate genes and disease genes.11,16,18,25,28–45 In addition, the phenotypic similarity of diseases can help to increase the total number of known disease genes for less studied disease phenotypes.35,46–55 Other sources of biological information frequently used by prioritization approaches are sequence properties, gene expression data, molecular pathways, functional orthology between organisms, and relevant biomedical literature.12,14,19

These data then serve as input for statistical learning methods or are integrated into network representations, which are further analyzed by network scoring algorithms. Although individual data sources such as functional annotations or protein interactions provide quite powerful information for prioritizing candidate genes, the integration of multiple data sources has been reported to increase the performance even more.23–25,35,47–69 However, a generally accepted and consistent benchmarking strategy for all the diverse prioritization methods has not emerged yet, which complicates performance evaluation and comparison.

Therefore, this advanced review does not only highlight recent prioritization approaches (published until the end of 2011), but also discusses different benchmarking strategies applied by authors and the need of standardized procedures for performance measurement. Other computational tasks such as the structural and functional interpretation as well as prioritization of disease-associated nucleotide and amino acid changes are not discussed in this article, but are reviewed elsewhere.3,70–74 In the following, the various prioritization methods are categorized according to the biological data and their representation that are primarily considered when scoring and ranking candidate disease genes: gene and protein characteristics, network information on molecular interactions, and integrated biomedical knowledge.

PRIORITIZATION METHODS USING GENE AND PROTEIN CHARACTERISTICS

The first computational approaches in the hunt for disease genes focused on molecular characteristics of disease genes, which discriminate them from non-disease genes. As described below, researchers developed methods related to individual gene and protein sequence properties75–77 as well as functional annotations of gene products.20–22,26,27,78 In principle, if a candidate satisfies certain characteristics as derived from known disease genes and proteins, its disease relevance is considered to be higher than otherwise.

Gene and Protein Sequence Properties

López-Bigas and Ouzounis75 derived several important characteristics of disease genes from the amino acid sequence of their gene products. In comparison with other proteins encoded in the human genome, disease proteins tend to be longer, to exhibit a wider phylogenetic extent, that is, to have more homologs in both vertebrates and invertebrates, to possess a low number of close paralogs, and to be more evolutionarily conserved. Using these sequence properties as input to a decision-tree algorithm, the researchers performed a genome-wide identification of genes involved in (hereditary) diseases.

Similarly, Adie et al.76 developed PROSPECTR, a method for candidate disease gene prioritization based on an alternating decision-tree algorithm. However, their approach examined a broader set of sequence features and thus produced a more successful classifier. In particular, Adie et al. found that disease genes tend to have different nucleotide compositions at the end of the sequence, a higher number of CpG islands at the 5′ end, and longer 3′ untranslated regions.

Functional Annotations

By demonstrating the strong correlation between gene and protein function and disease features, such as age of onset, Jimenez-Sanchez et al.78motivated prioritization approaches that exploit the functional annotation of known disease genes for ranking candidates.20–22,26,27

Perez-Iratxeta et al.20 applied text mining on biomedical literature to relate disease phenotypes with functional annotations using Medical Subject Headings (MeSH)79 and Gene Ontology (GO) terms.80 They ranked the candidate genes according to the characteristic functional annotations shared with the disease of interest. In a similar fashion, Freudenberg and Propping21identified candidate genes based on their annotated GO terms that are shared with groups of known disease genes associated with similar phenotypes. In contrast, the approach POCUS22 assesses the shared over-representation of functional annotation terms between genes in different loci for the same disease.

Recently, Schlicker et al.26 developed a prioritization method that makes use of the similarity between the functional annotations of disease genes and candidates. In contrast to the approaches that consider solely identical functional annotations or compute only GO term enrichments, MedSim automatically derives functional profiles for each disease phenotype from the GO term annotation of known disease genes and, optionally, of their orthologs or interaction partners. Candidate genes are then scored and ranked according to the functional similarity of their annotation profiles to a disease profile. In addition, Ramírez et al.27 introduced the BioSim method for discovering biological relationships between genes or proteins. While MedSim is based only on GO term annotations, BioSim quantifies functional gene and protein similarity according to multiple data sources of functional annotations and can also be applied to rank candidate genes based on their functional similarity to known disease genes.

The success of the presented studies also shows that phenotypically similar diseases often involve common molecular mechanisms and thus functionally related genes. This also explains the frequent use of functional annotations as important biological evidence in integrative prioritization approaches.23–25,56–58,61–66,68,69 Notably, the information value of functional annotations can be further increased by improved scoring of functional similarity, reaching the performance of complex integrative methods based on multiple data sources.26

PRIORITIZATION METHODS USING NETWORK INFORMATION

In the last decade, molecular interaction networks have become an indispensable tool and a valuable information source in the study of human diseases. Regarding methods for prioritization of candidate disease genes, it was repeatedly observed that protein interaction networks are among the most powerful data sources in addition to functional annotations.11,16,18,28,29,81 As in case of sequence properties, disease genes and their products have discriminatory network properties that allow their distinction from non-disease genes. In particular, molecular interactions naturally support the application of the guilt-by-association principle to identify disease genes. In the following, we highlight a representative selection of network-based prioritization approaches.

Local Network Information

Early prioritization methods have focused on local network information such as close network neighborhood of a node representing a candidate gene or protein (see Box 1). This can be explained by the observation that disease proteins tend to cluster and interact with each other.30,49,82–84 Molecular triangulation is one of the first methods that used protein interaction networks to rank candidates and their network nodes with respect to their shortest path distances to nodes of known disease proteins.31 An evidence score such as the MLS score corresponding to linkage peak association85 is assigned to each disease protein node and transferred to its neighbor nodes. The candidates are then ranked according to the accumulated sum of evidence scores. This means that candidates represented by nodes close to several disease protein nodes with good evidence scores are considered to be the most promising ones.

Box 1

NETWORK MEASURES

Local network information refers to the topological neighborhood of a node, and corresponding measures are less sensitive to the overall network topology. Examples are the node degree kn (number of edges linked to node n) and the shortest path length dnm (minimum number of edges between the nodes n and m). In disease gene networks, Xu et al.34 define, for each node n, the 1N index equation image and the 2N index equation image. Here, equation image is the number of edges between node nand disease genes, and Nn is the set of direct neighbors of n. Given the set of disease genes M, the average shortest path distance of a node n to disease genes is equation image.

Global network information relates to the overall network topology and measures that characterize the role of a node in the whole network. Common centrality measures are shortest path closeness and betweenness as well as random-walk-related properties such as hitting time, visit frequency and stationary distribution. For instance, closeness centrality indicates how distant a node is to the other network nodes, and it is calculated as equation image with Vn denoting the set of nodes reachable from n. The random walk with restart40 is defined as pt+1 = (1 − r)Wpt + rp0. Here, W is the column-normalized adjacency matrix of the network, pt contains the probability of being at each node at time step t, p0 denotes the initial probability vector, and r is restart probability. The steady-state probability vector p can be obtained by performing iterations until the change between pt and pt+1 falls below some significance threshold, e.g., 10−6.40

In a related approach, Karni et al.32 identified the minimal set of candidates so that there is a path between the products of known disease genes in the protein interaction network. Oti et al.33 proposed an even simpler method for a genome-wide prediction of disease genes. For each known disease protein, they identified its interaction partners and the chromosomal locations of the encoding genes. A gene is then considered to be relevant for a disease of interest if it resides within a known disease locus and its gene product shares an interaction with a protein known to be associated with the same disease.

To make the most out of the potential of local network measures, Xu and Li34computed multiple topological properties for three different molecular networks consisting of literature-curated, experimentally derived, and predicted protein–protein interactions. The considered properties are the node degree, the average distance to known disease genes, the 1N and 2N node indices (see Box 1 and Figure 1), and the positive topological coefficient.86 The authors then trained a k-nearest-neighbor classifier using the aforementioned topological properties of known disease genes and achieved comparable performance for all three networks. They also detected a possible bias in the literature-curated network because disease genes tend to be studied more extensively.

Figure 1.

Exemplary molecular network of candidate genes and known disease genes. Red nodes represent known disease genes, and green nodes correspond to candidate genes. For candidate genes C1 and C2, the table lists the node degree, the 1N and 2N indices, and the average network distance to disease genes (see also Box 1).

In addition to local network information, Lage et al.35 incorporated phenotypic data into the disease gene prioritization. Each candidate and its direct interaction partners are considered as a candidate complex. All disease proteins in a candidate complex are assigned phenotypic similarity scores, which are used as input to a Bayesian predictor. Thus, a candidate gene obtains a high score if the other proteins in the complex are involved in phenotypes very similar to the disease of interest. Care et al.36 elaborated on this approach by combining it with deleterious SNP predictions for the candidate gene products and their interaction partners. Using the method by Lage et al., Berchtold et al.37 successfully prioritized proteins associated with type 1 diabetes (T1D). Further studies of protein interaction networks underlying specific diseases such as breast cancer38 and T1D39 also deal with the application of similar network-based prioritization approaches.

Global Network Information

Beyond local network information that ignores potential network-mediated effects from distant nodes, the utilization of global network measures can considerably improve the performance of prioritization methods for candidate disease genes.40,41,44 Especially for the study of polygenic diseases, network topology analysis can provide more insight into multiple paths of long-range protein interactions and their impact on the functionality and interplay of disease genes.

Random-Walk Measures

Köhler et al.40 demonstrated that random-walk analysis of protein–protein interaction networks outperforms local network-based methods such as shortest path distances and direct interactions as well as sequence-based methods like PROSPECTR.76 In their method, the authors ranked the gene products in a given network according to the steady-state probability of a random walk, which starts at known disease proteins and can restart with a predefined probability (see Box 1). Although the ranking criterion is the proximity of candidates to known disease proteins, this approach is more discriminative than local measures because it accounts for the global network structure.

In a similar manner, Chen et al.41 adapted three sophisticated algorithms from social network and web analysis for the problem of disease gene prioritization. To this end, they analyzed a protein–protein interaction network using modified versions of the random-walk-based methods PageRank,87,88Hyperlink-Induced Topic Search (HITS),88,89 and K-Step Markov method (KSMM).88 PageRank, HITS, and KSMM consider the global network topology and compute the relevance of all nodes representing candidates with regard to the set of known disease proteins in the network. All three methods achieved comparable performance to each other.

To address the issue of finding causal genes within expression quantitative trait loci (eQTL),90,91 Suthram et al.42 introduced the eQTL electrical diagrams method (eQED). They modeled confidence weights of protein interactions as conductances, while the P-values of associations between genetic loci and the expression of candidate genes served as current sources. The best candidate gene is the one passed by the highest current. The currents in electric circuits can be determined efficiently using random-walk computations.92,93

Network Centrality Measures

For many years, global centrality measures such as closeness or betweenness have been used in social sciences to assess how important individual nodes are for the overall network connectivity. Recently, such measures have been applied to several problems in bioinformatics including disease gene prioritization. For example, Dezső et al.43 applied an adapted version of shortest path betweenness to prioritize candidates in a protein–protein interaction network. A candidate is scored more relevant to the disease of interest if it lies on significantly more shortest paths connecting nodes of known disease proteins than other nodes in the network.

In a recent case study on primary immunodeficiencies (PIDs), Ortutay and Vihinen25 integrated functional GO annotations with protein interaction networks to discover novel PID genes. The authors conducted a topological analysis on an immunome network consisting of all essential proteins related to the human immune system and their interactions. In particular, they used the node degree as well as the global centrality measures of vulnerability and closeness to assess the importance of candidate genes in the network (see Box 1). Additionally, they performed functional enrichment analysis to determine genes with PID-related GO terms. With some modifications, the described prioritization method could be generalized to other diseases of interest.

Combining Network Measures

Recently, Navlakha and Kingsford44 compared different network-based prioritization methods. The authors observed that random-walk-based measures40 outperform measures focused on the local network neighborhood33 or clustering.94–96 A consensus method that uses a random-forest classifier to combine all methods yielded the most accurate ranking. Therefore, apart from stressing the potential of protein interaction data, Navlakha and Kingsford also showed that disease gene prioritization can benefit from the integration of multiple information sources.

In summary, as many other studies have also demonstrated, molecular interaction networks, in particular, based on protein interactions, provide valuable biological knowledge for ranking candidate disease genes.11,16,18,28,29 It has also become clear that global network measures achieve better results in comparison to local measures.40,41,44 Nevertheless, the performance of such prioritization approaches depends heavily on the quality of the network data. Protein interaction data are well known to be biased toward extensively studied proteins and subject to inherent noise.34,97,98 Therefore, it is often suggested that existing methods will perform better when more accurate data become available. Furthermore, Erten et al.45 pointed out that network-based methods can also be improved by integrating statistical adjustments for the skewed degree distribution of protein interaction networks.

PRIORITIZATION METHODS USING INTEGRATED KNOWLEDGE

Network information on molecular interactions as well as individual gene and protein characteristics such as sequence properties and functional annotations are major sources of biological evidence for scoring and ranking candidate disease genes. However, a prioritization approach based on a single information source alone usually achieves only limited performance due to noisy and incomplete datasets. To address this problem, the integration of multiple sources of biological knowledge has proven to be a good solution in bioinformatics. Different types of data can complement each other well to increase the amount of available information and its overall quality. While some of the methods presented above already make successful use of relatively simple integration procedures for a few different sources of functional information and annotations, this section will focus on more sophisticated methods for knowledge integration and the prioritization of candidate disease genes.

Complementing Molecular Interactions with Phenotypic Network Information

In the last years, several groups investigated the similarities and differences between disease phenotypes. The main finding was that similar phenotypes often share underlying genes or even pathways.46,99,100 In particular, van Driel et al.46 classified all human phenotypes contained in the Online Mendelian Inheritance in Man database (OMIM)101 by defining a measure of phenotypic similarity based on text mining of the corresponding OMIM records. Such phenotypic knowledge can be very useful to discover new potential disease genes by transferring known gene–phenotype associations to similar diseases and phenotypes.

Therefore, phenotypic similarity has become another major data source exploited by computational methods for prioritization of candidate disease genes.35,47–55 In this context, a two-layered heterogeneous data network is typically constructed so that the phenome layer consists of connections between similar phenotypes, while the interactome layer contains protein–protein interactions. The two network layers are then linked by known gene–phenotype associations.

To demonstrate the importance of the additional phenotype network layer for identifying novel gene–phenotype associations and disease–disease relationships, Li et al.48 extended the random-walk algorithm used by Köhler et al.,40 as described in the previous section, to heterogeneous networks. Both the candidate genes and the disease phenotypes are prioritized simultaneously. In contrast, Yao et al.49 estimated the closeness of a candidate gene to a disease of interest by computing the hitting time of a random walk that starts at the corresponding disease phenotype and ends at the candidate. This approach also allows the genome-wide identification of potential disease genes for phenotypic disease subtypes.

Chen et al.50 reformulated the candidate gene prioritization problem as a maximum flow problem on a heterogeneous network. They represented the capacities of connections between phenotypes by their phenotypic similarity. Capacities on edges within the interactome and on edges bridging the phenome and interactome were estimated during the evaluation procedure. By calculating a maximum flow from a phenotype of interest through the interactome, the authors ranked candidate genes with regard to the amount of efflux.

A computationally simpler approach based on the same network type was suggested by Guo et al.,51 who computed the association score between a gene and a disease as the weighted sum of all association scores between similar diseases and between neighboring genes in the interaction layer. To this end, the authors formulated an iterative matrix multiplication of disease–gene–association matrices and disease-similarity matrices corresponding to the network structure. While the maximum flow problem solved by Chen et al.50 already accounts for the phenotypic overlap between diseases, the approach by Guo et al.51 additionally considers the genetic overlap of diseases. The recent PhenomeNET52 is even a cross-species network of phenotypic similarities between genotypes and diseases based on a uniform representation of different phenotype and anatomy ontologies. In particular, it can be used to perform whole-phenome discovery of genes for diseases with unknown molecular basis.

Two other studies used iterative network flow propagation on a heterogeneous network to identify protein complexes related to disease. Vanunu et al.53 developed a prioritization method that propagates flow from a phenotype of interest through the whole network and identifies dense subnetworks around high-scored genes as potential phenotype-related protein complexes. In contrast, Yang et al.54 modified the described heterogeneous network and included an additional layer of protein complexes. In the resulting network, phenotypes are connected to protein complexes, and complexes are linked with each other according to the protein interactions in the interactome layer. The method derives novel gene–phenotype associations by propagating the network flow within the protein complex layer.

Disease gene prioritization methods usually rank candidate genes relative to a phenotype of interest. However, the discovery of gene–phenotype associations can also be approached the other way around. Hwang et al.55devised a method to identify the phenotype that could result from a given set of candidate genes. For that purpose, the authors considered a gene network and a phenotype similarity network. In both networks, the nodes were ranked separately with graph Laplacian scores, and a rank coherence was calculated from the score differences between genes and phenotypes connected by known associations. Hwang et al. showed that their approach is suitable to predict the resulting phenotype for a given set of candidate genes.

Integrating Heterogeneous Data Sources of Biological Knowledge

Two distinct approaches to disease gene prioritization that exploit multiple data sources are exemplarily highlighted in the following (Figure 2). The first approach considers each data source separately when assessing the molecular and phenotypic relationships of candidate genes with the disease of interest, and aggregates the resulting multiple ranking lists into a final ranking of the candidates. The alternative approach combines all biological information into a network representation and subsequently applies network measures to score and rank candidates with regard to their network proximity to nodes representing known disease genes.

Figure 2.

Integrative approaches to disease gene prioritization. The typical workflow of integrative prioritization approaches based on multiple data sources consists of three major steps. The first step involves preparing the input data consisting of two different sets of genes, the known disease genes and the candidate genes. For each gene, further biomedical knowledge is retrieved from various data sources such as functional annotations from the Gene Ontology and molecular pathways from the KEGG database. In the second step, the collected information is integrated using a network representation (top) or evaluated individually for each data source, resulting in different ranking lists (bottom). The third step computes a final ranking list of candidate genes based on network measures or rank aggregation. The candidate genes are thus prioritized by their relevance to the disease of interest.

In detail, the prioritization method Endeavour56,102 utilizes more than 20 data sources such as ontologies and functional annotations, protein–protein interactions, cis-regulatory information, gene expression data, sequence information, and text-mining results. For each data source, candidate genes are first ranked separately based on their similarity to a profile derived from known disease genes. Afterwards, all individual candidate rankings are merged into a final overall ranking using rank order statistics. The authors showed that this approach is quite successful in finding potential disease genes as well as genes involved in specific pathways or biological functions. Recently, Endeavour has also been benchmarked using various disease marker sets and pathway maps103 to confirm that it performs very well if sufficient data is available for the disease or pathway of interest and the candidate genes. Furthermore, Li et al.57 proposed a discounted rating system, an algorithm for integrating multiple rank lists, and compared it with the rank aggregation procedure used by Endeavour.

Like Endeavour, the method MetaRanker59 also combines many heterogeneous data sources and forms separate evidence layers from SNP-to-phenotype associations, candidate protein interactions, linkage study data, quantitative disease similarity, and gene expression information. For each layer, all genes in the human genome are ranked with regard to their probability to be associated with the phenotype of interest. The overall score of a gene is the product of its rank scores for each layer. The evaluation of MetaRanker indicates that it is particularly suited to uncover associations in complex polygenic diseases and that the integration of multiple data layers improves the identification of weak contributions to the phenotype of interest in comparison to the use of only few data sources.

Another combination of network-based methods with score aggregation has been proposed by Chen et al.60 The authors generate an individual network for each data source and quantify potential gene–disease relationships in each network using a global network measure based on diffusion kernels. The final candidate ranking considers only the most informative network score for each candidate gene. Furthermore, an alternative way of integrating information from multiple data sources is the application of machine learning techniques. Here, each data source can be represented as one or more individual features and used as input for the training of supervised learning methods. In particular, support vector machines,61–63 decision-tree-based classifiers,64 and PU learning65 (machine learning from positive and unlabeled examples) have been applied to prioritize candidate disease genes using multiple data sources.

In contrast, one of the first alternative approaches that integrate information from multiple data sources into a network representation has been Prioritizer.24 Its authors constructed a comprehensive functional human gene network based on a number of datasets from molecular pathway and interaction databases such as KEGG,104 BIND,105 HPRD,106 Reactome107as well as from GO annotations,80 yeast-2-hybrid screens, gene expression experiments, and protein interaction predictions. In this network, positional candidates from different disease loci are ranked according to the length of the shortest paths between them. In functional networks as used by Prioritizer, the main assumption is that relevant genes are involved in specific disease-related pathways and cluster together in the network even if their products are not closely linked by physical protein interactions.

Building upon Prioritizer, several research groups have assembled different types of integrated networks as biological evidence for candidate disease gene prioritization. One example is the two-layered network by Li et al.48presented in the previous section that combines protein interactions and phenotypic similarity. Another method was presented by Linghu et al.66 who employed naïve Bayes integration of diverse functional genomics datasets to generate a weighted functional linkage network and to prioritize candidate genes based on their shortest path distance to known disease genes. Similarly, Huttenhower et al.67 incorporated information from several thousands genomic experiments to generate a functional relationship network. From this network, the authors could derive functional maps of different phenotypes and showed in a case study for macroautophagy that these maps can be used successfully to find novel gene associations.

Recently, Lee et al.68 also provided a large-scale human network of functional gene–gene associations and evaluated the performance of six different network-based methods using it. Similar to the findings by Navlakha and Kingsford,44 the authors concluded that the strongest overall performance is achieved with algorithms that account for the global network structure such as Google’s PageRank. A more general view of the relationships between phenotypes and genes is introduced by BioGraph,69 a heterogeneous network containing diverse biomedical entities and relations between them, which are extracted from over 20 publicly available databases. By computing random walks on this network, the authors aim at the automated generation of functional hypotheses between different concepts, in particular, of candidate genes and diseases.

EVALUATION AND BENCHMARKING OF PRIORITIZATION METHODS

To show the biological applicability and scientific value of disease gene prioritization methods, their authors are normally expected to conduct an extensive performance evaluation and, if possible, a thorough comparison with other methods. To this end, many authors usually benchmark disease phenotypes from OMIM. Depending on the requirements of their method, only phenotypes with at least two or three known disease genes may be suitable. Hence, the number of evaluated diseases can vary from tens to hundreds with hundreds to thousands corresponding genes. The range of disease phenotypes and genes, for which a given method is applicable, depends on the data used by the method. For instance, only about 10% of all human protein–protein interactions have probably been described so far,108only about 10% of all human genes have at least one known disease association,101 and only about every second gene or protein is functionally annotated.109

Leave-one-out cross-validation is a widely used and generally accepted test for how a method might perform on previously unseen data. In each run, one of the known disease genes, the so-called target disease gene, is removed from the training data. The remaining disease genes are used to identify the omitted gene from a test set of genes that are not known to be associated with the disease of interest. In the best case, the top rank should be assigned to the target disease gene and lower ranks to the other test genes. Since cross-validation is a standard performance test, a number of suitable measures of predictive power exist, for example, sensitivity and specificity, receiver operating characteristic (ROC) curve, precision and recall, enrichment and mean rank ratio (see Box 2). Unfortunately, none of these measures is considered as default, which renders the comparison between different methods of disease gene prioritization difficult. In particular, it would be useful to report the performance for the top-ranked candidate genes, e.g., the first 10 or 20 genes, because only a few candidates can usually be considered for further validation experiments.

Another important aspect of the benchmarking strategy is the choice of genes in the test set, i.e., the candidate genes that are prioritized together with the target disease gene. One usual input for prioritization methods is a set of susceptibility loci as determined by GWA studies. These loci typically contain up to several hundreds of possible disease genes. Therefore, different strategies have been followed by authors to derive useful test sets, i.e., the definition of artificial gene loci, the random selection of genes, the use of the whole genome, and the small-scale choice of genes.

Box 2

PERFORMANCE MEASURES

Here, we briefly describe frequently used measures for evaluating the performance of disease gene prioritization methods. A simple measure is the mean rank ratio defined as the average of rank ratios for all tested disease genes.110 One speaks of n/m-fold enrichment on average if disease genes are ranked in the top m% of all genes in n% of the linkage intervals.47 Other performance measures are calculated using the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) at a specific rank or score cut-off that discriminates predicted from not predicted ones. Positives are disease genes, while negatives are candidate genes without disease association. For instance, sensitivity is the percentage of correctly identified disease genes among all genes above the cut-off (TP/(TP + FN)), while specificity is the percentage of correctly dismissed candidate genes among all genes below the cut-off (TN/(TN + FP)). Plotting sensitivity versus specificity while varying the cut-off yields a ROC curve. The area-under-the-ROC curve (AUC) is a standard measure for the overall performance of binary classification methods (here, disease genes vs. others). The AUC is 100% in case of perfect prioritization and 50% if the disease genes were ranked randomly. In some cases, the authors of prioritization methods give the percentage of disease genes ranked in the top 1% and 5% of all genes, which corresponds to reporting the sensitivity at 99% or 95% specificity, respectively. The percentage of correctly prioritized disease genes among all disease genes is defined as precision (TP/(TP + FP)), while recall is equal to sensitivity. Thus, the plot of a precision-recall curvecan also be used to evaluate method performance.44

Endeavour56 and several related methods61,103,111 were evaluated with a test set containing 99 candidate genes chosen at random from the whole genome in addition to the target disease gene. Other methods were also benchmarked with this strategy, primarily, in order to be comparable with Endeavour.26,41,47,58,60,110,112 However, since similar genes tend to cluster in chromosomal neighborhoods,113 another, presumably, more difficult setting for performance benchmarking and especially more relevant for GWA studies is the definition of artificial linkage intervals with genes that surround the disease gene on the chromosome.22,24,26,35,40,45,47,48,50,53,57,60,76,112,114 The size of such intervals, as found in the relevant literature, ranges from the 100 nearest genes to 300 genes on average if a 10 Mb genomic neighborhood is considered.26 The average gene number of linkage intervals associated with diseases according to OMIM is estimated to be 108.35

The third option for assembling a test set is the use of all genes in the genome except for the known disease genes in the training set.44,47–49,110This setting is chosen only by the few methods that are capable of performing genome-wide disease gene prioritization. Finally, prioritization methods that consider, for instance, gene expression data are evaluated only on a smaller scale because there is not enough data for a comprehensive benchmarking over many disease phenotypes.38,43,68,77,115–117 Therefore, the authors commonly choose only few diseases that have, for example, the required experimental data available.

CONCLUSION

In this review, we gave an overview of different approaches to the prioritization of candidate disease genes. We described how disease genes can be identified by their molecular characteristics based on sequence properties, functional annotations, and network information. In particular, we presented recent approaches, which make use of phenotypic information and comprehensive knowledge integration. Finally, we discussed common benchmarking strategies of prioritization methods.

Many disease gene prioritization methods exploit discriminative gene and protein properties and successfully rank candidate genes according to their functional and phenotypic similarity or network proximity to known disease genes. Further improvement of the prioritization performance can be achieved by integrating biological information contained in multiple data sources. Many integrative methods first combine heterogeneous datasets and then apply specific analysis techniques. However, in the course of such analysis, the very useful insight which data source provides the most relevant biological information for the prioritization is usually lost. Therefore, it is also beneficial to follow the alternative approach that first analyzes each data source separately using the most suitable techniques and then combines the resulting ranking lists using sophisticated rank aggregation algorithms. This procedure also facilitates backtracking the origin of the most relevant information.

Among the most widely used data sources for disease gene prioritization are functional annotations and protein interactions as well as phenotypic similarity. In particular, performance evaluations of methods such as Endeavour, MedSim, and Prioritizer demonstrated consistently that functional GO term annotations constitute by far one of the most useful biological evidence sources for candidate prioritization.24,26,56 Further performance gain can be attributed to comprehensive knowledge integration, which reduces the noise in the integrated data and provides additional information from data sources that is not captured (yet) by GO term annotations. Even more performance increase can be expected when the used data sources become more and more complete and exhibit high quality without significant bias toward intensively studied genes and proteins.

Currently, the multitude of benchmarking strategies pursued by different researchers considerably hampers the performance comparison of disease gene prioritization methods. Moreover, some methods make use of only small test datasets due to the lack of the required training data and the limited amount of known disease genes. Nevertheless, established procedures to derive test sets and the application of different standard performance measures should form part of every benchmarking strategy to evaluate new prioritization methods comprehensively with respect to other well-performing methods. To facilitate future performance comparisons, the training and test datasets should always be made publicly available together with the published work. In the end, since follow-up validation experiments tend to be expensive and time-consuming, it is vital that the correct disease genes are found on the few top ranks of the prioritization list.

Acknowledgements

Part of this study was financially supported by the BMBF through the German National Genome Research Network (NGFN) and the Greifswald Approach to Individualized Medicine (GANI_MED). The research was also conducted in the context of the DFG-funded Cluster of Excellence for Multimodal Computing and Interaction.


Promise of personalized omics to precision medicine

Promise of personalized omics to precision medicine

Authors

  • Rui Chen,

  • Michael Snyder

Abstract

The rapid development of high-throughput technologies and computational frameworks enables the examination of biological systems in unprecedented detail. The ability to study biological phenomena at omics levels in turn is expected to lead to significant advances in personalized and precision medicine. Patients can be treated according to their own molecular characteristics. Individual omes as well as the integrated profiles of multiple omes, such as the genome, the epigenome, the transcriptome, the proteome, the metabolome, the antibodyome, and other omics information are expected to be valuable for health monitoring, preventative measures, and precision medicine. Moreover, omics technologies have the potential to transform medicine from traditional symptom-oriented diagnosis and treatment of diseases toward disease prevention and early diagnostics. We discuss here the advances and challenges in systems biology-powered personalized medicine at its current stage, as well as a prospective view of future personalized health care at the end of this review. WIREs Syst Biol Med 2013, 5:73–82. doi: 10.1002/wsbm.1198

Conflict of interest: M.S. serves as founder and consultant for Personalis, a member of the scientific advisory board of GenapSys, and a consultant for Illumina.

For further resources related to this article, please visit the WIREs website.

INTRODUCTION

Personalized or precision medicine is expected to become the paradigm of future health care, owing to the substantial improvement of high-throughput technologies and systems approaches in the past two decades.1,2Conventional symptoms-oriented disease diagnosis and treatment has a number of significant limitations: for example, it focuses on only late/terminal symptoms and generally neglects preclinical pathophenotypes or risk factors; it generally disregards the underlying mechanisms of the symptoms; the disease descriptions are often quite broad so that they may actually include multiple diseases with shared symptoms; the reductionist approach to identify therapeutic targets in traditional medicine may over-simplify the complex nature of most diseases.3 Advances in the ability to perform large-scale genetic and molecular profiling are expected to overcome these limitations by addressing individualized differences in diagnosis and treatment in unprecedented detail.

The rapid development of high-throughput technologies also drives modern biological and medical researches from traditional hypothesis-driven designs toward data-driven studies. Modern high-throughput technologies, such as high-throughout DNA sequencing and mass spectrometry, have enabled the facile monitoring of thousands of molecules simultaneously instead of just a few components that have been analyzed in traditional research, thus generating a huge amount of data to document the real-time molecular details of a given biological system. Ultimately, when enough knowledge is gained, these molecular signatures, as well as the biological networks they form, may be associated with the physiological state/phenotype of the biological system at the very moment when the sample is taken.

Future personalized health care is expected to benefit from the combined personal omics data, which should include genomic information as well as longitudinal documentation of all possible molecular components. This combined information not only determines the genetic susceptibility of the person, but also monitors his/her real-time physiological states, as our integrative Personal Omics Profile (iPOP) study exemplified.4 In this review we will cover recent advances in systems biology and personalized medicine. We will also discuss limitations and concerns in applying omics approaches to individualized, precision health care.

GENOMICS IN DISEASE-ORIENTED MEDICINE

The revolution of omics profiling technologies significantly benefited disease-oriented studies and health care, especially in disease mechanism elucidation, molecular diagnosis, and personalized treatment. These new technologies greatly facilitated the development of genomics, transcriptomics, proteomics, and metabolomics, which have become powerful tools for disease studies. Today, molecular disease analyses using large-scale approaches are pursued by an increasing number of physicians and pathologists.5,6

Initially, genome-wide association studies (http://gwas.nih.gov/) were launched in search of association of common genetic variants to certain phenotypes of interest, which typically assayed more than 500,000 single nucleotide polymorphisms (SNPs) and/or copy number variations (CNVs) with DNA microarrays in thousands to hundred thousands of participants.7 To date, 1,355 publications are listed in the National Human Genome Research Institute (NHGRI) GWAS Catalog reporting the association of 7,226 SNPs with 710 complex traits.7 The studied complex traits vary vastly, from cancers (e.g., prostate cancer and breast cancer) and complex diseases (e.g., type 1 and type 2 diabetes (T2D), Crohn’s Disease) to common traits (e.g., height and body mass index). These findings greatly broadened our knowledge on disease loci, and can potentially benefit disease risk prediction and drug treatments (as discussed in the section Integrative Omics in Preventative Medicine). Although powerful, GWAS studies have proven difficult for most complex diseases as typically a large number of loci are identified, each contributing to a small fraction of the genetic risk. These studies have many limitations including the small fraction of the genome that is analyzed, and failure to account for gene-gene interactions, epistasis and environmental factors.8

Whole genome sequencing (WGS) and whole exome sequencing (WES) have become more and more affordable for genomic studies and are rapidly replacing DNA microarrays. Single-base analysis of a genome/exome is achieved, which allows scientists to investigate the genetic basis of health and disease in unprecedented detail. Assigning variants to paternal and maternal chromosomes i.e. ‘phasing’ can be obtained through the analysis of families9or other methods.1,10,11 With the generation of massive amount of whole genome and exome data from diseased and healthy populations, understanding of both human population variation and genetic diseases, especially complex diseases, has been brought to a new level.1,12

One field that significantly benefited from WGS technologies is cancer-related research. A large number of cancer genomes have been sequenced through individual or collaborative efforts, such as the International Cancer Genome Consortium (http://www.icgc.org/) and the Cancer Genome Atlas (http://cancergenome.nih.gov/). The DNA from many types of cancer have been sequenced, including breast cancer,13–15 chronic lymphocytic leukaemia,16 hepatocellular carcinoma,17 pediatric glioblastoma,13melanoma,18 ovarian cancer,19 small-cell lung cancer,20 and Sonic-Hedgehog medulloblastoma,21 and databases are established, such as the cancer cell line encyclopedia.22 In addition, single-cell level cancer genome has also been investigated by WES for clear cell renal cell carcinoma23 and JAK2-negative myeloproliferative neoplasm.24 Somatic mutations and subtyping molecular markers were identified from these genomes. These different studies have revealed that nearly every tumor is different with distinct types of potential ‘driver’ mutations. Importantly, cancer genome sequencing often reveals potential targets that may suggest precision cancer treatment for the specific patients. As an example, a novel spontaneous germline mutation in the p53 gene was identified by WGS in a female patient, which accounted for the three types of cancers she developed in merely 5 years.25 An attempt has been made recently to treat a female patient with T Cell Lymphoma based on the target gene, CTLA4, identified by whole genome sequencing.26 The patient’s cancer was suppressed for two months with the anti-CTLA4 drug ipilimumab, although she died of recurrence soon after.

Whole genome and exome sequencing can also facilitate the identification of possible causal genes for hereditary genetic diseases, and is increasingly used in attempts to understand the basis of these ‘mystery diseases’ once obvious candidates are ruled out. In one successful example, whole genome sequencing of a fraternal twin pair with dopa (3,4-dihydroxyphenylalanine)-responsive dystonia helped the identification of one pair of personalized compound heterozygous mutations in the gene SPR, which accounted for the disease in both individuals.27 Importantly, based on the genome information the authors supplemented the l-dopa therapy with 5-hydroxytryptophan (SPR-dependent serotonin precursor) and significantly improved the health of both patients. In another example, Roach et al. sequenced the whole genomes of a family quartet and identified rare mutations in the genes DHODH and DNAH5 responsible for the two recessive disorders in both children—Miller syndrome and primary ciliary dyskinesia.28

Pharmacogenomics is another important application of genomic sequencing. It is known that the same drug may have different effect on different individuals due to their personal genomic background and living habits.8,29Genetic information can be used to assign drug doses and reduce side effects. For example, genetic variants are known to affect patients’ response to antipsychotic drugs.30 Based on pharmacogenomic trials, genetic tests for four drugs are required by the US Food and Drug Administration (FDA) before the administration of these drugs to patients, including the anti-cancer drugs cetuximab, trastuzumab, and dasatinib, and the anti-HIV drug maraviroc, and more are recommended such as the anticoagulant drug Warfarin and the anti-HIV drug Abacavir.8

OTHER OMICS TECHNOLOGIES AND MEDICINE

Other omics technologies are also likely to impact medicine. High throughput sequencing technologies have enabled whole transcriptome (cDNA) sequencing, or abbreviated as RNA-Seq.31 RNA-Seq has become a powerful tool for disease-related studies, as it has great accuracy and sensitivity relative to microarray technology and it can also detect splicing isoforms.32 As RNA profiles reflect actual gene activity, it is closer to the real phenotype compared to genomic sequence. With RNA-Seq, Shah et al. discovered varied clonal preference and allelic abundance in 104 cases of primary triple-negative breast cancers, and observed that ∼36% of the genomic mutations were actually expressed.33 Combining such information with genomic information may be valuable in treatment of cancer and other diseases. Moreover, RNA-Seq also captures more complex aspects of the transcriptome, such as splicing isoforms34 and editing events,35 which are generally overlooked by hybridization-based methods. Splicing variants have now been associated with several distinct types of cancer and cancer prognosis.36–40

Although proteins have long been deemed as the executors of most biological functions, clinical proteomics is still a relatively young field due to technological limitations to profile the complexity of the proteome with high sensitivity and accuracy. Since the development of new soft desorption methods that enabled the analysis of biological macromolecules with mass spectrometry, proteomics advanced significantly in the past decade.41,42With current mass spectrometry technology, one can now quantify thousands of proteins in a single sample. For example, we were able to reliably detect 6,280 proteins in the human peripheral blood mononuclear cell proteome.4Mass spectrometry also allows the detection of expressed mutations, allele-specific sequences and editing events in the human proteome,4,43 as well as profiling of the phosphoproteome.44 Also of note is the MALDI-TOF (matrix-assisted laser desorption/ionization-time of flight) mass spectrometry-based imaging technology (MALDI-MSI) developed by Cornett et al., which allows spatial proteome profiling in defined two-dimentional laser-shot areas using tissue sections.45 Using MALDI-MSI, Kang et al. identified immunoglobulin heavy constant α2 as a novel potential marker for breast cancer metastasis.46

The field of metabolomics has also advanced significantly with the improvement of mass spectrometry. Both hydrophilic and hydrophobic metabolites can be profiled in specific samples.4,47 As the metabolome reflects the real-time energy status as well as metabolism of the living organism, it is expected that certain metabolome profiles may be associated with different diseases.48 Therefore, metabolomic profiles become an important aspect for personalized medicine.49,50 Jamshidi et al. profiled the metabolome of a female patient with Hereditary Hemorrhagic Telangiectasia (HHT) along with four healthy controls, and identified differences which highlighted the nitric oxide synthase pathway.51 The authors then treated the patient with bevacizumab and shifted her metabolomic profile toward those of the healthy controls and improved the patient’s health. In addition, branched-chain amino acids such as isoleucine have been associated with T2D and may ultimately prove to be valuable biomarkers.52 Finally, since some metabolites bind and directly regulate the activity of other biomolecules (e.g., kinases),53 there is significant potential to modulate cellular pathways using diet and metabolic analogs that serve as agonist or antagonist of protein function.

INTEGRATIVE OMICS IN PREVENTATIVE MEDICINE

The concept of personalized medicine emphasizes not only personalized diagnosis and treatment, but also personalized disease susceptibility assessment, health monitoring and preventative medicine. Because disease is easier to manage prior to it onset or when a disease is at its early stages, risk assessment and early detection will be transformative in personalized medicine. Systems biology has the potential to capture real-time molecular phenotypes of a biological system, which enables the detection of subtle network perturbations preluding the actual development of clinical symptoms.

Disease susceptibility and drug response can be assessed with a person’s genomic information.8 This information may serve as a guideline for monitoring the health of a particular patient to achieve personalized health care, as showcased by Ashley et al.54 Whole genome sequence revealed variants for both high-penetrance Mendelian disorders, such as HTT(Huntington’s disease55) and PAH (Phenylketonuria56), as well as common, complex diseases, such as the disease-associated genetic variants reported in GWAS studies.57 Disease risks can be evaluated for a given person and an increase or decrease in disease risk compared with the population risk (of the same ethnicity, age, and gender) can be estimated (Figure 1). In the study of Ashley et al., the genome of a patient was analyzed and increased post-test probability risks for myocardial infarction and coronary artery disease were estimated.54 Their estimation matched the fact that the patient, although generally healthy, had a family history of vascular disease as well as early sudden death.58 Genetic variants associated with heart-related morbidities as well as drug response were identified in the patient’s genome, the information of which, as the authors stated, may direct the future health care for this particular patient. Similarly, Dewey et al. further extended this work by analysing a family quartet using a major allele reference sequence, and identified high-risk genes for familial thrombophilia, obesity, and psoriasis.59

Figure 1.

Example personalized RiskGraph. Each horizontal line symbolizes genetic risk of one disease tested for a specific individual. The tail of each arrow shows the pretest probability of a disease in a population of certain ethnicity, age and gender. The front end of each arrow displays the posttest probability with consideration of the person’s genomic information. Red arrow, increased risk; green arrow, decreased risk.

To further explore variation and power of the full human genome, projects and databases (such as the Personal Genome Project60) are being launched to help advance this field. However, genomic information alone usually is not adequate to predict disease onset, and other factors such as environment are expected to play a critical role in this process.61,62 The predictive capability of whole genome sequence was assessed by Roberts et al. through modeling 24 disease risks in monozygotic twins.63 For each disease, the authors modeled the genotype distribution in the twin population according to the observed concordance/discordance, and discovered that for most individuals and most diseases, the relative risk would be tested negative compared to the population, and in the best-case scenario, only one disease or more could be forewarned for any individual. The results of Roberts et al. are not surprising, as disease manifestation is probabilistic and not deterministic. Nonetheless, whole genome information by itself is expected to have partial value in disease prediction for complex diseases. In addition, from a systems point of view, peripheral components of the biological network would be more likely to contribute to complex diseases, as perturbation of the main nodes, which are usually essential genes, would be lethal.64 Therefore it is more difficult to identify the exact contributors of complex diseases. Moreover, as stated above, non-genomic factors may also exist and further complicate the situation. As an example of this, multiple sclerosis is known to have genetic components, however, Baranzini et al. failed to identify genomic, epigenomic or transcriptomic contributors in discordant monozygotic twins, which may indicate the existence of other factors, such as the environment.65

Current technologies, especially high-throughput sequencing and mass spectrometry, enable the monitoring of at least 105 molecular components, including DNA, RNA, protein, and metabolites in the human body. Therefore it is now feasible to identify the profiles of these components that correlate with various physiological states of the body, and profile alterations as a result of physiological state changes and diseases. Compared with genomic sequences alone, the profiles of transcriptome, proteome and metabolome are closer indicators to the real-time phenotype, therefore collecting these omics information in a longitudinal manner would allow monitoring of an individual’s physiological states. To test this concept, we implemented a study by following a generally healthy participant for 14 (now 32) months with integrated Personal Omics Profile (iPOP) analysis, incorporating information of the participant’s genome with longitudinal data from the person’s transcriptome, proteome, metabolome, and autoantibodyome.4 As blood constantly circulates the human body and exchanges biological matters with local tissues and is presently analyzed in medical tests, we chose to monitor the participant’s physiological states by profiling the blood components (PBMCs, serum and plasma) with iPOP analysis. The genome of this individual was sequenced with two WGS (Illumina and Complete Genomics) and three WES (Agilent, Roche Nimblegen, and Illumina) platforms to achieve high accuracy, which was further analyzed for disease risk and drug efficiency. The identified elevated risks included coronary artery disease, basal-cell carcinoma, hypertriglyceridemia and T2D, and the participant was estimated to have favorable response to rosiglitazone and metformin, both are antidiabetic medications. Although the participant has a known family history for some of the high-risk diseases (but not T2D), he was free from most of them (except for hypertriglyceridemia, for which he used medication) and had a normal Body Mass Index at the start of our study. Nonetheless, these elevated disease risks served as a guideline to monitor his personal health with iPOP analysis. We profiled the transcriptome, proteome and metabolome from 20 time points in the 14 months, and monitored molecular profile changes for physiological state change events during our study, including two viral infections. The subject also acquired T2D during the study, immediately after one of the viral (respiratory syncytial virus) infections. Two types of changes were observed from our iPOP data: the autocorrelated trends that reflect chronic changes, and the spikes which include significantly up/down-regulated genes and pathways especially at the onset of each event. With our iPOP approach, we acquired a comprehensive picture of detailed molecular differences between different physiological states, as well as during disease onset. In particular, interesting changes in glucose and insulin signaling pathways were observed during the onset of T2D. We also obtained other important information from our omics data, such as dynamic changes in allele-specific expression and RNA-editing events, as well as personalized autoantibody profiles. Overall, this study revealed an important application of the use of genomics and other omics profiling for personalized disease risk estimation and precision medicine, as we discovered the increased T2D risk, monitored its early onset, and helped the participant effectively control and eventually reverse the phenotype by proactive interventions (diet change and physical exercise).

Another important feature of our study is that samples are collected in a longitudinal fashion so that aberrant/disease states can be compared to healthy states of the same individual. One other advantage of our iPOP approach is its modularity, as other omics and quantifiable information can also be included in the iPOP profile, which can be readily tailored to monitor any biological or pathological event of interest (Figure 2). Examples of other information are: epigenome,66 gut microbiome,67 microRNA profiles68 and immune receptor repertoire.69 Moreover, quantifiable behavioral parameters such as nutrition, exercise, stress control and sleep may also be added to the profile.70

Figure 2.

The concept of integrative Personal Omics Profile (iPOP) analysis. Physiological state of the body can be reflected by the integrated information of different omics profiles, as well as the interactions among them.

THE IMPORTANCE OF DATA MINING AND RE-MINING

One important aspect of systems biology is data mining. Data management and access can become a daunting task given the tremendous amount of data generated with current high-throughput technologies, and the data size is constantly increasing with time.71 Challenges exist computationally in each step to handle, process and annotate high-throughput data, integrate data from different sources and platforms, and pursue clinical interpretation of the data.72 These steps can be quite computationally intensive and require significant computational hardware; for example, to map short reads to achieve 30× coverage of the human genome, 13 CPU days is typically required72 although these times are rapidly decreasing. Moreover, as biological systems act more than just the sum of its individual parts, knowledge from multiple levels (such as epistasis, interaction, localization, and activation status) should be considered to capture the underlying highly organized networks for functional annotations.73 Ultimately it will be important to have a comprehensive database that contains Electronic Health records (including treatment information), genome sequences with variant calls and as much molecular information as possible. In principle with appropriate algorithms such a database could be mined by physicians to make data-driven medical decisions.

Currently many high-throughput datasets of similar types (e.g., expression and genome-wide association data collected from different populations of the same disease) were created as smaller, separate studies. Thus combining these publicly available datasets bioinformatically may provide more statistical power and lead to a clearer conclusion that could not be achieved in the individual studies. The work by Roberts et al. mentioned above serves as one example.63 In order to test the capacity of whole genome information, the authors combined monozygotic twin pair data from a total of five sources in 13 publications to obtain a much large dataset for their test. Similarly, Butte and colleagues combined the results of 130 functional microarray experiments for T2D and re-mined the data for repeatedly appeared candidate genes.74 They identified CD44 as the top candidate gene associated with T2D. In a related effort, by analyzing curated data of 2,510 individuals from 74 populations, the group led by Butte also discovered that T2D risk alleles were unevenly distributed across different human populations, with the risk higher in African and lower in Asian populations.75

CONCERNS AND LIMITATIONS

Personalized health monitoring and precision medicine is just accelerating at a rapid pace because of the development of systems biology. As noted above, multiple efforts in both technology development and biological application have occurred, and an increasing number of researchers and physicians alike are sharing this vision. Hood et al. termed this approach as ‘P4 Medicine’ for predictive, preventive, personalized and participatory medicine.12

Nevertheless, many concerns also exist, and guidelines on translational omics research have been recommended by the Institute of Medicine.76 Khoury et al. suggested ‘a fifth P’, that is, the population perspective be added to personalized medicine77 and population validation of systems results with strong evidence should be achieved before its clinical application. Many disease-associated genetic variants discovered in GWAS still need to be functionally validated.78 In addition, Khoury et al. raised concerns that restricted health care resources might be wasted if unneeded disease screening/subclassification with systems approaches were conducted rather than lowering health care costs. However, with the rapid drop in technology costs and carefully designed pilot studies, the optimal screening frequencies/levels of subclassification necessary for precision medicine could be determined and costs maintained at affordable levels. It is worth noting that generating personalized omics data with appropriate interpretation can greatly benefit our understanding of physiological events for health and disease, and precision health care as we gain more knowledge in this field. In addition to personalized diagnosis and treatment, the future of precision medicine with omics approaches should emphasize personalized health monitoring, molecular symptom, early detection and preventative medicine, a paradigm shift from traditional health care.

As the human body is a highly organized, complex system with multiple organs and tissues, it is important to select the correct sample type for understanding a specific biological problem. However, as many sample types are unavailable (e.g., brain tissue) or not regularly accessible (e.g., biopsy samples from internal organs) from living individuals, our scope for personalized health monitoring is thus restricted. Therefore systems biology results, especially iPOP results, should not be over-interpreted. Although iPOP data from blood components may indicate changes in the other parts of the human body, the actual profiles for the tissue of interest might be underrepresented in blood or delayed in phase.

It is still not clear who is to develop and deliver personalized treatments for personalized medicine if they are not available as conventional medication. The cost for developing personalized drugs may become prohibitive to accurately address personal specificity, and may face other difficulties such as Food and Drug Administration approval. However, advances in high-throughput drug discovery will help accelerate this field.

In addition, personalized medicine using omics approaches relies heavily on technology development for biological research. This includes advances in both research instrumentation and computational framework. For example, it is still not possible to accurately determine the entire sequence of a genome due to limitations of current WGS/WES methods,79,80 even after computational improvement of signal-to-noise ratio.81,82 A low sequencing error rate was claimed by both the Illumina HiSeq (for 2 × 100 bp reads, more than 80% of the bases have a quality score above Q30, or 99.9% accuracy, http://www.illumina.com/documents//products/datasheets/datasheet_hiseq_systems.pdf) and the Complete Genomics platform (1 × 10−5 at the time of our study80 and 2 × 10−6 as of October 8th, 2012, www.completegenomics.com); however, per variant error rate is still high (15.50% and 9.08% for Illumina and Complete Genomics respectively with no filter, and 1.01% and 1.12% post multiple filters) as reported by Reumers et al.,81 which agreed with our observation that only 88.1% of the SNP calls overlapped when the same genome was sequenced with the two platforms.80 Thus possible disease-associated variants in these platform-specific regions might be overlooked or misinterpreted. Another issue lies in storage and processing of the omics data, as petabytes of data can easily be generated for a small iPOP study of 200 participants and demanding computing resources will be needed for data analysis. Therefore, interdisciplinary efforts from biologists, computer scientists and hardware engineers should be organized to ensure the continued improvement of this field.

CONCLUSION

The era of personalized precision medicine is about to emerge. The steady improvement of high-throughput technologies greatly facilitates this process by enabling profiling of various omes such as whole genome, epigenome, transcriptome, proteome and metabolome, which convey detailed information of the human body. Integrated profiles of these omes should reflect the physiological status of the host at the time the samples are collected. Personalized omics approach catalyzes precision medicine at two levels: for diseases and biological processes whose mechanisms are still unclear, omics approach will facilitate researches that would greatly advance our understanding; and when the mechanisms are clarified, individualized health care can be provided through health monitoring, preventative medicine, and personalized treatment. This would be especially helpful for complex diseases such as autism83 and Alzheimer’s disease,84 where multiple factors are responsible for the phenotypes. Furthermore, omics approach also facilitates the development of other less-stressed but important health-related fields, such as nutritional systems biology, which studies personalized diet and its relationship to health in systems point of view.85 With the rapid decrease in the cost of omics profiling, we anticipate an increased number of personalized medicine applications in many aspects of health care besides our proof-of-principle study. This will significantly improve the health of the general public and cut down health care costs. Scientists, governments, pharmaceutical companies and patients should work closely together to ensure the success of this transformation.86

Acknowledgements

This work is supported by funding from the Stanford University Department of Genetics and the National Institutes of Health. We thank Drs. George I. Mias and Hogune Im for their help in proof-reading the article and the insightful discussions.