The news in our blog

Deep Learning Papers on Medical Image Analysis

Deep Learning Papers on Medical Image Analysis


To the best of our knowledge, this is the first list of deep learning papers on medical applications. There are couple of lists for deep learning papers in general, or computer vision, for example Awesome Deep Learning Papers. In this list, I try to classify the papers based on their deep learning techniques and learning methodology. I believe this list could be a good starting point for DL researchers on Medical Applications.


  1. A list of top deep learning papers published since 2015.
  2. Papers are collected from peer-reviewed journals and high reputed conferences. However, it may have recent papers on arXiv.
  3. A meta-data is required along with the paper, i.e. Deep Learning technique, Imaging Modality, Area of Interest, Clinical Database (DB).

List of Journals / Conferences (J/C):


Deep Learning Techniques:

  • NN: Neural Networks
  • MLP: Multilayer Perceptron
  • RBM: Restricted Boltzmann Machine
  • SAE: Stacked Auto-Encoders
  • CAE: Convolutional Auto-Encoders
  • CNN: Convolutional Neural Networks
  • RNN: Recurrent Neural Networks
  • LSTM: Long Short Term Memory
  • M-CNN: Multi-Scale/View/Stream CNN
  • MIL-CNN: Multi-instance Learning CNN
  • FCN: Fully Convolutional Networks

Imaging Modality:

  • US: Ultrasound
  • MR/MRI: Magnetic Resonance Imaging
  • PET: Positron Emission Tomography
  • MG: Mammography
  • CT: Computed Tompgraphy
  • H&E: Hematoxylin & Eosin Histology Images
  • RGB: Optical Images

Table of Contents

Deep Learning Techniques

Medical Applications

Deep Learning Techniques

Auto-Encoders/ Stacked Auto-Encoders

Convolutional Neural Networks

Recurrent Neural Networks

Generative Adversarial Networks

Medical Applications


Technique Modality Area Paper Title DB J/C Year
NN H&E N/A Deep learning of feature representation with multiple instance learning for medical image analysis [pdf] ICASSP 2014
M-CNN H&E Breast AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images [pdf] AMIDA IEEE-TMI 2016
FCN H&E N/A Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation pdf MICCAI 2017


Technique Modality Area Paper Title DB J/C Year
M-CNN CT Lung Multi-scale Convolutional Neural Networks for Lung Nodule Classification [pdf] LIDC-IDRI IPMI 2015
3D-CNN MRI Brain Predicting Alzheimer’s disease: a neuroimaging study with 3D convolutional neural networks [pdf] ADNI arXiv 2015
CNN+RNN RGB Eye Automatic Feature Learning to Grade Nuclear Cataracts Based on Deep Learning [pdf] IEEE-TBME 2015
CNN X-ray Knee Quantifying Radiographic Knee Osteoarthritis Severity using Deep Convolutional Neural Networks [pdf] O.E.1 arXiv 2016
CNN H&E Thyroid A Deep Semantic Mobile Application for Thyroid Cytopathology [pdf] SPIE 2016
3D-CNN, 3D-CAE MRI Brain Alzheimer’s Disease Diagnostics by a Deeply Supervised Adaptable 3D Convolutional Network [pdf] ADNI arXiv 2016
M-CNN RGB Skin Multi-resolution-tract CNN with hybrid pretrained and skin-lesion trained layers [pdf] Dermofit MLMI 2016
CNN RGB Skin, Eye Towards Automated Melanoma Screening: Exploring Transfer Learning Schemes [pdf] EDRADRD arXiv 2016
M-CNN CT Lung Pulmonary Nodule Detection in CT Images: False Positive Reduction Using Multi-View Convolutional Networks [pdf] LIDC-IDRIANODE09DLCST IEEE-TMI 2016
3D-CNN CT Lung DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification [pdf] LIDC-IDRILUNA16 IEEE-WACV 2018
3D-CNN MRI Brain 3D Deep Learning for Multi-modal Imaging-Guided Survival Time Prediction of Brain Tumor Patients [pdf] MICCAI 2016
SAE US, CT Breast, Lung Computer-Aided Diagnosis with Deep Learning Architecture: Applications to Breast Lesions in US Images and Pulmonary Nodules in CT Scans [pdf] LIDC-IDRI Nature 2016
CAE MG Breast Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring [pdf] IEEE-TMI 2016
MIL-CNN MG Breast Deep multi-instance networks with sparse label assignment for whole mammogram classification [pdf] INbreast MICCAI 2017
GCN MRI Brain Spectral Graph Convolutions for Population-based Disease Prediction [pdf] ADNIABIDE arXiv 2017
CNN RGB Skin Dermatologist-level classification of skin cancer with deep neural networks Nature 2017
FCN + CNN MRI Liver-Liver Tumor SurvivalNet: Predicting patient survival from diffusion weighted magnetic resonance images using cascaded fully convolutional and 3D convolutional neural networks [pdf] ISBI 2017

Detection / Localization

Technique Modality Area Paper Title DB J/C Year
MLP CT Head-Neck 3D Deep Learning for Efficient and Robust Landmark Detection in Volumetric Data [pdf] MICCAI 2015
CNN US Fetal Standard Plane Localization in Fetal Ultrasound via Domain Transferred Deep Neural Networks [pdf] IEEE-JBHI 2015
2.5D-CNN MRI Femur Automated anatomical landmark detection ondistal femur surface using convolutional neural network [pdf] OAI ISBI 2015
LSTM US Fetal Automatic Fetal Ultrasound Standard Plane Detection Using Knowledge Transferred Recurrent Neural Networks [pdf] MICCAI 2015
CNN X-ray, MRI Hand Regressing Heatmaps for Multiple Landmark Localization using CNNs [pdf] DHADS MICCAI 2016
CNN MRI, US, CT - An artificial agent for anatomical landmark detection in medical images [pdf] SATCOM MICCAI 2016
FCN US Fetal Real-time Standard Scan Plane Detection and Localisation in Fetal Ultrasound using Fully Convolutional Neural Networks [pdf] MICCAI 2016
CNN+LSTM MRI Heart Recognizing end-diastole and end-systole frames via deep temporal regression network [pdf] MICCAI 2016
M-CNN MRI Heart Improving Computer-Aided Detection Using Convolutional Neural Networks and Random View Aggregation Neural Networks [pdf] IEEE-TMI 2016
CNN PET/CT Heart Automated detection of pulmonary nodules in PET/CT images: Ensemble false-positive reduction using a convolutional neural network technique Neural Networks [pdf] MP 2016
3D-CNN MRI Brain Automatic Detection of Cerebral Microbleeds From MR Images via 3D Convolutional Neural Networks [pdf] IEEE-TMI 2016
CNN X-ray, MG - Self-Transfer Learning for Fully Weakly Supervised Lesion Localization [pdf] NIH,ChinaDDSM,MIAS MICCAI 2016
CNN RGB Eye Fast Convolutional Neural Network Training Using Selective Data Sampling: Application to Hemorrhage Detection in Color Fundus Images [pdf] DRDMESSIDOR MICCAI 2016
GAN - - Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery IPMI 2017
FCN X-ray Cardiac CathNets: Detection and Single-View Depth Prediction of Catheter Electrodes MIAR 2016
3D-CNN CT Lung DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification [pdf] LIDC-IDRILUNA16 IEEE-WACV 2018


Technique Modality Area Paper Title DB J/C Year
U-Net - - U-net: Convolutional networks for biomedical image segmentation MICCAI 2015
FCN MRI Head-Neck Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation [pdf] arXiv 2016
FCN CT Liver-Liver Tumor Automatic Liver and Lesion Segmentation in CT Using Cascaded Fully Convolutional Neural Networks and 3D Conditional Random Fields [pdf] MICCAI 2016
3D-CNN MRI Spine Model-Based Segmentation of Vertebral Bodies from MR Images with 3D CNNs MICCAI 2016
FCN CT Liver-Liver Tumor Automatic Liver and Tumor Segmentation of CT and MRI Volumes using Cascaded Fully Convolutional Neural Networks [pdf] arXiv 2017
FCN MRI Liver-Liver Tumor SurvivalNet: Predicting patient survival from diffusion weighted magnetic resonance images using cascaded fully convolutional and 3D convolutional neural networks [pdf] ISBI 2017
3D-CNN Diffusion MRI Brain q-Space Deep Learning: Twelve-Fold Shorter and Model-Free Diffusion MRI [pdf] (Section II.B.2) IEEE-TMI 2016
GAN MG Breast Mass Adversarial Deep Structured Nets for Mass Segmentation from Mammograms [pdf] INbreastDDSM-BCRP ISBI 2018
3D-CNN CT Liver 3D Deeply Supervised Network for Automatic Liver Segmentation from CT Volumes pdf MICCAI 2017
3D-CNN MRI Brain Unsupervised domain adaptation in brain lesion segmentation with adversarial networks pdf IPMI 2017


Technique Modality Area Paper Title DB J/C Year
3D-CNN CT Spine An Artificial Agent for Robust Image Registration [pdf] 2016


Technique Modality Area Paper Title DB J/C Year
2.5D-CNN MRI Automated anatomical landmark detection ondistal femur surface using convolutional neural network [pdf] OAI ISBI 2015
3D-CNN Diffusion MRI Brain q-Space Deep Learning: Twelve-Fold Shorter and Model-Free Diffusion MRI [pdf] (Section II.B.1) [HCP]and other IEEE-TMI 2016

Image Reconstruction and Post Processing

Technique Modality Area Paper Title DB J/C Year
CNN CS-MRI A Deep Cascade of Convolutional Neural Networks for Dynamic MR Image Reconstruction pdf IEEE-TMI 2017
GAN CS-MRI Deep Generative Adversarial Networks for Compressed Sensing Automates MRI pdf NIPS 2017

Other tasks


Next-Generation Sequencing (NGS) – illumina

Next-Generation Sequencing (NGS)

With its unprecedented throughput, scalability, and speed, next-generation sequencing enables researchers to study biological systems at a level never before possible.

Today’s complex genomic research questions demand a depth of information beyond the capacity of traditional DNA sequencing technologies. Next-generation sequencing has filled that gap and become an everyday research tool to address these questions.

See What NGS Can Do For You

Innovative NGS sample preparation and data analysis options enable a broad range of applications. Next-gen sequencing allows you to:

Accessible Whole-Genome Sequencing

Using capillary electrophoresis-based Sanger sequencing, the Human Genome Project took over 10 years and cost nearly $3 billion.

Next-generation sequencing, in contrast, makes large-scale whole-genome sequencing accessible and practical for the average researcher.

Limitless Dynamic Range for Expression Profiling

NGS makes sequence-based gene expression analysis a “digital” alternative to analog techniques. It lets you quantify RNA expression with the breadth of a microarray and the resolution of qPCR.

Microarray gene expression measurement is limited by noise at the low end and signal saturation at the high end. In contrast, next-generation sequencing quantifies discrete, digital sequencing read counts, offering a virtually unlimited dynamic range.

Tunable Resolution for Targeted Next-Gen Sequencing

NGS is highly scalable, allowing you to tune the level of resolution to meet specific experimental needs.

Targeted sequencing allows you to focus your research on particular regions of the genome. Choose whether to do a shallow scan across multiple samples, or sequence at greater depth with fewer samples to find rare variants in a given region.

How Does Illumina NGS Work?

Illumina next-generation sequencing utilizes a fundamentally different approach from the classic Sanger chain-termination method. It leverages sequencing by synthesis (SBS) technology – tracking the addition of labeled nucleotides as the DNA chain is copied – in a massively parallel fashion.

Next-gen sequencing generates masses of DNA sequence data that’s richer and more complete than is imaginable with Sanger sequencing. Illumina sequencing systems can deliver data output ranging from 300 kilobases up to 1 terabase in a single run, depending on instrument type and configuration.

Learn more about sequencing by synthesis (SBS) technology »

Latest Evolution of Illumina Next-Gen Sequencing

Recent Illumina next-generation sequencing technology breakthroughs include:

  • 2-channel SBS: This technology enables faster sequencing with the same high data accuracy.
  • Patterned flow cell technology: This option offers dramatically increased data output and throughput.
  • $1000 genome sequencing: Discover how the HiSeq X Ten System breaks the $1000 genome barrier for human whole-genome sequencing.

Bring Next-Generation Sequencing to Your Lab

The following resources offer valuable guidance to researchers who are considering purchasing an NGS system:



Epigenomic analysis software tools and databases


Epigenomic analysis software tools and databases

Transcriptomic analysis software tools and databases


Transcriptomic analysis software tools and databases

Metabolomic analysis software tools and databases


Metabolomic analysis software tools and databases

Fluxomic analysis software tools and databases


Fluxomic analysis software tools and databases

Biological pathway analysis software tools and databases

Pathway analysis

Biological pathway analysis software tools and databases

Tools for next-generation sequencing analysis

DeepSeq Data Analysis Tools

NGS Data Analysis
Site Name Description Clicks

Top of Page

SNP Detection and Genomics Work
Site Name Description Clicks

Top of Page

NGS Forums
Site Name Description Clicks

Top of Page

Transcriptome Analysis
Site Name Description Clicks

Top of Page

Functional Annotation
Site Name Description Clicks


Top of Page

Site Name Description Clicks

Top of Page


Recent approaches to the prioritization of candidate disease genes

Recent approaches to the prioritization of candidate disease genes



Many efforts are still devoted to the discovery of genes involved with specific phenotypes, in particular, diseases. High-throughput techniques are thus applied frequently to detect dozens or even hundreds of candidate genes. However, the experimental validation of many candidates is often an expensive and time-consuming task. Therefore, a great variety of computational approaches has been developed to support the identification of the most promising candidates for follow-up studies. The biomedical knowledge already available about the disease of interest and related genes is commonly exploited to find new gene–disease associations and to prioritize candidates. In this review, we highlight recent methodological advances in this research field of candidate gene prioritization. We focus on approaches that use network information and integrate heterogeneous data sources. Furthermore, we discuss current benchmarking procedures for evaluating and comparing different prioritization methods. WIREs Syst Biol Med 2012. doi: 10.1002/wsbm.1177

For further resources related to this article, please visit the WIREs website.


Many common diseases are complex and polygenic, involving dozens of human genes that might predispose to, be causative of, or modify the respective disease phenotype.1–3 This intricate interplay of disease genotypes and phenotypes still renders the identification of all relevant disease genes difficult.4–6 Therefore, a number of experimental techniques exist to discover disease genes. In particular, high-throughput methods such as genome-wide association (GWA) studies2,7,8 and large-scale RNA interference screens6,9,10 yield lists of up to hundreds of candidate disease genes. As validating the actual disease relevance of candidate genes in experimental follow-up studies is a time-consuming and expensive task, many methods and web services for the computational prioritization of candidate disease genes have already been developed.11–19

The concrete problem of candidate gene prioritization can be formulated as follows: Given a disease (or, generally spoken, a specific phenotype) of interest and some list of candidate genes, identify potential gene–disease associations by ranking the candidate genes in decreasing order of their relevance to the disease phenotype. When abstracting from the methodological details, the vast majority of computational approaches to this prioritization problem work in a similar manner. Most of them rely on the biological information already available for the disease phenotype of interest and the known, already verified, disease genes as well as for the additional candidate genes. In this context, functional information, particularly, manually curated or automatically derived functional annotation, often provides strong evidence for establishing links between diseases and relevant genes and proteins.20–27 Many prioritization methods use protein interaction data as rich information source for finding relationships between gene products of candidate genes and disease genes.11,16,18,25,28–45 In addition, the phenotypic similarity of diseases can help to increase the total number of known disease genes for less studied disease phenotypes.35,46–55 Other sources of biological information frequently used by prioritization approaches are sequence properties, gene expression data, molecular pathways, functional orthology between organisms, and relevant biomedical literature.12,14,19

These data then serve as input for statistical learning methods or are integrated into network representations, which are further analyzed by network scoring algorithms. Although individual data sources such as functional annotations or protein interactions provide quite powerful information for prioritizing candidate genes, the integration of multiple data sources has been reported to increase the performance even more.23–25,35,47–69 However, a generally accepted and consistent benchmarking strategy for all the diverse prioritization methods has not emerged yet, which complicates performance evaluation and comparison.

Therefore, this advanced review does not only highlight recent prioritization approaches (published until the end of 2011), but also discusses different benchmarking strategies applied by authors and the need of standardized procedures for performance measurement. Other computational tasks such as the structural and functional interpretation as well as prioritization of disease-associated nucleotide and amino acid changes are not discussed in this article, but are reviewed elsewhere.3,70–74 In the following, the various prioritization methods are categorized according to the biological data and their representation that are primarily considered when scoring and ranking candidate disease genes: gene and protein characteristics, network information on molecular interactions, and integrated biomedical knowledge.


The first computational approaches in the hunt for disease genes focused on molecular characteristics of disease genes, which discriminate them from non-disease genes. As described below, researchers developed methods related to individual gene and protein sequence properties75–77 as well as functional annotations of gene products.20–22,26,27,78 In principle, if a candidate satisfies certain characteristics as derived from known disease genes and proteins, its disease relevance is considered to be higher than otherwise.

Gene and Protein Sequence Properties

López-Bigas and Ouzounis75 derived several important characteristics of disease genes from the amino acid sequence of their gene products. In comparison with other proteins encoded in the human genome, disease proteins tend to be longer, to exhibit a wider phylogenetic extent, that is, to have more homologs in both vertebrates and invertebrates, to possess a low number of close paralogs, and to be more evolutionarily conserved. Using these sequence properties as input to a decision-tree algorithm, the researchers performed a genome-wide identification of genes involved in (hereditary) diseases.

Similarly, Adie et al.76 developed PROSPECTR, a method for candidate disease gene prioritization based on an alternating decision-tree algorithm. However, their approach examined a broader set of sequence features and thus produced a more successful classifier. In particular, Adie et al. found that disease genes tend to have different nucleotide compositions at the end of the sequence, a higher number of CpG islands at the 5′ end, and longer 3′ untranslated regions.

Functional Annotations

By demonstrating the strong correlation between gene and protein function and disease features, such as age of onset, Jimenez-Sanchez et al.78motivated prioritization approaches that exploit the functional annotation of known disease genes for ranking candidates.20–22,26,27

Perez-Iratxeta et al.20 applied text mining on biomedical literature to relate disease phenotypes with functional annotations using Medical Subject Headings (MeSH)79 and Gene Ontology (GO) terms.80 They ranked the candidate genes according to the characteristic functional annotations shared with the disease of interest. In a similar fashion, Freudenberg and Propping21identified candidate genes based on their annotated GO terms that are shared with groups of known disease genes associated with similar phenotypes. In contrast, the approach POCUS22 assesses the shared over-representation of functional annotation terms between genes in different loci for the same disease.

Recently, Schlicker et al.26 developed a prioritization method that makes use of the similarity between the functional annotations of disease genes and candidates. In contrast to the approaches that consider solely identical functional annotations or compute only GO term enrichments, MedSim automatically derives functional profiles for each disease phenotype from the GO term annotation of known disease genes and, optionally, of their orthologs or interaction partners. Candidate genes are then scored and ranked according to the functional similarity of their annotation profiles to a disease profile. In addition, Ramírez et al.27 introduced the BioSim method for discovering biological relationships between genes or proteins. While MedSim is based only on GO term annotations, BioSim quantifies functional gene and protein similarity according to multiple data sources of functional annotations and can also be applied to rank candidate genes based on their functional similarity to known disease genes.

The success of the presented studies also shows that phenotypically similar diseases often involve common molecular mechanisms and thus functionally related genes. This also explains the frequent use of functional annotations as important biological evidence in integrative prioritization approaches.23–25,56–58,61–66,68,69 Notably, the information value of functional annotations can be further increased by improved scoring of functional similarity, reaching the performance of complex integrative methods based on multiple data sources.26


In the last decade, molecular interaction networks have become an indispensable tool and a valuable information source in the study of human diseases. Regarding methods for prioritization of candidate disease genes, it was repeatedly observed that protein interaction networks are among the most powerful data sources in addition to functional annotations.11,16,18,28,29,81 As in case of sequence properties, disease genes and their products have discriminatory network properties that allow their distinction from non-disease genes. In particular, molecular interactions naturally support the application of the guilt-by-association principle to identify disease genes. In the following, we highlight a representative selection of network-based prioritization approaches.

Local Network Information

Early prioritization methods have focused on local network information such as close network neighborhood of a node representing a candidate gene or protein (see Box 1). This can be explained by the observation that disease proteins tend to cluster and interact with each other.30,49,82–84 Molecular triangulation is one of the first methods that used protein interaction networks to rank candidates and their network nodes with respect to their shortest path distances to nodes of known disease proteins.31 An evidence score such as the MLS score corresponding to linkage peak association85 is assigned to each disease protein node and transferred to its neighbor nodes. The candidates are then ranked according to the accumulated sum of evidence scores. This means that candidates represented by nodes close to several disease protein nodes with good evidence scores are considered to be the most promising ones.

Box 1


Local network information refers to the topological neighborhood of a node, and corresponding measures are less sensitive to the overall network topology. Examples are the node degree kn (number of edges linked to node n) and the shortest path length dnm (minimum number of edges between the nodes n and m). In disease gene networks, Xu et al.34 define, for each node n, the 1N index equation image and the 2N index equation image. Here, equation image is the number of edges between node nand disease genes, and Nn is the set of direct neighbors of n. Given the set of disease genes M, the average shortest path distance of a node n to disease genes is equation image.

Global network information relates to the overall network topology and measures that characterize the role of a node in the whole network. Common centrality measures are shortest path closeness and betweenness as well as random-walk-related properties such as hitting time, visit frequency and stationary distribution. For instance, closeness centrality indicates how distant a node is to the other network nodes, and it is calculated as equation image with Vn denoting the set of nodes reachable from n. The random walk with restart40 is defined as pt+1 = (1 − r)Wpt + rp0. Here, W is the column-normalized adjacency matrix of the network, pt contains the probability of being at each node at time step t, p0 denotes the initial probability vector, and r is restart probability. The steady-state probability vector p can be obtained by performing iterations until the change between pt and pt+1 falls below some significance threshold, e.g., 10−6.40

In a related approach, Karni et al.32 identified the minimal set of candidates so that there is a path between the products of known disease genes in the protein interaction network. Oti et al.33 proposed an even simpler method for a genome-wide prediction of disease genes. For each known disease protein, they identified its interaction partners and the chromosomal locations of the encoding genes. A gene is then considered to be relevant for a disease of interest if it resides within a known disease locus and its gene product shares an interaction with a protein known to be associated with the same disease.

To make the most out of the potential of local network measures, Xu and Li34computed multiple topological properties for three different molecular networks consisting of literature-curated, experimentally derived, and predicted protein–protein interactions. The considered properties are the node degree, the average distance to known disease genes, the 1N and 2N node indices (see Box 1 and Figure 1), and the positive topological coefficient.86 The authors then trained a k-nearest-neighbor classifier using the aforementioned topological properties of known disease genes and achieved comparable performance for all three networks. They also detected a possible bias in the literature-curated network because disease genes tend to be studied more extensively.

Figure 1.

Exemplary molecular network of candidate genes and known disease genes. Red nodes represent known disease genes, and green nodes correspond to candidate genes. For candidate genes C1 and C2, the table lists the node degree, the 1N and 2N indices, and the average network distance to disease genes (see also Box 1).

In addition to local network information, Lage et al.35 incorporated phenotypic data into the disease gene prioritization. Each candidate and its direct interaction partners are considered as a candidate complex. All disease proteins in a candidate complex are assigned phenotypic similarity scores, which are used as input to a Bayesian predictor. Thus, a candidate gene obtains a high score if the other proteins in the complex are involved in phenotypes very similar to the disease of interest. Care et al.36 elaborated on this approach by combining it with deleterious SNP predictions for the candidate gene products and their interaction partners. Using the method by Lage et al., Berchtold et al.37 successfully prioritized proteins associated with type 1 diabetes (T1D). Further studies of protein interaction networks underlying specific diseases such as breast cancer38 and T1D39 also deal with the application of similar network-based prioritization approaches.

Global Network Information

Beyond local network information that ignores potential network-mediated effects from distant nodes, the utilization of global network measures can considerably improve the performance of prioritization methods for candidate disease genes.40,41,44 Especially for the study of polygenic diseases, network topology analysis can provide more insight into multiple paths of long-range protein interactions and their impact on the functionality and interplay of disease genes.

Random-Walk Measures

Köhler et al.40 demonstrated that random-walk analysis of protein–protein interaction networks outperforms local network-based methods such as shortest path distances and direct interactions as well as sequence-based methods like PROSPECTR.76 In their method, the authors ranked the gene products in a given network according to the steady-state probability of a random walk, which starts at known disease proteins and can restart with a predefined probability (see Box 1). Although the ranking criterion is the proximity of candidates to known disease proteins, this approach is more discriminative than local measures because it accounts for the global network structure.

In a similar manner, Chen et al.41 adapted three sophisticated algorithms from social network and web analysis for the problem of disease gene prioritization. To this end, they analyzed a protein–protein interaction network using modified versions of the random-walk-based methods PageRank,87,88Hyperlink-Induced Topic Search (HITS),88,89 and K-Step Markov method (KSMM).88 PageRank, HITS, and KSMM consider the global network topology and compute the relevance of all nodes representing candidates with regard to the set of known disease proteins in the network. All three methods achieved comparable performance to each other.

To address the issue of finding causal genes within expression quantitative trait loci (eQTL),90,91 Suthram et al.42 introduced the eQTL electrical diagrams method (eQED). They modeled confidence weights of protein interactions as conductances, while the P-values of associations between genetic loci and the expression of candidate genes served as current sources. The best candidate gene is the one passed by the highest current. The currents in electric circuits can be determined efficiently using random-walk computations.92,93

Network Centrality Measures

For many years, global centrality measures such as closeness or betweenness have been used in social sciences to assess how important individual nodes are for the overall network connectivity. Recently, such measures have been applied to several problems in bioinformatics including disease gene prioritization. For example, Dezső et al.43 applied an adapted version of shortest path betweenness to prioritize candidates in a protein–protein interaction network. A candidate is scored more relevant to the disease of interest if it lies on significantly more shortest paths connecting nodes of known disease proteins than other nodes in the network.

In a recent case study on primary immunodeficiencies (PIDs), Ortutay and Vihinen25 integrated functional GO annotations with protein interaction networks to discover novel PID genes. The authors conducted a topological analysis on an immunome network consisting of all essential proteins related to the human immune system and their interactions. In particular, they used the node degree as well as the global centrality measures of vulnerability and closeness to assess the importance of candidate genes in the network (see Box 1). Additionally, they performed functional enrichment analysis to determine genes with PID-related GO terms. With some modifications, the described prioritization method could be generalized to other diseases of interest.

Combining Network Measures

Recently, Navlakha and Kingsford44 compared different network-based prioritization methods. The authors observed that random-walk-based measures40 outperform measures focused on the local network neighborhood33 or clustering.94–96 A consensus method that uses a random-forest classifier to combine all methods yielded the most accurate ranking. Therefore, apart from stressing the potential of protein interaction data, Navlakha and Kingsford also showed that disease gene prioritization can benefit from the integration of multiple information sources.

In summary, as many other studies have also demonstrated, molecular interaction networks, in particular, based on protein interactions, provide valuable biological knowledge for ranking candidate disease genes.11,16,18,28,29 It has also become clear that global network measures achieve better results in comparison to local measures.40,41,44 Nevertheless, the performance of such prioritization approaches depends heavily on the quality of the network data. Protein interaction data are well known to be biased toward extensively studied proteins and subject to inherent noise.34,97,98 Therefore, it is often suggested that existing methods will perform better when more accurate data become available. Furthermore, Erten et al.45 pointed out that network-based methods can also be improved by integrating statistical adjustments for the skewed degree distribution of protein interaction networks.


Network information on molecular interactions as well as individual gene and protein characteristics such as sequence properties and functional annotations are major sources of biological evidence for scoring and ranking candidate disease genes. However, a prioritization approach based on a single information source alone usually achieves only limited performance due to noisy and incomplete datasets. To address this problem, the integration of multiple sources of biological knowledge has proven to be a good solution in bioinformatics. Different types of data can complement each other well to increase the amount of available information and its overall quality. While some of the methods presented above already make successful use of relatively simple integration procedures for a few different sources of functional information and annotations, this section will focus on more sophisticated methods for knowledge integration and the prioritization of candidate disease genes.

Complementing Molecular Interactions with Phenotypic Network Information

In the last years, several groups investigated the similarities and differences between disease phenotypes. The main finding was that similar phenotypes often share underlying genes or even pathways.46,99,100 In particular, van Driel et al.46 classified all human phenotypes contained in the Online Mendelian Inheritance in Man database (OMIM)101 by defining a measure of phenotypic similarity based on text mining of the corresponding OMIM records. Such phenotypic knowledge can be very useful to discover new potential disease genes by transferring known gene–phenotype associations to similar diseases and phenotypes.

Therefore, phenotypic similarity has become another major data source exploited by computational methods for prioritization of candidate disease genes.35,47–55 In this context, a two-layered heterogeneous data network is typically constructed so that the phenome layer consists of connections between similar phenotypes, while the interactome layer contains protein–protein interactions. The two network layers are then linked by known gene–phenotype associations.

To demonstrate the importance of the additional phenotype network layer for identifying novel gene–phenotype associations and disease–disease relationships, Li et al.48 extended the random-walk algorithm used by Köhler et al.,40 as described in the previous section, to heterogeneous networks. Both the candidate genes and the disease phenotypes are prioritized simultaneously. In contrast, Yao et al.49 estimated the closeness of a candidate gene to a disease of interest by computing the hitting time of a random walk that starts at the corresponding disease phenotype and ends at the candidate. This approach also allows the genome-wide identification of potential disease genes for phenotypic disease subtypes.

Chen et al.50 reformulated the candidate gene prioritization problem as a maximum flow problem on a heterogeneous network. They represented the capacities of connections between phenotypes by their phenotypic similarity. Capacities on edges within the interactome and on edges bridging the phenome and interactome were estimated during the evaluation procedure. By calculating a maximum flow from a phenotype of interest through the interactome, the authors ranked candidate genes with regard to the amount of efflux.

A computationally simpler approach based on the same network type was suggested by Guo et al.,51 who computed the association score between a gene and a disease as the weighted sum of all association scores between similar diseases and between neighboring genes in the interaction layer. To this end, the authors formulated an iterative matrix multiplication of disease–gene–association matrices and disease-similarity matrices corresponding to the network structure. While the maximum flow problem solved by Chen et al.50 already accounts for the phenotypic overlap between diseases, the approach by Guo et al.51 additionally considers the genetic overlap of diseases. The recent PhenomeNET52 is even a cross-species network of phenotypic similarities between genotypes and diseases based on a uniform representation of different phenotype and anatomy ontologies. In particular, it can be used to perform whole-phenome discovery of genes for diseases with unknown molecular basis.

Two other studies used iterative network flow propagation on a heterogeneous network to identify protein complexes related to disease. Vanunu et al.53 developed a prioritization method that propagates flow from a phenotype of interest through the whole network and identifies dense subnetworks around high-scored genes as potential phenotype-related protein complexes. In contrast, Yang et al.54 modified the described heterogeneous network and included an additional layer of protein complexes. In the resulting network, phenotypes are connected to protein complexes, and complexes are linked with each other according to the protein interactions in the interactome layer. The method derives novel gene–phenotype associations by propagating the network flow within the protein complex layer.

Disease gene prioritization methods usually rank candidate genes relative to a phenotype of interest. However, the discovery of gene–phenotype associations can also be approached the other way around. Hwang et al.55devised a method to identify the phenotype that could result from a given set of candidate genes. For that purpose, the authors considered a gene network and a phenotype similarity network. In both networks, the nodes were ranked separately with graph Laplacian scores, and a rank coherence was calculated from the score differences between genes and phenotypes connected by known associations. Hwang et al. showed that their approach is suitable to predict the resulting phenotype for a given set of candidate genes.

Integrating Heterogeneous Data Sources of Biological Knowledge

Two distinct approaches to disease gene prioritization that exploit multiple data sources are exemplarily highlighted in the following (Figure 2). The first approach considers each data source separately when assessing the molecular and phenotypic relationships of candidate genes with the disease of interest, and aggregates the resulting multiple ranking lists into a final ranking of the candidates. The alternative approach combines all biological information into a network representation and subsequently applies network measures to score and rank candidates with regard to their network proximity to nodes representing known disease genes.

Figure 2.

Integrative approaches to disease gene prioritization. The typical workflow of integrative prioritization approaches based on multiple data sources consists of three major steps. The first step involves preparing the input data consisting of two different sets of genes, the known disease genes and the candidate genes. For each gene, further biomedical knowledge is retrieved from various data sources such as functional annotations from the Gene Ontology and molecular pathways from the KEGG database. In the second step, the collected information is integrated using a network representation (top) or evaluated individually for each data source, resulting in different ranking lists (bottom). The third step computes a final ranking list of candidate genes based on network measures or rank aggregation. The candidate genes are thus prioritized by their relevance to the disease of interest.

In detail, the prioritization method Endeavour56,102 utilizes more than 20 data sources such as ontologies and functional annotations, protein–protein interactions, cis-regulatory information, gene expression data, sequence information, and text-mining results. For each data source, candidate genes are first ranked separately based on their similarity to a profile derived from known disease genes. Afterwards, all individual candidate rankings are merged into a final overall ranking using rank order statistics. The authors showed that this approach is quite successful in finding potential disease genes as well as genes involved in specific pathways or biological functions. Recently, Endeavour has also been benchmarked using various disease marker sets and pathway maps103 to confirm that it performs very well if sufficient data is available for the disease or pathway of interest and the candidate genes. Furthermore, Li et al.57 proposed a discounted rating system, an algorithm for integrating multiple rank lists, and compared it with the rank aggregation procedure used by Endeavour.

Like Endeavour, the method MetaRanker59 also combines many heterogeneous data sources and forms separate evidence layers from SNP-to-phenotype associations, candidate protein interactions, linkage study data, quantitative disease similarity, and gene expression information. For each layer, all genes in the human genome are ranked with regard to their probability to be associated with the phenotype of interest. The overall score of a gene is the product of its rank scores for each layer. The evaluation of MetaRanker indicates that it is particularly suited to uncover associations in complex polygenic diseases and that the integration of multiple data layers improves the identification of weak contributions to the phenotype of interest in comparison to the use of only few data sources.

Another combination of network-based methods with score aggregation has been proposed by Chen et al.60 The authors generate an individual network for each data source and quantify potential gene–disease relationships in each network using a global network measure based on diffusion kernels. The final candidate ranking considers only the most informative network score for each candidate gene. Furthermore, an alternative way of integrating information from multiple data sources is the application of machine learning techniques. Here, each data source can be represented as one or more individual features and used as input for the training of supervised learning methods. In particular, support vector machines,61–63 decision-tree-based classifiers,64 and PU learning65 (machine learning from positive and unlabeled examples) have been applied to prioritize candidate disease genes using multiple data sources.

In contrast, one of the first alternative approaches that integrate information from multiple data sources into a network representation has been Prioritizer.24 Its authors constructed a comprehensive functional human gene network based on a number of datasets from molecular pathway and interaction databases such as KEGG,104 BIND,105 HPRD,106 Reactome107as well as from GO annotations,80 yeast-2-hybrid screens, gene expression experiments, and protein interaction predictions. In this network, positional candidates from different disease loci are ranked according to the length of the shortest paths between them. In functional networks as used by Prioritizer, the main assumption is that relevant genes are involved in specific disease-related pathways and cluster together in the network even if their products are not closely linked by physical protein interactions.

Building upon Prioritizer, several research groups have assembled different types of integrated networks as biological evidence for candidate disease gene prioritization. One example is the two-layered network by Li et al.48presented in the previous section that combines protein interactions and phenotypic similarity. Another method was presented by Linghu et al.66 who employed naïve Bayes integration of diverse functional genomics datasets to generate a weighted functional linkage network and to prioritize candidate genes based on their shortest path distance to known disease genes. Similarly, Huttenhower et al.67 incorporated information from several thousands genomic experiments to generate a functional relationship network. From this network, the authors could derive functional maps of different phenotypes and showed in a case study for macroautophagy that these maps can be used successfully to find novel gene associations.

Recently, Lee et al.68 also provided a large-scale human network of functional gene–gene associations and evaluated the performance of six different network-based methods using it. Similar to the findings by Navlakha and Kingsford,44 the authors concluded that the strongest overall performance is achieved with algorithms that account for the global network structure such as Google’s PageRank. A more general view of the relationships between phenotypes and genes is introduced by BioGraph,69 a heterogeneous network containing diverse biomedical entities and relations between them, which are extracted from over 20 publicly available databases. By computing random walks on this network, the authors aim at the automated generation of functional hypotheses between different concepts, in particular, of candidate genes and diseases.


To show the biological applicability and scientific value of disease gene prioritization methods, their authors are normally expected to conduct an extensive performance evaluation and, if possible, a thorough comparison with other methods. To this end, many authors usually benchmark disease phenotypes from OMIM. Depending on the requirements of their method, only phenotypes with at least two or three known disease genes may be suitable. Hence, the number of evaluated diseases can vary from tens to hundreds with hundreds to thousands corresponding genes. The range of disease phenotypes and genes, for which a given method is applicable, depends on the data used by the method. For instance, only about 10% of all human protein–protein interactions have probably been described so far,108only about 10% of all human genes have at least one known disease association,101 and only about every second gene or protein is functionally annotated.109

Leave-one-out cross-validation is a widely used and generally accepted test for how a method might perform on previously unseen data. In each run, one of the known disease genes, the so-called target disease gene, is removed from the training data. The remaining disease genes are used to identify the omitted gene from a test set of genes that are not known to be associated with the disease of interest. In the best case, the top rank should be assigned to the target disease gene and lower ranks to the other test genes. Since cross-validation is a standard performance test, a number of suitable measures of predictive power exist, for example, sensitivity and specificity, receiver operating characteristic (ROC) curve, precision and recall, enrichment and mean rank ratio (see Box 2). Unfortunately, none of these measures is considered as default, which renders the comparison between different methods of disease gene prioritization difficult. In particular, it would be useful to report the performance for the top-ranked candidate genes, e.g., the first 10 or 20 genes, because only a few candidates can usually be considered for further validation experiments.

Another important aspect of the benchmarking strategy is the choice of genes in the test set, i.e., the candidate genes that are prioritized together with the target disease gene. One usual input for prioritization methods is a set of susceptibility loci as determined by GWA studies. These loci typically contain up to several hundreds of possible disease genes. Therefore, different strategies have been followed by authors to derive useful test sets, i.e., the definition of artificial gene loci, the random selection of genes, the use of the whole genome, and the small-scale choice of genes.

Box 2


Here, we briefly describe frequently used measures for evaluating the performance of disease gene prioritization methods. A simple measure is the mean rank ratio defined as the average of rank ratios for all tested disease genes.110 One speaks of n/m-fold enrichment on average if disease genes are ranked in the top m% of all genes in n% of the linkage intervals.47 Other performance measures are calculated using the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) at a specific rank or score cut-off that discriminates predicted from not predicted ones. Positives are disease genes, while negatives are candidate genes without disease association. For instance, sensitivity is the percentage of correctly identified disease genes among all genes above the cut-off (TP/(TP + FN)), while specificity is the percentage of correctly dismissed candidate genes among all genes below the cut-off (TN/(TN + FP)). Plotting sensitivity versus specificity while varying the cut-off yields a ROC curve. The area-under-the-ROC curve (AUC) is a standard measure for the overall performance of binary classification methods (here, disease genes vs. others). The AUC is 100% in case of perfect prioritization and 50% if the disease genes were ranked randomly. In some cases, the authors of prioritization methods give the percentage of disease genes ranked in the top 1% and 5% of all genes, which corresponds to reporting the sensitivity at 99% or 95% specificity, respectively. The percentage of correctly prioritized disease genes among all disease genes is defined as precision (TP/(TP + FP)), while recall is equal to sensitivity. Thus, the plot of a precision-recall curvecan also be used to evaluate method performance.44

Endeavour56 and several related methods61,103,111 were evaluated with a test set containing 99 candidate genes chosen at random from the whole genome in addition to the target disease gene. Other methods were also benchmarked with this strategy, primarily, in order to be comparable with Endeavour.26,41,47,58,60,110,112 However, since similar genes tend to cluster in chromosomal neighborhoods,113 another, presumably, more difficult setting for performance benchmarking and especially more relevant for GWA studies is the definition of artificial linkage intervals with genes that surround the disease gene on the chromosome.22,24,26,35,40,45,47,48,50,53,57,60,76,112,114 The size of such intervals, as found in the relevant literature, ranges from the 100 nearest genes to 300 genes on average if a 10 Mb genomic neighborhood is considered.26 The average gene number of linkage intervals associated with diseases according to OMIM is estimated to be 108.35

The third option for assembling a test set is the use of all genes in the genome except for the known disease genes in the training set.44,47–49,110This setting is chosen only by the few methods that are capable of performing genome-wide disease gene prioritization. Finally, prioritization methods that consider, for instance, gene expression data are evaluated only on a smaller scale because there is not enough data for a comprehensive benchmarking over many disease phenotypes.38,43,68,77,115–117 Therefore, the authors commonly choose only few diseases that have, for example, the required experimental data available.


In this review, we gave an overview of different approaches to the prioritization of candidate disease genes. We described how disease genes can be identified by their molecular characteristics based on sequence properties, functional annotations, and network information. In particular, we presented recent approaches, which make use of phenotypic information and comprehensive knowledge integration. Finally, we discussed common benchmarking strategies of prioritization methods.

Many disease gene prioritization methods exploit discriminative gene and protein properties and successfully rank candidate genes according to their functional and phenotypic similarity or network proximity to known disease genes. Further improvement of the prioritization performance can be achieved by integrating biological information contained in multiple data sources. Many integrative methods first combine heterogeneous datasets and then apply specific analysis techniques. However, in the course of such analysis, the very useful insight which data source provides the most relevant biological information for the prioritization is usually lost. Therefore, it is also beneficial to follow the alternative approach that first analyzes each data source separately using the most suitable techniques and then combines the resulting ranking lists using sophisticated rank aggregation algorithms. This procedure also facilitates backtracking the origin of the most relevant information.

Among the most widely used data sources for disease gene prioritization are functional annotations and protein interactions as well as phenotypic similarity. In particular, performance evaluations of methods such as Endeavour, MedSim, and Prioritizer demonstrated consistently that functional GO term annotations constitute by far one of the most useful biological evidence sources for candidate prioritization.24,26,56 Further performance gain can be attributed to comprehensive knowledge integration, which reduces the noise in the integrated data and provides additional information from data sources that is not captured (yet) by GO term annotations. Even more performance increase can be expected when the used data sources become more and more complete and exhibit high quality without significant bias toward intensively studied genes and proteins.

Currently, the multitude of benchmarking strategies pursued by different researchers considerably hampers the performance comparison of disease gene prioritization methods. Moreover, some methods make use of only small test datasets due to the lack of the required training data and the limited amount of known disease genes. Nevertheless, established procedures to derive test sets and the application of different standard performance measures should form part of every benchmarking strategy to evaluate new prioritization methods comprehensively with respect to other well-performing methods. To facilitate future performance comparisons, the training and test datasets should always be made publicly available together with the published work. In the end, since follow-up validation experiments tend to be expensive and time-consuming, it is vital that the correct disease genes are found on the few top ranks of the prioritization list.


Part of this study was financially supported by the BMBF through the German National Genome Research Network (NGFN) and the Greifswald Approach to Individualized Medicine (GANI_MED). The research was also conducted in the context of the DFG-funded Cluster of Excellence for Multimodal Computing and Interaction.

Promise of personalized omics to precision medicine

Promise of personalized omics to precision medicine


  • Rui Chen,

  • Michael Snyder


The rapid development of high-throughput technologies and computational frameworks enables the examination of biological systems in unprecedented detail. The ability to study biological phenomena at omics levels in turn is expected to lead to significant advances in personalized and precision medicine. Patients can be treated according to their own molecular characteristics. Individual omes as well as the integrated profiles of multiple omes, such as the genome, the epigenome, the transcriptome, the proteome, the metabolome, the antibodyome, and other omics information are expected to be valuable for health monitoring, preventative measures, and precision medicine. Moreover, omics technologies have the potential to transform medicine from traditional symptom-oriented diagnosis and treatment of diseases toward disease prevention and early diagnostics. We discuss here the advances and challenges in systems biology-powered personalized medicine at its current stage, as well as a prospective view of future personalized health care at the end of this review. WIREs Syst Biol Med 2013, 5:73–82. doi: 10.1002/wsbm.1198

Conflict of interest: M.S. serves as founder and consultant for Personalis, a member of the scientific advisory board of GenapSys, and a consultant for Illumina.

For further resources related to this article, please visit the WIREs website.


Personalized or precision medicine is expected to become the paradigm of future health care, owing to the substantial improvement of high-throughput technologies and systems approaches in the past two decades.1,2Conventional symptoms-oriented disease diagnosis and treatment has a number of significant limitations: for example, it focuses on only late/terminal symptoms and generally neglects preclinical pathophenotypes or risk factors; it generally disregards the underlying mechanisms of the symptoms; the disease descriptions are often quite broad so that they may actually include multiple diseases with shared symptoms; the reductionist approach to identify therapeutic targets in traditional medicine may over-simplify the complex nature of most diseases.3 Advances in the ability to perform large-scale genetic and molecular profiling are expected to overcome these limitations by addressing individualized differences in diagnosis and treatment in unprecedented detail.

The rapid development of high-throughput technologies also drives modern biological and medical researches from traditional hypothesis-driven designs toward data-driven studies. Modern high-throughput technologies, such as high-throughout DNA sequencing and mass spectrometry, have enabled the facile monitoring of thousands of molecules simultaneously instead of just a few components that have been analyzed in traditional research, thus generating a huge amount of data to document the real-time molecular details of a given biological system. Ultimately, when enough knowledge is gained, these molecular signatures, as well as the biological networks they form, may be associated with the physiological state/phenotype of the biological system at the very moment when the sample is taken.

Future personalized health care is expected to benefit from the combined personal omics data, which should include genomic information as well as longitudinal documentation of all possible molecular components. This combined information not only determines the genetic susceptibility of the person, but also monitors his/her real-time physiological states, as our integrative Personal Omics Profile (iPOP) study exemplified.4 In this review we will cover recent advances in systems biology and personalized medicine. We will also discuss limitations and concerns in applying omics approaches to individualized, precision health care.


The revolution of omics profiling technologies significantly benefited disease-oriented studies and health care, especially in disease mechanism elucidation, molecular diagnosis, and personalized treatment. These new technologies greatly facilitated the development of genomics, transcriptomics, proteomics, and metabolomics, which have become powerful tools for disease studies. Today, molecular disease analyses using large-scale approaches are pursued by an increasing number of physicians and pathologists.5,6

Initially, genome-wide association studies ( were launched in search of association of common genetic variants to certain phenotypes of interest, which typically assayed more than 500,000 single nucleotide polymorphisms (SNPs) and/or copy number variations (CNVs) with DNA microarrays in thousands to hundred thousands of participants.7 To date, 1,355 publications are listed in the National Human Genome Research Institute (NHGRI) GWAS Catalog reporting the association of 7,226 SNPs with 710 complex traits.7 The studied complex traits vary vastly, from cancers (e.g., prostate cancer and breast cancer) and complex diseases (e.g., type 1 and type 2 diabetes (T2D), Crohn’s Disease) to common traits (e.g., height and body mass index). These findings greatly broadened our knowledge on disease loci, and can potentially benefit disease risk prediction and drug treatments (as discussed in the section Integrative Omics in Preventative Medicine). Although powerful, GWAS studies have proven difficult for most complex diseases as typically a large number of loci are identified, each contributing to a small fraction of the genetic risk. These studies have many limitations including the small fraction of the genome that is analyzed, and failure to account for gene-gene interactions, epistasis and environmental factors.8

Whole genome sequencing (WGS) and whole exome sequencing (WES) have become more and more affordable for genomic studies and are rapidly replacing DNA microarrays. Single-base analysis of a genome/exome is achieved, which allows scientists to investigate the genetic basis of health and disease in unprecedented detail. Assigning variants to paternal and maternal chromosomes i.e. ‘phasing’ can be obtained through the analysis of families9or other methods.1,10,11 With the generation of massive amount of whole genome and exome data from diseased and healthy populations, understanding of both human population variation and genetic diseases, especially complex diseases, has been brought to a new level.1,12

One field that significantly benefited from WGS technologies is cancer-related research. A large number of cancer genomes have been sequenced through individual or collaborative efforts, such as the International Cancer Genome Consortium ( and the Cancer Genome Atlas ( The DNA from many types of cancer have been sequenced, including breast cancer,13–15 chronic lymphocytic leukaemia,16 hepatocellular carcinoma,17 pediatric glioblastoma,13melanoma,18 ovarian cancer,19 small-cell lung cancer,20 and Sonic-Hedgehog medulloblastoma,21 and databases are established, such as the cancer cell line encyclopedia.22 In addition, single-cell level cancer genome has also been investigated by WES for clear cell renal cell carcinoma23 and JAK2-negative myeloproliferative neoplasm.24 Somatic mutations and subtyping molecular markers were identified from these genomes. These different studies have revealed that nearly every tumor is different with distinct types of potential ‘driver’ mutations. Importantly, cancer genome sequencing often reveals potential targets that may suggest precision cancer treatment for the specific patients. As an example, a novel spontaneous germline mutation in the p53 gene was identified by WGS in a female patient, which accounted for the three types of cancers she developed in merely 5 years.25 An attempt has been made recently to treat a female patient with T Cell Lymphoma based on the target gene, CTLA4, identified by whole genome sequencing.26 The patient’s cancer was suppressed for two months with the anti-CTLA4 drug ipilimumab, although she died of recurrence soon after.

Whole genome and exome sequencing can also facilitate the identification of possible causal genes for hereditary genetic diseases, and is increasingly used in attempts to understand the basis of these ‘mystery diseases’ once obvious candidates are ruled out. In one successful example, whole genome sequencing of a fraternal twin pair with dopa (3,4-dihydroxyphenylalanine)-responsive dystonia helped the identification of one pair of personalized compound heterozygous mutations in the gene SPR, which accounted for the disease in both individuals.27 Importantly, based on the genome information the authors supplemented the l-dopa therapy with 5-hydroxytryptophan (SPR-dependent serotonin precursor) and significantly improved the health of both patients. In another example, Roach et al. sequenced the whole genomes of a family quartet and identified rare mutations in the genes DHODH and DNAH5 responsible for the two recessive disorders in both children—Miller syndrome and primary ciliary dyskinesia.28

Pharmacogenomics is another important application of genomic sequencing. It is known that the same drug may have different effect on different individuals due to their personal genomic background and living habits.8,29Genetic information can be used to assign drug doses and reduce side effects. For example, genetic variants are known to affect patients’ response to antipsychotic drugs.30 Based on pharmacogenomic trials, genetic tests for four drugs are required by the US Food and Drug Administration (FDA) before the administration of these drugs to patients, including the anti-cancer drugs cetuximab, trastuzumab, and dasatinib, and the anti-HIV drug maraviroc, and more are recommended such as the anticoagulant drug Warfarin and the anti-HIV drug Abacavir.8


Other omics technologies are also likely to impact medicine. High throughput sequencing technologies have enabled whole transcriptome (cDNA) sequencing, or abbreviated as RNA-Seq.31 RNA-Seq has become a powerful tool for disease-related studies, as it has great accuracy and sensitivity relative to microarray technology and it can also detect splicing isoforms.32 As RNA profiles reflect actual gene activity, it is closer to the real phenotype compared to genomic sequence. With RNA-Seq, Shah et al. discovered varied clonal preference and allelic abundance in 104 cases of primary triple-negative breast cancers, and observed that ∼36% of the genomic mutations were actually expressed.33 Combining such information with genomic information may be valuable in treatment of cancer and other diseases. Moreover, RNA-Seq also captures more complex aspects of the transcriptome, such as splicing isoforms34 and editing events,35 which are generally overlooked by hybridization-based methods. Splicing variants have now been associated with several distinct types of cancer and cancer prognosis.36–40

Although proteins have long been deemed as the executors of most biological functions, clinical proteomics is still a relatively young field due to technological limitations to profile the complexity of the proteome with high sensitivity and accuracy. Since the development of new soft desorption methods that enabled the analysis of biological macromolecules with mass spectrometry, proteomics advanced significantly in the past decade.41,42With current mass spectrometry technology, one can now quantify thousands of proteins in a single sample. For example, we were able to reliably detect 6,280 proteins in the human peripheral blood mononuclear cell proteome.4Mass spectrometry also allows the detection of expressed mutations, allele-specific sequences and editing events in the human proteome,4,43 as well as profiling of the phosphoproteome.44 Also of note is the MALDI-TOF (matrix-assisted laser desorption/ionization-time of flight) mass spectrometry-based imaging technology (MALDI-MSI) developed by Cornett et al., which allows spatial proteome profiling in defined two-dimentional laser-shot areas using tissue sections.45 Using MALDI-MSI, Kang et al. identified immunoglobulin heavy constant α2 as a novel potential marker for breast cancer metastasis.46

The field of metabolomics has also advanced significantly with the improvement of mass spectrometry. Both hydrophilic and hydrophobic metabolites can be profiled in specific samples.4,47 As the metabolome reflects the real-time energy status as well as metabolism of the living organism, it is expected that certain metabolome profiles may be associated with different diseases.48 Therefore, metabolomic profiles become an important aspect for personalized medicine.49,50 Jamshidi et al. profiled the metabolome of a female patient with Hereditary Hemorrhagic Telangiectasia (HHT) along with four healthy controls, and identified differences which highlighted the nitric oxide synthase pathway.51 The authors then treated the patient with bevacizumab and shifted her metabolomic profile toward those of the healthy controls and improved the patient’s health. In addition, branched-chain amino acids such as isoleucine have been associated with T2D and may ultimately prove to be valuable biomarkers.52 Finally, since some metabolites bind and directly regulate the activity of other biomolecules (e.g., kinases),53 there is significant potential to modulate cellular pathways using diet and metabolic analogs that serve as agonist or antagonist of protein function.


The concept of personalized medicine emphasizes not only personalized diagnosis and treatment, but also personalized disease susceptibility assessment, health monitoring and preventative medicine. Because disease is easier to manage prior to it onset or when a disease is at its early stages, risk assessment and early detection will be transformative in personalized medicine. Systems biology has the potential to capture real-time molecular phenotypes of a biological system, which enables the detection of subtle network perturbations preluding the actual development of clinical symptoms.

Disease susceptibility and drug response can be assessed with a person’s genomic information.8 This information may serve as a guideline for monitoring the health of a particular patient to achieve personalized health care, as showcased by Ashley et al.54 Whole genome sequence revealed variants for both high-penetrance Mendelian disorders, such as HTT(Huntington’s disease55) and PAH (Phenylketonuria56), as well as common, complex diseases, such as the disease-associated genetic variants reported in GWAS studies.57 Disease risks can be evaluated for a given person and an increase or decrease in disease risk compared with the population risk (of the same ethnicity, age, and gender) can be estimated (Figure 1). In the study of Ashley et al., the genome of a patient was analyzed and increased post-test probability risks for myocardial infarction and coronary artery disease were estimated.54 Their estimation matched the fact that the patient, although generally healthy, had a family history of vascular disease as well as early sudden death.58 Genetic variants associated with heart-related morbidities as well as drug response were identified in the patient’s genome, the information of which, as the authors stated, may direct the future health care for this particular patient. Similarly, Dewey et al. further extended this work by analysing a family quartet using a major allele reference sequence, and identified high-risk genes for familial thrombophilia, obesity, and psoriasis.59

Figure 1.

Example personalized RiskGraph. Each horizontal line symbolizes genetic risk of one disease tested for a specific individual. The tail of each arrow shows the pretest probability of a disease in a population of certain ethnicity, age and gender. The front end of each arrow displays the posttest probability with consideration of the person’s genomic information. Red arrow, increased risk; green arrow, decreased risk.

To further explore variation and power of the full human genome, projects and databases (such as the Personal Genome Project60) are being launched to help advance this field. However, genomic information alone usually is not adequate to predict disease onset, and other factors such as environment are expected to play a critical role in this process.61,62 The predictive capability of whole genome sequence was assessed by Roberts et al. through modeling 24 disease risks in monozygotic twins.63 For each disease, the authors modeled the genotype distribution in the twin population according to the observed concordance/discordance, and discovered that for most individuals and most diseases, the relative risk would be tested negative compared to the population, and in the best-case scenario, only one disease or more could be forewarned for any individual. The results of Roberts et al. are not surprising, as disease manifestation is probabilistic and not deterministic. Nonetheless, whole genome information by itself is expected to have partial value in disease prediction for complex diseases. In addition, from a systems point of view, peripheral components of the biological network would be more likely to contribute to complex diseases, as perturbation of the main nodes, which are usually essential genes, would be lethal.64 Therefore it is more difficult to identify the exact contributors of complex diseases. Moreover, as stated above, non-genomic factors may also exist and further complicate the situation. As an example of this, multiple sclerosis is known to have genetic components, however, Baranzini et al. failed to identify genomic, epigenomic or transcriptomic contributors in discordant monozygotic twins, which may indicate the existence of other factors, such as the environment.65

Current technologies, especially high-throughput sequencing and mass spectrometry, enable the monitoring of at least 105 molecular components, including DNA, RNA, protein, and metabolites in the human body. Therefore it is now feasible to identify the profiles of these components that correlate with various physiological states of the body, and profile alterations as a result of physiological state changes and diseases. Compared with genomic sequences alone, the profiles of transcriptome, proteome and metabolome are closer indicators to the real-time phenotype, therefore collecting these omics information in a longitudinal manner would allow monitoring of an individual’s physiological states. To test this concept, we implemented a study by following a generally healthy participant for 14 (now 32) months with integrated Personal Omics Profile (iPOP) analysis, incorporating information of the participant’s genome with longitudinal data from the person’s transcriptome, proteome, metabolome, and autoantibodyome.4 As blood constantly circulates the human body and exchanges biological matters with local tissues and is presently analyzed in medical tests, we chose to monitor the participant’s physiological states by profiling the blood components (PBMCs, serum and plasma) with iPOP analysis. The genome of this individual was sequenced with two WGS (Illumina and Complete Genomics) and three WES (Agilent, Roche Nimblegen, and Illumina) platforms to achieve high accuracy, which was further analyzed for disease risk and drug efficiency. The identified elevated risks included coronary artery disease, basal-cell carcinoma, hypertriglyceridemia and T2D, and the participant was estimated to have favorable response to rosiglitazone and metformin, both are antidiabetic medications. Although the participant has a known family history for some of the high-risk diseases (but not T2D), he was free from most of them (except for hypertriglyceridemia, for which he used medication) and had a normal Body Mass Index at the start of our study. Nonetheless, these elevated disease risks served as a guideline to monitor his personal health with iPOP analysis. We profiled the transcriptome, proteome and metabolome from 20 time points in the 14 months, and monitored molecular profile changes for physiological state change events during our study, including two viral infections. The subject also acquired T2D during the study, immediately after one of the viral (respiratory syncytial virus) infections. Two types of changes were observed from our iPOP data: the autocorrelated trends that reflect chronic changes, and the spikes which include significantly up/down-regulated genes and pathways especially at the onset of each event. With our iPOP approach, we acquired a comprehensive picture of detailed molecular differences between different physiological states, as well as during disease onset. In particular, interesting changes in glucose and insulin signaling pathways were observed during the onset of T2D. We also obtained other important information from our omics data, such as dynamic changes in allele-specific expression and RNA-editing events, as well as personalized autoantibody profiles. Overall, this study revealed an important application of the use of genomics and other omics profiling for personalized disease risk estimation and precision medicine, as we discovered the increased T2D risk, monitored its early onset, and helped the participant effectively control and eventually reverse the phenotype by proactive interventions (diet change and physical exercise).

Another important feature of our study is that samples are collected in a longitudinal fashion so that aberrant/disease states can be compared to healthy states of the same individual. One other advantage of our iPOP approach is its modularity, as other omics and quantifiable information can also be included in the iPOP profile, which can be readily tailored to monitor any biological or pathological event of interest (Figure 2). Examples of other information are: epigenome,66 gut microbiome,67 microRNA profiles68 and immune receptor repertoire.69 Moreover, quantifiable behavioral parameters such as nutrition, exercise, stress control and sleep may also be added to the profile.70

Figure 2.

The concept of integrative Personal Omics Profile (iPOP) analysis. Physiological state of the body can be reflected by the integrated information of different omics profiles, as well as the interactions among them.


One important aspect of systems biology is data mining. Data management and access can become a daunting task given the tremendous amount of data generated with current high-throughput technologies, and the data size is constantly increasing with time.71 Challenges exist computationally in each step to handle, process and annotate high-throughput data, integrate data from different sources and platforms, and pursue clinical interpretation of the data.72 These steps can be quite computationally intensive and require significant computational hardware; for example, to map short reads to achieve 30× coverage of the human genome, 13 CPU days is typically required72 although these times are rapidly decreasing. Moreover, as biological systems act more than just the sum of its individual parts, knowledge from multiple levels (such as epistasis, interaction, localization, and activation status) should be considered to capture the underlying highly organized networks for functional annotations.73 Ultimately it will be important to have a comprehensive database that contains Electronic Health records (including treatment information), genome sequences with variant calls and as much molecular information as possible. In principle with appropriate algorithms such a database could be mined by physicians to make data-driven medical decisions.

Currently many high-throughput datasets of similar types (e.g., expression and genome-wide association data collected from different populations of the same disease) were created as smaller, separate studies. Thus combining these publicly available datasets bioinformatically may provide more statistical power and lead to a clearer conclusion that could not be achieved in the individual studies. The work by Roberts et al. mentioned above serves as one example.63 In order to test the capacity of whole genome information, the authors combined monozygotic twin pair data from a total of five sources in 13 publications to obtain a much large dataset for their test. Similarly, Butte and colleagues combined the results of 130 functional microarray experiments for T2D and re-mined the data for repeatedly appeared candidate genes.74 They identified CD44 as the top candidate gene associated with T2D. In a related effort, by analyzing curated data of 2,510 individuals from 74 populations, the group led by Butte also discovered that T2D risk alleles were unevenly distributed across different human populations, with the risk higher in African and lower in Asian populations.75


Personalized health monitoring and precision medicine is just accelerating at a rapid pace because of the development of systems biology. As noted above, multiple efforts in both technology development and biological application have occurred, and an increasing number of researchers and physicians alike are sharing this vision. Hood et al. termed this approach as ‘P4 Medicine’ for predictive, preventive, personalized and participatory medicine.12

Nevertheless, many concerns also exist, and guidelines on translational omics research have been recommended by the Institute of Medicine.76 Khoury et al. suggested ‘a fifth P’, that is, the population perspective be added to personalized medicine77 and population validation of systems results with strong evidence should be achieved before its clinical application. Many disease-associated genetic variants discovered in GWAS still need to be functionally validated.78 In addition, Khoury et al. raised concerns that restricted health care resources might be wasted if unneeded disease screening/subclassification with systems approaches were conducted rather than lowering health care costs. However, with the rapid drop in technology costs and carefully designed pilot studies, the optimal screening frequencies/levels of subclassification necessary for precision medicine could be determined and costs maintained at affordable levels. It is worth noting that generating personalized omics data with appropriate interpretation can greatly benefit our understanding of physiological events for health and disease, and precision health care as we gain more knowledge in this field. In addition to personalized diagnosis and treatment, the future of precision medicine with omics approaches should emphasize personalized health monitoring, molecular symptom, early detection and preventative medicine, a paradigm shift from traditional health care.

As the human body is a highly organized, complex system with multiple organs and tissues, it is important to select the correct sample type for understanding a specific biological problem. However, as many sample types are unavailable (e.g., brain tissue) or not regularly accessible (e.g., biopsy samples from internal organs) from living individuals, our scope for personalized health monitoring is thus restricted. Therefore systems biology results, especially iPOP results, should not be over-interpreted. Although iPOP data from blood components may indicate changes in the other parts of the human body, the actual profiles for the tissue of interest might be underrepresented in blood or delayed in phase.

It is still not clear who is to develop and deliver personalized treatments for personalized medicine if they are not available as conventional medication. The cost for developing personalized drugs may become prohibitive to accurately address personal specificity, and may face other difficulties such as Food and Drug Administration approval. However, advances in high-throughput drug discovery will help accelerate this field.

In addition, personalized medicine using omics approaches relies heavily on technology development for biological research. This includes advances in both research instrumentation and computational framework. For example, it is still not possible to accurately determine the entire sequence of a genome due to limitations of current WGS/WES methods,79,80 even after computational improvement of signal-to-noise ratio.81,82 A low sequencing error rate was claimed by both the Illumina HiSeq (for 2 × 100 bp reads, more than 80% of the bases have a quality score above Q30, or 99.9% accuracy, and the Complete Genomics platform (1 × 10−5 at the time of our study80 and 2 × 10−6 as of October 8th, 2012,; however, per variant error rate is still high (15.50% and 9.08% for Illumina and Complete Genomics respectively with no filter, and 1.01% and 1.12% post multiple filters) as reported by Reumers et al.,81 which agreed with our observation that only 88.1% of the SNP calls overlapped when the same genome was sequenced with the two platforms.80 Thus possible disease-associated variants in these platform-specific regions might be overlooked or misinterpreted. Another issue lies in storage and processing of the omics data, as petabytes of data can easily be generated for a small iPOP study of 200 participants and demanding computing resources will be needed for data analysis. Therefore, interdisciplinary efforts from biologists, computer scientists and hardware engineers should be organized to ensure the continued improvement of this field.


The era of personalized precision medicine is about to emerge. The steady improvement of high-throughput technologies greatly facilitates this process by enabling profiling of various omes such as whole genome, epigenome, transcriptome, proteome and metabolome, which convey detailed information of the human body. Integrated profiles of these omes should reflect the physiological status of the host at the time the samples are collected. Personalized omics approach catalyzes precision medicine at two levels: for diseases and biological processes whose mechanisms are still unclear, omics approach will facilitate researches that would greatly advance our understanding; and when the mechanisms are clarified, individualized health care can be provided through health monitoring, preventative medicine, and personalized treatment. This would be especially helpful for complex diseases such as autism83 and Alzheimer’s disease,84 where multiple factors are responsible for the phenotypes. Furthermore, omics approach also facilitates the development of other less-stressed but important health-related fields, such as nutritional systems biology, which studies personalized diet and its relationship to health in systems point of view.85 With the rapid decrease in the cost of omics profiling, we anticipate an increased number of personalized medicine applications in many aspects of health care besides our proof-of-principle study. This will significantly improve the health of the general public and cut down health care costs. Scientists, governments, pharmaceutical companies and patients should work closely together to ensure the success of this transformation.86


This work is supported by funding from the Stanford University Department of Genetics and the National Institutes of Health. We thank Drs. George I. Mias and Hogune Im for their help in proof-reading the article and the insightful discussions.

OmniPath: guidelines and gateway for literature-curated signaling pathway resources


Figure 1: Resources featured in OmniPath and pypath.

FromOmniPath: guidelines and gateway for literature-curated signaling pathway resources

Nature Methods
Published online

UCSF pathway in Bioinformatics within the Biological and Medical Informatics Graduate Program.


Degree program

Explore these pages for the details of the UCSF pathway in Bioinformatics within the Biological and Medical Informatics Graduate Program. You will find descriptions of the three research areas in our pathway with links to the faculty members in each and a comprehensive look at the program curriculum with a list of courses and materials. You will also learn about Journal Club requirements, how to select an advisor, details of the qualifying examination, and much more.

Research areas

The Bioinformatics pathway focuses on three areas of research:

  1. Bioinformatics and computational biology
  2. Genetics and genomics
  3. Systems biology

1. Bioinformatics and computational biology

The fields of bioinformatics and computational biology at UCSF aim to investigate questions about biological composition, structure, function, and evolution of molecules, cells, tissues, and organisms using mathematics, informatics, statistics, and computer science.

Because these approaches allow large-scale and quantitative analyses of biological phenomena and data obtained from many disciplines, they can ask questions and achieve unique insights not imaginable before the genomic era.

Both bioinformatics and computational biology are frequently integrated in faculty laboratories, often with experimental studies as well, with bioinformatics emphasizing informatics and statistics, while computational biology emphasizes development of theoretical methods, mathematical modeling, and computational simulation techniques to answer these questions.

Examples of bioinformatics studies include analysis and integration of -omics data, prediction of protein function from sequence and structural information, and cheminformatics comparisons of protein ligands to identify off-target effects of drugs. Examples in computational biology include simulation of protein motion and folding and how proteins interact with each other.

Faculty members working in these areas include:

2. Genetics and genomics

Genetics is the study of DNA-based inheritance and variation of individuals, while genomics is the study of the structure and function of the genome. Both apply bioinformatics and computational techniques using data generated from methods such as DNA and RNA sequencing, microarrays, proteomics, and electron microscopy, or optical methods for nucleic acid structure determination.

Availability of these and many other new technologies, such as those that can conduct deep sequencing or sequencing of entire microbial communities, is generating massive amounts of data faster than informatics and computational methods can be developed to manage and query them. This opens opportunities for genetics and genomics scientists to develop and apply new cutting-edge technologies to analyze these data.

Faculty members working in genomics and genetics include:

3. Systems biology

Systems biology seeks to understand how cells, tissues, and organisms function from the perspective of the system as a whole. Computational systems biologists use mathematical modeling, simulation, and statistical analysis to gain a fundamental understanding of biological processes such as maintenance of homeostasis, minimal requirements for function, system response to environmental perturbation, predicting response to system stressors, and dissecting protein and nucleic acid networks.

Researchers taking a systems approach often combine computation with experimental work to address these questions. These faculty members include:

Next topic: Curriculum

The National Centre for Text Mining (NaCTeM)



The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester.

On our website, you can find pointers to sources of information about text mining such as links to

  • text mining services provided by NaCTeM
  • software tools, both those developed by the NaCTeM team and by other text mining groups
  • seminars, general events, conferences and workshops
  • tutorials and demonstrations
  • text mining publications

NaCTeM Software Tools

The National Centre for Text Mining bases its service systems on a number of text mining software tools.

Pathway (PPI) resources collected 2016-JUNE

human    yeast    mouse
Release 31 (01. Sept. 2015)

ConsensusPathDB-human integrates interaction networks in Homo sapiensincluding binary and complex protein-protein, genetic, metabolic,signaling, gene regulatory and drug-target interactions, as well as biochemical pathways. Data originate from currently 32 public resources for interactions (listed below) and interactions that we have curated from the literature. The interaction data are integrated in a complementary manner (avoiding redundancies), resulting in a seamless interaction network containing different types of interactions.

Current statistics:
unique physical entities: 158,523
unique interactions: 458,570
   gene regulations: 17,098
   protein interactions: 261,085
   genetic interactions: 443
   biochemical reactions: 21,070
   drug-target interactions: 158,874
pathways: 4,593

Licensing information:
The use of ConsensusPathDB is free for academic users. Commercial users should contact Dr. Atanas Kamburov (kamburov [at] or Dr. Ralf Herwig ( herwig [at] ). Interaction data from ConsensusPathDB is available under the license terms of each of the contributing databases listed above.
Although best efforts are always applied, the developers of ConsensusPathDB do not assume any legal responsibility for correctness or usefulness of the information in ConsensusPathDB.
ConsensusPathDB is being developed by the Bioinformatics group of the Vertebrate Genomics Department at the Max-Planck-Institute for Molecular Genetics in Berlin, Germany. The project was supported by the EMBRACE and CARCINOGENOMICS projects that are funded by the European Commission within its 6th Framework Programme under the thematic area “Life Sciences, Genomics and Biotechnology for Health” (LSHG-CT- 2004-512092 and LSHB-CT-2006-037712); 7th Framework Programme project APO-SYS (HEALTH-F4-2007-200767); German Federal Ministry of Education and Research within the 65 NGFN-2 program (SMP-Protein, FKZ01GR0472); Max Planck Society within its International Research School program (IMPRS-CBSC).

Pathway resources

Name URL Formats
Reactome2 BioPAX, png, pdf
Pathway Commons7 BioPAX, Sif, png
WikiPathways5 BioPAX, svg, png, pdf, gpml
Nature/NCI PathwayInteractionDatabase63 BioPAX, jpg, svg
BioCyc4 BioPAX, png, SBML
INOH84 BioPAX, INOH (xml)
Netpath85 BioPAX, SBML, PSI-MI
PharmGKB86 BioPAX, pdf, gpml

Abbreviations: BioPAX, Biological Pathway Exchange; KGML, KEGG Markup Language; PSI-MI, Proteomics Standards Initiative Molecular Interaction; SBML, Systems Biology Markup Language; NCI, National Cancer Institute; INOH, Integrating Network Objects with Hierarchies; PharmGKB, Pharmacogenomics Knowledge Base; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Tools for visualization and analysis of molecular networks, pathways, and -omics data

Pathway mining and comparison

Pathway gene sets were generated based on the GeneCards platform (12), implementing the gene symbolization process allowing for comparison of pathway gene sets, from 12 different manually curated sources, including: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16) QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (seeTable 1). A binary matrix was generated for all 3125 pathways, where each column represents a gene indicated by 1 for presence in the pathway and 0 for absence. Additionally, six sources were analysed for their cumulative tallying of genes content, including: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).


PathCards: multi-source consolidation of human biological pathways




Welcome to the Biological General Repository for Interaction Datasets

BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Our current index is version 3.4.137 and searches 56,733 publications for 1,067,443 protein and genetic interactions, 27,501 chemical associations and 38,559 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in standardized formats.




STRING is a database of known and predicted protein-protein interactions. The database contains information from numerous sources, including experimental repositories, computational prediction methods and public text collections. STRING is regularly updated and gives a comprehensive view on protein-protein interactions currently available.

    9.6 mio
    184 mio

Pathway Commons ( is a collection of publicly available pathway data from multiple organisms. Pathway Commons provides a web-based interface that enables biologists to browse and search a comprehensive collection of pathways from multiple sources represented in a common language, a download site that provides integrated bulk sets of pathway information in standard or convenient formats and a web service that software developers can use to conveniently query and access all data. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical
Oxford University Press

Pathway Commons, a web resource for biological pathway data


 PCViz Logo

Pathway Viewer Web

PCViz is an open-source web-based network visualization tool that helps users queryPathway Commons and obtain details about genes and their interactions extracted from multiple pathway data resources.

It allows interactive exploration of the gene networks where users can:

  • expand the network by adding new genes of interest
  • reduce the size of the network by filtering genes or interactions based on different criteria
  • load cancer context to see the overall frequency of alteration for each gene in the network
  • download networks in various formats for further analysis or use in publication

PCViz is built and maintained by Memorial Sloan-Kettering Cancer Center and theUniversity of Toronto.


BioPAX Editor Desktop

Ethan G. Cerami, Benjamin E. Gross, […], and Chris Sander

Additional article information


Pathway Commons ( is a collection of publicly available pathway data from multiple organisms. Pathway Commons provides a web-based interface that enables biologists to browse and search a comprehensive collection of pathways from multiple sources represented in a common language, a download site that provides integrated bulk sets of pathway information in standard or convenient formats and a web service that software developers can use to conveniently query and access all data. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical interactions involving proteins, DNA, RNA, small molecules and complexes. Pathway Commons aims to collect and integrate all public pathway data available in standard formats. Pathway Commons currently contains data from nine databases with over 1400 pathways and 687 000 interactions and will be continually expanded and updated.

Pathway Commons currently includes pathway and interaction information from nine sources

Data Source Format Size Updated Focus (species) Reference or URL
BioGRID PSI–MI 2.5 347 508 Interactions August 2010 (3.0.67) Model organisms (20)
Cancer Cell Map BioPAX L2 10 Pathways May 2006 Human
2104 Interactions
HPRD PSI–MI 2.5 40 618 Interactions 13 April 2010 Version 9 Human (21)
HumanCyc BioPAX L2 266 Pathways 16 June 2010 Version 14.1 Human (22)
4879 Interactions
IMID BioPAX L2 1729 Interactions March, 2009 Human
IntAct PSI–MI 2.5 154 567 Interactions 8 August 2010 Version 3.1, r14760 All (23)
MINT PSI–MI 2.5 117 202 Interactions 28 July 2010 All (24)
NCI/Nature PID BioPAX L2 186 Pathways 10 August 2010 Human (25)
13 879 Interactions
Reactome BioPAX L2 1015 Pathways 18 June 2010 Version 33 Human (5)
5397 Interactions
All Integrated BioPAX L2 1477 Pathways Multiple http:///
687 883 Interactions

New sources are periodically added and listed on the Pathway Commons website. Note that pathway and interaction statistics represent non-unique counts from source databases, as these records are not currently merged from multiple sources (only molecules are currently merged).

Data Sources (

Warehouse data (canonical molecules, ontologies) are converted to BioPAX utility classes, such as EntityReference, ControlledVocabulary, EntityFeature sub-classes, and saved as the initial BioPAX model, which forms the foundation for integrating pathway data and for id-mapping.

Pathway and binary interaction data (interactions, participants) are normalized next and merged into the database. Original reference molecules are replaced with the corresponding BioPAX warehouse objects.


Links to the access summary for Warehouse data sources are not provided below; however, the total number of requests minus errors will be fair estimate. Access statistics are computed from January 2014, except unique IP addresses, which are computed from November 2014.


The Pathway Commons team much appreciates the fundamental contribution of all the data providers, authors,, all the open biological ontologies, the open-source projects and standards, which made creating of this integrated BioPAX web service and database feasible.


Reactome v56 (only ‘Homo sapiens.owl’) 31-Mar-2016 (BIOPAX)


All names (for data filtering): reactome

Contains: 2007 pathways, 14427 interactions, 35835 participants

Access summary

Publication: Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, Caudy M, Garapati P, Gillespie M, Kamdar MR, Jassal B, Jupe S, Matthews L, May B, Palatnik S, Rothfels K, Shamovsky V, Song H, Williams M, Birney E, Hermjakob H, Stein L, D’Eustachio P. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014;42(database issue):d472-7 (PMID:24243840)

Availability: free

  NCI Pathway Interaction Database: Pathway

NCI Curated Human Pathways from PID (final); 27-Jul-2015 (BIOPAX)


All names (for data filtering): pid,nci pathway interaction database: pathway

Contains: 745 pathways, 14707 interactions, 10531 participants

Access summary

Publication: Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the Pathway Interaction Database. Nucleic Acids Res. 2009;37(database issue):d674-9 (PMID:18832364)

Availability: free


PhosphoSite Kinase-substrate information; 15-Mar-2016 (BIOPAX)


All names (for data filtering): phosphosite,phosphositeplus

Contains: 27692 interactions, 15458 participants

Access summary

Publication: Hornbeck PV, Kornhauser JM, Tkachev S, Zhang B, Skrzypek E, Murray B, Latham V, Sullivan M. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2012;40(database issue):d261-70 (PMID:22135298)

Availability: free


HumanCyc 19.5; 27-Oct-2015; under license from SRI International, (BIOPAX)


All names (for data filtering): humancyc,biocyc

Contains: 302 pathways, 7102 interactions, 5896 participants

Access summary

Publication: Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD. Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 2005;6(1):r2 (PMID:15642094)

Availability: free


HPRD PSI-MI Release 9; 13-Apr-2010 (PSI_MI)


All names (for data filtering): hprd

Contains: 40595 interactions, 9844 participants

Access summary

Publication: Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. Human Protein Reference Database–2009 update. Nucleic Acids Res. 2009;37(database issue):d767-72 (PMID:18988627)

Availability: academic

  PANTHER Pathway

PANTHER Pathways 3.4 on 18-May-2015 (auto-converted to human-only model) (BIOPAX)


All names (for data filtering): panther,panther pathway,pantherdb

Contains: 272 pathways, 4700 interactions, 6703 participants

Access summary

Publication: Mi H, Muruganujan A, Thomas PD. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2013;41(database issue):d377-86 (PMID:23193289)

Availability: free

  Database of Interacting Proteins

DIP (human), 14-01-2016 (PSI_MI)


All names (for data filtering): dip,database of interacting proteins

Contains: 8218 interactions, 4671 participants

Access summary

Publication: Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004;32(database issue):d449-51 (PMID:14681454)

Availability: free


BioGRID Release 3.4.135 (human and the viruses), 24-Mar-2016 (PSI_MI)


All names (for data filtering): biogrid

Contains: 322538 interactions, 645241 participants

Access summary

Publication: Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34(database issue):d535-9 (PMID:16381927)

Availability: free


IntAct (human only; ‘negative’ files removed), 16-Feb-2016 (PSI_MI)


All names (for data filtering): intact

Contains: 150549 interactions, 403729 participants

Access summary

Publication: Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H. The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014;42(database issue):d358-63 (PMID:24234451)

Availability: free


IntAct Complex (human), 16-Feb-2016 (PSI_MI)


All names (for data filtering): intact

Contains: 1452 participants

Access summary

Publication: Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H. The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014;42(database issue):d358-63 (PMID:24234451)

Availability: free


BIND (human), 15-Dec-2010 (PSI_MI)


All names (for data filtering): bind,biomolecular interaction network database

Contains: 35279 interactions, 74675 participants

Access summary

Publication: Bader GD, Betel D, Hogue CW. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003;31(1):248-250 (PMID:12519993)

Availability: free


CORUM (human), 17-Feb-2012 (PSI_MI)


All names (for data filtering): corum

Contains: 4401 participants

Access summary

Publication: Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW. CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucleic Acids Res. 2010;38(database issue):d497-501(PMID:19884131)

Availability: academic


Transctiption Factor Target data from Collection 3 in MSigDB (originally from: TRANSFAC Public, by BIOBASE, QIAGEN); version 7.4 (BIOPAX)


All names (for data filtering): transfac

Contains: 427 pathways, 261624 interactions, 13276 participants

Access summary

Publication: Wingender E. The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief Bioinform. 2008;9(4):326-332 (PMID:18436575)

Availability: academic


Human miRNA-target gene relationships from MiRTarBase; v4.5, 01-NOV-2013 (converted 13-MAR-2015) (BIOPAX)


All names (for data filtering): mirtarbase

Contains: 5 pathways, 51214 interactions, 12775 participants

Access summary

Publication: Hsu SD, Tseng YT, Shrestha S, Lin YL, Khaleel A, Chou CH, Chu CF, Huang HY, Lin CM, Ho SY, Jian TY, Lin FM, Chang TH, Weng SL, Liao KW, Liao IE, Liu CC, Huang HD. miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2014;42(database issue):d78-85 (PMID:24304892)

Availability: academic


DrugBank v4.3 converted to BioPAX from the original XML dump (BIOPAX)


All names (for data filtering): drugbank

Contains: 19297 interactions, 15854 participants

Access summary

Publication: Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, Tang A, Gabriel G, Ly C, Adamjee S, Dame ZT, Han B, Zhou Y, Wishart DS. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014;42(database issue):d1091-7 (PMID:24203711)

Availability: academic

  Recon X

Recon X: Reconstruction of the Human Genome, converted from SBML; 2.03  (BIOPAX)


All names (for data filtering): recon x

Contains: 1 pathways, 10813 interactions, 8316 participants

Access summary

Publication: Thiele I, Swainston N, Fleming RM, Hoppe A, Sahoo S, Aurich MK, Haraldsdottir H, Mo ML, Rolfsson O, Stobbe MD, Thorleifsson SG, Agren R, Bölling C, Bordel S, Chavali AK, Dobson P, Dunn WB, Endler L, Hala D, Hucka M, Hull D, Jameson D, Jamshidi N, Jonsson JJ, Juty N, Keating S, Nookaew I, Le Novère N, Malys N, Mazein A, Papin JA, Price ND, Selkov E Sr, Sigurdsson MI, Simeonidis E, Sonnenschein N, Smallbone K, Sorokin A, van Beek JH, Weichart D, Goryanin I, Nielsen J, Westerhoff HV, Kell DB, Mendes P, Palsson BØ. A community-driven global reconstruction of human metabolism. Nat Biotechnol. 2013;31(5):419-425(PMID:23455439)

Availability: free

  Comparative Toxicogenomics Database

Comparative Toxicogenomics Database (human), 20150603 (BIOPAX)


All names (for data filtering): ctd,comparative toxicogenomics database,ctdbase

Contains: 32722 pathways, 390428 interactions, 61031 participants

Access summary

Publication: Davis AP, Grondin CJ, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database’s 10th year anniversary: update 2015. Nucleic Acids Res. 2015;43(database issue):d914-20(PMID:25326323)

Availability: academic

  KEGG Pathway

KEGG 07/2011 (only human, hsa* files), converted to BioPAX by BioModels ( team (BIOPAX)


All names (for data filtering): kegg,kegg pathway

Contains: 122 pathways, 3566 interactions, 3355 participants

Access summary

Publication: Wrzodek C, Büchel F, Ruff M, Dräger A, Zell A. Precise generation of systems biology models from KEGG pathways. BMC Syst Biol. 2013;7(undefined):15 (PMID:23433509)

Availability: academic

  Small Molecule Pathway Database

Small Molecule Pathway Database 2.0, 07-Jul-2015 (BIOPAX)


All names (for data filtering): smpdb,small molecule pathway database

Contains: 1206 pathways, 4701 interactions, 4863 participants

Access summary

Publication: Jewison T, Su Y, Disfany FM, Liang Y, Knox C, Maciejewski A, Poelzer J, Huynh J, Zhou Y, Arndt D, Djoumbou Y, Liu Y, Deng L, Guo AC, Han B, Pon A, Wilson M, Rafatnia S, Liu P, Wishart DS. SMPDB 2.0: big improvements to the Small Molecule Pathway Database. Nucleic Acids Res. 2014;42(database issue):d478-84 (PMID:24203708)

Availability: free

  Integrating Network Objects with Hierarchies

INOH 4.0 (signal transduction and metabolic data), 22-MAR-2011 (BIOPAX)


All names (for data filtering): inoh,integrating network objects with hierarchies

Contains: 774 pathways, 5432 interactions, 17142 participants

Access summary

Publication: Yamamoto S, Sakai N, Nakamura H, Fukagawa H, Fukuda K, Takagi T. INOH: ontology-based highly structured database of signal transduction pathways. Database (Oxford). 2011;2011(undefined):bar052 (PMID:22120663)

Availability: free


NetPath 12/2011 (BIOPAX)


All names (for data filtering): netpath

Contains: 27 pathways, 6347 interactions, 3266 participants

Access summary

Publication: Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GS, Venugopal AK, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Gollapudi SK, Tattikota SG, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob HK, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra YL, Rahiman BA, Prasad TS, Lin JX, Houtman JC, Desiderio S, Renauld JC, Constantinescu SN, Ohara O, Hirano T, Kubo M, Singh S, Khatri P, Draghici S, Bader GD, Sander C, Leonard WJ, Pandey A. NetPath: a public resource of curated signal transduction pathways. Genome Biol. 2010;11(1):r3 (PMID:20067622)

Availability: free


WikiPathways – Community Curated Human Pathways; 29/09/2015 (human) (BIOPAX)


All names (for data filtering): wikipathways

Contains: 333 pathways, 9758 interactions, 9584 participants

Access summary

Publication: Pico AR, Kelder T, van Iersel MP, Hanspers K, Conklin BR, Evelo C. WikiPathways: pathway editing for the people. PLoS Biol. 2008;6(7):e184 (PMID:18651794)

Availability: free


ChEBI Ontology v138, 01-Apr-2016 (WAREHOUSE)

All names (for data filtering): chebi

Publication: Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N, Muthukrishnan V, Owen G, Turner S, Williams M, Steinbeck C. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 2013;41(database issue):d456-63 (PMID:23180789)

Availability: free


UniProtKB/Swiss-Prot (human), 16-Mar-2015 (WAREHOUSE)

All names (for data filtering): uniprot,swissprot,uniprotkb

Publication: UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014;42(database issue):d191-8 (PMID:24253303)

Availability: free


Selected whole-source id-mapping files (to ChEBI) from UniChem (manually edited/fixed/sorted), 29-Dec-2015 (MAPPING)

All names (for data filtering): unichem

Publication: Chambers J, Davies M, Gaulton A, Hersey A, Velankar S, Petryszak R, Hastings J, Bellis L, McGlinchey S, Overington JP. UniChem: a unified chemical structure cross-referencing and identifier tracking system. J Cheminform. 2013;5(1):3 (PMID:23317286)

Availability: free

ConsensusPathDB—a database for integrating human functional interaction networks

ConsensusPathDB is a database system for the integration of human functional interactions. Current knowledge of these interactions is dispersed in more than 200 databases, each having a specific focus and data format. ConsensusPathDB currently integrates the content of 12 different interaction databases with heterogeneous foci comprising a total of 26 133 distinct physical entities and 74 289 distinct functional interactions (protein–protein interactions, biochemical reactions, gene regulatory interactions), and covering 1738 pathways. We describe the database schema and the methods used for data integration. Furthermore, we describe the functionality of the ConsensusPathDB web interface, where users can search and visualize interaction networks, upload, modify and expand networks in BioPAX, SBML or PSI-MI format, or carry out over-representation analysis with uploaded identifier lists with respect to substructures derived from the integrated interaction network. The ConsensusPathDB database is available at:

The MIPS Mammalian Protein-Protein Interaction Database

The MIPS Mammalian Protein-Protein Interaction Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. We took great care to include only data from individually performed experiments since they usually provide the most reliable evidence for physical interactions.

Other PPI resources

There are plenty of interesting databases and other sites on protein-protein interactions. Currently we are aware of the following PPI resources:

Resource Comments
APID Agile Protein Interaction DataAnalyzer (Cancer Research Center, Salamanca, Spain)
BIND Biomolecular INteraction Network Database at the University of Toronto, Canada. No species restriction
CYGD PPI section of the Comprehensive Yeast Genome Database. Manually curated comprehensive S. cerevisiae PPI database at MIPS
DIP Database of Interacting Proteins at UCLA. No species restriction.
GRID General Repository for Interaction Datasets. Mount Sinai Hospital, Toronto, Canada
HIV Interaction DB Interactions between HIV and host proteins.
HPRD The Human Protein Reference Database. Institute of Bioinformatics, Bangalore, India and Johns Hopkins University, Baltimore, MD, USA.
HPID Human Protein Interaction Database. Department of computer Science and Information Engineering Inha University, Inchon, Korea
iHOP iHOP (Information Hyperlinked over Proteins). Protein association network built by literature mining
IntAct Protein interaction database at EBI. No species restriction.
InterDom Database of putative interacting protein domains. Institute for InfoComm Research, Singapore.
JCB PPI site at the Jena Centre for Bioinformatics, Germany
MetaCore Commercial software suite and database. Manually curated human PPIs (among other things). GeneGo
MINT Molecular INTeraction database at the Centro di Bioinformatica Moleculare, Universita di Roma, Italy.
MRC PPI links Commented list of links to PPI databases and resources maintained at the MRC Rosalind Franklin Cetre for Genomics Research, Cambridge, UK
OPHID The Online Predicted Human Interaction Database. Ontario Cancer Institute and University of Toronto, Canada.
Pawson Lab Information on protein-interaction domains.
PDZbase Database of PDZ mediated protein-protein interactions.
Predictome Predicted functional associations and interactions. Boston University.
Protein-Protein Interaction Server Analysis of protein-protein interfaces of protein complexes from PDB. University College of London, UK.
PathCalling Proteomics and PPI tool/database. CuraGen Corporation.
PIM Hybrigenics PPI data and tool, H. pylori. Free academic license available
RIKEN Experimental and literature PPIs in mouse.
STRING Protein networks based on experimental data and predictions at EMBL.
YPD “BioKnowledge Library” at Incyte Corporation. Manually curated PPI data from S. cerevisiae. Proprietary.