The news in our blog



Epigenomic analysis software tools and databases


Epigenomic analysis software tools and databases

Transcriptomic analysis software tools and databases


Transcriptomic analysis software tools and databases

Metabolomic analysis software tools and databases


Metabolomic analysis software tools and databases

Fluxomic analysis software tools and databases


Fluxomic analysis software tools and databases

Biological pathway analysis software tools and databases

Pathway analysis

Biological pathway analysis software tools and databases

OmniPath: guidelines and gateway for literature-curated signaling pathway resources


Figure 1: Resources featured in OmniPath and pypath.

FromOmniPath: guidelines and gateway for literature-curated signaling pathway resources

Nature Methods
Published online

Pathway (PPI) resources collected 2016-JUNE

human    yeast    mouse
Release 31 (01. Sept. 2015)

ConsensusPathDB-human integrates interaction networks in Homo sapiensincluding binary and complex protein-protein, genetic, metabolic,signaling, gene regulatory and drug-target interactions, as well as biochemical pathways. Data originate from currently 32 public resources for interactions (listed below) and interactions that we have curated from the literature. The interaction data are integrated in a complementary manner (avoiding redundancies), resulting in a seamless interaction network containing different types of interactions.

Current statistics:
unique physical entities: 158,523
unique interactions: 458,570
   gene regulations: 17,098
   protein interactions: 261,085
   genetic interactions: 443
   biochemical reactions: 21,070
   drug-target interactions: 158,874
pathways: 4,593

Licensing information:
The use of ConsensusPathDB is free for academic users. Commercial users should contact Dr. Atanas Kamburov (kamburov [at] or Dr. Ralf Herwig ( herwig [at] ). Interaction data from ConsensusPathDB is available under the license terms of each of the contributing databases listed above.
Although best efforts are always applied, the developers of ConsensusPathDB do not assume any legal responsibility for correctness or usefulness of the information in ConsensusPathDB.
ConsensusPathDB is being developed by the Bioinformatics group of the Vertebrate Genomics Department at the Max-Planck-Institute for Molecular Genetics in Berlin, Germany. The project was supported by the EMBRACE and CARCINOGENOMICS projects that are funded by the European Commission within its 6th Framework Programme under the thematic area “Life Sciences, Genomics and Biotechnology for Health” (LSHG-CT- 2004-512092 and LSHB-CT-2006-037712); 7th Framework Programme project APO-SYS (HEALTH-F4-2007-200767); German Federal Ministry of Education and Research within the 65 NGFN-2 program (SMP-Protein, FKZ01GR0472); Max Planck Society within its International Research School program (IMPRS-CBSC).

Pathway resources

Name URL Formats
Reactome2 BioPAX, png, pdf
Pathway Commons7 BioPAX, Sif, png
WikiPathways5 BioPAX, svg, png, pdf, gpml
Nature/NCI PathwayInteractionDatabase63 BioPAX, jpg, svg
BioCyc4 BioPAX, png, SBML
INOH84 BioPAX, INOH (xml)
Netpath85 BioPAX, SBML, PSI-MI
PharmGKB86 BioPAX, pdf, gpml

Abbreviations: BioPAX, Biological Pathway Exchange; KGML, KEGG Markup Language; PSI-MI, Proteomics Standards Initiative Molecular Interaction; SBML, Systems Biology Markup Language; NCI, National Cancer Institute; INOH, Integrating Network Objects with Hierarchies; PharmGKB, Pharmacogenomics Knowledge Base; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Tools for visualization and analysis of molecular networks, pathways, and -omics data

Pathway mining and comparison

Pathway gene sets were generated based on the GeneCards platform (12), implementing the gene symbolization process allowing for comparison of pathway gene sets, from 12 different manually curated sources, including: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16) QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (seeTable 1). A binary matrix was generated for all 3125 pathways, where each column represents a gene indicated by 1 for presence in the pathway and 0 for absence. Additionally, six sources were analysed for their cumulative tallying of genes content, including: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).


PathCards: multi-source consolidation of human biological pathways




Welcome to the Biological General Repository for Interaction Datasets

BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Our current index is version 3.4.137 and searches 56,733 publications for 1,067,443 protein and genetic interactions, 27,501 chemical associations and 38,559 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in standardized formats.




STRING is a database of known and predicted protein-protein interactions. The database contains information from numerous sources, including experimental repositories, computational prediction methods and public text collections. STRING is regularly updated and gives a comprehensive view on protein-protein interactions currently available.

    9.6 mio
    184 mio

Pathway Commons ( is a collection of publicly available pathway data from multiple organisms. Pathway Commons provides a web-based interface that enables biologists to browse and search a comprehensive collection of pathways from multiple sources represented in a common language, a download site that provides integrated bulk sets of pathway information in standard or convenient formats and a web service that software developers can use to conveniently query and access all data. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical
Oxford University Press

Pathway Commons, a web resource for biological pathway data


 PCViz Logo

Pathway Viewer Web

PCViz is an open-source web-based network visualization tool that helps users queryPathway Commons and obtain details about genes and their interactions extracted from multiple pathway data resources.

It allows interactive exploration of the gene networks where users can:

  • expand the network by adding new genes of interest
  • reduce the size of the network by filtering genes or interactions based on different criteria
  • load cancer context to see the overall frequency of alteration for each gene in the network
  • download networks in various formats for further analysis or use in publication

PCViz is built and maintained by Memorial Sloan-Kettering Cancer Center and theUniversity of Toronto.


BioPAX Editor Desktop

Ethan G. Cerami, Benjamin E. Gross, […], and Chris Sander

Additional article information


Pathway Commons ( is a collection of publicly available pathway data from multiple organisms. Pathway Commons provides a web-based interface that enables biologists to browse and search a comprehensive collection of pathways from multiple sources represented in a common language, a download site that provides integrated bulk sets of pathway information in standard or convenient formats and a web service that software developers can use to conveniently query and access all data. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical interactions involving proteins, DNA, RNA, small molecules and complexes. Pathway Commons aims to collect and integrate all public pathway data available in standard formats. Pathway Commons currently contains data from nine databases with over 1400 pathways and 687 000 interactions and will be continually expanded and updated.

Pathway Commons currently includes pathway and interaction information from nine sources

Data Source Format Size Updated Focus (species) Reference or URL
BioGRID PSI–MI 2.5 347 508 Interactions August 2010 (3.0.67) Model organisms (20)
Cancer Cell Map BioPAX L2 10 Pathways May 2006 Human
2104 Interactions
HPRD PSI–MI 2.5 40 618 Interactions 13 April 2010 Version 9 Human (21)
HumanCyc BioPAX L2 266 Pathways 16 June 2010 Version 14.1 Human (22)
4879 Interactions
IMID BioPAX L2 1729 Interactions March, 2009 Human
IntAct PSI–MI 2.5 154 567 Interactions 8 August 2010 Version 3.1, r14760 All (23)
MINT PSI–MI 2.5 117 202 Interactions 28 July 2010 All (24)
NCI/Nature PID BioPAX L2 186 Pathways 10 August 2010 Human (25)
13 879 Interactions
Reactome BioPAX L2 1015 Pathways 18 June 2010 Version 33 Human (5)
5397 Interactions
All Integrated BioPAX L2 1477 Pathways Multiple http:///
687 883 Interactions

New sources are periodically added and listed on the Pathway Commons website. Note that pathway and interaction statistics represent non-unique counts from source databases, as these records are not currently merged from multiple sources (only molecules are currently merged).

Data Sources (

Warehouse data (canonical molecules, ontologies) are converted to BioPAX utility classes, such as EntityReference, ControlledVocabulary, EntityFeature sub-classes, and saved as the initial BioPAX model, which forms the foundation for integrating pathway data and for id-mapping.

Pathway and binary interaction data (interactions, participants) are normalized next and merged into the database. Original reference molecules are replaced with the corresponding BioPAX warehouse objects.


Links to the access summary for Warehouse data sources are not provided below; however, the total number of requests minus errors will be fair estimate. Access statistics are computed from January 2014, except unique IP addresses, which are computed from November 2014.


The Pathway Commons team much appreciates the fundamental contribution of all the data providers, authors,, all the open biological ontologies, the open-source projects and standards, which made creating of this integrated BioPAX web service and database feasible.


Reactome v56 (only ‘Homo sapiens.owl’) 31-Mar-2016 (BIOPAX)


All names (for data filtering): reactome

Contains: 2007 pathways, 14427 interactions, 35835 participants

Access summary

Publication: Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, Caudy M, Garapati P, Gillespie M, Kamdar MR, Jassal B, Jupe S, Matthews L, May B, Palatnik S, Rothfels K, Shamovsky V, Song H, Williams M, Birney E, Hermjakob H, Stein L, D’Eustachio P. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014;42(database issue):d472-7 (PMID:24243840)

Availability: free

  NCI Pathway Interaction Database: Pathway

NCI Curated Human Pathways from PID (final); 27-Jul-2015 (BIOPAX)


All names (for data filtering): pid,nci pathway interaction database: pathway

Contains: 745 pathways, 14707 interactions, 10531 participants

Access summary

Publication: Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the Pathway Interaction Database. Nucleic Acids Res. 2009;37(database issue):d674-9 (PMID:18832364)

Availability: free


PhosphoSite Kinase-substrate information; 15-Mar-2016 (BIOPAX)


All names (for data filtering): phosphosite,phosphositeplus

Contains: 27692 interactions, 15458 participants

Access summary

Publication: Hornbeck PV, Kornhauser JM, Tkachev S, Zhang B, Skrzypek E, Murray B, Latham V, Sullivan M. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2012;40(database issue):d261-70 (PMID:22135298)

Availability: free


HumanCyc 19.5; 27-Oct-2015; under license from SRI International, (BIOPAX)


All names (for data filtering): humancyc,biocyc

Contains: 302 pathways, 7102 interactions, 5896 participants

Access summary

Publication: Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD. Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 2005;6(1):r2 (PMID:15642094)

Availability: free


HPRD PSI-MI Release 9; 13-Apr-2010 (PSI_MI)


All names (for data filtering): hprd

Contains: 40595 interactions, 9844 participants

Access summary

Publication: Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. Human Protein Reference Database–2009 update. Nucleic Acids Res. 2009;37(database issue):d767-72 (PMID:18988627)

Availability: academic

  PANTHER Pathway

PANTHER Pathways 3.4 on 18-May-2015 (auto-converted to human-only model) (BIOPAX)


All names (for data filtering): panther,panther pathway,pantherdb

Contains: 272 pathways, 4700 interactions, 6703 participants

Access summary

Publication: Mi H, Muruganujan A, Thomas PD. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2013;41(database issue):d377-86 (PMID:23193289)

Availability: free

  Database of Interacting Proteins

DIP (human), 14-01-2016 (PSI_MI)


All names (for data filtering): dip,database of interacting proteins

Contains: 8218 interactions, 4671 participants

Access summary

Publication: Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004;32(database issue):d449-51 (PMID:14681454)

Availability: free


BioGRID Release 3.4.135 (human and the viruses), 24-Mar-2016 (PSI_MI)


All names (for data filtering): biogrid

Contains: 322538 interactions, 645241 participants

Access summary

Publication: Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34(database issue):d535-9 (PMID:16381927)

Availability: free


IntAct (human only; ‘negative’ files removed), 16-Feb-2016 (PSI_MI)


All names (for data filtering): intact

Contains: 150549 interactions, 403729 participants

Access summary

Publication: Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H. The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014;42(database issue):d358-63 (PMID:24234451)

Availability: free


IntAct Complex (human), 16-Feb-2016 (PSI_MI)


All names (for data filtering): intact

Contains: 1452 participants

Access summary

Publication: Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H. The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014;42(database issue):d358-63 (PMID:24234451)

Availability: free


BIND (human), 15-Dec-2010 (PSI_MI)


All names (for data filtering): bind,biomolecular interaction network database

Contains: 35279 interactions, 74675 participants

Access summary

Publication: Bader GD, Betel D, Hogue CW. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003;31(1):248-250 (PMID:12519993)

Availability: free


CORUM (human), 17-Feb-2012 (PSI_MI)


All names (for data filtering): corum

Contains: 4401 participants

Access summary

Publication: Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW. CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucleic Acids Res. 2010;38(database issue):d497-501(PMID:19884131)

Availability: academic


Transctiption Factor Target data from Collection 3 in MSigDB (originally from: TRANSFAC Public, by BIOBASE, QIAGEN); version 7.4 (BIOPAX)


All names (for data filtering): transfac

Contains: 427 pathways, 261624 interactions, 13276 participants

Access summary

Publication: Wingender E. The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief Bioinform. 2008;9(4):326-332 (PMID:18436575)

Availability: academic


Human miRNA-target gene relationships from MiRTarBase; v4.5, 01-NOV-2013 (converted 13-MAR-2015) (BIOPAX)


All names (for data filtering): mirtarbase

Contains: 5 pathways, 51214 interactions, 12775 participants

Access summary

Publication: Hsu SD, Tseng YT, Shrestha S, Lin YL, Khaleel A, Chou CH, Chu CF, Huang HY, Lin CM, Ho SY, Jian TY, Lin FM, Chang TH, Weng SL, Liao KW, Liao IE, Liu CC, Huang HD. miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2014;42(database issue):d78-85 (PMID:24304892)

Availability: academic


DrugBank v4.3 converted to BioPAX from the original XML dump (BIOPAX)


All names (for data filtering): drugbank

Contains: 19297 interactions, 15854 participants

Access summary

Publication: Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, Tang A, Gabriel G, Ly C, Adamjee S, Dame ZT, Han B, Zhou Y, Wishart DS. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014;42(database issue):d1091-7 (PMID:24203711)

Availability: academic

  Recon X

Recon X: Reconstruction of the Human Genome, converted from SBML; 2.03  (BIOPAX)


All names (for data filtering): recon x

Contains: 1 pathways, 10813 interactions, 8316 participants

Access summary

Publication: Thiele I, Swainston N, Fleming RM, Hoppe A, Sahoo S, Aurich MK, Haraldsdottir H, Mo ML, Rolfsson O, Stobbe MD, Thorleifsson SG, Agren R, Bölling C, Bordel S, Chavali AK, Dobson P, Dunn WB, Endler L, Hala D, Hucka M, Hull D, Jameson D, Jamshidi N, Jonsson JJ, Juty N, Keating S, Nookaew I, Le Novère N, Malys N, Mazein A, Papin JA, Price ND, Selkov E Sr, Sigurdsson MI, Simeonidis E, Sonnenschein N, Smallbone K, Sorokin A, van Beek JH, Weichart D, Goryanin I, Nielsen J, Westerhoff HV, Kell DB, Mendes P, Palsson BØ. A community-driven global reconstruction of human metabolism. Nat Biotechnol. 2013;31(5):419-425(PMID:23455439)

Availability: free

  Comparative Toxicogenomics Database

Comparative Toxicogenomics Database (human), 20150603 (BIOPAX)


All names (for data filtering): ctd,comparative toxicogenomics database,ctdbase

Contains: 32722 pathways, 390428 interactions, 61031 participants

Access summary

Publication: Davis AP, Grondin CJ, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database’s 10th year anniversary: update 2015. Nucleic Acids Res. 2015;43(database issue):d914-20(PMID:25326323)

Availability: academic

  KEGG Pathway

KEGG 07/2011 (only human, hsa* files), converted to BioPAX by BioModels ( team (BIOPAX)


All names (for data filtering): kegg,kegg pathway

Contains: 122 pathways, 3566 interactions, 3355 participants

Access summary

Publication: Wrzodek C, Büchel F, Ruff M, Dräger A, Zell A. Precise generation of systems biology models from KEGG pathways. BMC Syst Biol. 2013;7(undefined):15 (PMID:23433509)

Availability: academic

  Small Molecule Pathway Database

Small Molecule Pathway Database 2.0, 07-Jul-2015 (BIOPAX)


All names (for data filtering): smpdb,small molecule pathway database

Contains: 1206 pathways, 4701 interactions, 4863 participants

Access summary

Publication: Jewison T, Su Y, Disfany FM, Liang Y, Knox C, Maciejewski A, Poelzer J, Huynh J, Zhou Y, Arndt D, Djoumbou Y, Liu Y, Deng L, Guo AC, Han B, Pon A, Wilson M, Rafatnia S, Liu P, Wishart DS. SMPDB 2.0: big improvements to the Small Molecule Pathway Database. Nucleic Acids Res. 2014;42(database issue):d478-84 (PMID:24203708)

Availability: free

  Integrating Network Objects with Hierarchies

INOH 4.0 (signal transduction and metabolic data), 22-MAR-2011 (BIOPAX)


All names (for data filtering): inoh,integrating network objects with hierarchies

Contains: 774 pathways, 5432 interactions, 17142 participants

Access summary

Publication: Yamamoto S, Sakai N, Nakamura H, Fukagawa H, Fukuda K, Takagi T. INOH: ontology-based highly structured database of signal transduction pathways. Database (Oxford). 2011;2011(undefined):bar052 (PMID:22120663)

Availability: free


NetPath 12/2011 (BIOPAX)


All names (for data filtering): netpath

Contains: 27 pathways, 6347 interactions, 3266 participants

Access summary

Publication: Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GS, Venugopal AK, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Gollapudi SK, Tattikota SG, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob HK, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra YL, Rahiman BA, Prasad TS, Lin JX, Houtman JC, Desiderio S, Renauld JC, Constantinescu SN, Ohara O, Hirano T, Kubo M, Singh S, Khatri P, Draghici S, Bader GD, Sander C, Leonard WJ, Pandey A. NetPath: a public resource of curated signal transduction pathways. Genome Biol. 2010;11(1):r3 (PMID:20067622)

Availability: free


WikiPathways – Community Curated Human Pathways; 29/09/2015 (human) (BIOPAX)


All names (for data filtering): wikipathways

Contains: 333 pathways, 9758 interactions, 9584 participants

Access summary

Publication: Pico AR, Kelder T, van Iersel MP, Hanspers K, Conklin BR, Evelo C. WikiPathways: pathway editing for the people. PLoS Biol. 2008;6(7):e184 (PMID:18651794)

Availability: free


ChEBI Ontology v138, 01-Apr-2016 (WAREHOUSE)

All names (for data filtering): chebi

Publication: Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N, Muthukrishnan V, Owen G, Turner S, Williams M, Steinbeck C. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 2013;41(database issue):d456-63 (PMID:23180789)

Availability: free


UniProtKB/Swiss-Prot (human), 16-Mar-2015 (WAREHOUSE)

All names (for data filtering): uniprot,swissprot,uniprotkb

Publication: UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014;42(database issue):d191-8 (PMID:24253303)

Availability: free


Selected whole-source id-mapping files (to ChEBI) from UniChem (manually edited/fixed/sorted), 29-Dec-2015 (MAPPING)

All names (for data filtering): unichem

Publication: Chambers J, Davies M, Gaulton A, Hersey A, Velankar S, Petryszak R, Hastings J, Bellis L, McGlinchey S, Overington JP. UniChem: a unified chemical structure cross-referencing and identifier tracking system. J Cheminform. 2013;5(1):3 (PMID:23317286)

Availability: free

ConsensusPathDB—a database for integrating human functional interaction networks

ConsensusPathDB is a database system for the integration of human functional interactions. Current knowledge of these interactions is dispersed in more than 200 databases, each having a specific focus and data format. ConsensusPathDB currently integrates the content of 12 different interaction databases with heterogeneous foci comprising a total of 26 133 distinct physical entities and 74 289 distinct functional interactions (protein–protein interactions, biochemical reactions, gene regulatory interactions), and covering 1738 pathways. We describe the database schema and the methods used for data integration. Furthermore, we describe the functionality of the ConsensusPathDB web interface, where users can search and visualize interaction networks, upload, modify and expand networks in BioPAX, SBML or PSI-MI format, or carry out over-representation analysis with uploaded identifier lists with respect to substructures derived from the integrated interaction network. The ConsensusPathDB database is available at:

The MIPS Mammalian Protein-Protein Interaction Database

The MIPS Mammalian Protein-Protein Interaction Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. We took great care to include only data from individually performed experiments since they usually provide the most reliable evidence for physical interactions.

Other PPI resources

There are plenty of interesting databases and other sites on protein-protein interactions. Currently we are aware of the following PPI resources:

Resource Comments
APID Agile Protein Interaction DataAnalyzer (Cancer Research Center, Salamanca, Spain)
BIND Biomolecular INteraction Network Database at the University of Toronto, Canada. No species restriction
CYGD PPI section of the Comprehensive Yeast Genome Database. Manually curated comprehensive S. cerevisiae PPI database at MIPS
DIP Database of Interacting Proteins at UCLA. No species restriction.
GRID General Repository for Interaction Datasets. Mount Sinai Hospital, Toronto, Canada
HIV Interaction DB Interactions between HIV and host proteins.
HPRD The Human Protein Reference Database. Institute of Bioinformatics, Bangalore, India and Johns Hopkins University, Baltimore, MD, USA.
HPID Human Protein Interaction Database. Department of computer Science and Information Engineering Inha University, Inchon, Korea
iHOP iHOP (Information Hyperlinked over Proteins). Protein association network built by literature mining
IntAct Protein interaction database at EBI. No species restriction.
InterDom Database of putative interacting protein domains. Institute for InfoComm Research, Singapore.
JCB PPI site at the Jena Centre for Bioinformatics, Germany
MetaCore Commercial software suite and database. Manually curated human PPIs (among other things). GeneGo
MINT Molecular INTeraction database at the Centro di Bioinformatica Moleculare, Universita di Roma, Italy.
MRC PPI links Commented list of links to PPI databases and resources maintained at the MRC Rosalind Franklin Cetre for Genomics Research, Cambridge, UK
OPHID The Online Predicted Human Interaction Database. Ontario Cancer Institute and University of Toronto, Canada.
Pawson Lab Information on protein-interaction domains.
PDZbase Database of PDZ mediated protein-protein interactions.
Predictome Predicted functional associations and interactions. Boston University.
Protein-Protein Interaction Server Analysis of protein-protein interfaces of protein complexes from PDB. University College of London, UK.
PathCalling Proteomics and PPI tool/database. CuraGen Corporation.
PIM Hybrigenics PPI data and tool, H. pylori. Free academic license available
RIKEN Experimental and literature PPIs in mouse.
STRING Protein networks based on experimental data and predictions at EMBL.
YPD “BioKnowledge Library” at Incyte Corporation. Manually curated PPI data from S. cerevisiae. Proprietary.


Human biological pathway unification

Human biological pathway unification

PathCards is an integrated database of human biological pathways and their annotations. Human pathways were clustered into SuperPaths based on gene content similarity. Each PathCard provides information on one SuperPath which represents one or more human pathways. It includes 1,131 SuperPath entries, consolidated from 12 sources.

Publication Details

Belinky, F., Nativ, N., Stelzer, G., Zimmerman, S., Iny Stein, T., Safran, M. and Lancet, D.PathCards: multi-source consolidation of human biological pathways, Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006 . [PDF]


PathCards: multi-source consolidation of human biological pathways

  1. Frida Belinky*,
  2. Noam Nativ,
  3. Gil Stelzer,
  4. Shahar Zimmerman,
  5. Tsippi Iny Stein,
  6. Marilyn Safran and
  7. Doron Lancet

+Author Affiliations

  1. Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel
  1. *Corresponding author: Tel: +972-89343188; Fax: +972-89344487; Email:
  • Received September 22, 2014.
  • Revision received January 13, 2015.
  • Accepted January 14, 2015.


The study of biological pathways is key to a large number of systems analyses. However, many relevant tools consider a limited number of pathway sources, missing out on many genes and gene-to-gene connections. Simply pooling several pathways sources would result in redundancy and the lack of systematic pathway interrelations. To address this, we exercised a combination of hierarchical clustering and nearest neighbor graph representation, with judiciously selected cutoff values, thereby consolidating 3215 human pathways from 12 sources into a set of 1073 SuperPaths. Our unification algorithm finds a balance between reducing redundancy and optimizing the level of pathway-related informativeness for individual genes. We show a substantial enhancement of the SuperPaths’ capacity to infer gene-to-gene relationships when compared with individual pathway sources, separately or taken together. Further, we demonstrate that the chosen 12 sources entail nearly exhaustive gene coverage. The computed SuperPaths are presented in a new online database, PathCards, showing each SuperPath, its constituent network of pathways, and its contained genes. This provides researchers with a rich, searchable systems analysis resource.Database URL:


The systematic analysis of biological pathways has ever-increasing significance in an age of growing systems analyses and omics data. Mapping genes onto pathways may contribute to a better understanding of biological and biomedical mechanisms. The literature provides a large collection of pathway definition sources (1). Pathway knowledge bases represent the careful collection of genes and their interactions, mapped onto biological processes. These repositories, which include both academic and commercial resources (Figure 1A), provide lists of pathways and their cellular components, each with an idiosyncratic view of the pathway universe.

Figure 1.

Figure 1.



The gene-content network of pathway sources. Eighteen sources are shown, 12 of which (colored) are included in SuperPaths generation. Edge widths are proportional to the pairwise Jaccard similarity coefficient computed for the gene contents of the entire source. The sources, depicted in GeneCards Version 3.12, are: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16), QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (see Table 1). White circles correspond to sources not included in the SuperPath generation process: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).


Indeed, the definition of the boundaries of biological pathways differs among sources, as exemplified by the highly studied processes of fatty acid metabolism (2) or the TCA cycle (the tricarboxylic acid cycle) (3). Further, the same pathway name may have widely dissimilar gene content in different sources (4). At present, there is no definitive analysis of pathway similarities, either between or within sources. Thus the multitude of pathway resources can often be confusing when portraying gene-pathway affiliations.

Previous attempts to unify pathways from several sources include NCBI’s Biosystems (5), PathwayCommons (6), PathJam (7), HPD (8), ConsensusPathDB (9), hiPathDB (10) and Pathway Distiller (11). But none of these efforts entail a standardized method to unify numerous sources into a consolidated global repository.

Here, we describe an approach aimed at generating an integrated view across multiple pathway sources. We applied a combination of nearest neighbor graph and hierarchical clustering, utilizing a gene-content metric, to generate a manageable set of 1073 unified pathways (SuperPaths). These optimally encompass all of the information contained in the individual sources, striving to minimize pathway redundancy while maximizing gene-related pathway informativeness. The resultant SuperPaths are integrated into GeneCards (12), enabling clear portrayal of a gene’s set of unified pathways. Finally, these SuperPaths, together with diverse related biological data, are provided in PathCards—a new pathway-centric online database, enabling quick in-depth analysis of each human SuperPath.


Materials and methods

Pathway mining and comparison

Pathway gene sets were generated based on the GeneCards platform (12), implementing the gene symbolization process allowing for comparison of pathway gene sets, from 12 different manually curated sources, including: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16) QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (seeTable 1). A binary matrix was generated for all 3125 pathways, where each column represents a gene indicated by 1 for presence in the pathway and 0 for absence. Additionally, six sources were analysed for their cumulative tallying of genes content, including: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).

Pathway similarity assessment

In the analyses performed, we utilized gene content overlap to estimate pathway similarity. This was done based on the Jaccard coefficient, that measures similarity between finite sample sets, and defined as the size of the intersection divided by the size of the union of the sets. To examine the legitimacy of this method, we performed a comparison to an alternative methodology, embodied in MetaPathwayHunter pathway comparison, that incorporates topology in pairwise pathway alignment (25). For such analysis, we used a set of 151 yeast pathways available in MetaPathwayHunter, and computed Jaccard similarity coefficients (J) for all 11 325 pathway pairs. We then selected a sample of 30 pairs containing 28 unique pathways out of a total of 87 pairs with J ≥ 0.3, ensuring maximal representation for larger pathways. Each of the 28 pathways was queried in MetaPathwayHunter against the entire gamut of 151 with default parameters (a total of 4228 comparisons). We found that 29 out of the 30 sample pathway pairs obtained a significant MetaPathwayHunter alignment (P ≤ 0.01). As only 64 of the 4228 comparisons showed such a P-value, the probability of obtaining this result at random is 1.6 × 1053(Supplementary Table S1). Thus, Jaccard scores appear as excellent predictors for the results of the more elaborate method. A full account of interpathway pairwise similarity is available upon request.

Clustering algorithm

For the main pathway clustering algorithm, we applied a method described elsewhere (26), which includes the following steps: i) The generation of cluster cores by joining all pathway pairs with Jaccard coefficient ≥T2, the upper cutoff, equivalent to hierarchical clustering. ii) Performing cluster extension by generating new best edges, i.e. joining every pathway to a pathway showing the highest score, as long as it is ≥T1, the lower cutoff, akin to nearest neighbor joining. If two or more target pathways have the same best score, all are joined. Each resultant connected component is defined to be a pathway cluster (SuperPath). Identical pathway sets were joined without considering each other as nearest neighbors (i.e. the best scoring non-identical pathway gene-set is chosen as the nearest neighbor). This clustering algorithm is order independent.

Determination of cutoffs

Uniqueness of a SuperPath UsUs is defined as log10(1NpNg)log10(∑1NpNg) where Npis the number of pathways that include a certain gene, averaging for each pathway over all genes in the SuperPath (divided by the number of genes Ng). Uniqueness of genes IsIsis symmetrically defined per SuperPath as log10(1NgNp)log10(∑1NgNp) where each Ng is the number of genes included in the relevant pathway, averaging for each gene over all SuperPaths including a gene. In order to then find the best tradeoff between the two scores, we summed up the average Us and Is for each set of T1 and T2 cutoff parameters. Thus Us + Is was calculated for each set of parameters to find the two parameters for which the tradeoff between pathway and gene uniqueness would be optimal. The best cutoffs by maximizing Us + Is were T1 = 0.3 and T2 ≥ 0.5. Further fine tuning of the upper cutoff was performed by resampling of the data, a technique employed by Levin and Domany (27). We used two dilutions (0.75 and 0.9), i.e. randomly sampling 75% and 90% of the pathways (resampling 100 times for each dilution) and performing the clustering algorithm on each sample, each time calculating the percent of the edges present in the original clustering—the percent of cases that two pathways belonged to the same cluster as in the full dataset. In both dilutions, the upper cutoff of 0.7 was found to recover a higher percent of the edges in the original clustering algorithm (Figure 4C).

Name similarity calculation and concordance with gene similarity

Name similarity was calculated as the Jaccard coefficients of the shared words in the two pathway names, after omitting trivial words and using stemming to identify words with the same root. The cutoff between similar and non-similar names (as well as gene content in regard to comparison with name similarity) was set to J = 0.5. Name similarity was compared with gene content similarity to find the level of concordance between the two.

Shared publications and PPI data

Publication and Protein-Protein Interactions (PPI) data for each gene were obtained from the GeneCards database, including several combined sources. Publications sources of GeneCards include both manually curated publications (e.g. UniProtKB/Swiss-Prot) as well as text mining approaches that report connections between a gene and a list of publications. A shared publication between two genes is an association of both genes to the same publication and does not indicate a direct interaction between the genes. PPI scores between pairs of genes are also based on several interaction sources in GeneCards. Unlike shared publications, PPIs reflect direct interactions between the two gene products.

Randomization and comparison

A randomized set of pseudo-SuperPaths was generated, such that the pseudo-SuperPaths are the same size and quantity as the SuperPaths, albeit with genes assigned at random (from the list of genes with any pathway annotation). Gene pairs that belong to at least one SuperPath, but do not belong together in any individual pathway (the test set) were analysed for the number of shared publications and PPI scores for each pair. In comparison, gene pairs that belong to at least one pseudo-SuperPath, but do not belong together in any individual pathway (the control set) were analysed for the same attributes. To compare the two sets which are of different sizes, a random sample of the larger set (the control set) of the same size as the smaller set (the test set) was compared with the smaller set. A one-sided Kolmogorov–Smirnoff test was performed to compare between the test and control sets.

Gene enrichment analysis comparison

Differentially expressed sets of genes were obtained from the GeneCards database (12) containing 830 different embryonic tissues based on manual curation (28). For the comparison of SuperPaths and their pathway constituents, 89 SuperPaths that contained exactly two pathways with Jaccard similarity coefficient <0.6 were chosen, a value selected to include pairs of relatively dissimilar pathways in order to enhance comparative power. Two gene set enrichment analyses were run for all 830 gene sets: one with SuperPaths and the other with their constituent pathways. Whenever both SuperPath and the constituent pathways received a statistical enrichment score, the difference between negative log Pvalues was computed.

GeneCards and PathCards

SuperPaths have been implemented in GeneCards and are now included in the standard procedure of GeneCards generation. PathCards is an online compendium of human pathways, based on the GeneCards database, presenting SuperPath-related data in each page.


Pathway sources

We analysed 12 pathway sources included in GeneCards (12) with a total of 3215 biological pathways (Table 1 and Figure 1A). The total number of genes covered by these sources is 11 478, nearly twice as large as the gene count in the largest source (Figure 1B), suggesting the power of analysing multiple sources. Asymptotic behavior is observed in the change of total gene count with increasing number of sources. When considering the incorporation of six additional sources (Supplementary Figure S1), we found that the gene count increment is ∼2% of the currently analysed total. This is an indication that the chosen 12 sources provide adequate coverage of human gene-pathway mappings. Switching between the six non-included sources and six included sources of similar size give a very similar graph, with merely 4% increment in gene count (Supplementary Figure S1).

Analysing the gene repertoires of the four largest sources (Figure 2A), we found that among the 10 770 genes contained within these sources, only 1413 genes were jointly covered by all four sources, and that more than 4000 were unique to one of the four sources. This highlights the notion that source unification is essential to obtain maximal gene coverage. In its simplest embodiment, source unification would entail presenting a unified list of the 3215 pathways included in all 12 sources. This however would ignore the extensive gene-content connectivity embodied in the network representation of this pathway collection (Figure 3A). Further, the original pathway collection has considerable inconsistencies of relations between pathway name and pathway gene content, as exemplified in Figure 2B and C. The summary in Table 2A suggests that only ∼9.4% of all pathway pairs with a similar name have similar gene content, and likewise, only 9.8% of all pathway pairs with similar gene content are named similarly (Supplementary Figure S2).

Figure 2.

Figure 2.

Discrepancies between pathway sources. (A)Incomplete gene overlap among sources. Venn diagram (created using VENNY showing the number of shared genes among the four largest pathway sources. For a total of 10 770 genes, only 1413 (13%) are shared by all four sources and 609–1791 genes are unique to each of these sources. (B) Inconsistency of names versus content in meiosis-related pathways. A Venn diagram created using BioVenn (29), exemplifies two pathways, ‘Meiosis’ from Reactome and ‘Oocyte meiosis’ from KEGG with very small gene sharing (7 genes out of 172, J = 0.04). (C) Redundancy in meiosis-related pathways. This is exemplified by the large number of genes (88 of 119, J = 0.74) shared by ‘Meiosis’ and ‘Meiotic recombination’ pathways both from Reactome, and by the large number of genes (52 of 146, J = 0.36) shared by ‘Oocyte meiosis’ and ‘Progesterone-mediated oocyte maturation’ both from KEGG. (D) Pathway size distribution across sources. The pathway size in gene count, is distributed differently across the different sources.

Figure 3.

Figure 3.

Network representations of the 3215 analyzed pathways. Nodes represent pathways and edges represent Jaccard similarity coefficients (J) using different methods. Network visualizations were performed using Gephi (30). Colors correspond to pathway sources. (A)No clustering. All edges with J ≥ 0.05 are shown. All but 20 pathways form one large connected component with an average degree of 134. (B) SuperPaths. Each is a connected component obtain by the main clustering algorithm, with thresholdsT1 (best edges) of J ≥ 0.3 and T2 of J ≥ 0.7. There are 544 singletons and 529 multi-pathway clusters; the size of the largest cluster is 70. (C) Pure hierarchical clustering, with thresholds T2 of J ≥ 0.3. There are 544 singletons and 288 multimembered clusters; the size of the largest cluster is 1046 pathways.

Figure 4.

Figure 4.

Selection of the T1 andT2 thresholds. (A)Distribution of Jaccard coefficients across all pathway pairs. T1 andT2 respectively represent the lower and upper cutoffs used in the algorithm employed. (B) Us + Isscores across combinations of T1 andT2. The diagonal (T1 = T2) represents pure hierarchical clustering with different thresholds. The best scores are attained when T1 = 0.3 and T2 ≥ 0.5. (C) Determination of T2. T2(upper cutoff) was determined by resampling of the pathway data at two dilution levels (27), 0.75 and 0.9. In both cases J = 0.7 was found to be the optimum in which a higher fraction of the original clustering is recovered.


View this table:

Table 2.

Gene content versusname similarity of pathways and SuperPaths


Pathway clustering

We performed global pathway analysis aimed at assigning maximally informative pathway-related annotation to every human gene. For this, we converted the pathway compendium into a set of connected components (SuperPaths), each being a limited-size cluster of pathways. We aimed at controlling the size of the resulting SuperPaths, so as to maintain a high measure of annotation specificity and minimize redundancy.

The following two steps were used in the clustering procedure, in which pathways were connected to each other to form SuperPaths. i) Preprocessing of very small pathways: pathways smaller than 20 genes were connected to larger pathways (<200 genes) with a content similarity metric of ≥0.9 relative to the smaller partner. ii) The main pathway clustering algorithm: this was performed using the Jaccard similarity coefficient (J) metric (31) (see Materials and Methods). We used a combination (cf. 26) of modified nearest neighbor graph generation with a threshold T1 and hierarchical clustering with a threshold T2 (Figure 4A and Materials and Methods).

To determine the optimal values of the thresholds T1 and T2, we defined two quantitative attributes of the clustering process. The first is US, the overall uniqueness of the set of SuperPaths. USelevation is the result of increasing pathway clustering, and reflects the gradual disappearance of redundancy, i.e. of cases in which certain gene sets are portrayed in multiple SuperPaths. The second parameter is IS, the overall informativeness of the set of SuperPaths. IS is a measure of how revealing a collection of SuperPaths is for annotating individual genes. It decreases with the extent of pathway clustering, reaching an undesirable minimum of one exceedingly large cluster, whereby identical SuperPath annotation is obtained for all genes. We thus sought an optimal degree of clustering whereby US + IS is maximized (Figure 4B and Materials and Methods).

Our procedure pointed to an optimum at T1 = 0.3 and T2 ≥ 0.5. Further fine tuning by data resampling suggested an optimal value of T2 = 0.7 (Figure 4C and Materials and Methods). This procedure resulted in the definition of 1073 SuperPaths, including 529 SuperPaths ranging in size from 2 to 70 pathways, and 544 singletons (one pathway per SuperPath) (Figures 3B and 5A). Each SuperPath had 3 ± 4.3 pathways (Figure 5A) and 82.7 ± 140.6 genes (Supplementary Figure S3A). The resultant set of SuperPaths indeed enhances the uniqueness US as depicted in Figure 5B.

Figure 5.

Figure 5.

SuperPaths increase uniqueness while keeping high informativeness. (A) Number of pathways in hierarchical clusteringversus SuperPath algorithm. The largest cluster with hierarchical clustering includes 1046 pathways, about 33% of the entire input, causing a great reduction of informativeness. In the SuperPath clustering the maximum cluster size is 70, about 2% of all pathways. (B) Increase in uniqueness (Us) following unification of pathways into SuperPaths.


The unification process resulted in relatively small changes in gene count distribution between the original pathways and the resultant SuperPaths (Supplementary Figure S3), suggesting a substantial preservation of gene groupings. Notably, applying pure hierarchical clustering (T1 = T2 = 0.3) resulted in a single very large cluster with 1046 pathways (Figure 3C) and with the same amount of singletons, strongly deviating from the goal of specific pathway annotation for genes (Supplementary Figure S3B). This sub-optimal performance of pure hierarchical clustering is general; any of the examined cases of T1 = T2 (Figure 4B diagonal), shows an Us + Isvalue lower than that for T1 = 0.3 T2 = 0.7.

Each SuperPath is identified by a textual name derived from one of its constituent pathways selected as the most connected pathway (hub) in the SuperPath cluster. For simplicity, the option of de novonaming was not exercised. Selecting the hub’s name, as opposed to that of the largest pathway, was chosen since this tends to enhance the descriptive value for the entire SuperPath. When more than one pathway has the same maximal number of connections, the larger one is chosen.

SuperPaths make important gene connections

One of the major implications of the process of SuperPath generation is elucidating new connections among genes. This happens because genes that were not connected via any pre-unification pathway become connected through belonging to the same SuperPath. The unification into SuperPaths is important in two ways: first, it brings, under one roof, pathway information from 12 sources, each individually contributing ∼9000 to ∼5 million instances of gene pairing, for a total of 7.3 million pairs (Supplementary Figure S4). Second, by unifying into SuperPaths, the number of gene pairs is further enhanced, reaching 8.3 million (Supplementary Figure S4).

To test the significance of the million new gene–gene connections resulting from SuperPath generation, we checked their correlation with two independent measures of gene pairing. First, a comparison was made to publications shared among gene pairs (Figure 6A). We found that for gene pairs appearing in a SuperPath but not in any of its constituent pathways, there is a 4- to 75-fold increase in instances of >20 shared publications when compared with random pairs of genes with pathway annotation. Added gene pairs have significantly more shared publications than those randomly paired. Second, we performed a similar analysis based on protein–protein interaction information. We found that for the SuperPath-implicated gene pairs there was a 4- to 25-fold increase of PPIs with score >0.2 (Figure 6B) when compared with controls. SuperPaths thus provide significant gene partnering information not conveyed by any of their 3215 constituent individual pathways. This may be seen when performing gene set enrichment analysis on 830 differential expression sets and comparing the scores of SuperPaths to that of their constituent pathways, demonstrating that SuperPaths tend to receive more significant scores compared with their constituent pathways average score (Figure 7A).

Figure 6.

Figure 6.

SuperPath-specific gene pairs are informative.(A) Shared publications. SuperPath-specific gene pairs are genes connected only by SuperPaths and not by any of the contained pathways. Enrichment of 10–100 is seen in the high abscissa values. The two distributions are significantly different (Kolmogorov–Smirnof P < 10−100). No random gene pairs with 80–90 publications—this point was treated as having one such publication for computing the ratio. (B) Protein–protein interactions. Experimental interaction score from STRING (32) as depicted in GeneCards (12), for SuperPath versus random gene pairs as in panel A. The two distributions are significantly different (Kolmogorov–Smirnof P < 2.8 × 10−61).

Figure 7.

Figure 7.

SuperPath integration attributes. (A)SuperPaths outperform their constituent pathways in significance scores across 830 differentially expressed genes sets.(B) Number of included sources in non-singleton SuperPaths.


SuperPaths in databases

SuperPath information is available both in the GeneCards pathway section (Supplementary Figure S5A) and in PathCards (Supplementary Figure S5B), a GeneCards companion database presenting a web card for each SuperPath. PathCards allows the user a view of the pathway network connectivity within a SuparPath, as well as the gene lists of the SuperPath and of each of its constituent pathways. Links to the original pathways are available from the pathway database symbols, placed to the left of pathway names. PathCards has extensive search capacity including finding any SuperPath that contains a search term within its included pathway names, gene symbols and gene descriptions. Multiple search terms are afforded, allowing fine-tuned results. The search results can be expanded to show exactly where in the SuperPath-related text the terms were found. The list of genes in a PathCard utilizes graded coloring to designate the fraction of included pathways containing this gene, providing an assessment of the importance of a gene in a SuperPath. Other features, including gene list sorting and a search tutorial, are under construction. PathCards is updated regularly, together with GeneCards updates. A new version is released 2–3 times a year.


Pathway source heterogeneity

This study highlights substantial mutual discrepancies among different pathway sources, e.g. with regard to pathway sizes, names and gene contents. The world of human biological pathways consists of many idiosyncratic definitions provided by mostly independent sources that curate publication data and interpret it into sets of genes and their connections. The idiosyncratic view of the different pathway sources is exemplified by the variation in pathway size distribution among sources (Table 1, Figure 2D), where some sources have overrepresentation of large pathways (QIAGEN), while others have mainly small pathways (HumanCyc). In some cases, the large standard deviation in pathway size (Table 1) is easily explained, as exemplified in the case of Reactome, which provides hierarchies of pathways and therefore contains a spectrum of pathway sizes. However, large standard deviations of pathway size are also observed in KEGG and QIAGEN—sources that are not hierarchical by definition. On the other hand, some sources (e.g. HumanCyc, PID and PharmGKB) have very little variation in their pathway sizes, revealing their focus on pathways of particular size. The idiosyncratic view provided by different sources is also evident when examining the genes covered by each source (Figure 2A), where some genes in the gene space are covered by only one source. This causes the unfavorable outcome that when unifying pathways, irrespective of the algorithm chosen, there is a relatively high proportion of single source pathway clusters. In order to account for the drawback of the Jaccard index to cope with large size differences between pathways, we added a preprocessing step to unify pathways that are almost completely included within other pathways (≥0.9 gene content similarity of the smaller pathway), thereby diminishing the barrier of variable pathway size between sources. Previously published isolated instances of intersource discrepancies include the lack of pathway source consensus for the TCA cycle (3) and fatty acid metabolism (2). The authors of both papers stress that each of their pathway sources has only a partial view of the pathway. For the TCA cycle example (3) there is an attempt to provide an optimal TCA cycle pathway by identifying genes that appear in multiple sources, but such manual curation is not feasible for a collection of >3000 biological pathways. In our procedure, 11 relevant pathways from four sources are unified into a SuperPath entitled ‘Citric acid cycle (TCA cycle)’ (Supplementary Figure S5). PathCards enables one to then view which genes are more highly represented within the constituent pathways. Our algorithm thus mimics human intervention, and greatly simplifies the task of finding concurrence within and among pathway sources.

Pathway unification

Combining several pathway resources has been attempted before, using different approaches. The first method is to simply aggregate all of the pathways in several knowledge bases into one database, without further processing. This approach is taken, for example, by NCBI’s Biosystems with 2496 human pathways from five sources (5) and by PathwayCommons with 1668 pathways from four sources (6). This was also the approach taken by GeneCards prior to the SuperPaths effort described here, where pathways from six sources were shown separately in every GeneCard. While this approach provides centralized portals with easy access to several pathway sets, it does not reveal interpathway relationships and may result in considerable redundancy. The second unification approach, taken by PathJam (7), and HPD (8) provide proteins versus pathways tables as search output. This scheme allows useful comparisons as related to specific search terms, but is not leveraged into global analyses of interpathway relations. A third line of action is exemplified by ConsensusPathDB (9), which integrates information from 38 sources, including 26 protein–protein interaction compendia as well as 12 knowledge bases with 4873 pathways. This allows users to observe which interactions are supported by each of the information sources. In turn, hiPathDB (10) integrates protein interactions from four pathway sources (1661 pathways) and creates ad hoc unified superpathways for a query gene, without globally generating consolidated pathway sets. Finally, a fourth methodology is employed by Pathway Distiller (11), which mines 2462 pathways from six pathway databases, and subsequently unifies them into clusters of several predecided sizes between 5 and 500, using hierarchical clustering. The third method of interaction mapping taken by ConsensusPathDB and HiPathDB differs conceptually from the fourth method of clustering, where the interaction mapping method provides information on the specific commonalities and discrepancies in protein interactions among sources with regard to specific keywords or genes, while the clustering method suggests which of the pathways are similar enough to be considered for the same cluster. Therefore, the third and fourth methods are complementary approaches aimed at utilization of pathway information in different observation levels, where the fourth (clustering) method is independent of user input or search in resultant consolidation. In the study described herein, we pursued a clustering method similar to the fourth methodology taken by Pathway Distiller, namely consolidation of pathways into clusters. However, in contrast to Pathway Distiller, our aim was to create a single coherent unification of biological pathways, which is essential for having a universal set of descriptors when looking at gene–gene relations. The resulting SuperPaths simplify the pathway-related descriptive space of a gene and reduce it 3-fold. Furthermore, the cutoffs in our algorithm are chosen to optimally adjust the criteria of uniqueness and informativeness, thereby reducing the subjective effect of choosing cutoffs arbitrarily or by predetermining the number of clusters.

SuperPath generation

A crucial element in our SuperPaths generation method is the definition of interpathway relationships. We have opted for the use of gene content, as described by others (11, 33). One could also consider the use of pathway name similarity (11). However, among the 3215 pathways analysed here, only 79 names were shared by more than one pathway, implying that the efficacy of such an approach would have been rather limited. Further, Table 2 andSupplementary Figure S2 indicate a relatively weak concordance between pathway names and their gene content. Specifically among 79 name-identical pathway groups 52 remained incompletely unified, again suggesting a limited usefulness for unifying based on pathway names. Many resources, including ConsensusPathDB (9) facilitate the option of finding pathways based on keywords in the name. Name sharing is thus a relatively trivial task to overcome when trying to find similar pathways. The more challenging goal is finding pathways that are similar in the biological process that they convey.

In this article we treated pathways as sets of genes, using gene content as a comparative measure and omitting topology and small molecule information. This approach was previously advocated as a means of reducing the complexity of pathway comparisons greatly (34). Further, most sources used in this study provide only the gene set information, hence topology information was unavailable. Finally, the high concordance between significance of pathway alignment and Jaccard coefficients ≥0.3 (P < 1052) indicates that the Jaccard coefficient is a good approximation of the more elaborate pathway alignment procedure (25).

SuperPath utility

A central aim of pathway source unification is enhancing the inference of gene-to-gene relations needed for pathway enrichment scrutiny (32, 35–40). To this end, we developed an algorithm for pathway clustering so as to optimize this inference and at the same time minimize redundancy.

Extending pathways into SuperPaths affords two major advantages. The first is augmenting the gene grouping used for such inference. Indeed, SuperPaths have slightly larger sizes than the original pathways, as evident by the SuperPath size distribution (Figure 2D). Nevertheless, comparing SuperPaths to pseudo-SuperPaths of the same size and quantity clearly show that the increase in size does not account for the addition of true positive gene connections, as evident by the higher PPIs and larger counts of shared publications for SuperPath gene pairs (Figure 6). Subsequently, it is not surprising that SuperPaths outperform their average pathway constituent’s enrichment analysis scores (Figure 7A). SuperPaths are currently used in two GeneCards-related novel tools, VarElect and GeneAnalyticshttp://geneana A second advantage of SuperPaths is in the reduction of redundancy, since they provide a smaller, unified pathway set, and thus diminish the necessary statistical correction for multiple testing. We note that ConsensusPathDB (9) also provides intersource integrated view of interactions. However, gene set analysis in ConsensusPathDB is only allowed for pathways as defined by the original sources. Finally, a third advantage of SuperPaths is their ability to rank genes within a biological mechanism via the multiplicity of constituent pathways within which a gene appears. This can be used not only to gain better functional insight but also to help eliminate suspected false-positive genes appearing in a minority of the pathway versions. A capacity to view such gene ranking is available within the PathCards database.

Limitations of SuperPaths

The SuperPaths generation procedure appears incomplete, as about a half of all SuperPaths are ‘singleton SuperPath’ (labelled accordingly in PathCards), having only one constituent pathway. This is an outcome of the specific cutoff parameters used. However, this provides a useful indication to the user that a singleton pathway is distinct, differing greatly in its constituent genes from any other pathway.

This SuperPath generation process is intended to reduce redundancies and inconsistencies found when analysing the unified pathways. Although SuperPaths increase uniqueness as compared with the original pathway set (Figure 5B), some redundancy and inconsistency still remain within SuperPaths. There are cases of pathways with similar names, which do not get unified into the same SuperPath. This happens because they have not met the unification criteria employed. We also note similarity in name does not always indicate similarity in gene content (Figure 2B and C,Supplementary Figure S2B), and such events are faithfully conveyed to the user.

A clarifying example is that of the 40 pathways whose names include the string ‘apoptosis’. The final post-unification list has 10 SuperPaths whose name includes ‘apoptosis’. This obviously provides the user with a greatly simplified view of the apoptosis world. Yet, at the same time the outcome is replete with instances of two name-similar pathways being included in different SuperPaths. Employing a more stringent algorithm would result in over-clustering, which would in turn reduce informativeness (seeFigure 3C).

In parallel, there are pathways with overlapping functions that are not consolidated into one SuperPath. For example, the pathway ‘integrated breast cancer pathway’ does not unify with the pathways ‘DNA repair’ and ‘DNA damage response pathway’, despite the strong functional relation of breast cancer with DNA damage and repair (41). This is because the relevant gene content similarity in the original pathway sources is small, respectively, J = 0.03 and 0.13. The need to view information on pathways with low pairwise similarity is addressed in Supplementary Figure S6, and is available as a text file upon request.

Finally, when looking at the number of contributing sources per SuperPath (Figure 7B), it is evident that the majority of SuperPaths are comprised by either one or two sources, and no SuperPaths includes more than five. Although this integration limitation is evident, it mainly arises from the inherent biases in gene coverage for the different information sources (Figure 2A).


Biological pathway information has traditionally been a central facet of GeneCards, the database of human genes (12, 42, 43). In previous versions, pathways were presented separately for each of the pathway sources, and it was difficult for users to relate the separate lists to each other. As a result of the consolidation into SuperPaths described herein, this problem has been effectively addressed. Thus, in every GeneCard, a table portrays all of a gene’s SuperPaths, each with its constituent pathways, with links to the original sources (Supplementary Figure S5A).

GeneCards is gene-centric and inherently does not present (Super) pathway-centric annotations. We therefore developed PathCards, a database that encompasses and displays such information in greater detail. PathCards has a page for every SuperPath, showing the connectivity of its included pathways, as well as gene lists for the SuperPath and its pathways. For every SuperPath, we also show a STRING gene interaction network (32) for the entire gamut of constituent genes, providing perspective on topological relationships within the SuperPath.

Supplementary Data

Supplementary data are available at Database Online.


This research is funded by grants from LifeMap Sciences Inc. California (USA) and the SysKid—EU FP7 project (number 241544). Support is also provided by the Crown Human Genome Center at the Weizmann Institute of Science. Funding for open access charge: LifeMap Sciences Inc. California (USA).

Conflict of interest. None declared.


We thank Prof. Eitan Domany and Prof. Ron Pinter for helpful discussions, as well as Dr. Noa Rappaport and Dr. Omer Markovich for assistance with clustering and visualization.


  • Citation details: Belinky,F., Nativ,N., Stelzer,G., et al. PathCards: multi-source consolidation of human biological pathways.Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.


PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool)

Recently, I found this good research paper called “PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool) “. Maybe it will be useful for scientists who are interested in this topic.

Sci Rep. 2015 May 19;5:10021. doi: 10.1038/srep10021.

PALM-IST: Pathway Assembly from Literature Mining–an Information Search Tool.


Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular component, etc.) information from the extracted text and subsequently mines various databases to provide their comprehensive inter-relation (e.g., interaction, expression, etc.). PALM-IST constructs protein interaction network and pathway information data relevant to the text search using multiple data mining tools and assembles them to create a meta-interaction network. It also analyzes scientific collaboration by extraction and creation of “co-authorship network,” for a given search context. Hence, this useful combination of literature and data mining provided in PALM-IST can be used to extract novel protein-protein interaction (PPI), to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. PALM-IST is available at

[PubMed – indexed for MEDLINE]

Free PMC Article

PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool) is a computational platform for users to explore biomedical literature resourse (PubMed) using multiple keywords and extract gene/protein(s) name, drug(s), disease(s) centered information along with their relation/interaction from text and databases. PALM-IST provides users a platform where data and literature mining are performed simultaneously. Combined structured data (from data mining) and unstructured data (from text mining) can be used to extract novel association/interaction between biological entities such as proteins, diseases, or drugs, to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. Further, PALM-IST also enables users to assemble human pathways and protein-protein interaction network (PPIN) using information extracted from text and databases.


1. Real time search in PubMed.
2. Identification and highlighting of genes, drugs and diseases extracted from searched abstracts.
3. Interactive co-occurrence based network of gene-disease, gene-drug, drug-disease from literature.
4. Functional annotation by mapping expression information on to human pathway proteins and their interactors.
5. Platform to merge protein-protein interaction of multiple human genes/proteins.
6. Platform to find cross-talk genes/proteins from merged pathways result.
7. Interactive display of pathways with over-laid with protein-protein interaction information.
8. Interactive display of collaborative network between biomedical experts.

David Bartel (Whitehead Institute/MIT/HHMI) Part 1: MicroRNAs: Introduction to MicroRNAs


Lecture Overview:
MicroRNAs are ~22 nucleotide RNAs processed from RNA hairpin structures. MicroRNAs are much too short to code for protein and instead play important roles in regulating gene expression. In humans, they regulate most protein-coding genes, including genes important in cancer and other diseases. In Part 1 of his talk, Bartel explains how microRNAs are made, how they have evolved, how they recognize and bind to target mRNA sequences, how this binding leads to the repression of the target mRNAs, and how this repression can be important for normal development and disease.
In Part 2, Bartel recounts experiments measuring the effect of microRNAs on mRNA levels, protein levels and protein synthesis in mammalian cells. The results showed that almost all of the changes in protein levels and synthesis are due to changes in the amount of mRNA. Interestingly, experiments in zebrafish embryos describe a somewhat different situation. In the early embryo, initial decreases in protein synthesis are due to shortening of the mRNA polyA tail, which is followed later by a decrease in the amount of RNA.
In the last part of his seminar, Bartel asks how a cell knows which hairpin RNA molecules are pri-microRNAs, and should be processed into microRNAs, and which should be ignored. He leads us through the experiments that identified some of the key conserved features of human pri-microRNAs.

Speaker Bio:
David Bartel studies the many roles of RNA. His lab initially studied the ability of RNA to catalyze reactions and more recently has focused on microRNAs and other regulatory RNAs. Since 2000, his lab has made fundamental discoveries regarding the genomics, biogenesis and regulatory targets of these RNAs, as well as the molecular and biological consequences of their actions in animals, plants and fungi.
Bartel received his BA in Biology from Goshen College. Soon after completion of his PhD at Harvard University in 1993, he joined the Whitehead Institute as a Fellow. Currently, Bartel is Professor of Biology at the Massachusetts Institute of Technology, a Member of the Whitehead Institute and an Investigator of the Howard Hughes Medical Institute. Bartel’s many contributions to our understanding of the roles of RNA have been recognized with numerous awards, including the NAS Molecular Biology Award and election to the National Academy of Sciences.

EnrichNet – Network-based gene set enrichment analysis


EnrichNet is a web-application and web-service to identify and visualize functional associations between a user-defined list of genes/proteins and known cellular pathways. As a complement to classical overlap-based enrichment analysis methods, the EnrichNet approach integrates a novel graph-based statistic with a new interactive visualization of network sub-structures to enable a direct molecular interpretation of how a set of genes/proteins is related to a specific cellular pathway. Available at: