blog

The news in our blog

Pathway (PPI) resources collected 2016-JUNE

human    yeast    mouse
Release 31 (01. Sept. 2015)

ConsensusPathDB-human integrates interaction networks in Homo sapiensincluding binary and complex protein-protein, genetic, metabolic,signaling, gene regulatory and drug-target interactions, as well as biochemical pathways. Data originate from currently 32 public resources for interactions (listed below) and interactions that we have curated from the literature. The interaction data are integrated in a complementary manner (avoiding redundancies), resulting in a seamless interaction network containing different types of interactions.

Current statistics:
unique physical entities: 158,523
unique interactions: 458,570
   gene regulations: 17,098
   protein interactions: 261,085
   genetic interactions: 443
   biochemical reactions: 21,070
   drug-target interactions: 158,874
pathways: 4,593

Licensing information:
The use of ConsensusPathDB is free for academic users. Commercial users should contact Dr. Atanas Kamburov (kamburov [at] molgen.mpg.de) or Dr. Ralf Herwig ( herwig [at] molgen.mpg.de ). Interaction data from ConsensusPathDB is available under the license terms of each of the contributing databases listed above.
Disclaimer:
Although best efforts are always applied, the developers of ConsensusPathDB do not assume any legal responsibility for correctness or usefulness of the information in ConsensusPathDB.
Acknowledgements:
ConsensusPathDB is being developed by the Bioinformatics group of the Vertebrate Genomics Department at the Max-Planck-Institute for Molecular Genetics in Berlin, Germany. The project was supported by the EMBRACE and CARCINOGENOMICS projects that are funded by the European Commission within its 6th Framework Programme under the thematic area “Life Sciences, Genomics and Biotechnology for Health” (LSHG-CT- 2004-512092 and LSHB-CT-2006-037712); 7th Framework Programme project APO-SYS (HEALTH-F4-2007-200767); German Federal Ministry of Education and Research within the 65 NGFN-2 program (SMP-Protein, FKZ01GR0472); Max Planck Society within its International Research School program (IMPRS-CBSC).

Pathway resources

Name URL Formats
KEGG1 http://www.genome.jp/kegg/ BioPAX, png, KGML
Reactome2 http://www.reactome.org/ BioPAX, png, pdf
Pathway Commons7 http://www.pathwaycommons.org/ BioPAX, Sif, png
PANTHER pathway3 http://www.pantherdb.org/pathway/ BioPAX, SBML
WikiPathways5 http://www.wikipathways.org/ BioPAX, svg, png, pdf, gpml
Nature/NCI PathwayInteractionDatabase63 http://pid.nci.nih.gov/ BioPAX, jpg, svg
BioCyc4 http://biocyc.org/ BioPAX, png, SBML
INOH84 http://inoh.hgc.jp/ BioPAX, INOH (xml)
Netpath85 http://www.netpath.org/ BioPAX, SBML, PSI-MI
PharmGKB86 http://www.pharmgkb.org/ BioPAX, pdf, gpml

Abbreviations: BioPAX, Biological Pathway Exchange; KGML, KEGG Markup Language; PSI-MI, Proteomics Standards Initiative Molecular Interaction; SBML, Systems Biology Markup Language; NCI, National Cancer Institute; INOH, Integrating Network Objects with Hierarchies; PharmGKB, Pharmacogenomics Knowledge Base; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Source: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4461095/?report=reader
Tools for visualization and analysis of molecular networks, pathways, and -omics data


Pathway mining and comparison

Pathway gene sets were generated based on the GeneCards platform (12), implementing the gene symbolization process allowing for comparison of pathway gene sets, from 12 different manually curated sources, including: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16) QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (seeTable 1). A binary matrix was generated for all 3125 pathways, where each column represents a gene indicated by 1 for presence in the pathway and 0 for absence. Additionally, six sources were analysed for their cumulative tallying of genes content, including: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).

 

Source: http://database.oxfordjournals.org/content/2015/bav006.full
PathCards: multi-source consolidation of human biological pathways

 

 


 

Welcome to the Biological General Repository for Interaction Datasets

BioGRID is an interaction repository with data compiled through comprehensive curation efforts. Our current index is version 3.4.137 and searches 56,733 publications for 1,067,443 protein and genetic interactions, 27,501 chemical associations and 38,559 post translational modifications from major model organism species. All data are freely provided via our search index and available for download in standardized formats.

 


 

© STRING CONSORTIUM 2016

STRING is a database of known and predicted protein-protein interactions. The database contains information from numerous sources, including experimental repositories, computational prediction methods and public text collections. STRING is regularly updated and gives a comprehensive view on protein-protein interactions currently available.

(http://string-db.org/cgi/input.pl?UserId=rzPqTB1Yphqn&sessionId=TuuvRkt5g4Bh&input_page_show_search=off)
  • ORGANISMS
    2031
  • PROTEINS
    9.6 mio
  • INTERACTIONS
    184 mio


Pathway Commons (http://www.pathwaycommons.org) is a collection of publicly available pathway data from multiple organisms. Pathway Commons provides a web-based interface that enables biologists to browse and search a comprehensive collection of pathways from multiple sources represented in a common language, a download site that provides integrated bulk sets of pathway information in standard or convenient formats and a web service that software developers can use to conveniently query and access all data. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical
Oxford University Press

Pathway Commons, a web resource for biological pathway data

 

 PCViz Logo

Pathway Viewer Web

PCViz is an open-source web-based network visualization tool that helps users queryPathway Commons and obtain details about genes and their interactions extracted from multiple pathway data resources.

It allows interactive exploration of the gene networks where users can:

  • expand the network by adding new genes of interest
  • reduce the size of the network by filtering genes or interactions based on different criteria
  • load cancer context to see the overall frequency of alteration for each gene in the network
  • download networks in various formats for further analysis or use in publication

PCViz is built and maintained by Memorial Sloan-Kettering Cancer Center and theUniversity of Toronto.

 ChiBE

BioPAX Editor Desktop

Ethan G. Cerami, Benjamin E. Gross, […], and Chris Sander

Additional article information

ABSTRACT

Pathway Commons (http://www.pathwaycommons.org) is a collection of publicly available pathway data from multiple organisms. Pathway Commons provides a web-based interface that enables biologists to browse and search a comprehensive collection of pathways from multiple sources represented in a common language, a download site that provides integrated bulk sets of pathway information in standard or convenient formats and a web service that software developers can use to conveniently query and access all data. Database providers can share their pathway data via a common repository. Pathways include biochemical reactions, complex assembly, transport and catalysis events and physical interactions involving proteins, DNA, RNA, small molecules and complexes. Pathway Commons aims to collect and integrate all public pathway data available in standard formats. Pathway Commons currently contains data from nine databases with over 1400 pathways and 687 000 interactions and will be continually expanded and updated.

Pathway Commons currently includes pathway and interaction information from nine sources

Data Source Format Size Updated Focus (species) Reference or URL
BioGRID PSI–MI 2.5 347 508 Interactions August 2010 (3.0.67) Model organisms (20)
Cancer Cell Map BioPAX L2 10 Pathways May 2006 Human http://cancer.cellmap.org
2104 Interactions
HPRD PSI–MI 2.5 40 618 Interactions 13 April 2010 Version 9 Human (21)
HumanCyc BioPAX L2 266 Pathways 16 June 2010 Version 14.1 Human (22)
4879 Interactions
IMID BioPAX L2 1729 Interactions March, 2009 Human http://www.sbcny.org/
IntAct PSI–MI 2.5 154 567 Interactions 8 August 2010 Version 3.1, r14760 All (23)
MINT PSI–MI 2.5 117 202 Interactions 28 July 2010 All (24)
NCI/Nature PID BioPAX L2 186 Pathways 10 August 2010 Human (25)
13 879 Interactions
Reactome BioPAX L2 1015 Pathways 18 June 2010 Version 33 Human (5)
5397 Interactions
All Integrated BioPAX L2 1477 Pathways Multiple http:///www.pathwaycommons.org
687 883 Interactions

New sources are periodically added and listed on the Pathway Commons website. Note that pathway and interaction statistics represent non-unique counts from source databases, as these records are not currently merged from multiple sources (only molecules are currently merged).


Data Sources (http://www.pathwaycommons.org/pc2/datasources)

Warehouse data (canonical molecules, ontologies) are converted to BioPAX utility classes, such as EntityReference, ControlledVocabulary, EntityFeature sub-classes, and saved as the initial BioPAX model, which forms the foundation for integrating pathway data and for id-mapping.

Pathway and binary interaction data (interactions, participants) are normalized next and merged into the database. Original reference molecules are replaced with the corresponding BioPAX warehouse objects.

Note:

Links to the access summary for Warehouse data sources are not provided below; however, the total number of requests minus errors will be fair estimate. Access statistics are computed from January 2014, except unique IP addresses, which are computed from November 2014.

Acknowledgment

The Pathway Commons team much appreciates the fundamental contribution of all the data providers, authors, Identifiers.org, all the open biological ontologies, the open-source projects and standards, which made creating of this integrated BioPAX web service and database feasible.

  Reactome

Reactome v56 (only ‘Homo sapiens.owl’) 31-Mar-2016 (BIOPAX)

URI: http://pathwaycommons.org/pc2/reactome

All names (for data filtering): reactome

Contains: 2007 pathways, 14427 interactions, 35835 participants

Access summary

Publication: Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, Caudy M, Garapati P, Gillespie M, Kamdar MR, Jassal B, Jupe S, Matthews L, May B, Palatnik S, Rothfels K, Shamovsky V, Song H, Williams M, Birney E, Hermjakob H, Stein L, D’Eustachio P. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014;42(database issue):d472-7 (PMID:24243840)

Availability: free

  NCI Pathway Interaction Database: Pathway

NCI Curated Human Pathways from PID (final); 27-Jul-2015 (BIOPAX)

URI: http://pathwaycommons.org/pc2/pid

All names (for data filtering): pid,nci pathway interaction database: pathway

Contains: 745 pathways, 14707 interactions, 10531 participants

Access summary

Publication: Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the Pathway Interaction Database. Nucleic Acids Res. 2009;37(database issue):d674-9 (PMID:18832364)

Availability: free

  PhosphoSitePlus

PhosphoSite Kinase-substrate information; 15-Mar-2016 (BIOPAX)

URI: http://pathwaycommons.org/pc2/psp

All names (for data filtering): phosphosite,phosphositeplus

Contains: 27692 interactions, 15458 participants

Access summary

Publication: Hornbeck PV, Kornhauser JM, Tkachev S, Zhang B, Skrzypek E, Murray B, Latham V, Sullivan M. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2012;40(database issue):d261-70 (PMID:22135298)

Availability: free

  HumanCyc

HumanCyc 19.5; 27-Oct-2015; under license from SRI International, www.biocyc.org (BIOPAX)

URI: http://pathwaycommons.org/pc2/humancyc

All names (for data filtering): humancyc,biocyc

Contains: 302 pathways, 7102 interactions, 5896 participants

Access summary

Publication: Romero P, Wagg J, Green ML, Kaiser D, Krummenacker M, Karp PD. Computational prediction of human metabolic pathways from the complete human genome. Genome Biol. 2005;6(1):r2 (PMID:15642094)

Availability: free

  HPRD

HPRD PSI-MI Release 9; 13-Apr-2010 (PSI_MI)

URI: http://pathwaycommons.org/pc2/hprd

All names (for data filtering): hprd

Contains: 40595 interactions, 9844 participants

Access summary

Publication: Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, Ramabadran S, Chaerkady R, Pandey A. Human Protein Reference Database–2009 update. Nucleic Acids Res. 2009;37(database issue):d767-72 (PMID:18988627)

Availability: academic

  PANTHER Pathway

PANTHER Pathways 3.4 on 18-May-2015 (auto-converted to human-only model) (BIOPAX)

URI: http://pathwaycommons.org/pc2/panther

All names (for data filtering): panther,panther pathway,pantherdb

Contains: 272 pathways, 4700 interactions, 6703 participants

Access summary

Publication: Mi H, Muruganujan A, Thomas PD. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res. 2013;41(database issue):d377-86 (PMID:23193289)

Availability: free

  Database of Interacting Proteins

DIP (human), 14-01-2016 (PSI_MI)

URI: http://pathwaycommons.org/pc2/dip

All names (for data filtering): dip,database of interacting proteins

Contains: 8218 interactions, 4671 participants

Access summary

Publication: Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004;32(database issue):d449-51 (PMID:14681454)

Availability: free

  BioGRID

BioGRID Release 3.4.135 (human and the viruses), 24-Mar-2016 (PSI_MI)

URI: http://pathwaycommons.org/pc2/biogrid

All names (for data filtering): biogrid

Contains: 322538 interactions, 645241 participants

Access summary

Publication: Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006;34(database issue):d535-9 (PMID:16381927)

Availability: free

  IntAct

IntAct (human only; ‘negative’ files removed), 16-Feb-2016 (PSI_MI)

URI: http://pathwaycommons.org/pc2/intact

All names (for data filtering): intact

Contains: 150549 interactions, 403729 participants

Access summary

Publication: Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H. The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014;42(database issue):d358-63 (PMID:24234451)

Availability: free

  IntAct

IntAct Complex (human), 16-Feb-2016 (PSI_MI)

URI: http://pathwaycommons.org/pc2/intact_complex

All names (for data filtering): intact

Contains: 1452 participants

Access summary

Publication: Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H. The MIntAct project–IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014;42(database issue):d358-63 (PMID:24234451)

Availability: free

  BIND

BIND (human), 15-Dec-2010 (PSI_MI)

URI: http://pathwaycommons.org/pc2/bind

All names (for data filtering): bind,biomolecular interaction network database

Contains: 35279 interactions, 74675 participants

Access summary

Publication: Bader GD, Betel D, Hogue CW. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003;31(1):248-250 (PMID:12519993)

Availability: free

  CORUM

CORUM (human), 17-Feb-2012 (PSI_MI)

URI: http://pathwaycommons.org/pc2/corum

All names (for data filtering): corum

Contains: 4401 participants

Access summary

Publication: Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, Frishman G, Montrone C, Mewes HW. CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucleic Acids Res. 2010;38(database issue):d497-501(PMID:19884131)

Availability: academic

  TRANSFAC

Transctiption Factor Target data from Collection 3 in MSigDB (originally from: TRANSFAC Public, by BIOBASE, QIAGEN); version 7.4 (BIOPAX)

URI: http://pathwaycommons.org/pc2/transfac

All names (for data filtering): transfac

Contains: 427 pathways, 261624 interactions, 13276 participants

Access summary

Publication: Wingender E. The TRANSFAC project as an example of framework technology that supports the analysis of genomic regulation. Brief Bioinform. 2008;9(4):326-332 (PMID:18436575)

Availability: academic

  miRTarBase

Human miRNA-target gene relationships from MiRTarBase; v4.5, 01-NOV-2013 (converted 13-MAR-2015) (BIOPAX)

URI: http://pathwaycommons.org/pc2/mirtarbase

All names (for data filtering): mirtarbase

Contains: 5 pathways, 51214 interactions, 12775 participants

Access summary

Publication: Hsu SD, Tseng YT, Shrestha S, Lin YL, Khaleel A, Chou CH, Chu CF, Huang HY, Lin CM, Ho SY, Jian TY, Lin FM, Chang TH, Weng SL, Liao KW, Liao IE, Liu CC, Huang HD. miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2014;42(database issue):d78-85 (PMID:24304892)

Availability: academic

  DrugBank

DrugBank v4.3 converted to BioPAX from the original XML dump (BIOPAX)

URI: http://pathwaycommons.org/pc2/drugbank

All names (for data filtering): drugbank

Contains: 19297 interactions, 15854 participants

Access summary

Publication: Law V, Knox C, Djoumbou Y, Jewison T, Guo AC, Liu Y, Maciejewski A, Arndt D, Wilson M, Neveu V, Tang A, Gabriel G, Ly C, Adamjee S, Dame ZT, Han B, Zhou Y, Wishart DS. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014;42(database issue):d1091-7 (PMID:24203711)

Availability: academic

  Recon X

Recon X: Reconstruction of the Human Genome, converted from SBML; 2.03  (BIOPAX)

URI: http://pathwaycommons.org/pc2/reconx

All names (for data filtering): recon x

Contains: 1 pathways, 10813 interactions, 8316 participants

Access summary

Publication: Thiele I, Swainston N, Fleming RM, Hoppe A, Sahoo S, Aurich MK, Haraldsdottir H, Mo ML, Rolfsson O, Stobbe MD, Thorleifsson SG, Agren R, Bölling C, Bordel S, Chavali AK, Dobson P, Dunn WB, Endler L, Hala D, Hucka M, Hull D, Jameson D, Jamshidi N, Jonsson JJ, Juty N, Keating S, Nookaew I, Le Novère N, Malys N, Mazein A, Papin JA, Price ND, Selkov E Sr, Sigurdsson MI, Simeonidis E, Sonnenschein N, Smallbone K, Sorokin A, van Beek JH, Weichart D, Goryanin I, Nielsen J, Westerhoff HV, Kell DB, Mendes P, Palsson BØ. A community-driven global reconstruction of human metabolism. Nat Biotechnol. 2013;31(5):419-425(PMID:23455439)

Availability: free

  Comparative Toxicogenomics Database

Comparative Toxicogenomics Database (human), 20150603 (BIOPAX)

URI: http://pathwaycommons.org/pc2/ctd

All names (for data filtering): ctd,comparative toxicogenomics database,ctdbase

Contains: 32722 pathways, 390428 interactions, 61031 participants

Access summary

Publication: Davis AP, Grondin CJ, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database’s 10th year anniversary: update 2015. Nucleic Acids Res. 2015;43(database issue):d914-20(PMID:25326323)

Availability: academic

  KEGG Pathway

KEGG 07/2011 (only human, hsa* files), converted to BioPAX by BioModels (http://www.ebi.ac.uk/biomodels-main/) team (BIOPAX)

URI: http://pathwaycommons.org/pc2/kegg

All names (for data filtering): kegg,kegg pathway

Contains: 122 pathways, 3566 interactions, 3355 participants

Access summary

Publication: Wrzodek C, Büchel F, Ruff M, Dräger A, Zell A. Precise generation of systems biology models from KEGG pathways. BMC Syst Biol. 2013;7(undefined):15 (PMID:23433509)

Availability: academic

  Small Molecule Pathway Database

Small Molecule Pathway Database 2.0, 07-Jul-2015 (BIOPAX)

URI: http://pathwaycommons.org/pc2/smpdb

All names (for data filtering): smpdb,small molecule pathway database

Contains: 1206 pathways, 4701 interactions, 4863 participants

Access summary

Publication: Jewison T, Su Y, Disfany FM, Liang Y, Knox C, Maciejewski A, Poelzer J, Huynh J, Zhou Y, Arndt D, Djoumbou Y, Liu Y, Deng L, Guo AC, Han B, Pon A, Wilson M, Rafatnia S, Liu P, Wishart DS. SMPDB 2.0: big improvements to the Small Molecule Pathway Database. Nucleic Acids Res. 2014;42(database issue):d478-84 (PMID:24203708)

Availability: free

  Integrating Network Objects with Hierarchies

INOH 4.0 (signal transduction and metabolic data), 22-MAR-2011 (BIOPAX)

URI: http://pathwaycommons.org/pc2/inoh

All names (for data filtering): inoh,integrating network objects with hierarchies

Contains: 774 pathways, 5432 interactions, 17142 participants

Access summary

Publication: Yamamoto S, Sakai N, Nakamura H, Fukagawa H, Fukuda K, Takagi T. INOH: ontology-based highly structured database of signal transduction pathways. Database (Oxford). 2011;2011(undefined):bar052 (PMID:22120663)

Availability: free

  NetPath

NetPath 12/2011 (BIOPAX)

URI: http://pathwaycommons.org/pc2/netpath

All names (for data filtering): netpath

Contains: 27 pathways, 6347 interactions, 3266 participants

Access summary

Publication: Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GS, Venugopal AK, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Gollapudi SK, Tattikota SG, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob HK, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra YL, Rahiman BA, Prasad TS, Lin JX, Houtman JC, Desiderio S, Renauld JC, Constantinescu SN, Ohara O, Hirano T, Kubo M, Singh S, Khatri P, Draghici S, Bader GD, Sander C, Leonard WJ, Pandey A. NetPath: a public resource of curated signal transduction pathways. Genome Biol. 2010;11(1):r3 (PMID:20067622)

Availability: free

  WikiPathways

WikiPathways – Community Curated Human Pathways; 29/09/2015 (human) (BIOPAX)

URI: http://pathwaycommons.org/pc2/wp

All names (for data filtering): wikipathways

Contains: 333 pathways, 9758 interactions, 9584 participants

Access summary

Publication: Pico AR, Kelder T, van Iersel MP, Hanspers K, Conklin BR, Evelo C. WikiPathways: pathway editing for the people. PLoS Biol. 2008;6(7):e184 (PMID:18651794)

Availability: free

  ChEBI

ChEBI Ontology v138, 01-Apr-2016 (WAREHOUSE)

All names (for data filtering): chebi

Publication: Hastings J, de Matos P, Dekker A, Ennis M, Harsha B, Kale N, Muthukrishnan V, Owen G, Turner S, Williams M, Steinbeck C. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 2013;41(database issue):d456-63 (PMID:23180789)

Availability: free

  SwissProt

UniProtKB/Swiss-Prot (human), 16-Mar-2015 (WAREHOUSE)

All names (for data filtering): uniprot,swissprot,uniprotkb

Publication: UniProt Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014;42(database issue):d191-8 (PMID:24253303)

Availability: free

  UniChem

Selected whole-source id-mapping files (to ChEBI) from UniChem (manually edited/fixed/sorted), 29-Dec-2015 (MAPPING)

All names (for data filtering): unichem

Publication: Chambers J, Davies M, Gaulton A, Hersey A, Velankar S, Petryszak R, Hastings J, Bellis L, McGlinchey S, Overington JP. UniChem: a unified chemical structure cross-referencing and identifier tracking system. J Cheminform. 2013;5(1):3 (PMID:23317286)

Availability: free


ConsensusPathDB—a database for integrating human functional interaction networks

ConsensusPathDB is a database system for the integration of human functional interactions. Current knowledge of these interactions is dispersed in more than 200 databases, each having a specific focus and data format. ConsensusPathDB currently integrates the content of 12 different interaction databases with heterogeneous foci comprising a total of 26 133 distinct physical entities and 74 289 distinct functional interactions (protein–protein interactions, biochemical reactions, gene regulatory interactions), and covering 1738 pathways. We describe the database schema and the methods used for data integration. Furthermore, we describe the functionality of the ConsensusPathDB web interface, where users can search and visualize interaction networks, upload, modify and expand networks in BioPAX, SBML or PSI-MI format, or carry out over-representation analysis with uploaded identifier lists with respect to substructures derived from the integrated interaction network. The ConsensusPathDB database is available at: http://cpdb.molgen.mpg.de

The MIPS Mammalian Protein-Protein Interaction Database

The MIPS Mammalian Protein-Protein Interaction Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. We took great care to include only data from individually performed experiments since they usually provide the most reliable evidence for physical interactions.

http://mips.helmholtz-muenchen.de/proj/ppi/

Other PPI resources

There are plenty of interesting databases and other sites on protein-protein interactions. Currently we are aware of the following PPI resources:

Resource Comments
APID Agile Protein Interaction DataAnalyzer (Cancer Research Center, Salamanca, Spain)
BIND Biomolecular INteraction Network Database at the University of Toronto, Canada. No species restriction
CYGD PPI section of the Comprehensive Yeast Genome Database. Manually curated comprehensive S. cerevisiae PPI database at MIPS
DIP Database of Interacting Proteins at UCLA. No species restriction.
GRID General Repository for Interaction Datasets. Mount Sinai Hospital, Toronto, Canada
HIV Interaction DB Interactions between HIV and host proteins.
HPRD The Human Protein Reference Database. Institute of Bioinformatics, Bangalore, India and Johns Hopkins University, Baltimore, MD, USA.
HPID Human Protein Interaction Database. Department of computer Science and Information Engineering Inha University, Inchon, Korea
iHOP iHOP (Information Hyperlinked over Proteins). Protein association network built by literature mining
IntAct Protein interaction database at EBI. No species restriction.
InterDom Database of putative interacting protein domains. Institute for InfoComm Research, Singapore.
JCB PPI site at the Jena Centre for Bioinformatics, Germany
MetaCore Commercial software suite and database. Manually curated human PPIs (among other things). GeneGo
MINT Molecular INTeraction database at the Centro di Bioinformatica Moleculare, Universita di Roma, Italy.
MRC PPI links Commented list of links to PPI databases and resources maintained at the MRC Rosalind Franklin Cetre for Genomics Research, Cambridge, UK
OPHID The Online Predicted Human Interaction Database. Ontario Cancer Institute and University of Toronto, Canada.
Pawson Lab Information on protein-interaction domains.
PDZbase Database of PDZ mediated protein-protein interactions.
Predictome Predicted functional associations and interactions. Boston University.
Protein-Protein Interaction Server Analysis of protein-protein interfaces of protein complexes from PDB. University College of London, UK.
PathCalling Proteomics and PPI tool/database. CuraGen Corporation.
PIM Hybrigenics PPI data and tool, H. pylori. Free academic license available
RIKEN Experimental and literature PPIs in mouse.
STRING Protein networks based on experimental data and predictions at EMBL.
YPD “BioKnowledge Library” at Incyte Corporation. Manually curated PPI data from S. cerevisiae. Proprietary.

 


Human biological pathway unification

Human biological pathway unification

PathCards is an integrated database of human biological pathways and their annotations. Human pathways were clustered into SuperPaths based on gene content similarity. Each PathCard provides information on one SuperPath which represents one or more human pathways. It includes 1,131 SuperPath entries, consolidated from 12 sources.

Publication Details

Belinky, F., Nativ, N., Stelzer, G., Zimmerman, S., Iny Stein, T., Safran, M. and Lancet, D.PathCards: multi-source consolidation of human biological pathways, Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006 . [PDF]

http://pathcards.genecards.org/

 

PathCards: multi-source consolidation of human biological pathways

  1. Frida Belinky*,
  2. Noam Nativ,
  3. Gil Stelzer,
  4. Shahar Zimmerman,
  5. Tsippi Iny Stein,
  6. Marilyn Safran and
  7. Doron Lancet

+Author Affiliations


  1. Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7610001, Israel
  1. *Corresponding author: Tel: +972-89343188; Fax: +972-89344487; Email: Frida.Belinky@weizmann.ac.il
  • Received September 22, 2014.
  • Revision received January 13, 2015.
  • Accepted January 14, 2015.

Abstract

The study of biological pathways is key to a large number of systems analyses. However, many relevant tools consider a limited number of pathway sources, missing out on many genes and gene-to-gene connections. Simply pooling several pathways sources would result in redundancy and the lack of systematic pathway interrelations. To address this, we exercised a combination of hierarchical clustering and nearest neighbor graph representation, with judiciously selected cutoff values, thereby consolidating 3215 human pathways from 12 sources into a set of 1073 SuperPaths. Our unification algorithm finds a balance between reducing redundancy and optimizing the level of pathway-related informativeness for individual genes. We show a substantial enhancement of the SuperPaths’ capacity to infer gene-to-gene relationships when compared with individual pathway sources, separately or taken together. Further, we demonstrate that the chosen 12 sources entail nearly exhaustive gene coverage. The computed SuperPaths are presented in a new online database, PathCards, showing each SuperPath, its constituent network of pathways, and its contained genes. This provides researchers with a rich, searchable systems analysis resource.Database URL:http://pathcards.genecards.org/

Introduction

The systematic analysis of biological pathways has ever-increasing significance in an age of growing systems analyses and omics data. Mapping genes onto pathways may contribute to a better understanding of biological and biomedical mechanisms. The literature provides a large collection of pathway definition sources (1). Pathway knowledge bases represent the careful collection of genes and their interactions, mapped onto biological processes. These repositories, which include both academic and commercial resources (Figure 1A), provide lists of pathways and their cellular components, each with an idiosyncratic view of the pathway universe.

Figure 1.

Figure 1.

 


 

The gene-content network of pathway sources. Eighteen sources are shown, 12 of which (colored) are included in SuperPaths generation. Edge widths are proportional to the pairwise Jaccard similarity coefficient computed for the gene contents of the entire source. The sources, depicted in GeneCards Version 3.12, are: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16), QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (see Table 1). White circles correspond to sources not included in the SuperPath generation process: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).

 


Indeed, the definition of the boundaries of biological pathways differs among sources, as exemplified by the highly studied processes of fatty acid metabolism (2) or the TCA cycle (the tricarboxylic acid cycle) (3). Further, the same pathway name may have widely dissimilar gene content in different sources (4). At present, there is no definitive analysis of pathway similarities, either between or within sources. Thus the multitude of pathway resources can often be confusing when portraying gene-pathway affiliations.

Previous attempts to unify pathways from several sources include NCBI’s Biosystems (5), PathwayCommons (6), PathJam (7), HPD (8), ConsensusPathDB (9), hiPathDB (10) and Pathway Distiller (11). But none of these efforts entail a standardized method to unify numerous sources into a consolidated global repository.

Here, we describe an approach aimed at generating an integrated view across multiple pathway sources. We applied a combination of nearest neighbor graph and hierarchical clustering, utilizing a gene-content metric, to generate a manageable set of 1073 unified pathways (SuperPaths). These optimally encompass all of the information contained in the individual sources, striving to minimize pathway redundancy while maximizing gene-related pathway informativeness. The resultant SuperPaths are integrated into GeneCards (12), enabling clear portrayal of a gene’s set of unified pathways. Finally, these SuperPaths, together with diverse related biological data, are provided in PathCards—a new pathway-centric online database, enabling quick in-depth analysis of each human SuperPath.

 


Materials and methods

Pathway mining and comparison

Pathway gene sets were generated based on the GeneCards platform (12), implementing the gene symbolization process allowing for comparison of pathway gene sets, from 12 different manually curated sources, including: Reactome (13), KEGG (14), PharmGKB (15), WikiPathways (16) QIAGEN, HumanCyc (17), Pathway Interaction Database (18), Tocris Bioscience, GeneGO, Cell Signaling Technologies (CST), R&D Systems and Sino Biological (seeTable 1). A binary matrix was generated for all 3125 pathways, where each column represents a gene indicated by 1 for presence in the pathway and 0 for absence. Additionally, six sources were analysed for their cumulative tallying of genes content, including: BioCarta (19), SMPDB (20), INOH (21), NetPath (22), EHMN (23) and SignaLink (24).

Pathway similarity assessment

In the analyses performed, we utilized gene content overlap to estimate pathway similarity. This was done based on the Jaccard coefficient, that measures similarity between finite sample sets, and defined as the size of the intersection divided by the size of the union of the sets. To examine the legitimacy of this method, we performed a comparison to an alternative methodology, embodied in MetaPathwayHunter pathway comparison, that incorporates topology in pairwise pathway alignment (25). For such analysis, we used a set of 151 yeast pathways available in MetaPathwayHunter, and computed Jaccard similarity coefficients (J) for all 11 325 pathway pairs. We then selected a sample of 30 pairs containing 28 unique pathways out of a total of 87 pairs with J ≥ 0.3, ensuring maximal representation for larger pathways. Each of the 28 pathways was queried in MetaPathwayHunter against the entire gamut of 151 with default parameters (a total of 4228 comparisons). We found that 29 out of the 30 sample pathway pairs obtained a significant MetaPathwayHunter alignment (P ≤ 0.01). As only 64 of the 4228 comparisons showed such a P-value, the probability of obtaining this result at random is 1.6 × 1053(Supplementary Table S1). Thus, Jaccard scores appear as excellent predictors for the results of the more elaborate method. A full account of interpathway pairwise similarity is available upon request.

Clustering algorithm

For the main pathway clustering algorithm, we applied a method described elsewhere (26), which includes the following steps: i) The generation of cluster cores by joining all pathway pairs with Jaccard coefficient ≥T2, the upper cutoff, equivalent to hierarchical clustering. ii) Performing cluster extension by generating new best edges, i.e. joining every pathway to a pathway showing the highest score, as long as it is ≥T1, the lower cutoff, akin to nearest neighbor joining. If two or more target pathways have the same best score, all are joined. Each resultant connected component is defined to be a pathway cluster (SuperPath). Identical pathway sets were joined without considering each other as nearest neighbors (i.e. the best scoring non-identical pathway gene-set is chosen as the nearest neighbor). This clustering algorithm is order independent.

Determination of cutoffs

Uniqueness of a SuperPath UsUs is defined as log10(1NpNg)log10(∑1NpNg) where Npis the number of pathways that include a certain gene, averaging for each pathway over all genes in the SuperPath (divided by the number of genes Ng). Uniqueness of genes IsIsis symmetrically defined per SuperPath as log10(1NgNp)log10(∑1NgNp) where each Ng is the number of genes included in the relevant pathway, averaging for each gene over all SuperPaths including a gene. In order to then find the best tradeoff between the two scores, we summed up the average Us and Is for each set of T1 and T2 cutoff parameters. Thus Us + Is was calculated for each set of parameters to find the two parameters for which the tradeoff between pathway and gene uniqueness would be optimal. The best cutoffs by maximizing Us + Is were T1 = 0.3 and T2 ≥ 0.5. Further fine tuning of the upper cutoff was performed by resampling of the data, a technique employed by Levin and Domany (27). We used two dilutions (0.75 and 0.9), i.e. randomly sampling 75% and 90% of the pathways (resampling 100 times for each dilution) and performing the clustering algorithm on each sample, each time calculating the percent of the edges present in the original clustering—the percent of cases that two pathways belonged to the same cluster as in the full dataset. In both dilutions, the upper cutoff of 0.7 was found to recover a higher percent of the edges in the original clustering algorithm (Figure 4C).

Name similarity calculation and concordance with gene similarity

Name similarity was calculated as the Jaccard coefficients of the shared words in the two pathway names, after omitting trivial words and using stemming to identify words with the same root. The cutoff between similar and non-similar names (as well as gene content in regard to comparison with name similarity) was set to J = 0.5. Name similarity was compared with gene content similarity to find the level of concordance between the two.

Shared publications and PPI data

Publication and Protein-Protein Interactions (PPI) data for each gene were obtained from the GeneCards database, including several combined sources. Publications sources of GeneCards include both manually curated publications (e.g. UniProtKB/Swiss-Prot) as well as text mining approaches that report connections between a gene and a list of publications. A shared publication between two genes is an association of both genes to the same publication and does not indicate a direct interaction between the genes. PPI scores between pairs of genes are also based on several interaction sources in GeneCards. Unlike shared publications, PPIs reflect direct interactions between the two gene products.

Randomization and comparison

A randomized set of pseudo-SuperPaths was generated, such that the pseudo-SuperPaths are the same size and quantity as the SuperPaths, albeit with genes assigned at random (from the list of genes with any pathway annotation). Gene pairs that belong to at least one SuperPath, but do not belong together in any individual pathway (the test set) were analysed for the number of shared publications and PPI scores for each pair. In comparison, gene pairs that belong to at least one pseudo-SuperPath, but do not belong together in any individual pathway (the control set) were analysed for the same attributes. To compare the two sets which are of different sizes, a random sample of the larger set (the control set) of the same size as the smaller set (the test set) was compared with the smaller set. A one-sided Kolmogorov–Smirnoff test was performed to compare between the test and control sets.

Gene enrichment analysis comparison

Differentially expressed sets of genes were obtained from the GeneCards database (12) containing 830 different embryonic tissues based on manual curation (28). For the comparison of SuperPaths and their pathway constituents, 89 SuperPaths that contained exactly two pathways with Jaccard similarity coefficient <0.6 were chosen, a value selected to include pairs of relatively dissimilar pathways in order to enhance comparative power. Two gene set enrichment analyses were run for all 830 gene sets: one with SuperPaths and the other with their constituent pathways. Whenever both SuperPath and the constituent pathways received a statistical enrichment score, the difference between negative log Pvalues was computed.

GeneCards and PathCards

SuperPaths have been implemented in GeneCards and are now included in the standard procedure of GeneCards generation. PathCards is an online compendium of human pathways, based on the GeneCards database, presenting SuperPath-related data in each page.

Results

Pathway sources

We analysed 12 pathway sources included in GeneCardshttp://www.genecards.org/ (12) with a total of 3215 biological pathways (Table 1 and Figure 1A). The total number of genes covered by these sources is 11 478, nearly twice as large as the gene count in the largest source (Figure 1B), suggesting the power of analysing multiple sources. Asymptotic behavior is observed in the change of total gene count with increasing number of sources. When considering the incorporation of six additional sources (Supplementary Figure S1), we found that the gene count increment is ∼2% of the currently analysed total. This is an indication that the chosen 12 sources provide adequate coverage of human gene-pathway mappings. Switching between the six non-included sources and six included sources of similar size give a very similar graph, with merely 4% increment in gene count (Supplementary Figure S1).

Analysing the gene repertoires of the four largest sources (Figure 2A), we found that among the 10 770 genes contained within these sources, only 1413 genes were jointly covered by all four sources, and that more than 4000 were unique to one of the four sources. This highlights the notion that source unification is essential to obtain maximal gene coverage. In its simplest embodiment, source unification would entail presenting a unified list of the 3215 pathways included in all 12 sources. This however would ignore the extensive gene-content connectivity embodied in the network representation of this pathway collection (Figure 3A). Further, the original pathway collection has considerable inconsistencies of relations between pathway name and pathway gene content, as exemplified in Figure 2B and C. The summary in Table 2A suggests that only ∼9.4% of all pathway pairs with a similar name have similar gene content, and likewise, only 9.8% of all pathway pairs with similar gene content are named similarly (Supplementary Figure S2).

Figure 2.

Figure 2.

Discrepancies between pathway sources. (A)Incomplete gene overlap among sources. Venn diagram (created using VENNYhttp://bioinfogp.cnb.csic.es/tools/venny/) showing the number of shared genes among the four largest pathway sources. For a total of 10 770 genes, only 1413 (13%) are shared by all four sources and 609–1791 genes are unique to each of these sources. (B) Inconsistency of names versus content in meiosis-related pathways. A Venn diagram created using BioVenn (29), exemplifies two pathways, ‘Meiosis’ from Reactome and ‘Oocyte meiosis’ from KEGG with very small gene sharing (7 genes out of 172, J = 0.04). (C) Redundancy in meiosis-related pathways. This is exemplified by the large number of genes (88 of 119, J = 0.74) shared by ‘Meiosis’ and ‘Meiotic recombination’ pathways both from Reactome, and by the large number of genes (52 of 146, J = 0.36) shared by ‘Oocyte meiosis’ and ‘Progesterone-mediated oocyte maturation’ both from KEGG. (D) Pathway size distribution across sources. The pathway size in gene count, is distributed differently across the different sources.

 
Figure 3.

Figure 3.

Network representations of the 3215 analyzed pathways. Nodes represent pathways and edges represent Jaccard similarity coefficients (J) using different methods. Network visualizations were performed using Gephi (30). Colors correspond to pathway sources. (A)No clustering. All edges with J ≥ 0.05 are shown. All but 20 pathways form one large connected component with an average degree of 134. (B) SuperPaths. Each is a connected component obtain by the main clustering algorithm, with thresholdsT1 (best edges) of J ≥ 0.3 and T2 of J ≥ 0.7. There are 544 singletons and 529 multi-pathway clusters; the size of the largest cluster is 70. (C) Pure hierarchical clustering, with thresholds T2 of J ≥ 0.3. There are 544 singletons and 288 multimembered clusters; the size of the largest cluster is 1046 pathways.

 
Figure 4.

Figure 4.

Selection of the T1 andT2 thresholds. (A)Distribution of Jaccard coefficients across all pathway pairs. T1 andT2 respectively represent the lower and upper cutoffs used in the algorithm employed. (B) Us + Isscores across combinations of T1 andT2. The diagonal (T1 = T2) represents pure hierarchical clustering with different thresholds. The best scores are attained when T1 = 0.3 and T2 ≥ 0.5. (C) Determination of T2. T2(upper cutoff) was determined by resampling of the pathway data at two dilution levels (27), 0.75 and 0.9. In both cases J = 0.7 was found to be the optimum in which a higher fraction of the original clustering is recovered.

 

View this table:

Table 2.

Gene content versusname similarity of pathways and SuperPaths

 

Pathway clustering

We performed global pathway analysis aimed at assigning maximally informative pathway-related annotation to every human gene. For this, we converted the pathway compendium into a set of connected components (SuperPaths), each being a limited-size cluster of pathways. We aimed at controlling the size of the resulting SuperPaths, so as to maintain a high measure of annotation specificity and minimize redundancy.

The following two steps were used in the clustering procedure, in which pathways were connected to each other to form SuperPaths. i) Preprocessing of very small pathways: pathways smaller than 20 genes were connected to larger pathways (<200 genes) with a content similarity metric of ≥0.9 relative to the smaller partner. ii) The main pathway clustering algorithm: this was performed using the Jaccard similarity coefficient (J) metric (31) (see Materials and Methods). We used a combination (cf. 26) of modified nearest neighbor graph generation with a threshold T1 and hierarchical clustering with a threshold T2 (Figure 4A and Materials and Methods).

To determine the optimal values of the thresholds T1 and T2, we defined two quantitative attributes of the clustering process. The first is US, the overall uniqueness of the set of SuperPaths. USelevation is the result of increasing pathway clustering, and reflects the gradual disappearance of redundancy, i.e. of cases in which certain gene sets are portrayed in multiple SuperPaths. The second parameter is IS, the overall informativeness of the set of SuperPaths. IS is a measure of how revealing a collection of SuperPaths is for annotating individual genes. It decreases with the extent of pathway clustering, reaching an undesirable minimum of one exceedingly large cluster, whereby identical SuperPath annotation is obtained for all genes. We thus sought an optimal degree of clustering whereby US + IS is maximized (Figure 4B and Materials and Methods).

Our procedure pointed to an optimum at T1 = 0.3 and T2 ≥ 0.5. Further fine tuning by data resampling suggested an optimal value of T2 = 0.7 (Figure 4C and Materials and Methods). This procedure resulted in the definition of 1073 SuperPaths, including 529 SuperPaths ranging in size from 2 to 70 pathways, and 544 singletons (one pathway per SuperPath) (Figures 3B and 5A). Each SuperPath had 3 ± 4.3 pathways (Figure 5A) and 82.7 ± 140.6 genes (Supplementary Figure S3A). The resultant set of SuperPaths indeed enhances the uniqueness US as depicted in Figure 5B.

Figure 5.

Figure 5.

SuperPaths increase uniqueness while keeping high informativeness. (A) Number of pathways in hierarchical clusteringversus SuperPath algorithm. The largest cluster with hierarchical clustering includes 1046 pathways, about 33% of the entire input, causing a great reduction of informativeness. In the SuperPath clustering the maximum cluster size is 70, about 2% of all pathways. (B) Increase in uniqueness (Us) following unification of pathways into SuperPaths.

 

The unification process resulted in relatively small changes in gene count distribution between the original pathways and the resultant SuperPaths (Supplementary Figure S3), suggesting a substantial preservation of gene groupings. Notably, applying pure hierarchical clustering (T1 = T2 = 0.3) resulted in a single very large cluster with 1046 pathways (Figure 3C) and with the same amount of singletons, strongly deviating from the goal of specific pathway annotation for genes (Supplementary Figure S3B). This sub-optimal performance of pure hierarchical clustering is general; any of the examined cases of T1 = T2 (Figure 4B diagonal), shows an Us + Isvalue lower than that for T1 = 0.3 T2 = 0.7.

Each SuperPath is identified by a textual name derived from one of its constituent pathways selected as the most connected pathway (hub) in the SuperPath cluster. For simplicity, the option of de novonaming was not exercised. Selecting the hub’s name, as opposed to that of the largest pathway, was chosen since this tends to enhance the descriptive value for the entire SuperPath. When more than one pathway has the same maximal number of connections, the larger one is chosen.

SuperPaths make important gene connections

One of the major implications of the process of SuperPath generation is elucidating new connections among genes. This happens because genes that were not connected via any pre-unification pathway become connected through belonging to the same SuperPath. The unification into SuperPaths is important in two ways: first, it brings, under one roof, pathway information from 12 sources, each individually contributing ∼9000 to ∼5 million instances of gene pairing, for a total of 7.3 million pairs (Supplementary Figure S4). Second, by unifying into SuperPaths, the number of gene pairs is further enhanced, reaching 8.3 million (Supplementary Figure S4).

To test the significance of the million new gene–gene connections resulting from SuperPath generation, we checked their correlation with two independent measures of gene pairing. First, a comparison was made to publications shared among gene pairs (Figure 6A). We found that for gene pairs appearing in a SuperPath but not in any of its constituent pathways, there is a 4- to 75-fold increase in instances of >20 shared publications when compared with random pairs of genes with pathway annotation. Added gene pairs have significantly more shared publications than those randomly paired. Second, we performed a similar analysis based on protein–protein interaction information. We found that for the SuperPath-implicated gene pairs there was a 4- to 25-fold increase of PPIs with score >0.2 (Figure 6B) when compared with controls. SuperPaths thus provide significant gene partnering information not conveyed by any of their 3215 constituent individual pathways. This may be seen when performing gene set enrichment analysis on 830 differential expression sets and comparing the scores of SuperPaths to that of their constituent pathways, demonstrating that SuperPaths tend to receive more significant scores compared with their constituent pathways average score (Figure 7A).

Figure 6.

Figure 6.

SuperPath-specific gene pairs are informative.(A) Shared publications. SuperPath-specific gene pairs are genes connected only by SuperPaths and not by any of the contained pathways. Enrichment of 10–100 is seen in the high abscissa values. The two distributions are significantly different (Kolmogorov–Smirnof P < 10−100). No random gene pairs with 80–90 publications—this point was treated as having one such publication for computing the ratio. (B) Protein–protein interactions. Experimental interaction score from STRING (32) as depicted in GeneCards (12), for SuperPath versus random gene pairs as in panel A. The two distributions are significantly different (Kolmogorov–Smirnof P < 2.8 × 10−61).

 
Figure 7.

Figure 7.

SuperPath integration attributes. (A)SuperPaths outperform their constituent pathways in significance scores across 830 differentially expressed genes sets.(B) Number of included sources in non-singleton SuperPaths.

 

SuperPaths in databases

SuperPath information is available both in the GeneCards pathway section (Supplementary Figure S5A) and in PathCards (Supplementary Figure S5B) http://pathcards.genecards.org/, a GeneCards companion database presenting a web card for each SuperPath. PathCards allows the user a view of the pathway network connectivity within a SuparPath, as well as the gene lists of the SuperPath and of each of its constituent pathways. Links to the original pathways are available from the pathway database symbols, placed to the left of pathway names. PathCards has extensive search capacity including finding any SuperPath that contains a search term within its included pathway names, gene symbols and gene descriptions. Multiple search terms are afforded, allowing fine-tuned results. The search results can be expanded to show exactly where in the SuperPath-related text the terms were found. The list of genes in a PathCard utilizes graded coloring to designate the fraction of included pathways containing this gene, providing an assessment of the importance of a gene in a SuperPath. Other features, including gene list sorting and a search tutorial, are under construction. PathCards is updated regularly, together with GeneCards updates. A new version is released 2–3 times a year.

Discussion

Pathway source heterogeneity

This study highlights substantial mutual discrepancies among different pathway sources, e.g. with regard to pathway sizes, names and gene contents. The world of human biological pathways consists of many idiosyncratic definitions provided by mostly independent sources that curate publication data and interpret it into sets of genes and their connections. The idiosyncratic view of the different pathway sources is exemplified by the variation in pathway size distribution among sources (Table 1, Figure 2D), where some sources have overrepresentation of large pathways (QIAGEN), while others have mainly small pathways (HumanCyc). In some cases, the large standard deviation in pathway size (Table 1) is easily explained, as exemplified in the case of Reactome, which provides hierarchies of pathways and therefore contains a spectrum of pathway sizes. However, large standard deviations of pathway size are also observed in KEGG and QIAGEN—sources that are not hierarchical by definition. On the other hand, some sources (e.g. HumanCyc, PID and PharmGKB) have very little variation in their pathway sizes, revealing their focus on pathways of particular size. The idiosyncratic view provided by different sources is also evident when examining the genes covered by each source (Figure 2A), where some genes in the gene space are covered by only one source. This causes the unfavorable outcome that when unifying pathways, irrespective of the algorithm chosen, there is a relatively high proportion of single source pathway clusters. In order to account for the drawback of the Jaccard index to cope with large size differences between pathways, we added a preprocessing step to unify pathways that are almost completely included within other pathways (≥0.9 gene content similarity of the smaller pathway), thereby diminishing the barrier of variable pathway size between sources. Previously published isolated instances of intersource discrepancies include the lack of pathway source consensus for the TCA cycle (3) and fatty acid metabolism (2). The authors of both papers stress that each of their pathway sources has only a partial view of the pathway. For the TCA cycle example (3) there is an attempt to provide an optimal TCA cycle pathway by identifying genes that appear in multiple sources, but such manual curation is not feasible for a collection of >3000 biological pathways. In our procedure, 11 relevant pathways from four sources are unified into a SuperPath entitled ‘Citric acid cycle (TCA cycle)’ (Supplementary Figure S5). PathCards enables one to then view which genes are more highly represented within the constituent pathways. Our algorithm thus mimics human intervention, and greatly simplifies the task of finding concurrence within and among pathway sources.

Pathway unification

Combining several pathway resources has been attempted before, using different approaches. The first method is to simply aggregate all of the pathways in several knowledge bases into one database, without further processing. This approach is taken, for example, by NCBI’s Biosystems with 2496 human pathways from five sources (5) and by PathwayCommons with 1668 pathways from four sources (6). This was also the approach taken by GeneCards prior to the SuperPaths effort described here, where pathways from six sources were shown separately in every GeneCard. While this approach provides centralized portals with easy access to several pathway sets, it does not reveal interpathway relationships and may result in considerable redundancy. The second unification approach, taken by PathJam (7), and HPD (8) provide proteins versus pathways tables as search output. This scheme allows useful comparisons as related to specific search terms, but is not leveraged into global analyses of interpathway relations. A third line of action is exemplified by ConsensusPathDB (9), which integrates information from 38 sources, including 26 protein–protein interaction compendia as well as 12 knowledge bases with 4873 pathways. This allows users to observe which interactions are supported by each of the information sources. In turn, hiPathDB (10) integrates protein interactions from four pathway sources (1661 pathways) and creates ad hoc unified superpathways for a query gene, without globally generating consolidated pathway sets. Finally, a fourth methodology is employed by Pathway Distiller (11), which mines 2462 pathways from six pathway databases, and subsequently unifies them into clusters of several predecided sizes between 5 and 500, using hierarchical clustering. The third method of interaction mapping taken by ConsensusPathDB and HiPathDB differs conceptually from the fourth method of clustering, where the interaction mapping method provides information on the specific commonalities and discrepancies in protein interactions among sources with regard to specific keywords or genes, while the clustering method suggests which of the pathways are similar enough to be considered for the same cluster. Therefore, the third and fourth methods are complementary approaches aimed at utilization of pathway information in different observation levels, where the fourth (clustering) method is independent of user input or search in resultant consolidation. In the study described herein, we pursued a clustering method similar to the fourth methodology taken by Pathway Distiller, namely consolidation of pathways into clusters. However, in contrast to Pathway Distiller, our aim was to create a single coherent unification of biological pathways, which is essential for having a universal set of descriptors when looking at gene–gene relations. The resulting SuperPaths simplify the pathway-related descriptive space of a gene and reduce it 3-fold. Furthermore, the cutoffs in our algorithm are chosen to optimally adjust the criteria of uniqueness and informativeness, thereby reducing the subjective effect of choosing cutoffs arbitrarily or by predetermining the number of clusters.

SuperPath generation

A crucial element in our SuperPaths generation method is the definition of interpathway relationships. We have opted for the use of gene content, as described by others (11, 33). One could also consider the use of pathway name similarity (11). However, among the 3215 pathways analysed here, only 79 names were shared by more than one pathway, implying that the efficacy of such an approach would have been rather limited. Further, Table 2 andSupplementary Figure S2 indicate a relatively weak concordance between pathway names and their gene content. Specifically among 79 name-identical pathway groups 52 remained incompletely unified, again suggesting a limited usefulness for unifying based on pathway names. Many resources, including ConsensusPathDB (9) facilitate the option of finding pathways based on keywords in the name. Name sharing is thus a relatively trivial task to overcome when trying to find similar pathways. The more challenging goal is finding pathways that are similar in the biological process that they convey.

In this article we treated pathways as sets of genes, using gene content as a comparative measure and omitting topology and small molecule information. This approach was previously advocated as a means of reducing the complexity of pathway comparisons greatly (34). Further, most sources used in this study provide only the gene set information, hence topology information was unavailable. Finally, the high concordance between significance of pathway alignment and Jaccard coefficients ≥0.3 (P < 1052) indicates that the Jaccard coefficient is a good approximation of the more elaborate pathway alignment procedure (25).

SuperPath utility

A central aim of pathway source unification is enhancing the inference of gene-to-gene relations needed for pathway enrichment scrutiny (32, 35–40). To this end, we developed an algorithm for pathway clustering so as to optimize this inference and at the same time minimize redundancy.

Extending pathways into SuperPaths affords two major advantages. The first is augmenting the gene grouping used for such inference. Indeed, SuperPaths have slightly larger sizes than the original pathways, as evident by the SuperPath size distribution (Figure 2D). Nevertheless, comparing SuperPaths to pseudo-SuperPaths of the same size and quantity clearly show that the increase in size does not account for the addition of true positive gene connections, as evident by the higher PPIs and larger counts of shared publications for SuperPath gene pairs (Figure 6). Subsequently, it is not surprising that SuperPaths outperform their average pathway constituent’s enrichment analysis scores (Figure 7A). SuperPaths are currently used in two GeneCards-related novel tools, VarElecthttp://varelect.genecards.org/ and GeneAnalyticshttp://geneana lytics.genecards.org/. A second advantage of SuperPaths is in the reduction of redundancy, since they provide a smaller, unified pathway set, and thus diminish the necessary statistical correction for multiple testing. We note that ConsensusPathDB (9) also provides intersource integrated view of interactions. However, gene set analysis in ConsensusPathDB is only allowed for pathways as defined by the original sources. Finally, a third advantage of SuperPaths is their ability to rank genes within a biological mechanism via the multiplicity of constituent pathways within which a gene appears. This can be used not only to gain better functional insight but also to help eliminate suspected false-positive genes appearing in a minority of the pathway versions. A capacity to view such gene ranking is available within the PathCards database.

Limitations of SuperPaths

The SuperPaths generation procedure appears incomplete, as about a half of all SuperPaths are ‘singleton SuperPath’ (labelled accordingly in PathCards), having only one constituent pathway. This is an outcome of the specific cutoff parameters used. However, this provides a useful indication to the user that a singleton pathway is distinct, differing greatly in its constituent genes from any other pathway.

This SuperPath generation process is intended to reduce redundancies and inconsistencies found when analysing the unified pathways. Although SuperPaths increase uniqueness as compared with the original pathway set (Figure 5B), some redundancy and inconsistency still remain within SuperPaths. There are cases of pathways with similar names, which do not get unified into the same SuperPath. This happens because they have not met the unification criteria employed. We also note similarity in name does not always indicate similarity in gene content (Figure 2B and C,Supplementary Figure S2B), and such events are faithfully conveyed to the user.

A clarifying example is that of the 40 pathways whose names include the string ‘apoptosis’. The final post-unification list has 10 SuperPaths whose name includes ‘apoptosis’. This obviously provides the user with a greatly simplified view of the apoptosis world. Yet, at the same time the outcome is replete with instances of two name-similar pathways being included in different SuperPaths. Employing a more stringent algorithm would result in over-clustering, which would in turn reduce informativeness (seeFigure 3C).

In parallel, there are pathways with overlapping functions that are not consolidated into one SuperPath. For example, the pathway ‘integrated breast cancer pathway’ does not unify with the pathways ‘DNA repair’ and ‘DNA damage response pathway’, despite the strong functional relation of breast cancer with DNA damage and repair (41). This is because the relevant gene content similarity in the original pathway sources is small, respectively, J = 0.03 and 0.13. The need to view information on pathways with low pairwise similarity is addressed in Supplementary Figure S6, and is available as a text file upon request.

Finally, when looking at the number of contributing sources per SuperPath (Figure 7B), it is evident that the majority of SuperPaths are comprised by either one or two sources, and no SuperPaths includes more than five. Although this integration limitation is evident, it mainly arises from the inherent biases in gene coverage for the different information sources (Figure 2A).

PathCards

Biological pathway information has traditionally been a central facet of GeneCards, the database of human genes (12, 42, 43). In previous versions, pathways were presented separately for each of the pathway sources, and it was difficult for users to relate the separate lists to each other. As a result of the consolidation into SuperPaths described herein, this problem has been effectively addressed. Thus, in every GeneCard, a table portrays all of a gene’s SuperPaths, each with its constituent pathways, with links to the original sources (Supplementary Figure S5A).

GeneCards is gene-centric and inherently does not present (Super) pathway-centric annotations. We therefore developed PathCardshttp://pathcards.genecards.org/, a database that encompasses and displays such information in greater detail. PathCards has a page for every SuperPath, showing the connectivity of its included pathways, as well as gene lists for the SuperPath and its pathways. For every SuperPath, we also show a STRING gene interaction network (32) for the entire gamut of constituent genes, providing perspective on topological relationships within the SuperPath.

Supplementary Data

Supplementary data are available at Database Online.

Funding

This research is funded by grants from LifeMap Sciences Inc. California (USA) and the SysKid—EU FP7 project (number 241544). Support is also provided by the Crown Human Genome Center at the Weizmann Institute of Science. Funding for open access charge: LifeMap Sciences Inc. California (USA).

Conflict of interest. None declared.

Acknowledgements

We thank Prof. Eitan Domany and Prof. Ron Pinter for helpful discussions, as well as Dr. Noa Rappaport and Dr. Omer Markovich for assistance with clustering and visualization.

Footnotes

  • Citation details: Belinky,F., Nativ,N., Stelzer,G., et al. PathCards: multi-source consolidation of human biological pathways.Database (2015) Vol. 2015: article ID bav006; doi:10.1093/database/bav006

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

References


Protein-Protein Interaction Database


The MIPS Mammalian Protein-Protein Interaction Database


The MIPS Mammalian Protein-Protein Interaction Database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. We took great care to include only data from individually performed experiments since they usually provide the most reliable evidence for physical interactions.

Search the database

To suit different users needs we provide a variety of interfaces to search the database:

Background

Protein-protein interactions (PPI) represent a pivotal aspect of protein function. Almost every cellular process relies on transient or permanent physical binding of two or more proteins in order to accomplish the respective task. Comprehensive databases of PPI in Saccharomyces cerevisiae have proved to be invaluable resources for both bioinformatics and experimental research and are used heavily in the scientific community.

Although yeast is a well established model organism, not all interactions in higher eukaryotes have equivalent counterparts in unicellular systems. Currently, publicly available PPI databases contain comparatively few entries from mammals so we embarked on building a high-quality, manually curated database of protein-protein interactions in mammals.

Conditions of use

You are free to use the database as you please including full download of the dataset for your own analyses as long as you cite the source properly:

Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stümpflen V, Mewes HW, Ruepp A, Frishman D
The MIPS mammalian protein-protein interaction database
Bioinformatics 2005; 21(6):832-834; [Epub 2004 Nov 5]   doi:10.1093/bioinformatics/bti115   PubMed

Other PPI resources

There are plenty of interesting databases and other sites on protein-protein interactions. Currently we are aware of the following PPI resources:

Resource Comments
APID Agile Protein Interaction DataAnalyzer (Cancer Research Center, Salamanca, Spain)
BIND Biomolecular INteraction Network Database at the University of Toronto, Canada. No species restriction
CYGD PPI section of the Comprehensive Yeast Genome Database. Manually curated comprehensive S. cerevisiae PPI database at MIPS
DIP Database of Interacting Proteins at UCLA. No species restriction.
GRID General Repository for Interaction Datasets. Mount Sinai Hospital, Toronto, Canada
HIV Interaction DB Interactions between HIV and host proteins.
HPRD The Human Protein Reference Database. Institute of Bioinformatics, Bangalore, India and Johns Hopkins University, Baltimore, MD, USA.
HPID Human Protein Interaction Database. Department of computer Science and Information Engineering Inha University, Inchon, Korea
iHOP iHOP (Information Hyperlinked over Proteins). Protein association network built by literature mining
IntAct Protein interaction database at EBI. No species restriction.
InterDom Database of putative interacting protein domains. Institute for InfoComm Research, Singapore.
JCB PPI site at the Jena Centre for Bioinformatics, Germany
MetaCore Commercial software suite and database. Manually curated human PPIs (among other things). GeneGo
MINT Molecular INTeraction database at the Centro di Bioinformatica Moleculare, Universita di Roma, Italy.
MRC PPI links Commented list of links to PPI databases and resources maintained at the MRC Rosalind Franklin Cetre for Genomics Research, Cambridge, UK
OPHID The Online Predicted Human Interaction Database. Ontario Cancer Institute and University of Toronto, Canada.
Pawson Lab Information on protein-interaction domains.
PDZbase Database of PDZ mediated protein-protein interactions.
Predictome Predicted functional associations and interactions. Boston University.
Protein-Protein Interaction Server Analysis of protein-protein interfaces of protein complexes from PDB. University College of London, UK.
PathCalling Proteomics and PPI tool/database. CuraGen Corporation.
PIM Hybrigenics PPI data and tool, H. pylori. Free academic license available
RIKEN Experimental and literature PPIs in mouse.
STRING Protein networks based on experimental data and predictions at EMBL.
YPD “BioKnowledge Library” at Incyte Corporation. Manually curated PPI data from S. cerevisiae. Proprietary.

If we forgot to list your favorite PPI resource or you are providing one yourself please let us know – we will be happy to include it.

PPI related software

Resource Comments
aiSee Commercial graph layout software
Cytoscape Open source software for visualization of PPI networks and data integration
graphviz Graph layout software

Download

You can get the full dataset here (PSI-MI format).

Acknowledgements

This work is funded by a grant from the German Federal Ministry of Education and Research. It is part of the initiative “Bioinformatics for the Functional Analysis of Mammalian Genomes” (BFAM).

Contact

For questions, criticism, praise and suggestions please contact Dr. Philipp Pagel or Dr. Dmitrij Frishman.

 

 

 

An evaluation of human protein-protein interaction data in the public domain

  • Suresh Mathivanan,
  • Balamurugan Periaswamy,
  • TKB Gandhi,
  • Kumaran Kandasamy,
  • Shubha Suresh,
  • Riaz Mohmood,
  • YL Ramachandra and
  • Akhilesh PandeyEmail author
BMC Bioinformatics20067(Suppl 5):S19

DOI: 10.1186/1471-2105-7-S5-S19

Published: 18 December 2006

Abstract

Background

Protein-protein interaction (PPI) databases have become a major resource for investigating biological networks and pathways in cells. A number of publicly available repositories for human PPIs are currently available. Each of these databases has their own unique features with a large variation in the type and depth of their annotations.

Results

We analyzed the major publicly available primary databases that contain literature curated PPI information for human proteins. This included BIND, DIP, HPRD, IntAct, MINT, MIPS, PDZBase and Reactome databases. The number of binary non-redundant human PPIs ranged from 101 in PDZBase and 346 in MIPS to 11,367 in MINT and 36,617 in HPRD. The number of genes annotated with at least one interactor was 9,427 in HPRD, 4,975 in MINT, 4,614 in IntAct, 3,887 in BIND and <1,000 in the remaining databases. The number of literature citations for the PPIs included in the databases was 43,634 in HPRD, 11,480 in MINT, 10,331 in IntAct, 8,020 in BIND and <2,100 in the remaining databases.

Conclusion

Given the importance of PPIs, we suggest that submission of PPIs to repositories be made mandatory by scientific journals at the time of manuscript submission as this will minimize annotation errors, promote standardization and help keep the information up to date. We hope that our analysis will help guide biomedical scientists in selecting the most appropriate database for their needs especially in light of the dramatic differences in their content.

Background

Protein-protein interactions (PPI) are essential for almost all cellular functions. Proteins seldom carry out their function in isolation; rather, they operate through a number of interactions with other biomolecules. Experimental elucidation and computational analysis of the complex networks formed by individual protein-protein interactions (PPIs) is one of the major challenges in the post-genomic era. PPI databases have thus become valuable resources for the systematic analysis of the molecular networks of a cell [1, 2]. With the accumulation of PPIs from high-throughput experiments, it is increasingly important to store such data for easy retrieval and analysis [3]. Several databases have compiled protein interactions based on manual curation of the scientific literature, automated text mining of articles or computational predictions. In this review, various features of nine different databases are evaluated, including compliance with emerging data standards such as proteomics standards initiative – molecular interaction (PSI-MI) format [4] and BioPAX [5], which define a unified framework for sharing PPI and pathway information, respectively.

Human protein-protein interaction databases

Protein interaction repositories can be broadly classified into 2 types based on their content: i) Those containing interactions supported by experimental evidence, or, ii) Those containing interactions derived from in silico predictions alone, or, mixed together with experimentally derived PPIs. Here, we evaluate only those databases that exclusively contain experimentally derived PPI data in humans.

Curated literature based repositories have two major mechanisms of incorporating PPIs supported by experimental validation: i) curation by biologists from the literature, or, ii) direct deposit of the experimentally derived PPIs prior to publication by an investigator. Currently, the majority of PPIs in most databases are from curation of the literature. If all scientific journals mandated that PPIs be submitted to repositories as a requirement for publication (as is currently the case with nucleotide sequences), the databases would not only become more comprehensive but perhaps also contain fewer annotation errors. Below, we will briefly describe salient features of nine major PPI databases.

H uman P rotein R eference D atabase (HPRD)

HPRD contains annotations pertaining to human proteins based on experimental evidence from the literature [6, 7]. This includes PPIs as well as information about post-translational modifications, subcellular localization, protein domain architecture, tissue expression and association with human diseases. In addition to interactions of proteins with other proteins, HPRD also reports interactions of proteins with nucleic acids and small molecules. The PPI data is sub classified as binary or complex interactions based on topology and the number of participants. Binary PPIs are direct interactions between two proteins while complexes represent interactions with more than 2 participants and the topology of interaction is unknown. Relevant publications are cited for each interaction. The type of experiment is also indicated as in vivo (e.g. coimmunoprecipitation),in vitro (e.g. GST pull-down assays) or yeast two-hybrid. Information about post-translational modifications includes the residue of modification, type of experiment and the upstream enzyme. These modifications can be viewed alongside the protein domain architecture. Each protein is linked to a genome browser, GenProt Viewer [8], which allows protein and transcript information to be visualized in the context of the relevant gene. HPRD is also linked to a compendium of signal transduction pathways, NetPath [9], which is freely available in several different formats. This database includes a tool called PhosphoMotif Finder, which reports the presence of any of over 320 phosphorylation-based motifs curated from the literature in a protein of interest. HPRD also incorporates a new feature, Protein Distributed Annotation System (PDAS) which allows researchers to contribute and share their data with the rest of the community. All interaction information can be downloaded from the website either in PSI-MI format or as tab delimited files.

IntAct

The PPI information in the IntAct database includes a brief description of the interaction, experimental method and the literature citation of human proteins as well as proteins derived from several other species [10, 11]. Whenever possible, PPI information is isoform specific. The database can be accessed by either a basic or advanced search. The latter provides the user with additional querying options such as experimental method or controlled vocabulary terms listed in PSI-MI. IntAct also has a tool which predicts best baits for pull-down experiments in humans by prioritizing the proteins which have the highest likelihood of being highly connected, or hubs, based on the available data within IntAct for various species – this is termed Pay-As-You-Go algorithm. Additional software developed as part of the IntAct project includes HierarchView, which depicts interaction networks as 2-dimensional graphs and highlights nodes based on a GO category specified by the user (e.g. cellular component).

M olecular INT eraction database (MINT)

MINT is a repository of experimentally verified protein interactions with special emphasis on mammalian interactions [12, 13]. It also features interactions involving non-protein entities such as promoter regions and mRNA transcripts. PPI information includes binary and complex interactions and is isoform specific. Each interaction is given a confidence score based on the number of interactions and type of experiment and the number of citations provided for each interaction. The interactors can be viewed graphically using the ‘MINT Viewer,’ which permits users to view interactors as a network, and to manipulate it such that only the proteins of interest are shown. Users can expand the network by dragging individual interactors, select and visualize PPIs based on confidence scores, and they can also export the data in flat files, PSI-MI format or to Osprey, a system developed for visualizing and manipulating network data [14]. The interaction data are displayed along with the corresponding Swiss-Prot annotation. Proteins with a role in genetic diseases (according to OMIM (Online Mendelian Inheritance in Man)) are further highlighted. MINT features a separate annotation of human PPIs called HomoMINT, which includes in addition to literature derived data information from other organisms mapped to their human orthologs.

D atabase of I nteracting P roteins (DIP)

PPI data stored in DIP were obtained through manual curation of the scientific literature and include direct and complex interactions [15, 16]. The JDIP is a Java application based visualization tool; it provides a graphical representation of interactions. New high-throughput experimental and predicted PPI data can be evaluated through other services provided by DIP such as Paralogous Verification Method (PVM), Expression Profile Reliability (EPR) [17] and Domain Pair Verification (DPV) [18]. PVM validates interacting pairs by showing the existence of paralogous interactions; EPR validates comparison based on common expression profiles of interactors and DPV validates through domain-domain interaction preferences. Other satellite projects, Live-DIP and DLRP, use the DIP database for accessing the interactions. Live-DIP annotates proteins under different physiological conditions [19] whereas DLRP annotates protein-ligand and protein-receptor pairs known to interact with each other [20].

MIPS Database

MIPS database consists of mammalian interaction data manually curated from the literature [21, 22], and includes experiment type, description of the interaction and binding regions of interacting partners (where available). Data from mass spectrometry and yeast two-hybrid studies are not included. PPIs can be queried based on interaction partners, experimental method, and functional aspects of the PPIs. The results can be retrieved in 2 formats – long and short. The long format details the interaction, including reference, experimental details, binding sites for each protein and a short comment on each interaction, its functional significance or the immediate outcome of the interaction. The short format is restricted to listing the interacting proteins. Both formats are also linked to visualization tools. Each protein is further linked to the corresponding annotation in the mouse PEDANT genome database developed by the same group; which contains pre-computed bioinformatics analyses of publicly available genomes [23].

A lliance F or C ellular S ignaling (AfCS)

The AfCS is a multidisciplinary, multi-institutional consortium that studies cellular signaling [24, 25]. “Molecule Pages” in the AfCS database provide qualitative and quantitative information on signaling molecules (mostly murine) and their interactions; – these include results of experiments carried out by the Alliance in addition to literature-derived data. The molecule pages contain automated as well as author-entered data. The former integrate DNA/protein sequence information and structural details along with basic biophysical and biochemical properties from external databases, whereas the latter consist of data manually curated from the literature. This is further assessed by AfCS-appointed editorial board members and anonymously peer-reviewed in a process established by the Nature Publishing Group. The curated data includes a textual description of protein function, regulation of activity, subcellular localization, major sites of expression, splice variants and phenotype of knockout animals. The interaction data are derived from murine proteins, or, if they are from other species, the interaction is mapped to the corresponding mouse orthologs. For some proteins, the annotations include descriptions of signaling molecules under different physiological conditions termed ‘states’ (e.g. binding of a phosphorylated protein with another protein). A number of signaling pathway maps are also available in this database. We have not considered this database in our comparison mainly because of its focus on murine, and not human, proteins.

B iomolecular I nteraction N etwork D atabase (BIND)

BIND is a database of biomolecular associations that are classified into 3 categories, binary molecular interactions, molecular complexes and pathways [26, 27]. In BIND, a molecular complex is a collection of two or more molecules that associate to form a functional unit in a cell. These records are supplemented with additional information such as complex topology and the number of subunits involved in the interaction. Pathways are a collection of two or more interactions that occur in a defined sequence within a living system; currently 8 pathways have been annotated. Data pertaining to 1473 organisms is available in BIND. Information on molecular associations is obtained from the literature. The majority of the interactions in BIND are PPIs although it includes some interactions with nucleic acids and small molecules as well. The function of proteins is depicted using ontoglyphs, a series of symbolic characters representing a high-level summary of Gene Ontology (GO) information, and, proteoglyphs, symbols used to represent the structural and binding properties of proteins at the level of conserved domains. Data in BIND can be queried using various database identifiers or by a BLAST search. BIND also stores biomolecular interactions for several other species. For yeast high-throughput PPI datasets, BIND provides a confidence measure based on text mining of publications, existence of homologous interactions, common and related GO annotations, domain composition and phenotypic profiling for the evaluation. The data can be downloaded in flat file and PSI-MI formats and the pathways can be exported to ‘sif’ format which allows visualization by Cytoscape, a software tool developed for visualization and manipulation of pathway data [28]. BIND offers a Standard Object Access Protocol (SOAP) interface for those who wish to access the data from third-party software. BIND also has data imports from FlyBase, MIPS, MGI etc. and entries can be queried through various sources (e.g. Wormbase and KEGG).

Reactome

Reactome is a curated knowledgebase of biological pathways [29, 30]. The goal of Reactome is to develop a curated resource of pathways and biochemical reactions in humans; however many of the reactions are also obtained via transfer from other species. The basic unit of this database is a reaction. Information on reactions is either derived from experiments in the literature or is an electronic inference based on sequence similarity. Reactions are also inferred in humans based on the putative human orthologs for the proteins that participate in the same reaction in other species. In such cases, the model organism reaction is annotated in Reactome, the inferred human reaction is annotated as a separate event, and the inferential link between the two reactions is explicitly noted. Each reaction is detailed with input, output, preceding and following events of the reaction, cellular component of the reaction and species of its occurrence. Each reaction is linked to pathways according to the order of reactions in corresponding pathway. The available pathways are integrated and represented graphically as a series of constellations in a ‘starry sky.’ This can be used to navigate through the reactions in biological pathways and visualize connections between them. It must be cautioned that the definition of PPIs in Reactome is quite broad: the interactions can be represented as ‘direct complex,’ ‘indirect complex,’ ‘reaction’ or ‘neighboring reaction.’ In a ‘direct complex,’ interactions occur between proteins present in the same complex and are not true pairwise interaction. ‘Indirect complexes’ contain interactions between interactors in different subcomplexes of a complex. ‘Reactions’ are interactions between proteins that participate in a reaction and the interactors are not reported to be in a complex. ‘Neighboring reactions’ represent the interactors that participate in 2 consecutive reactions, i.e. when one reaction produces a product, which is either an input or a catalyst for another reaction. The information is edited by the Reactome staff at Cold Spring Harbor Laboratory and the European Bioinformatics Institute and is then reviewed by other biological researchers for consistency and accuracy. Each reaction or pathway can be exported to Systems Biology Markup Language (SBML) and BioPAX formats. Reactome also provides tools such as Pathfinder and Skypainter. Pathfinder can identify pathways that connect input with output molecules while Skypainter allows the coloring of reaction maps based on user-specified identifiers that have been linked to each pathway. For our analysis, we have considered only the ‘direct complexes’ as they are the category most likely to correspond to true PPIs.

PDZBase

PDZBase is a database that focuses only on PPIs involving proteins with PDZ domains [31, 32]. Only those interactions involving the PDZ domain that have been confirmed by individual in vitro or in vivo biochemical experiments are considered. Thus, interactions discovered solely through high-throughput methods (e.g. yeast two-hybrid or mass spectrometry) are not included in PDZBase. PDZ domains and their ligands can be queried using sequence motifs. Each interaction in PDZBase consists of the residues of the interacting proteins on a 2D-diagram generated by a residue-based-diagram-editor (RBDG). The interacting residues between the PDZ domain and their peptide ligands are predicted based on similarity with the available structures of PDZ-peptide complexes.

Strategy used for comparison of datasets

The datasets were downloaded from the download sites of PPI databases on October 2, 2006 and scripts were used for parsing out the protein pairs involved in PPIs along with the experiment type and literature references, if provided. The PPIs were further parsed to extract binary interactions for those proteins pairs where both proteins were human. Most databases had Swiss-Prot as one of their accession identifiers except BIND which provided RefSeq, GenBank and PDB identifiers. To determine the overlap among databases, the Swiss-Prot or RefSeq identifiers were mapped to the corresponding Entrez Gene identifiers as of October 2, 2006. Scripts were used to convert these PPIs into a non-redundant list of PPIs (if protein A and B interact, the dataset may have two PPIs, A-B and B-A – only one of the PPI was retained for our analyses). All datasets were compared with each other to obtain the overlap at PPI and protein levels. Experiment types extracted for PPIs were mapped with PSI-MI vocabulary list. Disease annotations for genes were obtained from OMIM and mapped to gene symbols to obtain the number of proteins in PPIs corresponding to disease-associated genes.

Caveats of comparing PPI data

Assessment of the accuracy of annotation of all PPIs in various publicly available databases is beyond the scope of this article. In this study, we have tried to evaluate parameters that could be measured objectively. Nevertheless, there are still a number of caveats of any analysis comparing PPIs. Below is a list of some of the potential pitfalls and our strategies to tackle them.

  1. 1.

    Binary interactions including homodimers were considered for this analysis while complex interactions were not. It is not easy to look at complex interactions across databases especially for comparison purposes although ‘spoke’ and ‘matrix’ models have been described previously for comparing protein complexes [33]. In this study, we have chosen not to compare the complex interactions because of predictive nature of these models. However, cases where a protein complex was already converted into binary PPIs by using one of these models (e.g. use of the ‘matrix’ model to computationally predict PPIs in Reactome) were treated as binary interactions.

  2. 2.

    Some of the binary interactions involved proteins that were non-human. Mapping of orthologs is not an easy task and is not standardized. Thus, we did not attempt to map the human orthologs for proteins from any other species that were listed as interacting proteins.

  3. 3.

    We mapped all protein isoforms to a unique gene and then examined the overlaps. This was done because often a given isoform is annotated as an interacting protein although the interaction is not specific to that isoform. For example, this strategy allowed us to correctly capture PPIs as overlapping where a given protein was annotated as interacting with one isoform of another protein in one database and with another isoform of that protein in another database.

Results and Discussion

Comparison of PPI data

Table 1 summarizes the salient features of each database including total number of PPIs, total number of proteins, method of detection of PPIs, curation methodology, download options and URL links. The availability of data as a downloadable file is also indicated. Fig. 1A shows the distribution of the number of PPIs in each of the literature-based curated databases considered in our analysis. For each database, the total number of human PPIs present in the statistics page or in the downloaded files is shown along with the number of unique (non-redundant) binary human PPIs calculated by us. For this calculation, we only considered binary PPIs in which both members of an interacting pair were human proteins. As explained above, protein complexes were excluded from this analysis because it is difficult to ascertain the topology (i.e. which protein interacts with which protein in a complex) for determining overlap between datasets. The difference in the total and non-redundant PPIs in HPRD is because of protein complexes whereas in all other databases it is mainly due to the redundancy of PPIs. The distribution of PPI data in (Fig. 1A) shows a dramatic variation across these databases.

Table 1

Unique features of human PPI databases

 

 

Number of unique human PPIs

Number of proteins

PPI data

Unique features

Download options

PSI-MI compatibility

Download version number

HPRD

36,617

9,427

Experimental

Protein annotations are included (e.g. PTMs, substrate information, tissue expression, disease association, protein complexes, subcellular localization). Signal transduction pathways

Yes

Yes

Release 6

BIND

6,621

3,887

Experimental

Protein complexes, biological pathways, non-protein interactions, Data for >1473 organisms

Yes

Yes

20060525

DIP

1,067

804

Experimental

PPIs for other organisms, protein complexes

Yes

Yes

Hsapi20060402

MINT

11,367

4,975

Experimental

PPIs for other organisms, non-protein interactions

Yes

Yes

Version 18

PDZBase

101

115

Experimental

PPIs involving PDZ domains. Prediction of residues that interact.

No

No

October 2, 2006

MIPS

346

405

Experimental

PPIs for other organisms

Yes

Yes

October 2, 2006

IntAct

10,244

4,614

Experimental

Protein complexes, PPIs for other organisms, non-protein interactions, provides web based applications, ProViz and Hierarch View, for visualization of interactions

Yes

Yes

2006-09-22

AfCS

Mostly mouse interactions

Mostly mouse proteins

Experimental

Protein annotations are included (e.g. function, subcellular localization, orthologs, tissue expression, mouse knockout phenotype information, PTMs)

No

No

-

REACTOME

5,960

970

Experimental, automated and predicted

Biological pathways for several organisms. Navigation through reactions in biological pathways and visualizing connections between them

Yes

No

Version 18

https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-7-S5-S19/MediaObjects/12859_2006_Article_1369_Fig1_HTML.jpg

 

Figure 1

Protein-protein interactions (PPIs) deposited in publicly available literature derived human PPI databases. (A) Human PPIs present in interaction databases. The distribution of the number of PPIs annotated in each database is shown. For each bar, ‘total’ refers to the number of PPIs listed (i.e. claimed) at the database websites or number of PPIs in the downloaded datasets while the orange portion represents the number of human non-redundant direct PPIs calculated by us. (B) Distribution of the number of interacting proteins. Different scales are used to depict the number of proteins annotated with 1–10, 11–20, or 21–30 or higher number of PPIs per protein. All datasets were downloaded on October 2, 2006.

 

It is difficult to directly assess the depth of PPIs based on total interactions alone; thus, we analyzed the distribution of number of proteins in each database according to the number of binary (i.e. direct) interactions per protein. The majority of proteins in all databases have <10 interaction partners (Fig. 1B). The number of PPIs that fall under 31–40 and 41–50 PPI bins are high in HPRD and Reactome database. Although these PPIs are distributed across many types of proteins in HPRD, those in Reactome belong to mainly two classes: proteosomal or ribosomal protein complexes. The number of interactions for these two classes of proteins in Reactome is high because a ‘matrix’ model of interpreting protein complexes is used in which all proteins are considered connected to all proteins within a complex. All other database shows the same trend with a greater number of proteins in bins with lower number of PPIs per protein. This does not automatically imply that most proteins truly interact with a small number of interactors. Rather, this is likely due to the fact that not all proteins have been studied thoroughly and because all published interactions have not yet been included in these databases. Additionally, there is a bias of experimental methods in capturing all interactions (e.g. yeast two-hybrid system does not generally detect interactions involving integral membrane proteins). Overall, most databases contain a very small number of proteins with >30 PPIs.

Comparison of proteins annotated with PPIs

We looked for the total number of unique genes represented in the PPI databases (Fig. 2A). In HPRD, proteins encoded by 9,427 genes have at least one or more direct PPI annotated (out of ~20,000 proteins annotated in this database) while BIND, IntAct and MINT contain 3,887, 4,614 and 4,975 proteins, respectively. Other databases such as DIP, Reactome, MIPS and PDZ Base contain PPIs for <1000 proteins.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-7-S5-S19/MediaObjects/12859_2006_Article_1369_Fig2_HTML.jpg

 

Figure 2

Protein coverage across human PPI databases. (A) The total number of non-redundant genes whose protein products are annotated in the databases with at least one PPI. (B) The number of proteins encoded by human disease-associated genes listed in OMIM database with at least one PPI.

 

Proteins encoded by disease-associated genes in PPIs

PPIs are attractive as potential targets for small-molecule drugs for treatment of diseases. We checked for proteins encoded by genes listed in the OMIM database that are mutated in inherited genetic disorders (Fig.2B). HPRD has all human disease-associated genes listed in OMIM of which 1,463 have at least one protein interactor while most of the other databases contain significantly less number of proteins encoded by these genes.

Overlap of PPIs and proteins between databases

As discussed above, there is a significant difference in the total number of PPIs in the various databases. However, this statistic does not provide an idea of the extent to which the PPIs actually overlap across databases. As shown in Fig. 3A, HPRD contains a high proportion of human PPIs that are present in other literature-derived curated databases. The overlap between IntAct (10,244 PPIs) and MINT (11,367 PPIs) is 7,362, which is the highest overlap among the remaining literature-derived databases; the overlap between BIND (6,621 PPIs) and MINT (11,367 PPIs) is only 1,463 and there is no overlap between PDZBase and DIP.
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-7-S5-S19/MediaObjects/12859_2006_Article_1369_Fig3_HTML.jpg

 

Figure 3

Overlap of PPIs and proteins in human PPI databases. (A) Pairwise overlap of protein interactions across databases is shown in cells. The number of non-redundant direct PPIs present in each database is shown in parentheses for each database. (B) Pairwise overlap of proteins across databases is shown in the cells. The number of non-redundant proteins present in each database is shown in parenthesis for each database.

 

To determine whether the overlap is small because of proteins not being annotated in different databases, we looked at the overlap at the protein level between databases. As shown in Fig. 3B, the overlap of proteins between BIND (3,887 proteins) and IntAct (4,614 proteins) is 1,969 but the overlap at PPI level is only 1,167. HPRD contains 76% and MINT contains 51% of proteins in Reactome, although there is a very low overlap at the level of PPIs across these databases. Overall, although at protein level there is a good overlap between the databases, the PPIs do not overlap as much. Average degree (K) of a protein i.e. the number of interactions that a protein has with other proteins, is 7.6 for HPRD, while that for MIPS, PDZ Base, DIP, BIND, MINT and IntAct ranges from 1.7 to 4.5. Strikingly, the average degree of a protein in Reactome is 12.2, which is because of the interpretation of protein complexes through the ‘matrix’ model as explained above.

We also carried out a comparison of a test set of proteins to check the distribution of interaction partners of PPIs across different databases (Table 2). The test proteins were selected based on the presence of proteins in four or more databases. We required that the protein be present in four or more databases because there was not even a single protein that was common to all databases. The proteins were further selected to cover proteins that participate in several different types of biological processes to avoid any potential bias in the event that any particular database is especially ‘strong’ in certain types of annotations. As shown in Table 2, Caspase 3 (CASP3) has 126 protein interaction partners annotated in HPRD, while BIND, MINT, IntAct and Reactome contain 15, 6, 3 and 1 interaction, respectively. S-phase kinase-associated protein 1A (SKP1A) has 35 PPIs in HPRD, 11 in BIND, 5 in DIP and 13 in MINT. MIPS and PDZBase do not contain any PPIs for this protein. Nuclear factor kappa-B subunit 3 (RELA) has 98 protein interaction partners in HPRD while BIND, MINT, DIP and IntAct contain 13, 103, 13 and 90 PPIs. Overall, for most proteins, there is at least one, and often several, databases that do not contain any PPI annotations (Table 2). This again reflects the fact that the databases are still at an early stage of curation and annotation of published PPIs.

Table 2

Comparison of protein-protein interactions for a test set of proteins.

 

 

HPRD

BIND

DIP

MINT

IntAct

MIPS

PDZBase

Reactome

CASP3

126

15

0

6

3

0

0

1

CDK2

71

16

9

11

12

2

0

2

TBP

81

17

14

12

15

2

0

14

TNFRSF1A

43

11

8

77

74

1

0

1

YWHAB

116

12

4

83

6

1

0

2

GAPDH

37

6

0

20

19

0

0

0

RELA

98

13

13

103

90

1

0

2

HDAC1

114

13

5

14

12

1

0

0

RPS27

2

1

0

9

10

0

0

32

SKP1A

35

11

5

13

15

0

0

2

ACTC

32

2

0

1

2

0

0

0

PABPC1

23

3

0

11

6

0

0

2

VDAC1

16

4

0

2

0

2

0

0

THRB

35

11

0

0

2

2

0

0

HSPA8

42

5

0

42

40

0

0

0

PDZK1

17

0

0

3

4

0

1

0

Literature citations in literature-derived databases

Literature citations are generally linked to interactions in literature-derived datasets. We checked the total citations in PubMed linked to PPIs in the literature-derived databases (Fig. 4A). HPRD has >43,634 published articles to support the PPI data, while BIND and MINT contain ~8,020 and ~11,480 citations, respectively. Reactome contains a total of ~2,000 citations. Another parameter to assess the extent of curation is to determine the number of citations per interaction. More than one citation for a given PPI indicates that the interaction has been verified by more than one group or method. Conversely, however, the presence of a single citation does not automatically imply that there is only one study describing the interaction because it is quite likely that only one published paper was linked although several studies might have been carried out (i.e. incomplete curation). This is illustrated in the section below where the same PPI is compared across multiple databases. As shown in Fig. 4B, 100% of PPIs in PDZBase and >95% of PPIs in MINT, IntAct and MIPS had one PubMed citation. In contrast, 87% in BIND and DIP and 84% of PPIs in HPRD have only one citation. Notably, ~11% and 7% of PPIs in HPRD and BIND, respectively, have 2 citations and ~2% of PPIs in HPRD, BIND and IntAct have more than 5 citations each. The majority of PPIs in Reactome (~96%) are linked to the same 2 published articles because these PPIs are predicted computationally using a matrix approach (i.e. all against all) to link proteins that were identified in two mass spectrometry-based protein complex pulldown studies on spliceosomes [34, 35].
https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-7-S5-S19/MediaObjects/12859_2006_Article_1369_Fig4_HTML.jpg

 

Figure 4

Literature citations for protein-protein interactions. (A) The total number of literature citations linked to PPIs. (B) The percentage of PPIs in databases corresponding to 1, 2, 3, 4, or ≥5 literature citations per interaction is shown. The scale is modified as shown to provide a better view of the distribution of proteins with two or more citations per interaction.

 

Comparison of PPI annotations common to multiple databases

Overall statistics of databases might not reflect the breadth and depth of protein annotations from a biologist’s perspective. To provide certain ‘case studies,’ we prepared a list of protein interactions that are common to 4 or more literature-derived databases and then tabulated the number of PPIs in each database. We left out PDZBase because of its small size. Table 3 lists 6 representative PPIs that were common to 4 or more databases along with the article(s) cited for each interaction and the annotation of the experimental methods used to detect the corresponding PPI. As an example, the experimental method annotated for the interaction between transcription factors NFKB1 and NFKB3 reported recently [36] is in vivo (MI:0492) in HPRD, tandem affinity purification (TAP) (MI:0045) in DIP, anti tag coimmunoprecipitation (MI:0109) in MINT and tap tag coip (MI:0007) in IntAct. This example illustrates how databases can describe the same experiment using alternative vocabulary terms. The interaction, TNFRSF1A with TRADD, is annotated as in vivo, in vitro and yeast 2-hybrid with 3 PubMed citations in HPRD, simply ‘experimental’ with 1 PubMed citation in DIP, immunoprecipitation and affinity chromatography with 3 PubMed citations in BIND, co-immunoprecipitation with 1 PubMed citation by MIPS, ‘co-immunoprecipitation, pulldown and two hybrid’ with 2 citations by MINT and ‘anti-bait coip, pulldown and two hybrid’ with 1 citation by IntAct. Together, the 6 databases refer to 8 PubMed citations to describe this interaction while each individual database only uses between 1 and 3 citations. For the interaction of FADD with FAS, HPRD annotation is ‘in vivo, in vitro and yeast 2-hybrid,’ DIP mentions ‘two hybrid test,’ BIND describes it as ‘immunoprecipitation’, MIPS mentions ‘coip,’ MINT describes it as ‘coimmunoprecipitation and two hybrid’ and IntAct annotates it as coip, pull down, anti tag coip and two hybrid.’ Table 3 highlights how different databases use different published articles for annotating the same PPI. Thus, mere presence of a PPI in different literature-derived databases does not automatically guarantee that the annotations will be identical. It also illustrates that merging of annotations from multiple databases will lead to an increase in the depth of individual annotations.

Table 3

Comparison of annotations of PPIs common to literature-derived curated PPI databases

 

 

Interacting Proteins

HPRD

DIP

BIND

MIPS

MINT

IntAct

Detection method

PubMed ID

Detection method

PubMed ID

Detection method

PubMed ID

Detection method

PubMed ID

Detection method

PubMed ID

Detection method

PubMed ID

1

NFKB1

NFKB3

in vivo

9101089

Tandem Affinity Purification (TAP)

14743216

Gel retardation assays, three dimensional structure

15735750, 9738011, 9865693

-

-

anti tag coimmuno- precipitation

14743216

Comigration in gel, anti bait coip, tap

8246997, 8246997, 14743216

2

TNFRSF1A

TRADD

in vivo, in vitro, Yeast 2-hybrid

7758105, 8565075, 8612133

Experimental

9129204

Immuno- precipitation

11684708, 15247912, 9916731

coip: coimmuno precipitation

9916731

Coimmuno- precipitation, pull down, two hybrid

8565075, 8621670

anti bait coip, pull down, two hybrid

7758105

3

FADD

FAS

in vivo, in vitro, Yeast 2-hybrid

8967952, 7538907, 7536190

Two hybrid test

7538907

Immuno- precipitation

15665818, 15383280

coip: coimmuno precipitation

10196099

Coimmuno-precipitation, two hybrid

7536190, 7538907

anti tag coip, coip, pull down, two hybrid

7538907, 7536190, 7538907, 7538907

4

PEX19

PEX3

in vivo, in vitro, Yeast 2-hybrid

10704444, 12096124

-

-

two-hybrid-test

10430017, 12096124

coip: coimmuno precipitation, two hybrid

10430017

two hybrid, ubiquitin reconstruction

12096124, 16189514

far western blotting, two hybrid pooling

10704444, 16189514

5

CDK2

CDKN1A

in vitro

12839982

Two hybrid test

8242751

other

15232106

coip: coimmuno precipitation

8641969

protein array, pull down

15232106, 9284049

protein array, pull down

15232106, 8756624

6

PEX12

PEX5

in vivo, in vitro, Yeast 2-hybrid

10562279, 10837480, 12096124

-

-

two-hybrid-test

12096124

coip: coimmuno precipitation, two hybrid

10646847

two hybrid

12096124

anti tag coip, filter binding, two hybrid

10562279, 10562279, 12620231

Download options and use of identifiers in PPI databases

Proteomics Standards Initiative (PSI) is a collaborative initiative for standardization of protein-related data including protein-protein interaction and mass spectrometry data. PSI-molecular interaction (PSI-MI) [37] format is an exchange format, which has already become the standard for PPI data [4]. Table 1 shows that although many databases provide the PPI data in this format such as HPRD, BIND, DIP MINT, MIPS and IntAct, some databases such as AfCS and Reactome do not currently have this option. Reactome also provides data in two pathway-related formats, BioPAX and SBML. The data contained in AfCS is not currently available as a downloadable file.

Although a consensus on the use of standardized vocabulary for denoting PPIs is evolving and is being increasingly used, there is no requirement for use of any particular type of identifiers or database accession numbers for proteins in PPI databases. Different sets of protein database identifiers are used, with many of them being frequently retired, merged or otherwise updated. This creates great difficulties for those who want to combine datasets from different databases. It is not a trivial task to ‘map’ identifiers to a single set of proteins and creates a bioinformatics pitfall of its own. If this ‘mapping’ is done by purely automated methods, there is a risk of wrong assignment of a protein entry from one database to another. To minimize this, we recommend the use of gene symbols in addition to any ‘favorite’ protein identifier. This allows for a relatively more error-free interpretation of PPI data at the gene level.

Conclusion

There is great interest in protein-protein interactions as a means of understanding the complexities of a cell. Large scale PPI data derived from high-throughput experiments or literature derived curated databases has been used to analyze the molecular networks of human cells [38, 39, 40, 41]. Here, our assessment shows that the number of PPIs in databases varies widely from as low as 100 to over 36,600 interactions. Overlap of PPIs within the same category of databases (e.g. within literature-derived databases) is low despite the presence of overlapping proteins. A comparison of the number of PPIs for a test set of proteins confirms that there is indeed a large variation in the number of interactors across the interaction databases. Also, a comparison of annotations for the PPIs that do overlap between the databases reveals differences in annotations through the use of alternative vocabulary terms. This is partly because of the difference in interpretation of the experimental results by the biologists annotating them and partly because of the overlapping meaning of the terms themselves.

A particularly important issue is that of protein isoforms. Often, only one isoform is annotated as an interactor although there is no evidence that the interaction is specific to that isoform. In other experiments such as coimmunoprecipitation experiments, it is almost impossible to discern which isoform binds unless an isoform-specific antibody is used. Because of this difficulty in mapping isoforms, we suggest that groups carrying out interaction studies, especially large-scale studies, map the identity of the proteins to genes and include this in their data submission. We have also previously done this for protein identification studies using mass spectrometry where a similar difficulty exists with regard to identification of particular isoforms [42]. If this is done, then a binary interaction can be interpreted thus: at least one of the gene products of Gene A interacts with at least one of the gene products of Gene B.

The dissemination of PPI datasets is an important aspect for optimal use of the data. Through decades of research, molecular biologists have discovered a large number of PPIs. Collecting this information, storing it and maintaining a database is a valuable task, which is perhaps not adequately appreciated by the scientific community. Our evaluation of human PPI databases highlights the diverse nature of annotation and representation of PPIs in databases. We hope that this review will assist biomedical scientists in making informed decisions about the most appropriate database to suit their needs and to actively participate with the databases to maintain error-free and updated annotations.

List of Abbreviations

PSI-MI: 

Proteomics Standards Initiative – Molecular Interaction

HPRD: 

Human Protein Reference Database

BIND: 

Biomolecular Interaction Network Database

DIP: 

Database of Interacting Proteins

MINT: 

Molecular INTeraction database

AfCS: 

Alliance for Cellular Signaling

Declarations

Acknowledgements

Akhilesh Pandey is supported by a grant from the National Institutes of Health (U54 RR020839). The Human Protein Reference Database was developed with funding from the National Institutes of Health and the Institute of Bioinformatics. Dr. Pandey serves as Chief Scientific Advisor to the Institute of Bioinformatics. Dr. Pandey is entitled to a share of licensing fees paid to the Johns Hopkins University by commercial entities for use of the database. The terms of these arrangements are being managed by the Johns Hopkins University in accordance with its conflict of interest policies.

This article has been published as part of BMC Bioinformatics Volume 7, Supplement 5, 2006: APBioNet – Fifth International Conference on Bioinformatics (InCoB2006). The full contents of the supplement are available online at http://​www.​biomedcentral.​com/​1471-2105/​7?​issue=​S5.

References

  1. Kemmer D, Huang Y, Shah SP, Lim J, Brumm J, Yuen MM, Ling J, Xu T, Wasserman WW, Ouellette BF: Ulysses – an application for the projection of molecular interactions across species. Genome Biol 2005, 6: R106. 10.1186/gb-2005-6-12-r106PubMed CentralView ArticlePubMed
  2. Riley R, Lee C, Sabatti C, Eisenberg D: Inferring protein domain interactions from databases of interacting proteins. Genome Biol 2005, 6: R89. 10.1186/gb-2005-6-10-r89PubMed CentralView ArticlePubMed
  3. Suresh S, Sujatha Mohan S, Mishra G, Hanumanthu GR, Suresh M, Reddy R, Pandey A: Proteomic resources: Integrating biomedical information in humans. Gene 2005, 364: 13–18. 10.1016/j.gene.2005.07.021View ArticlePubMed
  4. Hermjakob H, Montecchi-Palazzi L, Bader G, Wojcik J, Salwinski L, Ceol A, Moore S, Orchard S, Sarkans U, von Mering C, et al.: The HUPO PSI’s molecular interaction format – a community standard for the representation of protein interaction data. Nat Biotechnol 2004, 22: 177–183. 10.1038/nbt926View ArticlePubMed
  5. BioPAX[http://​www.​biopax.​org]
  6. HPRD Human Proteins Reference Database[http://​www.​hprd.​org]
  7. Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, et al.: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13: 2363–2371. 10.1101/gr.1680803PubMed CentralView ArticlePubMed
  8. GenProt[http://​www.​genprot.​org]
  9. NetPath[http://​www.​netpath.​org]
  10. Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, et al.: IntAct: an open source molecular interaction database. Nucleic Acids Res2004, 32: D452–455. 10.1093/nar/gkh052PubMed CentralView ArticlePubMed
  11. IntAct[http://​www.​ebi.​ac.​uk/​intact]
  12. Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G: MINT: a Molecular INTeraction database. FEBS Lett 2002, 513: 135–140. 10.1016/S0014-5793(01)03293-8View ArticlePubMed
  13. MINT Molecular INTeraction database[http://​mint.​bio.​uniroma2.​it/​mint]
  14. Breitkreutz BJ, Stark C, Tyers M: Osprey: a network visualization system. Genome Biol 2003, 4: R22. 10.1186/gb-2003-4-3-r22PubMed CentralView ArticlePubMed
  15. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32: D449–451. 10.1093/nar/gkh086PubMed CentralView ArticlePubMed
  16. DIP Database of Interacting Proteins[http://​dip.​doe-mbi.​ucla.​edu]
  17. Deane CM, Salwinski L, Xenarios I, Eisenberg D: Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 2002, 1: 349–356. 10.1074/mcp.M100037-MCP200View ArticlePubMed
  18. Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions from protein-protein interactions.Genome Res 2002, 12: 1540–1548. 10.1101/gr.153002PubMed CentralView ArticlePubMed
  19. Duan XJ, Xenarios I, Eisenberg D: Describing biological protein interactions in terms of protein states and state transitions: the LiveDIP database. Mol Cell Proteomics 2002, 1: 104–116. 10.1074/mcp.M100026-MCP200View ArticlePubMed
  20. Graeber TG, Eisenberg D: Bioinformatic identification of potential autocrine signaling loops in cancers from gene expression profiles. Nat Genet 2001, 29: 295–300. 10.1038/ng755View ArticlePubMed
  21. Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stumpflen V, Mewes HW, et al.: The MIPS mammalian protein-protein interaction database.Bioinformatics 2005, 21: 832–834. 10.1093/bioinformatics/bti115View ArticlePubMed
  22. MIPS Mammalian Protein-Protein InteractionDatabase[http://​mips.​gsf.​de/​proj/​ppi]
  23. Riley ML, Schmidt T, Wagner C, Mewes HW, Frishman D: The PEDANT genome database in 2005. Nucleic Acids Res 2005, 33: D308–310. 10.1093/nar/gki019PubMed CentralView ArticlePubMed
  24. Gilman AG, Simon MI, Bourne HR, Harris BA, Long R, Ross EM, Stull JT, Taussig R, Bourne HR, Arkin AP, et al.:Overview of the Alliance for Cellular Signaling. Nature 2002, 420: 703–706. 10.1038/nature01304View ArticlePubMed
  25. AfCS Alliance for Cellular Signaling[http://​www.​signaling-gateway.​org]
  26. Alfarano C, Andrade CE, Anthony K, Bahroos N, Bajec M, Bantoft K, Betel D, Bobechko B, Boutilier K, Burgess E, et al.: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res 2005, 33: D418–424. 10.1093/nar/gki051PubMed CentralView ArticlePubMed
  27. BIND Biomolecular Interaction Network Database[http://​www.​bind.​ca]
  28. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003,13: 2498–2504. 10.1101/gr.1239303PubMed CentralView ArticlePubMed
  29. Reactome[http://​www.​reactome.​org]
  30. Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, et al.: Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 2005, 33: D428–432. 10.1093/nar/gki072PubMed CentralView ArticlePubMed
  31. PDZBase[http://​icb.​med.​cornell.​edu/​services/​pdz]
  32. Beuming T, Skrabanek L, Niv MY, Mukherjee P, Weinstein H: PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics 2005, 21: 827–828. 10.1093/bioinformatics/bti098View ArticlePubMed
  33. Bader GD, Hogue CW: Analyzing yeast protein-protein interaction data obtained from different sources.Nat Biotechnol 2002, 20: 991–997. 10.1038/nbt1002-991View ArticlePubMed
  34. Hartmuth K, Urlaub H, Vornlocher HP, Will CL, Gentzel M, Wilm M, Luhrmann R: Protein composition of human prespliceosomes isolated by a tobramycin affinity-selection method. Proc Natl Acad Sci U S A2002, 99: 16719–16724. 10.1073/pnas.262483899PubMed CentralView ArticlePubMed
  35. Rappsilber J, Ryder U, Lamond AI, Mann M: Large-scale proteomic analysis of the human spliceosome.Genome Res 2002, 12: 1231–1245. 10.1101/gr.473902PubMed CentralView ArticlePubMed
  36. Bouwmeester T, Bauch A, Ruffner H, Angrand PO, Bergamini G, Croughton K, Cruciat C, Eberhard D, Gagneur J, Ghidelli S, et al.: A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol 2004, 6: 97–105. 10.1038/ncb1086View ArticlePubMed
  37. PSI-MI Proteomics Standards Initiative – Molecular Interaction[http://​psidev.​sourceforge.​net/​mi/​xml/​doc/​user]
  38. Neduva V, Linding R, Su-Angrand I, Stark A, de Masi F, Gibson TJ, Lewis J, Serrano L, Russell RB: Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol 2005, 3: e405. 10.1371/journal.pbio.0030405PubMed CentralView ArticlePubMed
  39. Rual JF, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, Gibbons FD, Dreze M, Ayivi-Guedehoussou N, et al.: Towards a proteome-scale map of the human protein-protein interaction network. Nature 2005, 437: 1173–1178. 10.1038/nature04209View ArticlePubMed
  40. Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck FH, Goehler H, Stroedicke M, Zenkner M, Schoenherr A, Koeppen S, et al.: A human protein-protein interaction network: a resource for annotating the proteome.Cell 2005, 122: 957–968. 10.1016/j.cell.2005.08.029View ArticlePubMed
  41. Gandhi TK, Zhong J, Mathivanan S, Karthick L, Chandrika KN, Mohan SS, Sharma S, Pinkert S, Nagaraju S, Periaswamy B, et al.: Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet 2006, 38: 285–293. 10.1038/ng1747View ArticlePubMed
  42. Muthusamy B, Hanumanthu G, Suresh S, Rekha B, Srinivas D, Karthick L, Vrushabendra BM, Sharma S, Mishra G, Chatterjee P, et al.: Plasma Proteome Database as a resource for proteomics research.Proteomics 2005, 5: 3531–3536. 10.1002/pmic.200401335View ArticlePubMed

 

 

 

http://www.ebi.ac.uk/intact/

IntAct Molecular Interaction Database

IntAct provides a freely available, open source database system and analysis tools for molecular interaction data. All interactions are derived from literature curation or direct user submissions and are freely available. The IntAct Team also produce the Complex Portal.

 

 

http://thebiogrid.org/download.php

BioGRID interaction data are 100% freely available to both commercial and academic users and are provided WITHOUT ANY WARRANTY. Publications that make use of this data are requested to please cite the contributing authors and : Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. Biogrid: A General Repository for Interaction Datasets. Nucleic Acids Res. Jan1; 34:D535-9 where applicable.