The National Centre for Text Mining (NaCTeM) is the first publicly-funded text mining centre in the world. We provide text mining services in response to the requirements of the UK academic community. NaCTeM is operated by the University of Manchester.
On our website, you can find pointers to sources of information about text mining such as links to
- text mining services provided by NaCTeM
- software tools, both those developed by the NaCTeM team and by other text mining groups
- seminars, general events, conferences and workshops
- tutorials and demonstrations
- text mining publications
NaCTeM Software Tools
The National Centre for Text Mining bases its service systems on a number of text mining software tools.
- Part-of-speech (POS) taggers
- Named entitities/terms
- AnatomyTagger — an open-source entity mention tagger for anatomical entities
- Named-entity Recognizer — Part of the GENIA Tagger
- NEMine — Recognizes gene/protein names in text.
- Yeast MetaboliNER — Recognizes yeast metabolite names in text.
- ACELA — Tool for efficient annotation of named entitites
- Smart dictionary lookup — machine learning-based gene/protein name lookup
- Smart Dictionary Lookup Tool Web Service — Looks up term variations of a given gene/protein name based on an automatically trained similarity measure
- Term Normalization Tool — Normalizes terms with string rewriting rules automatically generated based on a dictionary.
- DECA — A species disambiguation system for biological named entities
- RF-TermAlign — a bilingual dictionary extraction tool that uses a Random Forest method to learn string similarity of terms between a source and target language.
- Other tools
- EventMine — A machine learning-based event extraction system.
- brat — A free, open-source, web-based tool for text annotation visualisation and editing.
- Cafetiere — An easy-to-use text mining system for carrying text mining on your own document collection
- Sentence and paragraph breaker — An accurate sentence and paragraph detector based on heuristic rules
- Clinical Document Classification — automatic document classification demo
- Sentiment Analysis Tool — Analyses sentiment of input text.
Analysis of biological processes and diseases using text mining approaches.
A number of biomedical text mining systems have been developed to extract biologically relevant information directly from the literature, complementing bioinformatics methods in the analysis of experimentally generated data. We provide a short overview of the general characteristics of natural language data, existing biomedical literature databases, and lexical resources relevant in the context of biomedical text mining. A selected number of practically useful systems are introduced together with the type of user queries supported and the results they generate. The extraction of biological relationships, such as protein-protein interactions as well as metabolic and signaling pathways using information extraction systems, will be discussed through example cases of cancer-relevant proteins. Basic strategies for detecting associations of genes to diseases together with literature mining of mutations, SNPs, and epigenetic information (methylation) are described. We provide an overview of disease-centric and gene-centric literature mining methods for linking genes to phenotypic and genotypic aspects. Moreover, we discuss recent efforts for finding biomarkers through text mining and for gene list analysis and prioritization. Some relevant issues for implementing a customized biomedical text mining system will be pointed out. To demonstrate the usefulness of literature mining for the molecular oncology domain, we implemented two cancer-related applications. The first tool consists of a literature mining system for retrieving human mutations together with supporting articles. Specific gene mutations are linked to a set of predefined cancer types. The second application consists of a text categorization system supporting breast cancer-specific literature search and document-based breast cancer gene ranking. Future trends in text mining emphasize the importance of community efforts such as the BioCreative challenge for the development and integration of multiple systems into a common platform provided by the BioCreative Metaserver.
- [PubMed – indexed for MEDLINE]
Recently, I found this good research paper called “PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool) “. Maybe it will be useful for scientists who are interested in this topic.
PALM-IST: Pathway Assembly from Literature Mining–an Information Search Tool.
Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular component, etc.) information from the extracted text and subsequently mines various databases to provide their comprehensive inter-relation (e.g., interaction, expression, etc.). PALM-IST constructs protein interaction network and pathway information data relevant to the text search using multiple data mining tools and assembles them to create a meta-interaction network. It also analyzes scientific collaboration by extraction and creation of “co-authorship network,” for a given search context. Hence, this useful combination of literature and data mining provided in PALM-IST can be used to extract novel protein-protein interaction (PPI), to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. PALM-IST is available at www.hpppi.iicb.res.in/ctm.
- [PubMed – indexed for MEDLINE]
PALM-IST (Pathway Assembly from Literature Mining – an Information Search Tool) is a computational platform for users to explore biomedical literature resourse (PubMed) using multiple keywords and extract gene/protein(s) name, drug(s), disease(s) centered information along with their relation/interaction from text and databases. PALM-IST provides users a platform where data and literature mining are performed simultaneously. Combined structured data (from data mining) and unstructured data (from text mining) can be used to extract novel association/interaction between biological entities such as proteins, diseases, or drugs, to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. Further, PALM-IST also enables users to assemble human pathways and protein-protein interaction network (PPIN) using information extracted from text and databases.
1. Real time search in PubMed.
2. Identification and highlighting of genes, drugs and diseases extracted from searched abstracts.
3. Interactive co-occurrence based network of gene-disease, gene-drug, drug-disease from literature.
4. Functional annotation by mapping expression information on to human pathway proteins and their interactors.
5. Platform to merge protein-protein interaction of multiple human genes/proteins.
6. Platform to find cross-talk genes/proteins from merged pathways result.
7. Interactive display of pathways with over-laid with protein-protein interaction information.
8. Interactive display of collaborative network between biomedical experts.
KH Coder is a free software for quantitative content analysis or text data mining. It is also utilized for computational linguistics. You can analyze Japanese, English, French, German, Italian, Portuguese and Spanish text with KH Coder. Chinese (simplified, UTF-8), Korean and Russian (UTF-8) language data can also be analyzed with the latest alpha version.
KH Coder provides various kinds of search and statistical analysis functions using back-end tools such as Stanford POS Tagger, FreeLing, Snowball stemmer, MySQL and R.
The OntoGene literature mining web service
Received 1 August 2013; Accepted 10 September 2013; Published 14 October 2013
Competing interests: the authors have declared that no competing interests exist.
Motivation and Objectives
Text mining technologies are increasingly providing an effective response to the growing demand for faster access to the vast amounts of information hidden in the literature. Several tools are becoming available which offer the capability to mine the literature for specific information, such as for example protein-protein interactions or drug-disease relationships. The biomedical text mining community regularly verifies the progress of such systems through competitive evaluations, such as BioCreative, BioNLP, i2b2, CALBC, CLEF-ER, BioASQ, etc.
The OntoGene system is a text mining system which specializes in the detection of entities and relationships from selected categories, such as proteins, genes, drugs, diseases, chemicals. The quality of the system has been tested several times through participation in some of the community-organized evaluation campaigns.
In order to make the advanced text mining capabilities of the OntoGene system more widely accessible without the burden of installation of complex software, we are setting up a web service which will allow any remote user to submit arbitrary documents. The results of the mining service (entities and relationships) are then delivered back to the user as XML data, or optionally can be inspected via a flexible web interface.
The text mining pipeline which constitutes the core of the OntoGene system has been described previously in a number of publications (Rinaldi, 2008; Rinaldi, 2010; Rinaldi, 2012). We will only briefly describe the core text mining technologies, and instead focus mainly on the novel web service which allows remote access to the OntoGene text mining capabilities.
The first step in order to process a collection of biomedical literature consists in the annotation of names of relevant domain entities in biomedical literature (currently the systems considers proteins, genes, species, experimental methods, cell lines, chemicals, drugs and diseases). These names are sourced from reference databases and are associated with their unique identifiers in those databases, thus allowing resolution of synonyms and cross-linking among different resources. A term normalization step is used to match the terms with their actual representation in the text, taking into account a number of possible surface variations. Finally, a disambiguation step resolves the ambiguity of the matched terms.
Candidate interactions are generated by simple co-occurence of terms within the same syntactic units. However, in order to increase precision, we parse the sentences with our state-of-the art dependency parser, which generates a syntactic representation of the sentence. This is in turn used to score and filter candidate interactions based on the syntactic fragment which connects the two participating entities.
The ranking of relation candidates is further optimized by a supervised machine learning method. Since the term recognizer aims at high recall, it introduces several noisy concepts, which we want to automatically identify in order to penalize them. Additionally, we need to adapt to highly-ranked false positive relations which are generated by our frequency based approach. The goal is to identify some global preference or biases which can be found in the reference database. One technique is to weight individual concepts according to their likeliness to appear as an entity in a correct relation, as seen in the target database.
The OntoGene web service has been implemented as a RESTful service (Richardson and Ruby, 2007). It accepts simple XML files as input, based on the. The output of the system is generated in the same format. For example, a query aiming at retrieving the diseases from pubmed abstract 10617681 would generate the output presented in Box 1.
Possible role of valvular serotonin 5-HT(2B) receptors in the cardiopathy associated with fenfluramine.
Dexfenfluramine was approved in the United States for long-term use as an appetite suppressant until it was reported to be associated with valvular heart disease. The valvular changes (myofibroblast proliferation) are histopathologically indistinguishable from those observed in carcinoid disease or after long-term exposure to 5-hydroxytryptamine (5-HT)(2)-preferring ergot drugs (ergotamine, methysergide). 5-HT(2) receptor stimulation is known to cause fibroblast mitogenesis, which could contribute to this lesion. To elucidate the mechanism of “fen-phen”-associated valvular lesions, we examined the interaction of fenfluramine and its metabolite norfenfluramine with 5-HT(2) receptor subtypes and examined the expression of these receptors in human and porcine heart valves. Fenfluramine binds weakly to 5-HT(2A), 5-HT(2B), and 5-HT(2C) receptors. In contrast, norfenfluramine exhibited high affinity for 5-HT(2B) and 5-HT(2C) receptors and more moderate affinity for 5-HT(2A) receptors. In cells expressing recombinant 5-HT(2B) receptors, norfenfluramine potently stimulated the hydrolysis of inositol phosphates, increased intracellular Ca(2+), and activated the mitogen-activated protein kinase cascade, the latter of which has been linked to mitogenic actions of the 5-HT(2B) receptor. The level of 5-HT(2B) and 5-HT(2A) receptor transcripts in heart valves was at least 300-fold higher than the levels of 5-HT(2C) receptor transcript, which were barely detectable. We propose that preferential stimulation of valvular 5-HT(2B) receptors by norfenfluramine, ergot drugs, or 5-HT released from carcinoid tumors (with or without accompanying 5-HT(2A) receptor activation) may contribute to valvular fibroplasia in humans.
<text>HEART VALVE DISEASES</text>
Box 1. The output of the system is generated in the the BioC specification format. This oputput was generated by a query aiming at retrieving the diseases from pubmed abstract 10617681.
Options can be used in the input query to select whether the result should contain in-line annotations (showing where exactly in the text the term was mentioned), or stand-off annotations (as in the example above). Currently the system uses pre-defined terminology, and only allows the users to decide whether they want to use or not to use one of the pre-loaded vocabularies. However we foresee in future the possibility to upload own terminologies.
Since the OntoGene system not only delivers the specific terms found in the submitted articles, but also their unique identifiers in the source database(s), it is relatively easy to turn its results in a semantic representation, as long as the original databases are based on a standardized ontology. Any term annotation can be turned into a monadic ground fact (possibly using a suitable URI), and interactions can be turned into RDF statements, which could then potentially be integrated across a large collection of documents.
Results and Discussion
Users can submit arbitrary documents to the OntoGene mining service by embedding the text to be mined within a simple XML wrapper. Both input and output of the system are defined according to the BioC standard [Comeau et al., 2013]. However typical usages will involve processing of PubMed abstracts or PubMed Central full papers. In this case the user can provide as input simply the PubMed identifier of the article. Optionally the users can specify which type of output they would like to obtain: if entities, which entity types, and if relationships, which combination of types.
Figure 1. Example of visualization of text mining results using the ODIN interface.
The OntoGene pipeline identifies all relevant entities mentioned in the paper, and their interactions, and reports them back to the user as a ranked list, where the ranking criteria is the system own confidence in the specific result. The confidence value is computed taking into account several factors, including the relative frequency of the term in the article, its general frequency in PubMed, the context in which the term is mentioned, and the syntactic configuration among two interacting entities (for relationships). A detailed description of the factors that contribute to the computation of the confidence score can be found in (Rinaldi et al, 2010).
The user can chose to either inspect the results, using the ODIN web interface (see figure 1), or to have them delivered back via the RESTful web service in BioC XML format, for further processing locally. The usage of ODIN as a curation tool has been tested within the scope of collaborations with curation groups, including PharmGKB, CTD, RegulonDB (Rinaldi, 2012).
The effectiveness of the web service has been recently evaluated within the scope of one of the BioCreative 2013 shared tasks. The official results will be made available at the BioCreative workshop (to be held at the NIH, Bethesda, Maryland, 7-9 October 2013), where only two groups have been invited to present their results, thus showing that the OntoGene/ODIN system is among the top achievers, and will be discussed at the NETTAB workshop when this paper is presented. The system can currently be tested via the.
As a future development we envisage the possibility that ODIN could be turned into a tool for collaborative curation of the biomedical literature, with input from the text mining system aimed only at facilitating the curation process but not at fully replacing the knowledge of the human experts. It is already possible in ODIN for any user to easily add, remove or modify annotations provided by the system. Such social application could help address the widening gap between the amount of published literature and the capabilities of curation teams to keep abreast with it.
The OntoGene group is partially supported by the Swiss National Science Foundation (grants 100014-118396/1 and 105315-130558/1). A continuation of this work is planned within the scope
of a collaboration with Roche Pharmaceuticals, Basel, Switzerland.
Comeau DC, Islamaj Doğan R, et al. (2013) BIoC: A Minimalist Approach to Interoperability for Biomedical Text Processing, Database (Oxford) 2013, bat064. doi:10.1093/database/bat064
Richardson L and Sam R (2007), RESTful Web Services, O’Reilly, ISBN 978-0-596-52926-0.
Rinaldi F, Kappeler T, et al. (2008). OntoGene in BioCreative II. Genome Biol 9:S13. doi:10.1186/gb-2008-9-s2-s13
Rinaldi F, Schneider G, et al. (2010) OntoGene in BioCreative II.5 IEEE/ACM Trans Comput Biol Bioinform 7(3), 472-480. doi:10.1109/TCBB.2010.50
Rinaldi F, Clematide S, et al. (2012) Using ODIN for a PharmGKB revalidation experiment. Database (Oxford), bas021; doi:10.1093/database/bas021
Rinaldi F, Schneider G, and Clematide S. (2012) Relation Mining Experiments in the Pharmacogenomics Domain. J Biomed Inform. 45(5), 851-861. doi:10.1016/j.jbi.2012.04.014
Nature Biotechnology 22, 1253 – 1259 (2004)
Published online: 6 October 2004 | doi:10.1038/nbt1017
Systems biology in drug discovery
The hope of the rapid translation of ‘genes to drugs’ has foundered on the reality that disease biology is complex, and that drug development must be driven by insights into biological responses. Systems biology aims to describe and to understand the operation of complex biological systems and ultimately to develop predictive models of human disease. Although meaningful molecular level models of human cell and tissue function are a distant goal, systems biology efforts are already influencing drug discovery. Large-scale gene, protein and metabolite measurements (‘omics’) dramatically accelerate hypothesis generation and testing in disease models. Computer simulations integrating knowledge of organ and system-level responses help prioritize targets and design clinical trials. Automation of complex primary human cell–based assay systems designed to capture emergent properties can now integrate a broad range of disease-relevant human biology into the drug discovery process, informing target and compound validation, lead optimization, and clinical indication selection. These systems biology approaches promise to improve decision making in pharmaceutical development.
Drug discovery and systems biology began together: in traditional or ‘folk’ medicine, herbal drugs were discovered through direct if anecdotal observations in people with diseases, the most relevant complex biological systems there are. With the advent of chemistry in the late 1800s and early 1900s, derivatives of natural products and subsequently novel synthetic chemicals made their way into drug discovery pipelines; but screening was still in the setting of complex disease biology, with animals replacing patients as the primary ‘guinea pigs.’ Most of today’s pharmaceuticals (at least on a ‘doses per patient-year’ basis) derive directly or indirectly from such early ‘systems biology’-based drug discovery. In the interest of speed and the perceived advantages of mechanistic insight, however, animal models were successively replaced with tissue-level screens (e.g., vascular or tracheal muscle tone), simple cell-based pathway screens (proliferation, cytokine production) and finally with today’s ultra-high-throughput screens capable of interrogating individual molecular targets with hundreds of thousands of compounds a day.
Today’s ‘win-by-numbers’ approach is very powerful when applied to known, validated targets (which often means targets of historical drugs), but has led to disappointingly few new drugs when applied to less well biologically understood (e.g., genome-derived) targets. The desire to mine the wealth of the genome has come face to face with the realization that knowing a target is not the same as knowing what the target does, let alone knowing the effects of a chemical inhibitor in diverse disease settings. In fact, despite the enormous investment in genomics and screening technologies over the past 20 years, the cost of new drug discovery continues to rise while approval rates fall1. The primary selection of drug targets and candidates has become divorced from the complexity of disease physiology. Reenter systems biology, in modern guise.
The goal of modern systems biology is to understand physiology and disease from the level of molecular pathways, regulatory networks, cells, tissues, organs and ultimately the whole organism. As currently employed, the term ‘systems biology’ encompasses many different approaches and models for probing and understanding biological complexity, and studies of many organisms from bacteria to man. Much of the academic focus is on developing fundamental computational and informatics tools required to integrate large amounts of reductionist data (global gene expression, proteomic and metabolomic data) into models of regulatory networks and cell behavior. Because biological complexity is an exponential function of the number of system components and the interactions between them, and escalates at each additional level of organization (Fig. 1), such efforts are currently limited to simple organisms or to specific minimal pathways (and generally in very specific cell and environmental contexts) in higher organisms2, 3, 4. Even if our ability to measure molecules and their functional states and interactions were adequate to the task, computational limitations alone would prohibit our understanding of cell and tissue behavior from the molecular level. Thus, methodologies that filter information for relevance, such as biological context and experimental knowledge of cellular and higher level system responses, will be critical for successful understanding of different levels of organization in systems biology research.
Omics (the bottom-up approach) focuses on the identification and global measurement of molecular components. Modeling (the top-down approach) attempts to form integrative (across scales) models of human physiology and disease, although with current technologies, such modeling focuses on relatively specific questions at particular scales, e.g., at the pathway or organ levels. An intermediate approach, with the potential to bridge the two, is to generate profiling data (e.g., biologically multiplexed activity profiling or BioMAP data) from high-throughput assays designed to incorporate biological complexity at multiple levels: multiple interacting active pathways, multiple intercommunicating cell types and multiple different environments. Such a complex cell systems approach addresses the need for data on cell responses to physiological stimuli and to pharmaceutical agents as an aid to modelers, and also as a practical approach to systems biology at the cell signaling network and cell-cell interaction scales.
This review focuses on recent advances in the practical applica- tions of systems biology to drug discovery. Three principal approaches are discussed (Fig. 1): informatic integration of ‘omics’ data sets (a bottom-up approach); computer modeling of disease or organ system physiology from cell and organ response level information available in the literature (a top-down approach to target selection, clinical indication and clinical trial design); and the use of complex human cell systems themselves to interpret and predict the biological activities of drugs and gene targets (a direct experimental approach to cataloguing complex disease-relevant biological responses). These complementary approaches, which must ultimately be integrated in the quest for a hierarchical, molecule-to-systems level understanding of human disease, are already having an impact on the drug discovery process.
Omics: large-scale data generation and mining
It could be argued that a full understanding of the responses of a system requires knowledge of all of its component parts. Omics approaches to systems biology focus on the building blocks of complex systems (genes, proteins and metabolites). These approaches have been adopted wholeheartedly by the drug industry to complement traditional approaches to target identification and validation, for generating hypotheses and for experimental analysis in traditional hypothesis-based methods. For example, omics can be used to ask what genes, proteins or phosphorylation states of proteins are expressed or upregulated in a disease process, leading to the testable hypothesis that the regulated species are important to disease induction or progression (Table 1). Integration of genomics, proteomics and metabolite measurements within the context of controlled gene or drug perturbations of complex cell and animal models (and in the context of clinical data) is the basis of systems biology efforts at a number of drug companies, including Eli Lilly (Indianapolis, IN, USA), where they are accelerating the study of complex physiological processes such as bone metabolism5.
Omics classification of disease states can lead to more efficient targeting or even personalization of therapies by identifying the specific molecular pathways active in particular disease states and in individual patients6. Another valuable application of the technology is the identification of surrogate markers for disease detection, or for monitoring of therapies7, 8. Although omics approaches thus accelerate development of mechanistic hypotheses and clinical insights, a systems-level understanding does not automatically emerge.
Significant efforts are underway to understand key pathway and organism-level responses by relying on the emergent properties of global gene and protein expression data (that is, the properties of the system as a whole that cannot be predicted from the parts). In relatively simple organisms, studies incorporating analysis of time-series genome-wide mRNA expression data, large-scale perturbation analyses and identification of coregulated components, and protein-protein interaction studies have led to new insights into pathway functions and signaling network organization in specific biological processes, such as cell proliferation or the response to metabolic perturbation9, 10, 11, 12. Although the added levels of complexity in human disease, as well as economic and computational limitations severely limit the utility of omics as a stand-alone approach for systems-level understanding, omics technologies will be important for constructing the ‘scaffolds’ that help define and limit the possible pathways and connectivities in top-down models of cell-signaling networks3.
Computer models: from pathways to disease physiology
The goal of modeling in systems biology is to provide a framework for hypothesis generation and prediction based on in silico simulation of human disease biology across the multiple distance and time scales of an organism (from molecular reactions to organism homeostasis and disease responses)2, 4. We are certainly a long way from achieving any general, integrated model of human cell behavior, let alone human organismal biology, but real progress is being made in developing and testing computational and experimental methods for in silico systems biology at different scales (Table 2). Moreover, we do not need a global synthesis for modeling and simulation to be useful for basic biological insights and drug development; highly focused, problem-directed models are already having an impact on target validation and clinical development decisions (Table 1).
Mathematical and more recently computational models have a rich history in human physiology4, 13, 14, 15. Modeling efforts useful for drug discovery and development must simulate responses at the scale of cell and tissue or organ complexity (that is, the scale at which disease manifests itself). At the same time, a sufficient level of detail must be included such that intervention points accessible to drug discovery are available and can be modulated in silico to predict an organ level readout. Thus, a model simulation of heart contractility must incorporate the connection between Na+/Ca2+ exchangers and contractility to be useful to predict the effect of drugs targeting these channels14. Difficulty arises in developing models that can effectively integrate the molecular, cellular and organ levels. In addition to pure computational issues, limitations in bottom-up knowledge and in our understanding of pathway and network architecture and interactions, as well as a general lack of standardized knowledge of cell- and tissue-level responses to bioactive stimuli that could be used to validate models (see below) are fundamental, long-term problems that have to be addressed before models integrating complexity at multiple scales can be considered.
A practical approach to address the computational issues is to put in place an organ-level framework and add increasing complexity in a modular format. For example, one can begin with models of inflammation that examine cell-cell communica- tion through cytokine networks and then start replacing the ‘black box’ cells with simulations of cell behavior (Table 2) modeled from network modules (e.g., models of cytoskeleton motility, proliferative or cytokine responses), ultimately replacing ‘black box’ pathway modules with bottom-up approaches4.
Entelos (Foster City, CA, USA) has developed complex simulations of disease physiology using a framework of deterministic differential equations based on empirical data in humans16 (Table 2). In these models, internal signaling pathways are not modeled explicitly; cells or even tissues are represented as black boxes that respond to inputs by giving specified outputs that vary with time. Using such an organ level ‘disease physiology’ framework, Stokes et al.17 have developed a computational model of chronic asthma that incorporates interactions among cells and some of the complexity of their responses to each other and their environment. Model parameters can be modified to reach a particular steady state reference point, for example, the state of chronic asthma (including chronic eosinophilic inflammation, chronic airway obstruction, airway hyperresponsiveness and elevated IgE levels) or the state of exercise-induced airway obstruction. Simulated ‘asthmatics’ respond as expected to various drugs, including 2 agonists, glucocorticoids and leukotriene anta- gonists17. Moreover, by simulating an antibody-dependent reduction in interleukin (IL)-5 protein (a driver of eosinophilia during asthma), this model predicts a decrease in airway eosinophilia but little therapeutic improvement in airway conductance18, predictions that are consistent with the results of a clinical trial testing a humanized anti-IL-5 antibody in asthmatics19.
Similar cell- and organ-scale models of glucose metabolism and homeostasis have a long history, evolving from simple relationships between glucose and insulin levels in circulation20 to more complex models involving integrated multiple tissue responses and their involvement in glucose metabolism21. A presentation of Entelos’ diabetes ‘PhysioLab’ at a recent conference (In Silico Biology Conference, San Diego, California, USA, June 2–3, 2002; C. Wallwork, personal communication) described how such a computational model has been used in the design of phase 1 trials for an unspecified drug treatment for type 2 diabetes. The results suggested that computational modeling enabled the experimental dosing arms and the number of patients required for the trial to be decreased, thus potentially reducing costs and increasing the probability of clinical success.
More detailed understanding of the systems behavior of intercellular signaling pathways, such as the identification of key nodes or regulatory points in networks or better understanding of crosstalk between pathways, can also help predict drug target effects and their translation to organ and organism level physiology. To this end, a very large number (more than can be fairly cited) of efforts have been focused at the scale of signaling pathways within cells (e.g., see Table 2). These models benefit from the large amount of literature data and the promise that omics efforts can provide constraints on the pathways (see previous ‘Omics: large-scale data generation and mining’ section). As for cell- and organ-level models, simulations of mammalian signaling networks usually rely on time-dependent differential equations and model the pathway in isolation and under very specific (and simple) conditions3, 22. A next level of detail that enhances the utility of such pathway models is the crosstalk between pathways. Bhalla et al.23 modeled signaling modules and found that combinations of simple modules lead to nonlinear responses or ‘emergent properties’ of the system. These nonobvious results based on pathway nonlinearity hold promise for identification and prioritization of intervention points within signaling networks.
Interestingly, the architecture of signaling pathways displays significant conservation during evolution, an insight that is being used to help define and understand mammalian cell signaling pathways based on homology with well-defined pathways in lower organisms, and between evolutionarily duplicated pathways in man (e.g., the PathBlast tool24). However, although pathway homologies may suggest conservation of key points for chemical intervention in signaling, divergence of pathway functions and regulatory interactions are the norm so that ultimately there can be no substitute for studies in complex human systems.
No matter how successful current attempts at predictive modeling turn out to be, such models raise the challenge of experimental validation (theoretically, only possible with human data) and the cycles of improvement inherent to the modeling effort3 (Fig. 2). From a drug discovery point of view, any of the successes to date could be considered anecdotal and until a given model shows a track record of successful prediction in humans, it will be risky to rely on it for development decisions. For the foreseeable future, modeling predictions will likely be one of many inputs into the decision making process in the pharmaceutical industry.
Figure 2: Development cycle of integrated in silico models using component level and system response data.
Integrated models of disease can be generated using data from the literature as well as protein expression and interaction data sets, potentially informed by predictions of functional network organization and cell responses based ideally on complex human cell-based assays (e.g., see Fig. 3). Models are iteratively tested and improved by comparison of predictions with systems (cell, tissue or organism) level responses measured experimentally through traditional assays or from profiles generated from complex, activated human cell mixtures under a set of different environmental conditions. Component level ‘omics’ data can provide a scaffold, limiting the range of possible models at the molecular level.
Using complex cell systems to assay and model biology
Pathway modeling as yet remains too disconnected from systemic disease biology to have a significant impact on drug discovery. Top-down modeling at the cell-to-organ and organism scale shows promise, but is extremely dependent on contextual cell response data. Moreover, to bridge the gap between omics and modeling, we need to collect a different type of cell biology data—data that incorporate the complexity and emergent properties of cell regulatory systems and yet ideally are reproducible and amenable to storing in databases, sharing and quantitative analysis.
At one extreme, responses of human tissues themselves can be probed ex vivo, an approach that, even with limitations in terms of availability and reproducibility of human tissues, has proven useful for validating selected compounds and targets25. Highly reproducible or even automated approaches to cell biology, however, seem more likely to contribute to the large-scale compound and gene function analyses desired by industry and required as a basis for modeling efforts. Indeed, high-throughput cell-based screening systems, often relying on reporter assays and cell lines, are being used effectively by many companies to identify components of pathways26, screen for active compounds27 and even to profile drugs based on their effects on pathway or simple stimulus-response readouts28, 29. However, these assays are generally designed to isolate individual pathways and to minimize biological complexity and thus neither take advantage of, nor provide insight into, emergent properties of cell systems. This ‘systematic biology’ focus on simplified pathways is thus to be distinguished from the ‘systems biology’ focus on complexity and emergent properties.
At the same time, some groups are beginning to appreciate the importance of emergent properties in drug development. For instance, researchers at CombinatoRx (Boston, MA, USA) search for novel combination therapies by taking advantage of two stimuli (phorbol myristate acetate, an activator of the protein kinase C cascade, and ionomycin, a stimulator of Ca2+ dependent signaling) that turn on multiple pathways in primary cells to search for pairs of compounds that exhibit antagonism (e.g., to tumor necrosis factor (TNF)– secretion from activated T cells) when combined, but not when used singly28. Elsewhere, Rosetta Inpharmatics (Seattle, WA, USA) has measured thousands of output genes in yeast, using the gene response profiles resulting from genetic or chemical (drug) perturbations to determine how genes that effect growth fit into pathways12 and to reveal the mechanism(s) of action of compounds29. These experimental approaches have begun to harness the power of systems biology, but the systems studied remain intentionally simple, focusing on only a few inputs or outputs (CombinatoRx) or a single physiologic state in a model organism (Rosetta). Complexity is a byproduct, not a product of design of these approaches.
Complexity and emergent properties in biology derive from several features: first, complex inputs that stimulate multiple pathways; second, multiple outputs that are integrated network responses to the inputs; third, interactions between multiple cell types; and fourth, multiple contexts and environments for each cell type or combination of cell types. The drug discovery industry has invested billions of dollars in technologies to evaluate outputs, but to incorporate disease- relevant complexity into drug discovery, intentional efforts must also be made to study cells in combination to mimic cell-cell interactions critical to in vivo regulatory networks and to assay cells in different complex environmental contexts (in which different combinations of pathways are activated). Parallel context or ‘multisystem’ analysis is important because proteins and pathways have evolved to integrate inputs and outputs from multiple contexts, so that to understand the effects of a drug (or target), data must be derived from cell responses in multiple environments.
Our group at BioSeek (Burlingame, CA, USA) has developed human cell–based assays that intentionally incorporate complexity at multiple levels, using parallel interrogation of standardized cell ‘systems’ (cells plus environments) designed to mimic physiological complexity by including one or more primary cell types as well as combinations of cells and active pathways (Fig. 3a). Cell systems are engineered to embody disease-relevant responses for biological function analyses, modeling and drug discovery. For example, a panel of just four cell systems (combinations of endothelial cells and blood mononuclear cells in four different complex inflammatory environments) was found to embody complex biology reflecting distinctive contributions of many pharmacologic targets relevant to inflammation30, 31. Profiles made up of as few as 24–40 protein readouts (including cytokines, chemokines, adhesion receptors and other inflammatory mediators) used to assess the responses of these complex systems are able to discriminate and classify most of the pathways and mechanisms effected by known modulators of inflammation, as well as a surprising array of other drugs and pathways tested30, 31 (Fig. 3b). Importantly, the profiles generated from these complex, activated cell mixtures are reproducible, allowing archiving in databases and automated searching and analyses by profile similarity or other characteristics (e.g., effects on key disease-relevant parameters).
Figure 3: Leveraging complexity in cell systems biology for drug discovery: biologically multiplexed activity profiling (BioMAP) applied to gene function, network architecture and drug activity relationships.
(a) Primary cells (e.g., endothelial cells and/or blood lymphocytes) are combined and exposed to stimuli (e.g., cytokines, growth factors or chemical mediators) in combinations relevant to the disease biology of interest (e.g., inflammation). Readouts used to measure system responses can be proteins, activated states of proteins, genes or other cellular constituents or properties selected for disease relevance (e.g., cytokines, growth factors, adhesion receptors, which are the ultimate mediators of cellular communication and function in disease) and for responsiveness to environmental and pharmacologic inputs (information content). Perturbations to the parallel systems define the biological activity profiles of interrogating drugs or genes. The combination of multiple cell types and multiple pathways activated elicits complex network regulation and emergent properties that enhance the sensitivity and ability of the systems to discriminate unique drug and gene effects. (b) Several complex human cell ‘systems’ (cells or cell combinations in disease-relevant environments) are interrogated with genes (via overexpression or siRNA) or drugs of interest and the effects on the levels of selected protein readouts are determined, generating a profile that serves as a multisystem signature of the function of the test agent. Statistical measures of profile similarity (i.e., do particular agents induce the same multisystem response?) can be used to cluster genes or drugs by function, and to generate graphical representations of their functional relationships with each other28, 29. As examples, clustering of profiles induced by gene overexpression (bottom left) reveals key pathway relationships (e.g., Ras/MAPK, phosphatidyl inositol 3-kinase (PI3K), interferon- (IFN-), and NF-B-associated clusters) as well as pathway–pathway interactions in signaling networks controlling endothelial cell responses in the context of different inflammatory cytokines32. Clustering of drug-induced profiles from inflammatory model systems (comprising activated combinations of endothelial cells and peripheral blood mononuclear cells) detects and discriminates the activities of most known modulators of inflammation as well as a surprising array of other drug targets and pathways, including for example glucocorticoids, cytokine antagonists, and inhibitors of HMG-CoA reductase, calcineurin, inosine monophosphate dehydrogenase, phophodiesterases, nuclear hormone receptors, phosphatidyl inositol 3 kinases, heat shock protein 90, casein kinase 2, janus-activated kinases, and p38 MAPK among others (illustrated in upper right; drugs are colored by mechanistic class)28, 30. Drugs specific for a common target (circled in black) or for targets in a common pathway (circled in red) cluster together, but compounds having different off target activities are readily detected (e.g., the profiles of three JAK inhibitors with known secondary activities; asterisks). Clustering of activity profiles from lead chemical series can define compound-specific structure-activity relationships for lead optimization (lower right; different analogs are color coded; circle size reflects concentration). In the example shown, BioMAP clustering defines two functional activity classes among structurally related p38 MAPK inhibitors.
This approach, termed biologically multiplexed activity profiling (BioMAP), has been successfully employed in model studies suggesting its applicability to several stages of the drug discovery process (Table 1). For target identification and validation, informatics approaches based on the similarity of database-stored multisystem profiles have been shown to rapidly associate gene or drug activities with known (or novel) pathways, and to predict functional pathways and network interactions32 (Fig. 3b). Multisystem profiles induced by gene overexpression in endothelial cells in four different cytokine environments (in essence, multisystem signatures of gene function) automatically clustered into groups that reflected known pathway relationships with surprising fidelity32. Moreover, graphical representation of function similarity relationships (Fig. 3b, lower left panel) point to unique roles for two gene products, MyD88 and IRAK, in mediating interactions between the nuclear factor (NF)-B and Ras/mitogen-activated protein kinase (MAPK) pathways. MyD88, previously known to signal via NF-B, was subsequently confirmed in biochemical studies to trigger the MAPK pathway as well, which in turn inhibited NF-B activation in a negative feedback loop activated by IL-1 but not TNF-32. Clustering multisystem response profiles, in which the systems are designed to capture emergent properties, can thus help define the functional architecture of signaling networks, information important (in conjunction with conventional data sets) for designing and testing computational models.
For compound characterization, the limited data sets, automation and broad functional coverage may make profiles generated from complex, activated cell mixtures an efficient way to screen focused libraries for effects on complex, disease-relevant biology and, more importantly, to prioritize hits from conventional high-throughput screening. In model studies, we have used profiles in four systems to classify hits and leads by their biological activities, to identify compounds with off-target activities (which may be desirable or undesirable), to distinguish ‘well-behaved’ lead series displaying consistent biological responses and to monitor structure-function relationships as a guide to lead optimization31 (Fig. 3b, lower right panel).
An additional strength of the multisystem approach is that parallel systems can be designed to capture a wide range of elicited (disease-relevant) biological and pathway activities; thus, the effects of drugs or genes can be assessed simultaneously for complex biological responses relevant to many different diseases and can be used to screen for novel therapeutic indications. (This contrasts with most modeling efforts and even animal or clinical trials, which are typically designed to address a single disease target.) Complex cell systems models of inflammation (Fig. 3), for example, readily detect the activities of 3-hydroxy-3-methyl-glutaryl-CoA (HMG-CoA) reductase inhibitors (e.g., statins) on inflammatory signaling30. This prompts the interesting question of whether inclusion of complex biological systems analyses in the development of statins could have accelerated the discovery of their potent role in autoimmune and inflammatory disorders33?
Omics could and certainly should be applied to cell systems designed to incorporate meaningful biological complexity. However, as indicated by studies by our group, highly informative functional signatures for gene and drug effects can be generated using very small numbers (tens) of biologically significant parameters, when these are assayed within several different complex cell and environment combinations. This appears to bear out the prediction that biological complexity encodes useful information about drug and protein function, and suggests that it can be leveraged for ‘smarter, faster, cheaper’ industrial-scale functional profiling.
From the practical near-term perspective, these approaches present an opportunity to integrate systems biology more efficiently and cost effectively throughout the drug discovery process. From a fundamental perspective, databases of such quantitative human cell biological responses to drugs and gene alterations, under standardized and reproducible conditions designed to embody disease-relevant complexity and capture emergent properties, are likely to be useful in predicting the functional architecture of complex regulatory networks and will provide an essential bridge for integration of omics data into in silico models of cell systems behavior, as well as a testing ground for these models as they develop (Fig. 2).
During drug development, million-dollar decisions are (and must be) routinely made using flawed criteria based on incomplete biological knowledge: for example, targets are prioritized because they are upregulated at the gene level in disease (even though many of our best historical targets are not); compounds are selected to be biochemically specific (though many of our most effective drugs are not); animal models are considered essential (although these are known to be poor predictors of clinical success). Better biology, preferably more relevant to human disease and capable of being integrated into the drug discovery process, is sorely needed to inform decision-making. Although the systems biology approaches outlined here are in their infancy, they are already contributing to meaningful drug development decisions by accelerating hypothesis-driven biology, by modeling specific physiologic problems in target validation or clinical physiology and by providing rapid characterization and interpretation of disease-relevant cell and cell system level responses.
Although these approaches are currently being pursued by separate laboratories and companies, it is clear that they are complementary and that ultimately they must be integrated for systems biology to achieve its potential. An analogy can be drawn to the genome project, in which multiple individual efforts contributed technology and informatics approaches that eventually enabled a concerted ‘big science’ push to sequence the genome. However, whereas the linear output of the genome project was easily standardized and archived, the multidimensional and multivariate nature of biological function and cell biology studies presents an extraordinary informatics and even social challenge, since standardization of experimental design and data are essential before a ‘big science’ approach to systems biology can be envisioned. Markup languages for gene expression data, emerging ontologies for sharing and integrating different kinds of omic and conventional biological data4 and the introduction of standardi- zed high-throughput systems biology and associated informatics approaches represent important first steps on this path.
Writing of this review was supported in part by SBIR grants (R44 AI048255 and R43 AI049048) to BioSeek, Inc., and by NIH grants to E.C.B. The authors thank Evangelos Hytopoulos and Ivan Plavec for thoughtful criticism and input.
Competing interests statement:
The authors declare competing financial interests.
- DiMasi, J.A., Hansen, R.W. & Grabowski, H.G. The price of innovation: new estimates of drug development costs. J. Health Econ. 22, 151–185 (2003). | Article | PubMed | ISI |
- Ideker, T., Galitski, T. & Hood, L. A new approach to decoding life: systems biology. Annu. Rev. Genomics Hum. Genet. 2, 343–372 (2001). | Article | PubMed | ISI | ChemPort |
- Ideker, T. & Lauffenburger, D. Building with a scaffold: emerging strategies for high- to low-level cellular modeling. Trends Biotechnol. 21, 255–262 (2003). | Article | PubMed | ISI | ChemPort |
- Hunter, P.J. & Borg, T.K. Integration from proteins to organs: the Physiome Project. Nat. Rev. Mol. Cell Biol. 4, 237–243 (2003). | Article | PubMed | ISI | ChemPort |
- Kulkarni, N.H. et al. Gene expression profiles classify different classes of bone therapies: PTH, Alendronate and SERMs, Poster 307, 31st European Symposium on Calicified Tissue, June 5, 2004, Nice, France;http://www.ectsoc.org/nice2004/abstracts.htm#onl
- Weston, A.D. & Hood, L. Systems biology, proteomics, and the future of health care: toward predictive, preventative, and personalized medicine. J. Proteome. Res. 3, 179–196 (2004). | Article | PubMed | ISI | ChemPort |
- Clish, C.B. et al. Integrative biological analysis of the APOE*3-leiden transgenic mouse. Omics 8, 3–13 (2004). | Article | PubMed | ISI | ChemPort |
- Kantor, A.B. et al. Biomarker discovery by comprehensive phenotyping for autoimmune diseases. Clin. Immunol. 111, 186–195 (2004). | Article | PubMed | ISI | ChemPort |
- Davidson, E.H. et al. A genomic regulatory network for development. Science295, 1669–1678 (2002). | Article | PubMed | ISI | ChemPort |
- Ideker, T. et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292, 929–934 (2001). | Article | PubMed | ISI | ChemPort |
- Covert, M.W., Knight, E.M., Reed, J.L., Herrgard, M.J. & Palsson, B.O. Integrating high-throughput and computational data elucidates bacterial networks. Nature429, 92–96 (2004). | Article | PubMed | ISI | ChemPort |
- Hughes, T.R. et al. Functional discovery via a compendium of expression profiles.Cell 102, 109–126 (2000). | Article | PubMed | ISI | ChemPort |
- Crampin, E.J. et al. Computational physiology and the Physiome Project. Exp. Physiol 89, 1–26 (2004). | Article | PubMed | ISI |
- Noble, D. Modeling the heart—from genes to cells to the whole organ. Science295, 1678–1682 (2002). | Article | PubMed | ISI | ChemPort |
- Bassingthwaighte, J.B. & Vinnakota, K.C. The computational integrated myocyte: a view into the virtual heart. Ann. NY Acad. Sci. 1015, 391–404 (2004). | Article | PubMed |
- Musante, C.J., Lewis, A.K. & Hall, K. Small- and large-scale biosimulation applied to drug discovery and development. Drug Discov. Today 7, S192–S196 (2002). | Article | PubMed | ISI | ChemPort |
- Stokes, C.L. et al. A computer model of chronic asthma with application to clinical studies: example of treatment of exercise-induced asthma. J. Allergy. Clin. Immunol. 107, 933 (2001).
- Lewis, A.K. et al. The roles of cells and mediators in a computer model of chronic asthma. Inter. Arch. Allergy Immunol. 124, 282–286 (2001). | Article | ISI | ChemPort |
- Leckie, M.J. et al. Effects of an interleukin-5 blocking monoclonal antibody on eosinophils, airway hyper-responsiveness, and the late asthmatic response.Lancet 356, 2144–2148 (2000). | Article | PubMed | ISI | ChemPort |
- Bergman, R.N., Ider, Y.Z., Bowden, C.R. & Cobelli, C. Quantitative estimation of insulin sensitivity. Am. J. Physiol. 236, E667–E677 (1979). | PubMed | ISI | ChemPort |
- Kansal, A.R. Modeling approaches to type 2 diabetes. Diabetes Technol. Ther. 6, 39–47 (2004). | Article | PubMed |
- Eungdamrong, N.J. & Iyengar, R. Modeling cell signaling networks. Biol. Cell 96, 355–362 (2004). | Article | PubMed | ISI | ChemPort |
- Bhalla, U.S. & Iyengar, R. Emergent properties of networks of biological signaling pathways. Science 283, 381–387 (1999). | Article | PubMed | ISI | ChemPort |
- Kelley, B.P. et al. PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res. 32, W83–W88 (2004). | Article | PubMed | ISI | ChemPort |
- Coleman, R.A., Bowen, W.P., Baines, I.A., Woodrooffe, A.J. & Brown, A.M. Use of human tissue in ADME and safety profiling of development candidates. Drug Discov. Today 6, 1116–1126 (2001). | Article | PubMed | ISI | ChemPort |
- Chanda, S.K. et al. Genome-scale functional profiling of the mammalian AP-1 signaling pathway. Proc. Natl. Acad. Sci. USA 100, 12153–12158 (2003). | Article | PubMed | ChemPort |
- Haggarty, S.J., Koeller, K.M., Wong, J.C., Butcher, R.A. & Schreiber, S.L.Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays. Chem. Biol. 10, 383–396 (2003). | Article | PubMed | ISI | ChemPort |
- Borisy, A.A. et al. Systematic discovery of multicomponent therapeutics. Proc. Natl. Acad. Sci. USA 100, 7977–7982 (2003). | Article | PubMed | ChemPort |
- Marton, M.J. et al. Drug target validation and identification of secondary drug target effects using DNA microarrays. Nat. Med. 4, 1293–1301 (1998). | Article | PubMed | ISI | ChemPort |
- Kunkel, E.J. et al. An integrative biology approach for analysis of drug action in models of human vascular inflammation. FASEB J. 18, 1279–1281 (2004). | PubMed | ChemPort |
- Kunkel, E.J. et al. Rapid structure-activity and selectivity analysis of kinase inhibitors by BioMAP analysis in complex human primary cell-based models.Assay Drug Dev. Technol. 2, 431–441 (2004). | Article | PubMed | ISI | ChemPort |
- Plavec, I. et al. Method for analyzing signaling networks in complex cellular systems. Proc. Natl. Acad. Sci. USA 101, 1223–1228 (2004). | Article | PubMed | ChemPort |
- Mach, F. Statins as novel immunomodulators: from cell to potential clinical benefit. Thromb. Haemost. 90, 607–610 (2003). | PubMed | ISI | ChemPort |
- Christopher, R. et al. Data-driven computer simulation of human cancer cell.Ann. NY Acad. Sci. 1020, 132–153 (2004). | Article | PubMed | ChemPort |
- Wiley, H.S., Shvartsman, S.Y. & Lauffenburger, D.A. Computational modeling of the EGF-receptor system: a paradigm for systems biology. Trends Cell Biol. 13, 43–50 (2003). | Article | PubMed | ISI | ChemPort |
- Schoeberl, B., Eichler-Jonsson, C., Gilles, E.D. & Muller, G. Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors. Nat. Biotechnol. 20, 370–375 (2002). | Article | PubMed | ISI |
- Eker, S. et al. Pathway logic: symbolic analysis of biological signaling. Pac. Symp. Biocomput. 7, 400–412 (2002).
- Cho, K.H., Shin, S.Y., Lee, H.W. & Wolkenhauer, O. Investigations into the analysis and modeling of the TNF alpha-mediated NF-kappa B-signaling pathway.Genome Res. 13, 2413–2422 (2003). | Article | PubMed | ISI | ChemPort |
- Hoffmann, A., Levchenko, A., Scott, M.L. & Baltimore, D. The IkappaB-NF-kappaB signaling module: temporal control and selective gene activation.Science 298, 1241–1245 (2002). | Article | PubMed | ISI | ChemPort |
- Laboratory of Immunology and Vascular Biology, Department of Pathology (5324), Stanford University School of Medicine, Stanford, California 94305-5324, USA.
- The Veterans Affairs Palo Alto Health Care System, Palo Alto, California 94304, USA.
- BioSeek Inc., 863-C Mitten Rd., Burlingame, California 94010, USA.
MORE ARTICLES LIKE THIS
These links to content published by NPG are automatically generated.
Nature Reviews Drug Discovery Review (01 Jul 2009)
Nature Reviews Clinical Oncology Review (01 Aug 2009)
NEWS AND VIEWS
Nature Genetics News and Views (01 May 2003)
Nature Biotechnology News and Views (01 Nov 2006)
Journal of Investigative Dermatology Original Article
Vol.03 No.04(2014), Article ID:51154,5 pages
Extraordinary Potential of High Technologies Applications: A Literature Review and a Model of Assessment of Head and Neck Squamous Cell Carcinoma (HNSCC) Prognosis
Claudio Camuto1, Nerina Denaro2,3
1High Technology Department, ASO Santa Croce e Carle, Cuneo, Italy
2Oncology Department, ASO Santa Croce e Carle, Cuneo, Italy
3Human Pathology Department, Messina University, Messina, Italy
Copyright © 2014 by authors and Scientific Research Publishing Inc.
This work is licensed under the Creative Commons Attribution International License (CC BY).
Received 25 August 2014; revised 22 September 2014; accepted 20 October 2014
Head and neck squamous cell carcinoma (HNSCC) is the sixth most common cause of cancer mor- tality in the world and the 5th most commonly occurring cancer (Siegel, R. 2014). In the last few decades a growing interest for the emerging data from both tumor biology and multimodality treatment in HNSCC has been developed. A huge number of new markers need to be managed with bio-informatics systems to elaborate and correlate clinical and molecular data. Data mining algorithms are a promising medical application. We used this technology to correlate blood samples with clinical outcome in 120 patients treated with chemoradiation for locally advanced HNSCC. Our results did not find a significant correlation because of the sample exiguity but they show the potential of this tool.
Data Mining, Mining Software, Algorithm, Biomarker, Head and Neck Squamous Cell Carcinoma (HNSCC)
1.1. Data Mining
With the term “Data mining”, people usually intend a set of algorithms to discover hidden knowledge from a very large amount of heterogeneous data and to group data into categories. Data mining was originally developed for economy field to help managers in their decisions but after few years it was progressively introduced in other fields; in last ten years their application in medicine was largely increased in particular for the elaboration of signals such as EEG, ECG, etc.   . The main objective of this paper is to explain the data mining tool and to provide an example of its application in clinical practice. We also provided a brief review of data mining application in clinical practice.
In this study we looked for a correlation among clinical outcome (tumor progression) and blood tests (white blood cell-WBC, C reactive protein-PCR). Therefore we applied data mining algorithms to blood test. Obviously, this tool shouldn’t be considered as the absolute method to detect progression but it may play a prognosticator role in providing an elaboration of several variables. Normally blood tests, imaging and biomarkers are used to evaluate patient state of disease. However, recent data suggest that high levels of inflammatory markers indicate a high probability of progression.
1.2. Medical Data
Head and neck carcinoma (HNC) is the sixth most common cancer worldwide  .
Despite recent advances in the diagnosis and treatment of head and neck squamous cell carcinoma HNSCC, there has been little evidence of improvement in 5-year survival rates over the last few decades  . The most important risk factors are heavy exposure to alcohol and smoking and human papilloma virus (HPV) infections. These last two are also prognostic factors  . Other common prognostic factors include T and N stage, syn- chronous multiple primary cancers, patients performance status and age  .
Correlations with blood sample values have not been reported although the role of PCR and infiammation is now well known to contribute to both pathogenesis and toxic deaths.
The goal of this paper is to provide an example of data mining application in HNSCC treated with chemo- radiation (CRT) or bio-radiation (bio-RT) at the S. Croce General Hospital in the years 2010 and 2011 in daily clinical practice.
We analyzed blood samples results of 120 patients, all patients were treated with chemo-radiation or bio-radia- tion at the S. Croce General Hospital in the years 2010 and 2011 in daily clinical practice. We analyzed results of white blood cell (WBC), hemoglobin (HB), PCR and lactate of each patients pre during and post treatment.
First steps of this work were loading and cleaning original data; original data format was an excel file with 57 columns, in this file were stored many information about patients, treatments, progressions, exams but not all of these information were helpful for this analysis. We started saving the excel file as csv (Comma separated text format) another format more simple to manage and use in database contexts. The second step was to create a “tablespace” and a user on an Oracle Database Schema (Using Oracle Express Edition 11 g) and load all data in a table called “tmpdata”, using PL/SQL Developer Text data import tool, with the same structure of the original data. After this were created two new tables: a table to store patient’s information such as Name, Surname, birth date; a second table with exams. In user table was generated a unique code called “id” for each patient, this code is used in the second table to link exams with a specific patient without using his personal data. The exam table contains the following columns:
Patient_code à id of the patient, linked with the patient’s table;
Age à the age of the patient when he/she information were recorded for first time;
Exam_age à the age of the patient at the exam’s day;
Exam_type à the type of exam (for example S-LDH);
Exam_result àthe value of exam;
Target_value à a value used to indicate if the patient at that date and exam was in progression or not (initially empty).
This table was populated from the tmpdata cleaning exams data for example “S-LDH” value was the same of Sldh, SLDH and so on; exam age was calculated from exam data minus patient birth’s date.
Mining Software Used and Algorithms
In this analysis, a software called WEKA was used. WEKA is a software freeware and open source developed in Java with a modular structure. It contains several algorithms for mining and it’s possible to develop and add new ones. WEKA could work with files in various formats or with databases; we chose the second one because with a database we could easily manipulate data. We analyzed data using two different classification algorithms called “J48-Decision Tree” and “Decision Table”; the first one was choose because it is the best algorithm in many cases and it’s commonly used; the second one because the structure of data should be analyzed as table and not as single record so it seem be the best choice in this case.
J48 algorithm is an implementation realized by WEKA’s team of C4.5 decision tree algorithm created by Quinlan.
It works as follow:
· It choose in the attribute set the one who best discriminate the target attribute;
· For every value of chose attribute it creates a branch;
· Move data into correct branch;
· For every branch repeat the process until a branch contains only an element or all elements of a branch have the same values (or range of values) and is impossible to determinate a discriminant attribute;
The Decision table algorithm works as follow:
· It choose in the attribute set the one who best discriminate the target attribute;
· It creates a table, in rows the attributes discrimination ranges and in columns the conditions;
· After it creates a second table with conditions and corresponding categories;
· If all record’s attributes satisfies all conditions of a category the record is placed in that category;
· For every record it repeats the process adding conditions for a category (if it already exists) or creating a new category.
The analyzed population came from a mono Institutional experience with HNSCC patients treated with CRT or bioRT. Among this population Male/female ratio were 91/22 Heavy smokers (more than 10 pack/year were 100/120. Primary site were hypopharynx 28; larynx 24; oral cavity 21; orpharynx 26, rhynopharynx 4, mascellar sinus 2 and 15 unknown primary HNSCC respectively. We considered evaluable those patients with at least 10 records of blood tests during the treatment period.
PCR results were available in 52 patients, we found a correlation among PCR levels and worst outcome in 27/52 patients with PCR higher than 155 (overall survival inferior than 3 months).
LDH levels correlates with tumor progression. On 95 patients evaluable 58 had higher LDH levels who correlates with disease progression (35 local recurrence and 23 distant metastases).
A data mining analysis has typically five steps: Collecting data, preprocessing data, creating a model of data, testing generated model, applying generated model to new/complete data.
4.1. Collecting Data
The main target of first step is to load all heterogeneous data for the analysis and elaborate them to obtain an homogeneous structure for example we can have some data as number in excel, other data as string in a text file, other data as floating number in a database at the end of the process we’ll probably have a table in a database where that field is a floating number so the process convert the first thing of data in floating number adding dot zero at the end, the second will be simply transformed in number, the third remain the same; all them will be loaded into a database (or another system such as a csv file).
4.2. Preprocessing Data
Data mining is very powerful to understand the data and their relationships but in the other side these analysis are very hard and slow so usually before send data to real processing they are pre-analyzed to remove not-sig- nificant data and to add pre-elaboration info that help the mining process. For example if we are analyzing a set of patients mining processor don’t care of names and surnames so that information could be safely removed without effect negatively the elaboration and increasing the running of algorithm because it has to process less fields. Another important thing is to add pre-calculated data for example if we want to analyze data in witch are relevant the days after an event we should calculate this information before processing so processor has an important field in plus to help it in taking decisions.
4.3. Creating a Model of Data
Data mining algorithms groups’ data into a limited set of groups called “Classes” the basic rules are: an element must stay only in one class; elements in the same class are similar and they are different from element of other classes.  To Classification algorithm analyze the attribute of an object (every data element is an object for example a patient with his exams) and decide the class of an element. The core of data mining is the creation of a model of data; it is a decisional model used by mining to choose in which class put new elements. There are several algorithm to generate models one of the most popular is the “Decision Tree” model; it has three elements: “decision node”, “leaf”, “branch” the model created is similar to a tree where there is an initial node (often called root) with two or more branch, a branch can has a decision node with others branch or a leaf that is the ending point. The most important part is the decision nodes; every decision node has a set of binary rules such as “major than…” “equals to” and so on. To generate this model the algorithm needs some data in witch is know the class (at least one element for each class), this set of data is called “training set”. Now an example of data and generated model was given:
Based on generated model is irrelevant type of object. This tree is a two level’s three.
4.4. Testing Generated Model
To generate a model usually people submit at least 1/3 of total data and use the remaining to test the model. The training data (1/3) contains also the associated class, other data, often called “test set” contains also this information but it isn’t submitted to algorithm. The algorithm takes test set and the previous generated model and returns an associated class for each data record; after this the automatic associated data is compared with real association if confidence is better than 90% the model is usable otherwise we retry to generate model using another set of training data (training data are chose selecting random records from full data, remaining data became test set).
4.5. Applying Generated Model to New/Complete Data
When the model is created and tested and it’s considered stable (confidence factor equals of better than 90%) the model can be applied to full set of data and to new data to decide the correspondent class for example if we have a model than can distinguish between healthy patient or not we can use it to discover the health status of a submitted patient. A model is never perfect so is a good procedure to update periodically model with new data.
Until now we talk about classification but there is another important group of algorithms for data mining called “cluster algorithms”. Main aim of these algorithms is to discover automatically classes and store data into them. They analyze a test set without class and put similar data to same class; a class is generated when a single data is very different from others in other classes; the process is repeated recursively until there is a stable classification. The core of algorithm is the “distance function” a function that takes two data and returns a value that represents the distance between the two data, in other words it represents of much two objects are different. Clustering is used when we don’t know classes, for example we could analyze the purchases of users of a credit card to classificate users in some categories.
Several data mining approaches are routinely used in research work these include dose-volume metrics, equivalent uniform dose, mechanistic Poisson model, and model building methods using statistical regression and machine learning techniques. Their application in daily clinical practice could quicken the time lost to achieve information from biomarkers or physics or genetic variables   .
From a brief revision of literature in English language of cancer patients we concluded that software automated analysis will significantly reduce the overall time required to complete daily biological-radiological or physics studies (such as dose volumes studies in radiotherapy, microarray analyses and genetic elaboration). Many tools are available for automated digital acquisition of images of the spots from the microarray slide.
This study provides an example of future applications of high technology in oncology. In the era of microarray and personalized medicine these instruments are fundamental. Furthermore as the HNC patients clinical approach is well recognized to necessitate a multidisciplinary team (including ENT surgeons, radiation oncologist, medical oncologist, speech language specialist), the future global approach cannot work without a close cooperation between HT Engineers and biologists.
A correlation among elevated and reduced blood tests was not found. Data are too small to be interpreted but our analyses show the potential of this tool to evaluate correlations among a huge number of records.
- Banville, D.L. (2009) Mining Chemical and Biological Information from the Drug Literature. Current Opinion in Drug Discovery & Development, 12, 376-387.
- Zhu, F., Patumcharoenpol, P., Zhang, C., Yang, Y., Chan, J., Meechai, A., Vongsangnak, W. and Shen, B. (2013) Biomedical Text Mining and Its Applications in Cancer Research. Journal of Biomedical Informatics, 46, 200-211.http://dx.doi.org/10.1016/j.jbi.2012.10.007
- Siegel, R., Naishadham, D. and Jemal, A. (2013) Cancer statistics, 2013. CA: A Cancer Journal for Clinicians, 63, 11-30. http://dx.doi.org/10.3322/caac.21166
- Denaro, N., Russi, E.G., Adamo, V. and Merlano, M.C. (2014) State-of-the-Art and Emerging Treatment Options in the Management of Head and Neck Cancer: News from 2013. Oncology, 86, 212-229.
- Ang, K.K., Harris, J., Wheeler, R., Weber, R., Rosenthal, D.I., Nguyen-Tân, P.F., Westra, W.H., Chung, C.H., Jordan, R.C., Lu, C., Kim, H., Axelrod, R., Silverman, C.C., Redmond, K.P. and Gillison, M.L. (2010) Human Papillomavirus and Survival of Patients with Oropharyngeal Cancer. New England Journal of Medicine, 363, 24-35.http://dx.doi.org/10.1056/NEJMoa0912217
- Wu, J., Ho, C., Laskin, J., Gavin, D., Mak, P., Duncan, K., French, J., McGahan, C., Reid, S., Chia, S. and Cheung, H. (2013) The Development of a Standardized Software Platform to Support Provincial Population-Based Cancer Out- comes Units for Multiple Tumour Sites: OaSIS—Outcomes and Surveillance Integration System. Studies in Health Technology and Informatics, 183, 98-103.
- Naqa, I.E, Deasy, J.O., Mu, Y., Huang, E., Hope, A.J., Lindsay, P.E., Apte, A., Alaly, J. and Bradley, J.D. (2010) Datamining Approaches for Modeling Tumor Control Probability. Acta Oncologica, 49, 1363-73. http://dx.doi.org/10.3109/02841861003649224
- Spencer, S.J., Bonnin, D.A., Deasy, J.O., Bradley, J.D. and El Naqa, I. (2009) Bioinformatics Methods for Learning Radiation-Induced Lung Inflammation from Heterogeneous Retrospective and Prospective Data. Journal of Bio- medicine and Biotechnology, 2009, 1-14. http://dx.doi.org/10.1155/2009/892863
- Published: December 24, 2009
- DOI: 10.1371/journal.pcbi.1000597
- Featured in PLOS Collections
Citation: Rodriguez-Esteban R (2009) Biomedical Text Mining and Its Applications. PLoS Comput Biol 5(12): e1000597. doi:10.1371/journal.pcbi.1000597
Editor: Fran Lewitter, Whitehead Institute, United States of America
Published: December 24, 2009
Copyright: © 2009 Raul Rodriguez-Esteban. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The author received no specific funding for this work.
Competing interests: The author has declared that no competing interests exist.
This tutorial is intended for biologists and computational biologists interested in adding text mining tools to their bioinformatics toolbox. As an illustrative example, the tutorial examines the relationship between progressive multifocal leukoencephalopathy (PML) and antibodies. Recent cases of PML have been associated to the administration of some monoclonal antibodies such as efalizumab . Those interested in a further introduction to text mining may also want to read other reviews –.
Understanding large amounts of text with the aid of a computer is harder than simply equipping a computer with a grammar and a dictionary. A computer, like a human, needs certain specialized knowledge in order to understand text. The scientific field that is dedicated to train computers with the right knowledge for this task (among other tasks) is called natural language processing (NLP). Biomedical text mining (henceforth, text mining) is the subfield that deals with text that comes from biology, medicine, and chemistry (henceforth, biomedical text). Another popular name is BioNLP, which some practitioners use as synonymous with text mining.
Biomedical text is not a homogeneous realm . Medical records are written differently from scientific articles, sequence annotations, or public health guidelines. Moreover, local dialects are not uncommon . For example, medical centers develop their own jargons and laboratories create their idiosyncratic protein nomenclatures. This variability means, in practice, that text mining applications are tailored to specific types of text. In particular, for reasons of availability and cost, many are designed for scientific abstracts in English from Medline.
A term is a name used in a specific domain, and a terminology is a collection of terms. Terms abound in biomedical text, where they constitute important building blocks. Some examples of terms are the names of cell types, proteins, medical devices, diseases, gene mutations, chemical names, and protein domains . Due to their importance, text miners have worked to design algorithms that recognize terms (see examples in Figure 1). The task of recognizing terms is also called named entity recognition in the text mining literature, although this NLP task is broader and goes beyond recognition of terms. Although the concept of term is intuitive (or, perhaps, because it is intuitive), terms are hard to define precisely . For example, the text “early progressive multifocal leukoencephalopathy” could possibly refer to any, or all, of these disease terms: “early progressive multifocal leukoencephalopathy,” “progressive multifocal leukoencephalopathy,” “multifocal leukoencephalopathy,” and “leukoencephalopathy.” To overcome such dilemmas, text miners ask experts to identify terms within collections of text such as sets of selected Medline abstracts. These annotations are then used to train a computer by example, so that the computer can emulate the knowledge experts deploy when they read biomedical text. This pedagogical method, “teaching by example,” is a common approach used in many text mining tasks and it is more generally called supervised training. (Alternatively, text miners create rules using expert knowledge.) Thus, text miners rely heavily on collections of text (corpora) that have been annotated by experts (see compilations of corpora: http://www2.informatik.hu-berlin.de/~ hakenber/links/benchmarks.html;http://compbio.uchsc.edu/ccp/corpora/obtaining.shtml). Before beginning a text mining task, it is advisable to limit the scope of the task to a corpus made of a set of documents around the topic of interest. In our case, a PML corpus could comprise all the Medline abstracts that mention the term “progressive multifocal leukoencephalopathy,” because this is an unambiguous term. Another relevant corpus to consider could be the ImmunoTome , which is focused on immunology.
(A) Text marked with protein (blue), disease (crimson), Gene Ontology (bright red), chemical (dark red), and species (red) terms by Whatizit  with thewhatizitEBIMedDiseaseChemicals pipeline. (B) Text marked with protein and cell line terms by ABNER . (C) Protein terms identified by the prototype BIOCreAtIvE metaserver . In the example shown, the metaserver combines the output of systems hosted in three servers.
Text miners are interested in terminologies that have been built manually. These controlled terminologies have notable roles in biomedicine, for example, the HUGO gene nomenclature, the ICD disease classification, or the Gene Ontology. Many of these terminologies are more than just a flat list of terms. Some include term synonyms (thesauri) or relations between terms (taxonomies, ontologies). For text miners, their usefulness comes from their ability to link to information. Once a text is mapped to one of these terminologies, a bridge is opened between the text and other resources. This usefulness justifies efforts such as the National Library of Medicine’s manual mapping of Medline abstracts to the Medical Subject Headings (MeSH) terminology. In our example, MeSH can be used to make the PML corpus more focused by restricting it only to abstracts with the MeSH term “leukoencephalopathy, progressive multifocal.” Controlled terminologies can be used to annotate results from experiments and databases . Text miners attempt to make such mappings automatically. For example, a task called gene normalization consists in recognizing names of genes in text and mapping them to their corresponding gene identifiers (e.g., Entrez Gene ID). Thus, using gene normalization it is possible to identify all the abstracts in Medline that mention a given gene from Entrez Gene.
Because there are many controlled terminologies, some terminologies have been created to map between them. For example, the BioThesaurus  is a compilation of protein synonyms from several terminologies. The Unified Medical Language System (UMLS) , is a grand compilation of more than 120 terminologies and close to 4 million terms. Despite UMLS’s size, all controlled terminologies are incomplete, because new terms are created too quickly to keep them up to date. Furthermore, all have gaps and areas of emphasis that conflict with the needs of users.
Tools for Terms
Whatizit  is a tool that recognizes several types of terms. It can be accessed through a Web interface, Web services, or a streamed servlet. Abner  is a standalone application that recognizes five types of terms: protein, DNA, RNA, cell line, and cell type. More specialized term recognition has been used, for example, for databases such as LSAT  for alternative transcripts and PepBank  for peptides. Text miners have also used terminologies to enrich PubMed’s search capabilities. Some recent search engines are semedico , novo|seek , and GoPubMed/GoGene ,.
After recognizing terms, the natural next step is to look for relationships between terms. The simplest method to identify relationships is using the co-occurrence assumption: terms that appear in the same texts tend to be related. For example, if a protein is mentioned often in the same abstracts as a disease, it is reasonable to hypothesize that the protein is involved in some aspect of the disease. The degree of co-occurrence can be quantified statistically to rank and eliminate statistically weak co-occurrences (see Box 1). An example using GoGene can illustrate the use of simple co-occurrence, MeSH terms, and gene normalization. The query“leukoencephalopathy, progressive multifocal”[mh] in GoGene returns all the genes mentioned in Medline abstracts annotated with the MeSH term for PML. The genes that appear most often are likely to be related to PML. Those that appear disproportionately more often for PML than for other diseases are likely to be more specific to PML.
Better evidence than co-occurrence comes from relationships that are described explicitly . For example, the sentence “We describe a PML in a 67-year-old woman with a destructive polyarthritis associated with anti-JO1 antibodies treated with corticosteroids”  describes an explicit link between PML and anti-JO1 antibodies. We can simplify this relationship into a triplet of two terms and a verb: PML is associated with anti-JO1 antibodies. To create the triplet, the verb can be identified with the aid of a part-of-speech (POS) tagger. An example of a POS tagger for biomedical text is MedPost . This triplet representation is powerful due to its simplicity, but it omits crucial details from the original article, such as the fact that the evidence comes from a clinical case study.
A heavily studied area in text mining concerns the relationships known as protein-protein interactions (PPI). Using the triplet representation, PPI can be depicted as network graphs with the proteins as nodes and the verbs as edges (see Figure 2). When analyzing text-mined interaction networks, it is important to understand the information that underpins them. For example, interactions can be direct (physical) or indirect, depending on the verb (examples of direct verbs are to bind, to stabilize, to phosphorylate; examples of indirect verbs are to induce,to trigger, to block) . The different nature of the protein interactions described in the literature reflects in part the experimental methodology employed and the nature of the interaction itself. A common way to capture the textual variations is by exhaustively identifying all the patterns that appear and writing a set of rules that capture them ,. For example, a simple pattern to capture phosphorylations might involve, sequentially, a kinase name, a form of the verb to phosphorylate, and a substrate name ,.
The nodes are proteins identified using the query: “leukoencephalopathy, progressive multifocal”[mh] antibody[pubmed] in GoGene . The query retrieves gene symbols mapped to PubMed abstracts that include the keyword antibody and the MeSH termleukoencephalopathy, progressive multifocal (PML). The gene list was exported to SIF format and the gene symbols extracted and used to query PPI using iHOP Web services. Only those iHOP interactions with at least two co-occurrences and confidence above zero were considered. The network was plotted using Cytoscape . The node color is based on the number of interactions (node degree).
Tools for Relationships
To see co-occurrence in action, try FACTA . MedGene and BioGene , use co-occurrence for gene prioritization. Gene prioritization tools such as Endeavour  and G2D use text as well as other data sources. PolySearch  uses heuristic weighting of different co-occurrence measures and includes a detailed guide to implementation and vocabularies. Anni  uses textual profiles instead of co-occurrence to measure relationship between terms. For PPI, iHOP  is the most popular tool. RLIMS-P  uses linguistic patterns to detect the kinase, substrate, and phosphosite in a phosphorylation. E3Miner  detects ubiquitinations, including contextual information.
Besides finding relationships, text miners are also interested in discovering relationships. Due to the size of the literature, scientists miss links between their work and other, related work. Swanson called these links “undiscovered public knowledge.” In a classic example he found by careful reading 11 links between magnesium and migraine that had been neglected . One method to discover relationships is based on transitive inference . Simply stated, if A is linked to B, and B is linked to C, then there is a chance that A is linked to C. PPI networks are, at the core, an example of transitive inference. Arrowsmith  is a basic discovery tool that compares two literature sets to find links between them. Applying Arrowsmith to the literature for PML and antibodies yields the immunomodulator tacrolimus, a calcineurin inhibitor, among the top hits. Tacrolimus affects the production of several proteins depicted in Figure 2, such as IL-2.
The most common measure of output quality in text mining is the F-measure, which is the harmonic mean of two other measures, precision and recall. These three measures can be described with the analogy of searching for needles in a haystack. After a manual search of a haystack, our hands end up full with valuable needles but also with some useless straws. Recall is based on the number of needles found. High recall means that we have found most of the needles for which we were looking. Precision, however, is based on the number of both needles and straws. High precision means that we have retrieved far more needles than straws. Both high precision and high recall are desirable, and a high F-measure reflects both because it is the harmonic mean. Optimizing the F-measure of a text mining application is often different from optimizing the accuracy, because there are usually few needles and large amounts of hay in the haystack. An application that identifies the whole haystack as being only hay is quite accurate but misses all the needles.
It is important to ponder over the way an application has been evaluated before assessing its F-measure , and especially to consider how realistic the evaluation was. The F-measure is not an absolute value. The larger a haystack is, the more difficult it is to find needles. In other words, a low F-measure might reflect a harder task, not a worse application. Moreover, text mined applications may perform differently in different types of text and this may be reflected in lower F-measures than advertised. When the F-measure attainable is not high enough, one solution is to use text mining as a filter. A filter needs high recall, but only moderate precision, to reduce the amount of hay without affecting the needles. Filtering with text mining is used as a preliminary step in databases such as MINT , DIP , and BIND . Filtering is followed by human curation, which involves the review and assessment of results to reduce hay and, hopefully, provide feedback to improve the filtering. The feedback loop between text mining and curation can have an incremental positive impact in output results .
Doing comprehensive text mining means considering all sources of information—Medline and beyond. The abstract conveys an article’s main findings, but many other pieces of information are elsewhere in the full text, figures, tables, supplementary information, references, databases, Web sites, and multimedia files. In particular, the full text is critical for information that rarely appears in abstracts, such as experimental measurements. A more comprehensive PML corpus would include full text articles, however despite the surge in open access articles (see the Directory of Open Access Journals, www.doaj.org; ), the majority of published articles have access and processing restrictions. PubMed Central  is the main source of open access articles, and the specialized search engines BioText , Yale Image Finder , and Figurome search PubMed Central figures and tables. A search for “progressive multifocal leukoencephalopathy” in the Yale Image Finder yields only one figure, while a search for “PML” yields a large number of hits, most of them not relevant because PML is an ambiguous acronym.
Text and DNA
Considering text as a sequence of symbols as informative as a protein’s DNA sequence is the underlying premise of many text mining tools for bioinformatics. For example, the linguistic similarity between protein corpora (sets of texts built around proteins) correlates with the BLAST score between those same proteins . Text that is used in articles or database annotations to describe a protein can be used for protein clustering and to predict structure , subcellular localization, and function . For example, a protein corpus of a protein located in the nucleus uses a vocabulary that is somewhat different from a corpus built around a secreted protein. These vocabulary differences can be used to predict the subcellular localization of a protein of unknown location. One way to measure vocabulary differences is to represent the texts as vectors of word counts. The word counts can be normalized by the size of the text they come from and the vectors compared using, for example, Euclidean distance (for more, see). To reduce vector dimensionality, some words can be grouped using a method called stemming. A simple example of stemming is converting plural nouns into singular form and verbs into infinitive form (a widely used stemming algorithm is the Porter stemmer ). Additional simplification can be achieved via tokenization, because some words can be separated into constitutive elements called tokens. In English, however, most words are a single token. An example of a word of two tokens is don’t.
Text mining applications for bioinformatics  include subcellular localization prediction such as Sherloc and Epiloc , and protein clustering such as TXTGate . Thus, text mining tools can be used for annotating biological databases in the same fashion other bioinformatics tools are used.
An extensive list of text mining applications is maintained inhttp://zope.bioinfo.cnio.es/bionlp_tools/ . A growing number of tools are being developed under a standard framework called UIMA, which comprises NLP as well as BioNLP tools .
Text mining tools are increasingly more accessible to biologists and computational biologists and these can often be applied to answer scientific questions in combination with other bioinformatics tools. Getting acquainted with them is a first step towards grasping the possibilities of text mining and towards venturing into the algorithms described in the literature. One way to get started on this path is by looking at examples such as –.
- 1.Sobell JM, Weinberg JM (2009) Patient fatalities potentially associated with efalizumab use. J Drugs Dermatol 8: 215.
- 2.Cohen KB, Hunter L (2008) Getting started in text mining. PLoS Comput Biol 4: e20. doi:10.1371/journal.pcbi.0040020.
- 3.Rzhetsky A, Seringhaus M, Gerstein MB (2009) Getting started in text mining: part two. PLoS Comput Biol 5: e1000411. doi:10.1371/journal.pcbi.1000411.
- 4.Rzhetsky A, Seringhaus M, Gerstein M (2008) Seeking a new biology through text mining. Cell 134: 9–13.
- 5.Friedman C, Kra P, Rzhetsky A (2002) Two biomedical sublanguages: a description based on the theories of Zellig Harris. J Biomed Inform 35: 222–235.
- 6.Netzel R, Perez-Iratxeta C, Bork P, Andrade MA (2003) The way we write. EMBO Rep 4: 446–451.
- 7.Krauthammer M, Nenadic G (2004) Term identification in the biomedical literature. J Biomed Inform 37: 512–526.
- 8.Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6: Suppl 1S3.
- 9.Kabiljo R, Shepherd AJ (2008) Protein name tagging in the immunological domain. Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008) 141–144.
- 10.Lu X, Zhai C, Gopalakrishnan V, Buchanan BG (2004) Automatic annotation of protein motif function with Gene Ontology terms. BMC Bioinformatics 5: 122.
- 11.Morgan AA, Lu Z, Wang X, Cohen AM, Fluck J, et al. (2008) Overview of BioCreative II gene normalization. Genome Biol 9: Suppl 2S3.
- 12.Liu H, Hu ZZ, Zhang J, Wu C (2006) BioThesaurus: a web-based thesaurus of protein and gene names. Bioinformatics 22: 103–105. Available:http://pir.georgetown.edu/pirwww/iprolink/biothesaurus.shtml.
- 13.Bangalore A, Thorn KE, Tilley C, Peters L (2003) The UMLS knowledge source server: an object model for delivering UMLS data. AMIA Annu Symp Proc 51–55. Available:http://www.nlm.nih.gov/research/umls/.
- 14.Aronson AR (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 17–21. Available: http://mmtx.nlm.nih.gov/.
- 15.Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A (2008) Text processing through web services: calling Whatizit. Bioinformatics 24: 296–298. Available: http://www.ebi.ac.uk/webservices/whatizit/info.jsf.
- 16.Settles B (2005) ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21: 3191–3192. Available:http://pages.cs.wisc.edu/~ bsettles/abner/.
- 17.Shah PK, Bork P (2006) LSAT: learning about alternative transcripts in MEDLINE. Bioinformatics 22: 857–865. Available: http://www.bork.embl.de/LSAT.
- 18.Shtatland T, Guettler D, Kossodo M, Pivovarov M, Weissleder R (2007) PepBank–a database of peptides based on sequence text mining and public peptide data sources. BMC Bioinformatics 8: 280. Available: http://pepbank.mgh.harvard.edu/.
- 19.Wermter J, Tomanek K, Hahn U (2009) High-performance gene name normalization with GeNo. Bioinformatics 25: 815–821. Available: http://www.semedico.org/.
- 20.Alonso-Allende R (2009) Accelerating searches of research grants and scientific literature with novo|seek. Nat Methods 6. Advertising feature. Available:http://www.novoseek.com/.
- 21.Doms A, Schroeder M (2005) GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res 33: W783–W786. Available: http://www.gopubmed.com.
- 22.Plake C, Royer L, Winnenburg R, Hakenberg J, Schroeder M (2009) GoGene: gene annotation in the fast lane. Nucleic Acids Res 37(Web Server issue) W300–W304. Available: http://www.gopubmed.org/gogene/.
- 23.Shatkay H, Pan F, Rzhetsky A, Wilbur WJ (2008) Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users. Bioinformatics 24: 2086–2093.
- 24.Viallard JF, Lazaro E, Ellie E, Eimer S, Camou F, et al. (2007) Improvement of progressive multifocal leukoencephalopathy after cidofovir therapy in a patient with a destructive polyarthritis. Infection 35: 33–36.
- 25.Smith L, Rindflesch T, Wilbur WJ (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20: 2320–2321. Available:http://www.ncbi.nlm.nih.gov/staff/lsmith/MedPost.html.
- 26.Santos C, Eggle D, States DJ (2005) Wnt pathway curation using automated natural language processing: combining statistical methods with partial and full parse for knowledge extraction. Bioinformatics 21: 1653–1658.
- 27.Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 17: Suppl 1S74–S82.
- 28.Blaschke C, Valencia A (2001) The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform 12: 123–134.
- 29.Hu ZZ, Narayanaswamy M, Ravikumar KE, Vijay-Shanker K, Wu CH (2005) Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics 21: 2759–2765.
- 30.Yuan X, Hu ZZ, Wu HT, Torii M, Narayanaswamy M, et al. (2006) An online literature mining tool for protein phosphorylation. Bioinformatics 22: 1668–1669. Available:http://pir.georgetown.edu/pirwww/iprolink/rlimsp.shtml.
- 31.Tsuruoka Y, Tsujii J, Ananiadou S (2008) FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24: 2559–2560. Available:http://text0.mib.man.ac.uk/software/facta/.
- 32.Hu Y, Hines LM, Weng H, Zuo D, Rivera M, et al. (2003) Analysis of genomic and proteomic data using advanced literature mining. J Proteome Res 2: 405–412. Available: http://medgene.med.harvard.edu/MEDGENE/.
- 33.Rolfs A, Hu Y, Ebert L, Hoffmann D, Zuo D, et al. (2008) A biomedically enriched collection of 7000 human ORF clones. PLoS ONE 3: e1528. Available:http://biogene.med.harvard.edu/BIOGENE/.
- 34.Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, et al. (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24: 537–544. Available:http://homes.esat.kuleuven.be/~ bioiuser/endeavour/endeavour.php.
- 35.Perez-Iratxeta C, Wjst M, Bork P, Andrade MA (2005) G2D: a tool for mining genes associated with disease. BMC Genet 6: 45.
- 36.Cheng D, Knox C, Young N, Stothard P, Damaraju S, et al. (2008) PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res 36: W399–W405. Available:http://wishart.biology.ualberta.ca/polysearch/index.htm.
- 37.Jelier R, Schuemie MJ, Veldhoven A, Dorssers LC, Jenster G, et al. (2008) Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol 9: R96. Available:http://www.biosemantics.org/index.php?page=anni-2-0.
- 38.Hoffmann R, Valencia A (2004) A gene network for navigating the literature. Nat Genet 36: 664. Available: http://www.ihop-net.org/.
- 39.Lee H, Yi GS, Park JC (2008) E3Miner: a text mining tool for ubiquitin-protein ligases. Nucleic Acids Res 36: W416–W422. Available: http://e3miner.biopathway.org.
- 40.Swanson DR (1988) Migraine and magnesium: eleven neglected connections. Perspect Biol Med 31: 526–557.
- 41.Weeber M, Kors JA, Mons B (2005) Online tools to support literature-based discovery in the life sciences. Brief Bioinform 6: 277–286.
- 42.Smalheiser NR, Torvik VI, Zhou W (2009) Arrowsmith two-node search interface: a tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput Meth Program Biomed 94: 190–197. Available:http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/start.cgi.
- 43.Caporaso JG, Deshpande N, Fink JL, Bourne PE, Cohen KB, et al. (2008) Intrinsic evaluation of text mining tools may not predict performance on realistic tasks. Pac Symp Biocomput 640–651.
- 44.Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, et al. (2002) MINT: a Molecular INTeraction database. FEBS Lett 513: 135–140.
- 45.Marcotte EM, Xenarios I, Eisenberg D (2001) Mining literature for protein-protein interactions. Bioinformatics 17: 359–363.
- 46.Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, et al. (2003) PreBIND and Textomy–mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4: 11.
- 47.Rodriguez-Esteban R, Iossifov I, Rzhetsky A (2006) Imitating manual curation of text-mined facts in biomedicine. PLoS Comput Biol 2: e118. doi:10.1371/journal.pcbi.0020118.
- 48.Wadman M (2009) Open-access policy flourishes at NIH. Nature 458: 690–691.
- 49.Vastag B (2000) NIH launches PubMed Central. J Natl Cancer Inst 92: 374. Available:http://www.ncbi.nlm.nih.gov/pmc/.
- 50.Hearst MA, Divoli A, Guturu H, Ksikes A, Nakov P, et al. (2007) BioText Search Engine: beyond abstract search. Bioinformatics 23: 2196–2197. Available:http://biosearch.berkeley.edu/.
- 51.Xu S, McCusker J, Krauthammer M (2008) Yale Image Finder (YIF): a new search engine for retrieving biomedical images. Bioinformatics 24: 1968–1970. Available:http://krauthammerlab.med.yale.edu/imagefinder/.
- 52.Rodriguez-Esteban R, Iossifov I (2009) Figure mining for biomedical research. Bioinformatics 25: 2082–2084.
- 53.Yandell MD, Majoros WH (2002) Genomics and natural language processing. Nat Rev Genet 3: 601–610.
- 54.Koussounadis A, Redfern OC, Jones DT (2009) Improving classification in protein structure databases using text mining. BMC Bioinformatics 10: 129.
- 55.Pandev G, Kumar V, Steinbach M (2006) Computational approaches for protein function prediction: a survey. Technical Report 06-028, Department of Computer Science and Engineering, University of Minnesota, Twin Cities.
- 56.Manning CD, Schutze H (1999) Foundations of Statistical Natural Language Processing. MIT Press.
- 57.Van Rijsbergen CJ, Robertson SE, Porter MF (1980) New models in probabilistic information retrieval. Tech. Rep. 5587. British Library. Available: http://tartarus.org/~ martin/PorterStemmer/.
- 58.Krallinger M, Valencia A (2005) Text-mining and information-retrieval services for molecular biology. Genome Biol 6: 224.
- 59.Shatkay H, Höglund A, Brady S, Blum T, Dönnes P, et al. (2007) SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics 23: 1410–1417. Available: http://www-bs.informatik.uni-tuebingen.de/Services/SherLoc2/.
- 60.Brady S, Shatkay H (2008) EpiLoc: a (working) text-based system for predicting protein subcellular location. Pac Symp Biocomput 604–615. Available:http://epiloc.cs.queensu.ca/.
- 61.Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, et al. (2004) TXTGate: profiling gene groups with text-based information. Genome Biol 5: R43. Available:http://tomcat.esat.kuleuven.be/txtgate/.
- 62.Krallinger M, Hirschman L, Valencia A (2008) Linking genes to literature: text mining, information extraction, and retrieval applications for biology. Genome Biol 9: S8. Available: http://zope.bioinfo.cnio.es/bionlp_tools/.
- 63.Kano Y, Baumgartner WA Jr, McCrohon L, Ananiadou S, Cohen KB, et al. (2009) U-Compare: share and compare text mining tools with UIMA. Bioinformatics 25: 1997–1998. Available: http://u-compare.org/.
- 64.Ramialison M, Bajoghli B, Aghaallaei N, Ettwiller L, Gaudan S, et al. (2008) Rapid identification of PAX2/5/8 direct downstream targets in the otic vesicle by combinatorial use of bioinformatics tools. Genome Biol 9: R145.
- 65.Natarajan J, Berrar D, Dubitzky W, Hack C, Zhang Y, et al. (2006) Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line. BMC Bioinformatics 7: 373.
- 66.Leach SM, Tipney H, Feng W, Baumgartner WA, Kasliwal P, et al. (2009) Biomedical discovery acceleration, with applications to craniofacial development. PLoS Comput Biol 5: e1000215. doi:10.1371/journal.pcbi.1000215.
- 67.Campillos M, Kuhn M, Gavin AC, Jensen LJ, Bork P (2008) Drug target identification using side-effect similarity. Science 321: 263–266.
- 68.Leitner F, Krallinger M, Rodriguez-Penagos C, Hakenberg J, Plake C, et al. (2008) Introducing meta-services for biomedical information extraction. Genome Biol 9: Suppl 2S6. Available: http://bcms.bioinfo.cnio.es/.
- 69.Fernández JM, Hoffmann R, Valencia A (2007) iHOP web services. Nucleic Acids Res 35(Web Server issue) W21–W26.
- 70.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 13: 2498–2504. Available: http://www.cytoscape.org/.
- 71.Wilbur WJ, Rzhetsky A, Shatkay H (2006) New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics 7: 356.
- 72.Rzhetsky A, Zheng T, Weinreb C (2006) Self-correcting maps of molecular pathways. PLoS One 1: e61. doi:10.1371/journal.pone.0000061.
- 73.Jenssen TK, Laegreid A, Komorowski J, Hovig E (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28: 21–28.