KH Coder is a free software for quantitative content analysis or text data mining


030_kwic 040_coloc 060_cluster1 120_net3 180_cp4 som3



KH Coder is a free software for quantitative content analysis or text data mining. It is also utilized for computational linguistics. You can analyze Japanese, English, French, German, Italian, Portuguese and Spanish text with KH Coder. Chinese (simplified, UTF-8), Korean and Russian (UTF-8) language data can also be analyzed with the latest alpha version.

KH Coder provides various kinds of search and statistical analysis functions using back-end tools such as Stanford POS Tagger, FreeLing, Snowball stemmer, MySQL and R.

KH Coder Web Site

Translational Bioinformatics in contextThe Y axis depicts the “central dogma” of ...

Figure 1.

Translational Bioinformatics in context

The Y axis depicts the “central dogma” of informatics, converting data to information and information to knowledge. Along the X axis is the translational spectrum from bench to bedside. Translational bioinformatics spans the data to knowledge spectrum, and bridges the gap between bench research and application to human health. The figure was reproduced from [1] with permission from Springer.


In the general phase of text mining of cancer systems biology, we initially obtained related biomedical text from many available sources, such as PubMed. A number of literature databases provide packed data download service. However, although it is convenient, the included text is not timely updated, and text quantity is also limited. Many literature database systems offers application programming interface, by which we can use scripts to download the text automatically by computers. For examples, through E-utility of PubMed [64] and [101], users can easily get up-to-date text.

Named entity recognition tools can then be used to extract biomedical mentions from the text obtained. The mentions usually include terms such as gene names, protein names, mRNA (message RNA) names, miRNA (micro-RNA) names, metabolism related terms, and cell terms. After finding the biomedical terms, we can build a gene–gene interaction network, metabolism pathways, and other networks. Resources such as Gene Ontology can be used for network building. MicroRNAs are considered to be connected with cancer, so we can investigate how miRNAs work in gene–gene interaction. In the next phase, we can study how components and structures change in dynamic contexts. Certain networks and their variations, such as protein–protein interaction networks [102]and variations in metabolism network, can be built from text. Due to the high false negative rate in text mining-based networks, we can employ some validation and inference algorithms to correct and optimize the network. In each phase, we can use many resources to validate the network, such as homology, co-expression data, rich domain data, and co-biological process data, as well as other information. Through validation, some nodes and interactions with strong evidence will be strengthened, whereas a false one will be removed or updated. Consequently, we can develop a protein–protein interactome based on multiple sources of interaction evidence [47]. Finally, all the networks and components can be used for further studies.

Signaling pathway reconstruction plays a significant role in understand the molecular mechanisms in cancer. Signaling pathway maps are usually obtained from manual literature search, automated text mining, or canonical pathway databases [103]. Pena-Hernandez et al. implemented an extraction tool to find gene relationship and up-to-date pathways from literature [104].

5.2. Examples of integrated biomedical text mining tools

An integrated biomedical text mining systems is supposed to provide the stated functionalities. There are many tools dominated in cancer research. However blindly using the results from text mining tools is not a wise idea because the information and knowledge derived from uncurated text are error prone. Many tools choose to manually curate text by experts. In the following we will briefly introduce the three most popular commercial tools, i.e., Pathway Studio [105], GeneGO [106] and Ingenuity [107].

By Pathway Studio [105], we can analyze pathway, gene regulation networks, protein interaction maps and navigate molecular networks. Its background knowledge database contains more than 100,000 events of regulation, interaction and modification between proteins, cell processes and small molecules. It has a natural language processing module, MedScan, which enables Pathway Studio for entity identification and then applied handcrafted context free grammar (CFG) rules to extract relationships. Pathway Studio can access the entire PubMed database and online resource, full-text journal, literature, experimental and electronic notebooks. Pathways and networks from the extracted facts and interactions extracted from retrieved text. Many algorithms such as Find direct interactions, Find shortest paths, Find common targets or Find common regulators are available.

MetaCore, one of key products of GeneGO [106] is an integrated knowledge database and software suite for pathway analysis of experimental data and gene lists. The knowledge base of MetaCore is manually curated database derived from extensive full-text literature annotation. MetaMiner of GeneGo, mainly including MetaMiner Disease Platforms, MetaMiner Stem Cells, MetaMiner Prostate Cancer, MetaMiner Cystic Fibrosis, offers a knowledge mining and data analysis platforms for oncology. The most important disease reconstruction function is based on three fundamentals, manual annotation of all gene–disease associations, reconstruction of disease pathways and functional data and knowledge mining of OMICs experimental studies published in a disease area. GeneGo also provides API for third party software development.

Ingenuity [107] helps researchers model, analyze, and understand the complex biomedical, biological and chemical systems by integrating data from a variety of experimental platforms. One application example of Ingenuity Systems is analysis of CD44hi breast cancer stem cell-like subpopulations using Ingenuity iReport. The base knowledge of Ingenuity is also extracted by experts from the full text of the scientific literature, including findings about genes, drugs, biomarkers, chemicals, cellular and disease processes, and signaling and metabolic pathways. Researchers can search the scientific literature and find insights most relevant to the desired experimental model or question, build dynamic pathway models, and get confidence in hypotheses and conclusions.

6. Future work and challenges

With the development of the next-generation sequencing technologies, high throughput experimental methods are revolutionizing the life sciences rapidly. The widespread of the cloud computing application is also accelerating the application of text mining technology in the frontier research in life science. We here discuss the work and challenges in the future application of text mining in cancer researches as follows.

The first challenge is to apply biomedical text mining technologies in the personalized medicine development. It is well-known that cancer is a complex disease. Many factors such as race, gender, age and environments may correlate with risk of cancer [108],[109], [110], [111], [112], [113] and [114]. The personalized medicine is becoming a trend and the therapies will be tailored to individual patients with their biomedical information collected and analyzed. Ando et al. have applied the text mining technique to qualitatively identify the differences in the focus of life review interviews by patient’s age, gender, disease age and stage [115]. Ahmed et al. integrated compound–target relationships related with cancer by text mining and presented the spectrum of research on personalized medicine and compound–target interactions [116]. The personalized medicine in cancer will take in all these important aspects into consideration during text mining [117]. One solution is to categorize data before text mining rather than treat them together without any pre-processing. It is a really tough task to categorize data at individual level features. On the other hand, one of the negative consequence of categorization is making it harder for text mining to find a good biomarker for all cases.

The second challenge is the complex of cancer molecular mechanisms. The same cancer phenotype could be caused by different gene or gene sets from the same pathway or network. To study the complex mechanisms of cancer, we need to mine text from a hierarchical network view rather than from a single level. Systems biomedicine carries on analysis and study from different levels, including motif [118] and [119], pathway [120], [121] and [122], module [123], [124] and [125] and network[126] and [127]. The resulting hierarchical data provide us valuable materials to conduct text mining on different levels. However, how to correctly categorize text to hierarchical network, and how to integrate text mining results from different levels and discover new knowledge with a systems biomedicine view are really a hard work.

The third challenge is to apply the text mining techniques in translational medicine research. Translational medicine, an emerging field of biomedicine, involves the transformation of laboratory findings into novel diagnosis and treatment of patients [128]. The knowledge of pre-clinical can be used in clinic to improve treatment. Translational medicine facilitates the course of diseases predicting, preventing, diagnosing, and treating. Bioinformatics will be a driver rather than a passenger for translational biomedical research [128], such as the data integration and data mining platform presented by Liekens et al. [129] could retrospectively confirm recently discovered disease genes and identify potential susceptibility genes. It will add tough tasks for text mining, since translation biomedical text mining should consider various stages of information and various sources of evidence, and integrate the Omics and clinical data sets to find out novel knowledge for both biology and medicine domains. There are many this kind of applications, such as the data integration and data mining platform presented by Liekens et al. [129] could retrospectively confirm recently discovered disease genes and identify potential susceptibility genes.

The fourth challenge for text mining will be the integration of the text information at molecule, cell, tissue, organ, individual and even population levels to understand the complex biological systems. Nevertheless, most of the current text mining studies focus on molecular level, and very little text mining work reported at high levels, which in fact has a close relationship with cancer phenotypes. Text mining at high levels and integrate the text information at all these levels will be a big challenge for cancer study and provide also opportunities for successful cancer diagnosis and treatments.

The last challenge will be the de-noising and testing of the text mining results. Text mining results are often obtained with noising information and false positives since natural language text are often inconsistent. It contains ambiguities caused by semantics, slang and syntax. It can be also suffered from noise and error in text. As a result, the mined information cannot be used blindly. Many methods have been developed to solve the problem. The first is to manually read and understand the contexts, analyze them, and then add semantic tags. This pre-processing in fact turns the unstructured text into structured text with semantic tags. Thereby, the developed tools can easily achieve the goal with high precision rate. However, the approach is very restricted as it needs vast human efforts and turns out to be very time consuming. As a result, the data source for mining could be modest in size, only limiting mining ability. The second method is to carry on text mining on vast biomedical text, and then analyze the results and screen out the final results with prior domain experience. During the mining process, domain knowledge is usually employed to improve mining efficiency as well as the quality of the mined knowledge. This approach although the mined results may still contain more errors, is more powerful on knowledge discovering compared with the first approach. These two approaches are distinct on treating the text to be mined. The first one ensures correctness by carefully manual pre-processing, while the second one is to select correct ones by post-processing by experts. The third approach is to take a compromise between pre-processing and post-processing, where some advanced statistical analysis will be used to roughly clean data at first stage and then conduct mining on them.

7. Conclusions

Currently, there is a huge body of biomedical text and their rapid growth makes it impossible for researchers to address the information manually. Researchers can use biomedical text mining to discover new knowledge. We have reviewed the important research issues related to text mining in the biomedical field. We also provided a review of the state-of-the-art applications and datasets used for text mining in cancer research, thereby providing researchers with the necessary resources to apply or develop text mining tools in their research. We introduced the general workflow of text mining to support cancer systems biology and we illustrated each phase in detail. We can see that text mining has been used widely in cancer research. However, to fully utilize text mining, it is still necessary to develop new methods for full text mining and for highly complex text, as well as platforms for integrating other biomedical knowledge bases.

In spite of the huge potential of applying text mining on biomedicine, it still needs further development. Biomedical text mining systems are not as golden standard tools of biomedical researchers as retrieval systems and sequencing tools. The next important mission of text mining for us is to develop applications that are really helpful to biomedical research, so that researchers can get more productive and make more progress in the information rapid growing ear. To achieve the goal, more concerns should be put on helping biological biomedical scientists to remove the obstacles that block the development rather than discussions that are not related with actual demands. One of the hottest topics of text mining is to coordinate and cooperate with multiple subjects. That is, biomedical text mining, coupled with other data and means, should yield consistent, measurable, and testable results.