The earliest deep-learning-like algorithms that had multiple layers of non-linear features can be traced back to Ivakhnenko and Lapa in 1965 (Figure 1), who used thin but deep models with polynomial activation functions which they analyzed with statistical methods. In each layer, they selected the best features through statistical methods and forwarded them to the next layer. They did not use backpropagation to train their network end-to-end but used layer-by-layer least squares fitting where previous layers were independently fitted from later layers.
The earliest convolutional networks were used by Fukushima in 1979. Fukushima’s networks had multiple convolutional and pooling layers similar to modern networks, but the network was trained by using a reinforcement scheme where a trail of strong activation in multiple layers was increased over time. Additionally, one would assign important features of each image by hand by increasing the weight on certain connections.
Backpropagation of errors to train deep models was lacking at this point. Backpropagation was derived already in the early 1960s but in an inefficient and incomplete form. The modern form was derived first by Linnainmaa in his 1970 masters thesis that included FORTRAN code for backpropagation but did not mention its application to neural networks. Even at this point, backpropagation was relatively unknown and very few documented applications of backpropagation existed the early 1980s (e.g. Werbos in 1982). Rumelhart, Hinton, and Williams showed in 1985 that backpropagation in neural networks could yield interesting distributed representations. At this time, this was an important result in cognitive psychology where the question was whether human cognition can be thought of as relying on distributed representations (connectionism) or symbolic logic (computationalism).
The first true, practical application of backpropagation came about through the work of LeCun in 1989 at Bell Labs. He used convolutional networks in combination with backpropagation to classify handwritten digits (MNIST) and this system was later used to read large numbers of handwritten checks in the United States. The video above shows Yann LeCun demonstrating digit classification using the “LeNet” network in 1993.
Despite these successes, funding for research into neural networks was scarce. The term artificial intelligence dropped to near pseudoscience status during the AI winter and the field still needed some time to recover. Some important advances were made in this time, for example, the long short-term memory (LSTM) for recurrent neural networks by Hochreiter and Schmidhuber in 1997, but these advances went mostly unnoticed until later as they were overshadowed by the support vector machine developed by Cortes and Vapnik in 1995.
The next big shift occurred just by waiting for computers to get faster, and then later by the introduction of graphics processing units (GPUs). Waiting for faster computers and GPUs alone increased the computational speed by a factor of 1000 over a span of 10 years. In this period, neural networks slowly began to rival support vector machines. Neural networks can be slow when compared to support vector machines, but they reach much better results with the same amount of data. Unlike simpler algorithms, neural networks continue to improve with more training data.
The main hurdle at this point was to train big, deep networks, which suffered from the vanishing gradient problem, where features in early layers could not be learned because no learning signal reached these layers.
The first solution to this problem was layer-by-layer pretraining, where the model is built in a layer-by-layer fashion by using unsupervised learning so that the features in early layers are already initialized or “pretrained” with some suitable features (weights). Pretrained features in early layers only need to be adjusted slightly during supervised learning to achieve good results. The first pretraining approaches where developed for recurrent neural networks by Schmidhuber in 1992, and for feed-forward networks by Hinton and Salakhutdinov in 2006. Another solution for the vanishing gradient problem in recurrent neural networks was long short-term memory in 1997.
As the speed of GPUs increased rapidly, it was soon possible to train deep networks such as convolutional networks without the help of pretraining as demonstrated by Ciresan and colleagues in 2011 and 2012 who won character recognition, traffic sign, and medical imaging competitions with their convolutional network architecture. Krizhevsky, Sutskever, and Hinton used a similar architecture in 2012 that also features rectified linear activation functions and dropout for regularization. They received outstanding results in the ILSVRC-2012 ImageNet competition, which marked the abandonment of feature engineering and the adoption of feature learning in the form of deep learning. Google, Facebook, and Microsoft noticed this trend and made major acquisitions of deep learning startups and research teams between 2012 and 2014. From here, research in deep learning accelerated rapidly.
Additional material: Deep Learning in Neural Networks: An Overview
A perceptron contains only a single linear or nonlinear unit. Geometrically, a perceptron with a nonlinear unit trained with the delta rule can find the nonlinear plane separating data points of two different classes (if the separation plane exists). If no such separation plane exists, the perceptron will often still produce separation planes that provide good classification accuracy. The good performance of the perceptron led to a hype of artificial intelligence. In 1969 however, it was shown that a perceptron may fail to separate seemingly simple patterns such as the points provided by the XOR function. The fall from grace of the perceptron was one of the main reasons for the occurrence of the first AI winter. While neural networks with hidden layers do not suffer from the typical problems of the perceptron, neural networks were still associated with the perceptron and therefore also suffered an image problem during the AI winter.
Despite this, and despite the success of deep learning, perceptrons still find widespread use in the realm of big data, where the simplicity of the perceptron allows for successful application to very large data sets.
Rapid advances in machine learning and other approaches of inference led to a hype of artificial intelligence (similar to the buzz around deep learning today). Researchers made promises that these advances would continue and would lead to strong AI and in turn, AI research received lots of funding.
In the 1970s it became clear that those promises could not be kept, funding was cut dramatically and the field of artificial intelligence dropped to near pseudo-science status. Research became very difficult (little funding; publications almost never made it through peer review), but nevertheless, a few researchers continued further down this path and their research soon lead to the reinvigoration of the field and the creation of the field of deep learning.
This is why excessive deep learning hype is dangerous and researchers typically avoid making predictions about the future: AI researchers want to avoid another AI winter.
AlexNet is a convolutional network architecture named after Alex Krizhevsky, who along with Ilya Sutskever under the supervision of Geoffrey Hinton applied this architecture to the ILSVRC-2012 competition that featured the ImageNet dataset. They improved the convolutional network architecture developed by Ciresan and colleagues, which won multiple international competitions in 2011 and 2012 by using rectified linear units for enhanced speed and dropout for improved generalization. Their results stood in stark contrast to feature engineering methods, which immediately created a great rift between deep learning and feature engineering methods for computer vision. From here it was apparent that deep learning would take over computer vision and that other methods would not be able to catch up. AlexNet heralded the mainstream usage and the hype of deep learning.
Training Deep Learning Architectures
The process of training a deep learning architecture is similar to how toddlers start to make sense of the world around them. When a toddler encounters a new animal, say a monkey, he or she will not know what it is. But then an adult points with a finger at the monkey and says: “That is a monkey!” The toddler will then be able to associate the image he or she sees with the label “monkey”.
A single image, however, might not be sufficient to label an animal correctly when it is encountered the next time. For example, the toddler might mistake a sloth for a monkey or a monkey for a sloth, or might simply forget the name of a certain animal. For reliable recall and labeling, a toddler needs to see many different monkeys and similar animals and needs to know each time whether or not it is really a monkey—feedback is essential for learning. After some time, if the toddler encounters enough animals paired with their names, the toddler will have learned to distinguish between different animals.
The deep learning process is similar. We present the neural network with images or other data, such as the image of a monkey. The deep neural network predicts a certain outcome, for example, the label of the object in an image (“monkey”). We then supply the network with feedback. For example, if the network predicted that the image showed a monkey with 30% probability and a sloth with 70% probability, then all the outputs in favor of the sloth class made an error! We use this error to adjust the parameters of the neural network using the backpropagation of errors algorithm.
Usually, we randomly initialize the parameters of a deep network so the network initially outputs random predictions. This means for ImageNet, which consists of 1000 classes, we will achieve an average classification accuracy of just 0.1% for any image after initializing the neural network. To improve the performance we need to adjust the parameters so that the classification performance increases over time. But this is inherently difficult: If we adjust one parameter to improve performance on one class, this change might decrease the classification performance for another class. Only if we find parameter changes that work for all classes can we achieve good classification performance.
If you imagine a neural network with only 2 parameters (e.g. -0.37 and 1.14), then you can imagine a mountain landscape, where the height of the landscape represents the classification error and the two directions—north-south (x-axis) and east-west (y-axis)—represent the directions in which we can change the two parameters (negative-positive direction). The task is to find the lowest altitude point in the mountain landscape: we want to find the minimum.
The problem with this is that the entire mountain landscape is unknown to us at the beginning. It is as if the whole mountain range is covered in fog. We only know our current position (the initial random parameters) and our height (the current classification error). How can we find the minimum quickly when we have so little information about the landscape?
Imagine you stand on top of a mountain with skis strapped to your feet. You want to get down to the valley as quickly as possible, but there is fog and you can only see your immediate surroundings. How can you get down the mountain as quickly as possible? You look around and identify the steepest path down, go down that path for a bit, again look around and find the new steepest path, go down that path, and repeat—this is exactly what gradient descent does.
While gradient descent is equivalent to stopping every 10 meters and measuring the steepness of your surroundings with a measuring tape (you measure your gradient according to the whole data set), stochastic gradient descent is the equivalent of quickly estimating the steepness with a short glance (just a few hundred data points are used to estimate the steepness).
In terms of stochastic gradient descent, we go down the steepest path (the negative gradient or first derivative) on the landscape of the error function to find a local minimum, that is, the point that yields a low error for our task. We do this in tiny steps so that we do not get trapped in half-pipe-like obstacles (if we are too fast, we never get out of these half-pipes and we may even be “catapulted” up the mountain).
While our ski-landscape is 3D, typical error landscapes may have millions of dimensions. In such a space we have many valleys so it is easy to find a good solution, but we also have many saddle points, which makes matters very difficult.
Saddle points are points at which the surroundings are almost entirely flat, yet which may have dramatic descents at one end or the other (saddle points are like plateaus that slightly bend and may lead to a cliff). Most difficulties to find good solutions on an error landscape with many dimensions stems from navigating saddle points (because these plateaus have almost no steepness, progress is very slow near saddle points) rather than finding the minimum itself (there are many minima, which are almost all of the same quality).
Backpropagation of errors, or often simply backpropagation, is a method for finding the gradient of the error with respect to weights over a neural network. The gradient signifies how the error of the network changes with changes to the network’s weights. The gradient is used to perform gradient descent and thus find a set of weights that minimize the error of the network.
There are three good ways to teach backpropagation: (1) Using a visual representation, (2) using a mathematical representation, (3) using a rule-based representation. The bonus material at the end of this section uses a mathematical representation. Here I’ll use a rule-based representation as it requires little math and is easy to understand.
Imagine a neural network with 100 layers. We can imagine a forward pass in which a matrix (dimensions: number of examples x number of input nodes) is input to the network and propagated t through it, where we always have the order (1) input nodes, (2) weight matrix (dimensions: input nodes x output nodes), and (3) output nodes, which usually also have a non-linear activation function (dimensions: examples x output nodes). How can we imagine these matrices?
The input matrix represents the following: For every input node we have one input value, for example, pixels (three input values = three pixels in Figure 1), and we take this times our number of examples, such as the number of images. So for 128 3-pixel images, we have a 128×3 input matrix.
The weight matrix represents the connections between input and output nodes. The value passed to an input node (a pixel) is weighted by the weight matrix values and it “flows” to each output node through these connections. This flow is a result of multipying the input value by the value of each weight between the input node and output nodes. The output matrix is the accumulated “flow” of all input nodes at all output nodes.
So for each input, we multiply by all weights, and add up all those contributions at the output nodes, or more easily we take the matrix product of the input matrix times the weight matrix. In our example, this would be our 128×3 input matrix multiplied by the 3×5 weight matrix (see Figure 1). We thus receive our output matrix as a result which in this example is of size 128×5. We then use this output matrix, apply the non-linear activation function and treat our resulting output matrix as the input matrix to the next layer. We repeat these steps until we reach the error function. We then apply the error function to see how far the predictions are different from the correct values. We can formulate this whole process of the forward pass, and equivalently the backward pass, by defining simple rules (see Figure 1).
For the forward pass with given input data we go from the first to the last layer according to these rules:
- When we encounter a weight matrix, we matrix multiply by this weight and propagate the result.
- If we encounter a function, we put our current result into the function and propagate the function output as our result.
- We treat outputs of the previous layer as inputs into the next layer
- When we encounter the error function we apply it and thus generate the error for our backward pass
The backward pass for a given error is similar but proceeds from the last to the first layer where the error generated in rule 4 in the forward pass represents the “inputs” to the last layer. We then go backward through the network and follow these rules:
- When we encounter a weight matrix, we matrix multiply by the transpose of the matrix and propagate the result.
- If we encounter a function, we multiply (element-wise) by the derivative of that function with respect to the inputs that this function received from the forward pass. (see Figure 1)
- We treat errors of the previous layer as inputs (errors) into the next layer
To calculate the gradients, we use each intermediate result obtained after executing rule 2 in the backward pass and matrix multiply this intermediate result by the value of rule 2 from the forward pass from the previous layer (see Figure 1).
The rectified linear function is a simple non-linearity: It evaluates to 0 for negative inputs, and positive values remain untouched (f(x) = max(0,x)). The gradient of the rectified linear function is 1 for all positive values and 0 for negative values. This means that during backpropagation, negative gradients will not be used to update the weights of the outgoing rectified linear unit.
However, because we have a gradient of 1 for any positive value we have much better training speed when compared to other non-linear functions due to the good gradient flow. For example, the logistic sigmoid function has very tiny gradients for large positive and negative values so that learning nearly stops in these regions (this behavior is similar to a saddle point).
Despite the fact that negative gradients do not propagate with rectified linear functions (the gradient is zero here), large gradients for positive values are very powerful and ensure fast training regardless of the size of the gradient. Once these benefits were discovered, rectified linear functions and similar activation functions with large gradients became the activation functions of choice for deep networks.
Momentum uses the idea that the gradient zigzags every now and then but generally follows a rather straight line towards a local minimum. As such, if we move faster in this general direction and disregard the zigzag directions we will arrive faster at the local minimum, in general.
To realize this behavior we keep track of a running momentum matrix, which is the weighted running sum of the gradient, and we add that momentum matrix value to the gradient. The size of this momentum matrix is kept in check by attenuating it on every update (multiply by a momentum value between 0.7-0.99). Over time, the zigzag dimensions will be smoothed out in our running momentum matrix: A zig in one direction and a zag in the exact opposite direction cancel out and yield a straight line towards the general direction of the local minimum. In the beginning, the general direction towards the local minimum is not strongly established (a sequence of zags with no zigs, or vice versa), and the momentum matrix needs to be attenuated more strongly or the values for the momentum increasingly emphasize zigzagging directions, which in turn can lead to unstable learning. Thus, the momentum value should be kept small (0.5-0.7) in the beginning when no general direction towards a local minimum has been established. Later the momentum value can be increased rapidly (0.9-0.999).
Usually, the gradient update is applied first, and then the jump into the momentum direction follows. However, Nesterov showed that it is better to first jump into the momentum direction and then correct this direction with a gradient update; this procedure is known as “Nesterov’s accelerated gradient” (sometimes “Nesterov momentum”) and yields faster convergence to a local minimum.
Additional material: Coursera: Neural Networks for Machine Learning: 3. The Momentum Method
RMSprop keeps track of the weighted running mean of the squared gradient and then divides each calculated gradient by the square root of this weighted running mean (it essentially normalizes the gradient by dividing by the magnitude of recent gradients). The consequence is that when a plateau in the error surface is encountered and the gradient is very small, the updates take greater steps, ensuring faster learning (a small update: 0.00001, the square root of the weighted average: 0.00005, update size: 0.2). On the other hand, RMSprop protects against exploding gradients (a large update: 100, the square root of the weighted average: 25, update size: 4) and is thus used frequently in recurrent neural networks and LSTMs to protect both against vanishing and exploding gradients.
Imagine you (a unit in a convolutional network) are preparing for an exam (a classification task) and you know that during the exam you are permitted to copy answers from your peers (other units). Will you study for the exam? The answer to this question is probably yes or no depending on whether at least some students in your class have studied for the exam.
Let’s say you know that there are two students (units) in your class (convolutional net) who have the reputation of studying for every exam they take (every image that is presented). So you do not study for the exam and just copy from these students (you weigh the input from a single “elite” unit in the previous layer highly).
Now we introduce an infectious flu (dropout) that affects 50% of all students. Now there is a high chance that these two students who actually studied for the exam will not be present, so relying on copying their answers is no longer a good strategy. So this time you have to learn by yourself (make choices which take into account all units in a layer and not just the elite units).
In other words, dropout decouples the information processing of units so that they cannot rely on some unit “superstars” which always seem to have the right answer (these superstars detect features which are more important than the features that other units detect).
This in turn democratizes the classification process so that every unit makes computations that are largely independent of strong influencers, and thus reduces bias by ensuring less extreme opinions (there are no mainstream opinions). This decoupling of units in turn leads to strong regularization and better generalization (wisdom of the crowd).
L1 and L2 regularization penalizes the size of the weights of a network so that large output values that signify strong confidence can no longer be achieved from a single large weight, but instead require several medium-sized weights. Since many units have to agree to achieve a large value, it is less likely that the output will be biased by the opinion of a single unit. Conceptually, it penalizes strong opinions from single units and encourages taking into account the opinion of multiple units, thus reducing bias.
The L1 regularization penalizes the absolute size of the weight, while the L2 penalizes the squared size of the weight. This penalty is added to the error function value thus increasing the error if larger weights are used. As a result, the network is driven to solve the problem with small weights.
Since even small weights produce a sizeable L1 penalty, the L1 penalty has the effect that most weights will be set to zero while a few medium-to-large weights remain. Because fewer non-zero weights exist, the network must be highly confident about its results to achieve good predictive performance.
The L2 penalty encourages very small non-zero weights (large weight = very large error). Here the prediction is made by almost all weights thus reducing the bias (there are no influencers that can turn around outcomes by themselves).
Additional material: Coursera: Neural Networks for Machine Learning: 2. Limiting the Size of the Weights
Conclusion to Part 2
This concludes part 2 of this crash course on deep learning. Please check back soon for the next part of the series. In part 3, I’ll provide some details on learning algorithms, unsupervised learning, sequence learning, and natural language processing, and in part 4 I’ll go into reinforcement learning. In case you missed it, be sure to check out part 1 of the series.
Meanwhile, you might be interested in learning about cuDNN, DIGITS, Computer Vision with Caffe, Natural Language Processing with Torch, Neural Machine Translation, the Mocha.jl deep learning framework for Julia, or other Parallel Forall posts on deep learning.
“This is a really, really, really big deal,” said Jeremy Howard, president and chief scientist of data-science competition platform Kaggle. “… It’s going to enable whole new classes of products that have never existed before.” Think of Siri on steroids, for starters, or perhaps emulators that could mimic your writing style down to the tone.
When deep learning works, it works great
To understand Howard’s excitement, let’s go back a few days. It was Monday and I was watching him give a presentation in Chicago about how deep learning was dominating the competition in Kaggle, the online platform where organization present vexing predictive problems and data scientists compete to create the best models. Whenever someone has used a deep learning model to tackle one of the challenges, he told the room, it has performed better than any model ever previously devised to tackle that specific problem.
But there’s a catch: deep learning is really hard. So far, only a handful of teams in hundreds of Kaggle competitions have used it. Most of them have included Geoffrey Hinton or have been associated with him.
Hinton is a University of Toronto professor who pioneered the use of deep learning for image recognition and is now a distinguished engineer at Google, as well. What got Google really interested in Hinton — at least to the point where it hired him — was his work in an image-recognition competition called ImageNet. For years the contest’s winners had been improving only incrementally on previous results, until Hinton and his team used deep learning to improve by an order of magnitude.
Neural networks: A way-simplified overview
Deep learning, Howard explained, is essentially a bigger, badder take on the neural network models that have been around for some time. It’s particularly useful for analyzing image, audio, text, genomic and other multidimensional data that doesn’t lend itself well to traditional machine learning techniques.
Neural networks work by analyzing inputs (e.g., words or images) and recognizing the features that comprise them as well as how all those features relate to each other. With images, for example, a neural network model might recognize various formations of pixels or intensities of pixels as features.
Trained against a set of labeled data, the output of a neural network might be the classification of an input as a dog or cat, for example. In cases where there is no labeled training data — a process called self-taught learning — neural networks can be used to identify the common features of their inputs and group similar inputs even though the models can’t predict what they actually are. Like when Google researchers constructed neural networks that were able to recognize cats and human faces without having been trained to do so.
Stacking neural networks to do deep learning
In deep learning, multiple neural networks are “stacked” on top of each other, or layered, in order to create models that are even better at prediction because each new layer learns from the ones before it. In Hinton’s approach, each layer randomly omits features — a process called “dropout” — to minimize the chances the model will overfit itself to just the data upon which it was trained. That’s a technical way of saying the model won’t work as well when trying to analyze new data.
So dropout or similar techniques are critical to helping deep learning models understand the real causality between the inputs and the outputs, Howard explained during a call on Thursday. It’s like looking at the same thing under the same lighting all the time versus looking at it in different lighting and from different angles. You’ll see new aspects and won’t see others, he said, “But the underlying structure is going to be the same each time.”
Still, it’s difficult to create accurate models and to program them to run on the number of computing cores necessary to process them in a reasonable timeframe. It’s also can be difficult to train them on enough data to guarantee accuracy in an unsupervised environment. That’s why so much of the cutting-edge work in the field is still done by experts such as Hinton, Jeff Dean and Andrew Ng, all of whom had or still have strong ties to Google.
There are open source tools such as Theano and PyLearn2 that try to minimize the complexity, Howard told the audience on Monday, but a user-friendly, commercialized software package could be revolutionary. If data scientists in places outside Google could simply (a relative term if ever there was one) input their multidimensional data and train models to learn it, that could make other approaches to predictive modeling all but obsolete. It wouldn’t be inconceivable, Howard noted, that a software package like this could emerge within the next year.
Which brings us back to word2vec. Google calls it “an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words.” Those “architectures” are two new natural-language processing techniques developed by Google researchers Tomas Mikolov, Ilya Sutskever, and Quoc Le (Google Fellow Jeff Dean was also involved, although modestly, he told me.) They’re like neural networks, only simpler so they can be trained on larger data sets.
Kaggle’s Howard calls word2vec the “crown jewel” of natural language processing. “It’s the English language compressed down to a list of numbers,” he said.
Word2vec is designed to run on a system as small as a single multicore machine (Google tested its underlying techniques over days across more than 100 cores on its data center servers). Its creators have shown how it can recognize the similarities among words (e.g., the countries in Europe) as well as how they’re related to other words (e.g., countries and capitals). It’s able to decipher analogical relationships (e.g., short is to shortest as big is to biggest), word classes (e.g., carnivore and cormorant both relate to animals) and “linguistic regularities” (e.g., “vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to vector(‘queen’)).
Right now, the word2vec Google Code page notes, “The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences.”
This is accomplished by turning words into numbers that correlate with their characteristics, Howard said. Words that express positive sentiment, adjectives, nouns associated with sporting events — they’ll all have certain numbers in common based on how they’re used in the training data (so bigger data is better).
Smarter models means smarter apps
If this is all too esoteric, think about these methods applied to auto-correct or word suggestions in text-messaging apps. Current methods for doing this might be as simple as suggesting words that are usually paired together, Howard explained, meaning a suggestion is could be based solely on the word immediately before it. Using deep-learning-based approaches, a texting app could take into account the entire sentence, for example, because the app would have a better understanding of what the all words really mean in context.
Maybe you could average out all the numbers in a tweet, Howard suggested, and get a vector output that would accurately infer the sentiment, subject and level of formality of the tweet. Really, the possibilities are limited only to the types of applications people can think up to take advantage of word2vec’s deep understanding of natural language.
The big caveat, however, is researchers and industry data scientists still need to learn how to use word2vec. There hasn’t been a lot of research done on how to best use these types of models, Howard said, and the thousands of researchers working on other methods of natural language processing aren’t going to jump ship to Google’s tools overnight. Still, he believes the community will come around and word2vec and its underlying techniques could make all other approaches to natural language processing obsolete.
And this is just the start. A year from now, Howard predicts, deep learning will have surpassed a whole class of algorithms in other fields (i.e., things other than speech recognition, image recognition and natural language processing), and a year after that it will be integrated into all sorts of software packages. The only questions — and they’re admittedly big ones — is how smart deep learning models can get (and whether they’ll run into another era of hardware constraints that graphical processing units helped resolve earlier this millennium) and how accessible software packages like word2vec can make deep learning even for relatively unsophisticated users.
“Maybe in 10 years’ time,” Howard proposed, “we’ll get to that next level.”
In this post, we take a tour of the most popular machine learning algorithms.
It is useful to tour the main algorithms in the field to get a feeling of what methods are available.
There are so many algorithms available that it can feel overwhelming when algorithm names are thrown around and you are expected to just know what they are and where they fit.
I want to give you two ways to think about and categorize the algorithms you may come across in the field.
- The first is a grouping of algorithms by the learning style.
- The second is a grouping of algorithms by similarity in form or function (like grouping similar animals together).
Both approaches are useful, but we will focus in on the grouping of algorithms by similarity and go on a tour of a variety of different algorithm types.
After reading this post, you will have a much better understanding of the most popular machine learning algorithms for supervised learning and how they are related.
Algorithms Grouped by Learning Style
There are different ways an algorithm can model a problem based on its interaction with the experience or environment or whatever we want to call the input data.
It is popular in machine learning and artificial intelligence textbooks to first consider the learning styles that an algorithm can adopt.
There are only a few main learning styles or learning models that an algorithm can have and we’ll go through them here with a few examples of algorithms and problem types that they suit.
This taxonomy or way of organizing machine learning algorithms is useful because it forces you to think about the roles of the input data and the model preparation process and select one that is the most appropriate for your problem in order to get the best result.
Let’s take a look at four different learning styles in machine learning algorithms:
Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include Logistic Regression and the Back Propagation Neural Network.
Input data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and k-Means.
There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.
Example problems are classification and regression.
Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.
When crunching data to model business decisions, you are most typically using supervised and unsupervised learning methods.
A hot topic at the moment is semi-supervised learning methods in areas such as image classification where there are large datasets with very few labeled examples.
Get your FREE Algorithms Mind Map
I’ve created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
Algorithms Grouped By Similarity
Algorithms are often grouped by similarity in terms of their function (how they work). For example, tree-based methods, and neural network inspired methods.
I think this is the most useful way to group algorithms and it is the approach we will use here.
This is a useful grouping method, but it is not perfect. There are still algorithms that could just as easily fit into multiple categories like Learning Vector Quantization that is both a neural network inspired method and an instance-based method. There are also categories that have the same name that describe the problem and the class of algorithm such as Regression and Clustering.
We could handle these cases by listing algorithms twice or by selecting the group that subjectively is the “best” fit. I like this latter approach of not duplicating algorithms to keep things simple.
In this section, I list many of the popular machine learning algorithms grouped the way I think is the most intuitive. The list is not exhaustive in either the groups or the algorithms, but I think it is representative and will be useful to you to get an idea of the lay of the land.
Please Note: There is a strong bias towards algorithms used for classification and regression, the two most prevalent supervised machine learning problems you will encounter.
If you know of an algorithm or a group of algorithms not listed, put it in the comments and share it with us. Let’s dive in.
Regression is concerned with modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model.
Regression methods are a workhorse of statistics and have been co-opted into statistical machine learning. This may be confusing because we can use regression to refer to the class of problem and the class of algorithm. Really, regression is a process.
The most popular regression algorithms are:
- Ordinary Least Squares Regression (OLSR)
- Linear Regression
- Logistic Regression
- Stepwise Regression
- Multivariate Adaptive Regression Splines (MARS)
- Locally Estimated Scatterplot Smoothing (LOESS)
Instance-based learning model is a decision problem with instances or examples of training data that are deemed important or required to the model.
Such methods typically build up a database of example data and compare new data to the database using a similarity measure in order to find the best match and make a prediction. For this reason, instance-based methods are also called winner-take-all methods and memory-based learning. Focus is put on the representation of the stored instances and similarity measures used between instances.
The most popular instance-based algorithms are:
- k-Nearest Neighbor (kNN)
- Learning Vector Quantization (LVQ)
- Self-Organizing Map (SOM)
- Locally Weighted Learning (LWL)
An extension made to another method (typically regression methods) that penalizes models based on their complexity, favoring simpler models that are also better at generalizing.
I have listed regularization algorithms separately here because they are popular, powerful and generally simple modifications made to other methods.
The most popular regularization algorithms are:
- Ridge Regression
- Least Absolute Shrinkage and Selection Operator (LASSO)
- Elastic Net
- Least-Angle Regression (LARS)
Decision Tree Algorithms
Decision tree methods construct a model of decisions made based on actual values of attributes in the data.
Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning.
The most popular decision tree algorithms are:
- Classification and Regression Tree (CART)
- Iterative Dichotomiser 3 (ID3)
- C4.5 and C5.0 (different versions of a powerful approach)
- Chi-squared Automatic Interaction Detection (CHAID)
- Decision Stump
- Conditional Decision Trees
Bayesian methods are those that explicitly apply Bayes’ Theorem for problems such as classification and regression.
The most popular Bayesian algorithms are:
- Naive Bayes
- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Averaged One-Dependence Estimators (AODE)
- Bayesian Belief Network (BBN)
- Bayesian Network (BN)
Clustering, like regression, describes the class of problem and the class of methods.
Clustering methods are typically organized by the modeling approaches such as centroid-based and hierarchal. All methods are concerned with using the inherent structures in the data to best organize the data into groups of maximum commonality.
The most popular clustering algorithms are:
- Expectation Maximisation (EM)
- Hierarchical Clustering
Association Rule Learning Algorithms
Association rule learning methods extract rules that best explain observed relationships between variables in data.
These rules can discover important and commercially useful associations in large multidimensional datasets that can be exploited by an organization.
The most popular association rule learning algorithms are:
- Apriori algorithm
- Eclat algorithm
Artificial Neural Network Algorithms
Artificial Neural Networks are models that are inspired by the structure and/or function of biological neural networks.
They are a class of pattern matching that are commonly used for regression and classification problems but are really an enormous subfield comprised of hundreds of algorithms and variations for all manner of problem types.
Note that I have separated out Deep Learning from neural networks because of the massive growth and popularity in the field. Here we are concerned with the more classical methods.
The most popular artificial neural network algorithms are:
- Hopfield Network
- Radial Basis Function Network (RBFN)
Deep Learning Algorithms
Deep Learning methods are a modern update to Artificial Neural Networks that exploit abundant cheap computation.
They are concerned with building much larger and more complex neural networks and, as commented on above, many methods are concerned with semi-supervised learning problems where large datasets contain very little labeled data.
The most popular deep learning algorithms are:
- Deep Boltzmann Machine (DBM)
- Deep Belief Networks (DBN)
- Convolutional Neural Network (CNN)
- Stacked Auto-Encoders
Dimensionality Reduction Algorithms
Like clustering methods, dimensionality reduction seek and exploit the inherent structure in the data, but in this case in an unsupervised manner or order to summarize or describe data using less information.
This can be useful to visualize dimensional data or to simplify data which can then be used in a supervised learning method. Many of these methods can be adapted for use in classification and regression.
- Principal Component Analysis (PCA)
- Principal Component Regression (PCR)
- Partial Least Squares Regression (PLSR)
- Sammon Mapping
- Multidimensional Scaling (MDS)
- Projection Pursuit
- Linear Discriminant Analysis (LDA)
- Mixture Discriminant Analysis (MDA)
- Quadratic Discriminant Analysis (QDA)
- Flexible Discriminant Analysis (FDA)
Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction.
Much effort is put into what types of weak learners to combine and the ways in which to combine them. This is a very powerful class of techniques and as such is very popular.
- Bootstrapped Aggregation (Bagging)
- Stacked Generalization (blending)
- Gradient Boosting Machines (GBM)
- Gradient Boosted Regression Trees (GBRT)
- Random Forest
Many algorithms were not covered.
For example, what group would Support Vector Machines go into? Its own?
I did not cover algorithms from specialty tasks in the process of machine learning, such as:
- Feature selection algorithms
- Algorithm accuracy evaluation
- Performance measures
I also did not cover algorithms from specialty subfields of machine learning, such as:
- Computational intelligence (evolutionary algorithms, etc.)
- Computer Vision (CV)
- Natural Language Processing (NLP)
- Recommender Systems
- Reinforcement Learning
- Graphical Models
- And more…
These may feature in future posts.
This tour of machine learning algorithms was intended to give you an overview of what is out there and some ideas on how to relate algorithms to each other.
I’ve collected together some resources for you to continue your reading on algorithms. If you have a specific question, please leave a comment.
Other Lists of Algorithms
There are other great lists of algorithms out there if you’re interested. Below are few hand selected examples.
- List of Machine Learning Algorithms: On Wikipedia. Although extensive, I do not find this list or the organization of the algorithms particularly useful.
- Machine Learning Algorithms Category: Also on Wikipedia, slightly more useful than Wikipedias great list above. It organizes algorithms alphabetically.
- CRAN Task View: Machine Learning & Statistical Learning: A list of all the packages and all the algorithms supported by each machine learning package in R. Gives you a grounded feeling of what’s out there and what people are using for analysis day-to-day.
- Top 10 Algorithms in Data Mining: Published article and now a book (Affiliate Link) on the most popular algorithms for data mining. Another grounded and less overwhelming take on methods that you could go off and learn deeply.
How to Study Machine Learning Algorithms
Algorithms are a big part of machine learning. It’s a topic I am passionate about and write about a lot on this blog. Below are few hand selected posts that might interest you for further reading.
- How to Learn Any Machine Learning Algorithm: A systematic approach that you can use to study and understand any machine learning algorithm using “algorithm description templates” (I used this approach to write my first book).
- How to Create Targeted Lists of Machine Learning Algorithms: How you can create your own systematic lists of machine learning algorithms to jump start work on your next machine learning problem.
- How to Research a Machine Learning Algorithm: A systematic approach that you can use to research machine learning algorithms (works great in collaboration with the template approach listed above).
- How to Investigate Machine Learning Algorithm Behavior: A methodology you can use to understand how machine learning algorithms work by creating and executing very small studies into their behavior. Research is not just for academics!
- How to Implement a Machine Learning Algorithm: A process and tips and tricks for implementing machine learning algorithms from scratch.
How to Run Machine Learning Algorithms
Sometimes you just want to dive into code. Below are some links you can use to run machine learning algorithms, code them up using standard libraries or implement them from scratch.
- How To Get Started With Machine Learning Algorithms in R: Links to a large number of code examples on this site demonstrating machine learning algorithms in R.
- Machine Learning Algorithm Recipes in scikit-learn: A collection of Python code examples demonstrating how to create predictive models using scikit-learn.
- How to Run Your First Classifier in Weka: A tutorial for running your very first classifier in Weka (no code required!).
I hope you have found this tour useful.
Please, leave a comment if you have any questions or ideas on how to improve the algorithm tour.
Update #2: I’ve added a bunch more resources and more algorithms. I’ve also added a handy mind map that you can download (see above).
Frustrated With Machine Learning Math?
See How Algorithms Work in Minutes
…with just arithmetic and simple examples
Discover how in my new Ebook: Master Machine Learning Algorithms
It covers explanations and examples of 10 top algorithms, including:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more…
Finally, Pull Back the Curtain on
Machine Learning Algorithms
Skip the Academics. Just Results.
With its unprecedented throughput, scalability, and speed, next-generation sequencing enables researchers to study biological systems at a level never before possible.
Today’s complex genomic research questions demand a depth of information beyond the capacity of traditional DNA sequencing technologies. Next-generation sequencing has filled that gap and become an everyday research tool to address these questions.
See What NGS Can Do For You
Innovative NGS sample preparation and data analysis options enable a broad range of applications. Next-gen sequencing allows you to:
- Rapidly sequence whole genomes
- Zoom in to deeply sequence target regions
- Utilize RNA sequencing to discover novel RNA variants and splice sites, or precisely quantify mRNAs for gene expression analysis
- Analyze genome-wide methylation or DNA-protein interactions
- Study microbial diversity in humans or in the environment
Accessible Whole-Genome Sequencing
Using capillary electrophoresis-based Sanger sequencing, the Human Genome Project took over 10 years and cost nearly $3 billion.
Next-generation sequencing, in contrast, makes large-scale whole-genome sequencing accessible and practical for the average researcher.
Limitless Dynamic Range for Expression Profiling
NGS makes sequence-based gene expression analysis a “digital” alternative to analog techniques. It lets you quantify RNA expression with the breadth of a microarray and the resolution of qPCR.
Microarray gene expression measurement is limited by noise at the low end and signal saturation at the high end. In contrast, next-generation sequencing quantifies discrete, digital sequencing read counts, offering a virtually unlimited dynamic range.
Tunable Resolution for Targeted Next-Gen Sequencing
NGS is highly scalable, allowing you to tune the level of resolution to meet specific experimental needs.
Targeted sequencing allows you to focus your research on particular regions of the genome. Choose whether to do a shallow scan across multiple samples, or sequence at greater depth with fewer samples to find rare variants in a given region.
How Does Illumina NGS Work?
Illumina next-generation sequencing utilizes a fundamentally different approach from the classic Sanger chain-termination method. It leverages sequencing by synthesis (SBS) technology – tracking the addition of labeled nucleotides as the DNA chain is copied – in a massively parallel fashion.
Next-gen sequencing generates masses of DNA sequence data that’s richer and more complete than is imaginable with Sanger sequencing. Illumina sequencing systems can deliver data output ranging from 300 kilobases up to 1 terabase in a single run, depending on instrument type and configuration.
Learn more about sequencing by synthesis (SBS) technology »
Latest Evolution of Illumina Next-Gen Sequencing
Recent Illumina next-generation sequencing technology breakthroughs include:
- 2-channel SBS: This technology enables faster sequencing with the same high data accuracy.
- Patterned flow cell technology: This option offers dramatically increased data output and throughput.
- $1000 genome sequencing: Discover how the HiSeq X Ten System breaks the $1000 genome barrier for human whole-genome sequencing.
Bring Next-Generation Sequencing to Your Lab
The following resources offer valuable guidance to researchers who are considering purchasing an NGS system:
Browse by OMIC TECHNOLOGIES
Browse by OMIC INTERPRETATION
Epigenomic analysis software tools and databases
Transcriptomic analysis software tools and databases
Metabolomic analysis software tools and databases
Fluxomic analysis software tools and databases
Biological pathway analysis software tools and databases
Many efforts are still devoted to the discovery of genes involved with specific phenotypes, in particular, diseases. High-throughput techniques are thus applied frequently to detect dozens or even hundreds of candidate genes. However, the experimental validation of many candidates is often an expensive and time-consuming task. Therefore, a great variety of computational approaches has been developed to support the identification of the most promising candidates for follow-up studies. The biomedical knowledge already available about the disease of interest and related genes is commonly exploited to find new gene–disease associations and to prioritize candidates. In this review, we highlight recent methodological advances in this research field of candidate gene prioritization. We focus on approaches that use network information and integrate heterogeneous data sources. Furthermore, we discuss current benchmarking procedures for evaluating and comparing different prioritization methods. WIREs Syst Biol Med 2012. doi: 10.1002/wsbm.1177
For further resources related to this article, please visit the WIREs website.
COMPLEX DISEASES AND THE IDENTIFICATION OF RELEVANT GENES
Many common diseases are complex and polygenic, involving dozens of human genes that might predispose to, be causative of, or modify the respective disease phenotype.1–3 This intricate interplay of disease genotypes and phenotypes still renders the identification of all relevant disease genes difficult.4–6 Therefore, a number of experimental techniques exist to discover disease genes. In particular, high-throughput methods such as genome-wide association (GWA) studies2,7,8 and large-scale RNA interference screens6,9,10 yield lists of up to hundreds of candidate disease genes. As validating the actual disease relevance of candidate genes in experimental follow-up studies is a time-consuming and expensive task, many methods and web services for the computational prioritization of candidate disease genes have already been developed.11–19
The concrete problem of candidate gene prioritization can be formulated as follows: Given a disease (or, generally spoken, a specific phenotype) of interest and some list of candidate genes, identify potential gene–disease associations by ranking the candidate genes in decreasing order of their relevance to the disease phenotype. When abstracting from the methodological details, the vast majority of computational approaches to this prioritization problem work in a similar manner. Most of them rely on the biological information already available for the disease phenotype of interest and the known, already verified, disease genes as well as for the additional candidate genes. In this context, functional information, particularly, manually curated or automatically derived functional annotation, often provides strong evidence for establishing links between diseases and relevant genes and proteins.20–27 Many prioritization methods use protein interaction data as rich information source for finding relationships between gene products of candidate genes and disease genes.11,16,18,25,28–45 In addition, the phenotypic similarity of diseases can help to increase the total number of known disease genes for less studied disease phenotypes.35,46–55 Other sources of biological information frequently used by prioritization approaches are sequence properties, gene expression data, molecular pathways, functional orthology between organisms, and relevant biomedical literature.12,14,19
These data then serve as input for statistical learning methods or are integrated into network representations, which are further analyzed by network scoring algorithms. Although individual data sources such as functional annotations or protein interactions provide quite powerful information for prioritizing candidate genes, the integration of multiple data sources has been reported to increase the performance even more.23–25,35,47–69 However, a generally accepted and consistent benchmarking strategy for all the diverse prioritization methods has not emerged yet, which complicates performance evaluation and comparison.
Therefore, this advanced review does not only highlight recent prioritization approaches (published until the end of 2011), but also discusses different benchmarking strategies applied by authors and the need of standardized procedures for performance measurement. Other computational tasks such as the structural and functional interpretation as well as prioritization of disease-associated nucleotide and amino acid changes are not discussed in this article, but are reviewed elsewhere.3,70–74 In the following, the various prioritization methods are categorized according to the biological data and their representation that are primarily considered when scoring and ranking candidate disease genes: gene and protein characteristics, network information on molecular interactions, and integrated biomedical knowledge.
PRIORITIZATION METHODS USING GENE AND PROTEIN CHARACTERISTICS
The first computational approaches in the hunt for disease genes focused on molecular characteristics of disease genes, which discriminate them from non-disease genes. As described below, researchers developed methods related to individual gene and protein sequence properties75–77 as well as functional annotations of gene products.20–22,26,27,78 In principle, if a candidate satisfies certain characteristics as derived from known disease genes and proteins, its disease relevance is considered to be higher than otherwise.
Gene and Protein Sequence Properties
López-Bigas and Ouzounis75 derived several important characteristics of disease genes from the amino acid sequence of their gene products. In comparison with other proteins encoded in the human genome, disease proteins tend to be longer, to exhibit a wider phylogenetic extent, that is, to have more homologs in both vertebrates and invertebrates, to possess a low number of close paralogs, and to be more evolutionarily conserved. Using these sequence properties as input to a decision-tree algorithm, the researchers performed a genome-wide identification of genes involved in (hereditary) diseases.
Similarly, Adie et al.76 developed PROSPECTR, a method for candidate disease gene prioritization based on an alternating decision-tree algorithm. However, their approach examined a broader set of sequence features and thus produced a more successful classifier. In particular, Adie et al. found that disease genes tend to have different nucleotide compositions at the end of the sequence, a higher number of CpG islands at the 5′ end, and longer 3′ untranslated regions.
By demonstrating the strong correlation between gene and protein function and disease features, such as age of onset, Jimenez-Sanchez et al.78motivated prioritization approaches that exploit the functional annotation of known disease genes for ranking candidates.20–22,26,27
Perez-Iratxeta et al.20 applied text mining on biomedical literature to relate disease phenotypes with functional annotations using Medical Subject Headings (MeSH)79 and Gene Ontology (GO) terms.80 They ranked the candidate genes according to the characteristic functional annotations shared with the disease of interest. In a similar fashion, Freudenberg and Propping21identified candidate genes based on their annotated GO terms that are shared with groups of known disease genes associated with similar phenotypes. In contrast, the approach POCUS22 assesses the shared over-representation of functional annotation terms between genes in different loci for the same disease.
Recently, Schlicker et al.26 developed a prioritization method that makes use of the similarity between the functional annotations of disease genes and candidates. In contrast to the approaches that consider solely identical functional annotations or compute only GO term enrichments, MedSim automatically derives functional profiles for each disease phenotype from the GO term annotation of known disease genes and, optionally, of their orthologs or interaction partners. Candidate genes are then scored and ranked according to the functional similarity of their annotation profiles to a disease profile. In addition, Ramírez et al.27 introduced the BioSim method for discovering biological relationships between genes or proteins. While MedSim is based only on GO term annotations, BioSim quantifies functional gene and protein similarity according to multiple data sources of functional annotations and can also be applied to rank candidate genes based on their functional similarity to known disease genes.
The success of the presented studies also shows that phenotypically similar diseases often involve common molecular mechanisms and thus functionally related genes. This also explains the frequent use of functional annotations as important biological evidence in integrative prioritization approaches.23–25,56–58,61–66,68,69 Notably, the information value of functional annotations can be further increased by improved scoring of functional similarity, reaching the performance of complex integrative methods based on multiple data sources.26
PRIORITIZATION METHODS USING NETWORK INFORMATION
In the last decade, molecular interaction networks have become an indispensable tool and a valuable information source in the study of human diseases. Regarding methods for prioritization of candidate disease genes, it was repeatedly observed that protein interaction networks are among the most powerful data sources in addition to functional annotations.11,16,18,28,29,81 As in case of sequence properties, disease genes and their products have discriminatory network properties that allow their distinction from non-disease genes. In particular, molecular interactions naturally support the application of the guilt-by-association principle to identify disease genes. In the following, we highlight a representative selection of network-based prioritization approaches.
Local Network Information
Early prioritization methods have focused on local network information such as close network neighborhood of a node representing a candidate gene or protein (see Box 1). This can be explained by the observation that disease proteins tend to cluster and interact with each other.30,49,82–84 Molecular triangulation is one of the first methods that used protein interaction networks to rank candidates and their network nodes with respect to their shortest path distances to nodes of known disease proteins.31 An evidence score such as the MLS score corresponding to linkage peak association85 is assigned to each disease protein node and transferred to its neighbor nodes. The candidates are then ranked according to the accumulated sum of evidence scores. This means that candidates represented by nodes close to several disease protein nodes with good evidence scores are considered to be the most promising ones.
Local network information refers to the topological neighborhood of a node, and corresponding measures are less sensitive to the overall network topology. Examples are the node degree kn (number of edges linked to node n) and the shortest path length dnm (minimum number of edges between the nodes n and m). In disease gene networks, Xu et al.34 define, for each node n, the 1N index and the 2N index . Here, is the number of edges between node nand disease genes, and Nn is the set of direct neighbors of n. Given the set of disease genes M, the average shortest path distance of a node n to disease genes is .
Global network information relates to the overall network topology and measures that characterize the role of a node in the whole network. Common centrality measures are shortest path closeness and betweenness as well as random-walk-related properties such as hitting time, visit frequency and stationary distribution. For instance, closeness centrality indicates how distant a node is to the other network nodes, and it is calculated as with Vn denoting the set of nodes reachable from n. The random walk with restart40 is defined as pt+1 = (1 − r)Wpt + rp0. Here, W is the column-normalized adjacency matrix of the network, pt contains the probability of being at each node at time step t, p0 denotes the initial probability vector, and r is restart probability. The steady-state probability vector p∞ can be obtained by performing iterations until the change between pt and pt+1 falls below some significance threshold, e.g., 10−6.40
In a related approach, Karni et al.32 identified the minimal set of candidates so that there is a path between the products of known disease genes in the protein interaction network. Oti et al.33 proposed an even simpler method for a genome-wide prediction of disease genes. For each known disease protein, they identified its interaction partners and the chromosomal locations of the encoding genes. A gene is then considered to be relevant for a disease of interest if it resides within a known disease locus and its gene product shares an interaction with a protein known to be associated with the same disease.
To make the most out of the potential of local network measures, Xu and Li34computed multiple topological properties for three different molecular networks consisting of literature-curated, experimentally derived, and predicted protein–protein interactions. The considered properties are the node degree, the average distance to known disease genes, the 1N and 2N node indices (see Box 1 and Figure 1), and the positive topological coefficient.86 The authors then trained a k-nearest-neighbor classifier using the aforementioned topological properties of known disease genes and achieved comparable performance for all three networks. They also detected a possible bias in the literature-curated network because disease genes tend to be studied more extensively.
In addition to local network information, Lage et al.35 incorporated phenotypic data into the disease gene prioritization. Each candidate and its direct interaction partners are considered as a candidate complex. All disease proteins in a candidate complex are assigned phenotypic similarity scores, which are used as input to a Bayesian predictor. Thus, a candidate gene obtains a high score if the other proteins in the complex are involved in phenotypes very similar to the disease of interest. Care et al.36 elaborated on this approach by combining it with deleterious SNP predictions for the candidate gene products and their interaction partners. Using the method by Lage et al., Berchtold et al.37 successfully prioritized proteins associated with type 1 diabetes (T1D). Further studies of protein interaction networks underlying specific diseases such as breast cancer38 and T1D39 also deal with the application of similar network-based prioritization approaches.
Global Network Information
Beyond local network information that ignores potential network-mediated effects from distant nodes, the utilization of global network measures can considerably improve the performance of prioritization methods for candidate disease genes.40,41,44 Especially for the study of polygenic diseases, network topology analysis can provide more insight into multiple paths of long-range protein interactions and their impact on the functionality and interplay of disease genes.
Köhler et al.40 demonstrated that random-walk analysis of protein–protein interaction networks outperforms local network-based methods such as shortest path distances and direct interactions as well as sequence-based methods like PROSPECTR.76 In their method, the authors ranked the gene products in a given network according to the steady-state probability of a random walk, which starts at known disease proteins and can restart with a predefined probability (see Box 1). Although the ranking criterion is the proximity of candidates to known disease proteins, this approach is more discriminative than local measures because it accounts for the global network structure.
In a similar manner, Chen et al.41 adapted three sophisticated algorithms from social network and web analysis for the problem of disease gene prioritization. To this end, they analyzed a protein–protein interaction network using modified versions of the random-walk-based methods PageRank,87,88Hyperlink-Induced Topic Search (HITS),88,89 and K-Step Markov method (KSMM).88 PageRank, HITS, and KSMM consider the global network topology and compute the relevance of all nodes representing candidates with regard to the set of known disease proteins in the network. All three methods achieved comparable performance to each other.
To address the issue of finding causal genes within expression quantitative trait loci (eQTL),90,91 Suthram et al.42 introduced the eQTL electrical diagrams method (eQED). They modeled confidence weights of protein interactions as conductances, while the P-values of associations between genetic loci and the expression of candidate genes served as current sources. The best candidate gene is the one passed by the highest current. The currents in electric circuits can be determined efficiently using random-walk computations.92,93
Network Centrality Measures
For many years, global centrality measures such as closeness or betweenness have been used in social sciences to assess how important individual nodes are for the overall network connectivity. Recently, such measures have been applied to several problems in bioinformatics including disease gene prioritization. For example, Dezső et al.43 applied an adapted version of shortest path betweenness to prioritize candidates in a protein–protein interaction network. A candidate is scored more relevant to the disease of interest if it lies on significantly more shortest paths connecting nodes of known disease proteins than other nodes in the network.
In a recent case study on primary immunodeficiencies (PIDs), Ortutay and Vihinen25 integrated functional GO annotations with protein interaction networks to discover novel PID genes. The authors conducted a topological analysis on an immunome network consisting of all essential proteins related to the human immune system and their interactions. In particular, they used the node degree as well as the global centrality measures of vulnerability and closeness to assess the importance of candidate genes in the network (see Box 1). Additionally, they performed functional enrichment analysis to determine genes with PID-related GO terms. With some modifications, the described prioritization method could be generalized to other diseases of interest.
Combining Network Measures
Recently, Navlakha and Kingsford44 compared different network-based prioritization methods. The authors observed that random-walk-based measures40 outperform measures focused on the local network neighborhood33 or clustering.94–96 A consensus method that uses a random-forest classifier to combine all methods yielded the most accurate ranking. Therefore, apart from stressing the potential of protein interaction data, Navlakha and Kingsford also showed that disease gene prioritization can benefit from the integration of multiple information sources.
In summary, as many other studies have also demonstrated, molecular interaction networks, in particular, based on protein interactions, provide valuable biological knowledge for ranking candidate disease genes.11,16,18,28,29 It has also become clear that global network measures achieve better results in comparison to local measures.40,41,44 Nevertheless, the performance of such prioritization approaches depends heavily on the quality of the network data. Protein interaction data are well known to be biased toward extensively studied proteins and subject to inherent noise.34,97,98 Therefore, it is often suggested that existing methods will perform better when more accurate data become available. Furthermore, Erten et al.45 pointed out that network-based methods can also be improved by integrating statistical adjustments for the skewed degree distribution of protein interaction networks.
PRIORITIZATION METHODS USING INTEGRATED KNOWLEDGE
Network information on molecular interactions as well as individual gene and protein characteristics such as sequence properties and functional annotations are major sources of biological evidence for scoring and ranking candidate disease genes. However, a prioritization approach based on a single information source alone usually achieves only limited performance due to noisy and incomplete datasets. To address this problem, the integration of multiple sources of biological knowledge has proven to be a good solution in bioinformatics. Different types of data can complement each other well to increase the amount of available information and its overall quality. While some of the methods presented above already make successful use of relatively simple integration procedures for a few different sources of functional information and annotations, this section will focus on more sophisticated methods for knowledge integration and the prioritization of candidate disease genes.
Complementing Molecular Interactions with Phenotypic Network Information
In the last years, several groups investigated the similarities and differences between disease phenotypes. The main finding was that similar phenotypes often share underlying genes or even pathways.46,99,100 In particular, van Driel et al.46 classified all human phenotypes contained in the Online Mendelian Inheritance in Man database (OMIM)101 by defining a measure of phenotypic similarity based on text mining of the corresponding OMIM records. Such phenotypic knowledge can be very useful to discover new potential disease genes by transferring known gene–phenotype associations to similar diseases and phenotypes.
Therefore, phenotypic similarity has become another major data source exploited by computational methods for prioritization of candidate disease genes.35,47–55 In this context, a two-layered heterogeneous data network is typically constructed so that the phenome layer consists of connections between similar phenotypes, while the interactome layer contains protein–protein interactions. The two network layers are then linked by known gene–phenotype associations.
To demonstrate the importance of the additional phenotype network layer for identifying novel gene–phenotype associations and disease–disease relationships, Li et al.48 extended the random-walk algorithm used by Köhler et al.,40 as described in the previous section, to heterogeneous networks. Both the candidate genes and the disease phenotypes are prioritized simultaneously. In contrast, Yao et al.49 estimated the closeness of a candidate gene to a disease of interest by computing the hitting time of a random walk that starts at the corresponding disease phenotype and ends at the candidate. This approach also allows the genome-wide identification of potential disease genes for phenotypic disease subtypes.
Chen et al.50 reformulated the candidate gene prioritization problem as a maximum flow problem on a heterogeneous network. They represented the capacities of connections between phenotypes by their phenotypic similarity. Capacities on edges within the interactome and on edges bridging the phenome and interactome were estimated during the evaluation procedure. By calculating a maximum flow from a phenotype of interest through the interactome, the authors ranked candidate genes with regard to the amount of efflux.
A computationally simpler approach based on the same network type was suggested by Guo et al.,51 who computed the association score between a gene and a disease as the weighted sum of all association scores between similar diseases and between neighboring genes in the interaction layer. To this end, the authors formulated an iterative matrix multiplication of disease–gene–association matrices and disease-similarity matrices corresponding to the network structure. While the maximum flow problem solved by Chen et al.50 already accounts for the phenotypic overlap between diseases, the approach by Guo et al.51 additionally considers the genetic overlap of diseases. The recent PhenomeNET52 is even a cross-species network of phenotypic similarities between genotypes and diseases based on a uniform representation of different phenotype and anatomy ontologies. In particular, it can be used to perform whole-phenome discovery of genes for diseases with unknown molecular basis.
Two other studies used iterative network flow propagation on a heterogeneous network to identify protein complexes related to disease. Vanunu et al.53 developed a prioritization method that propagates flow from a phenotype of interest through the whole network and identifies dense subnetworks around high-scored genes as potential phenotype-related protein complexes. In contrast, Yang et al.54 modified the described heterogeneous network and included an additional layer of protein complexes. In the resulting network, phenotypes are connected to protein complexes, and complexes are linked with each other according to the protein interactions in the interactome layer. The method derives novel gene–phenotype associations by propagating the network flow within the protein complex layer.
Disease gene prioritization methods usually rank candidate genes relative to a phenotype of interest. However, the discovery of gene–phenotype associations can also be approached the other way around. Hwang et al.55devised a method to identify the phenotype that could result from a given set of candidate genes. For that purpose, the authors considered a gene network and a phenotype similarity network. In both networks, the nodes were ranked separately with graph Laplacian scores, and a rank coherence was calculated from the score differences between genes and phenotypes connected by known associations. Hwang et al. showed that their approach is suitable to predict the resulting phenotype for a given set of candidate genes.
Integrating Heterogeneous Data Sources of Biological Knowledge
Two distinct approaches to disease gene prioritization that exploit multiple data sources are exemplarily highlighted in the following (Figure 2). The first approach considers each data source separately when assessing the molecular and phenotypic relationships of candidate genes with the disease of interest, and aggregates the resulting multiple ranking lists into a final ranking of the candidates. The alternative approach combines all biological information into a network representation and subsequently applies network measures to score and rank candidates with regard to their network proximity to nodes representing known disease genes.
In detail, the prioritization method Endeavour56,102 utilizes more than 20 data sources such as ontologies and functional annotations, protein–protein interactions, cis-regulatory information, gene expression data, sequence information, and text-mining results. For each data source, candidate genes are first ranked separately based on their similarity to a profile derived from known disease genes. Afterwards, all individual candidate rankings are merged into a final overall ranking using rank order statistics. The authors showed that this approach is quite successful in finding potential disease genes as well as genes involved in specific pathways or biological functions. Recently, Endeavour has also been benchmarked using various disease marker sets and pathway maps103 to confirm that it performs very well if sufficient data is available for the disease or pathway of interest and the candidate genes. Furthermore, Li et al.57 proposed a discounted rating system, an algorithm for integrating multiple rank lists, and compared it with the rank aggregation procedure used by Endeavour.
Like Endeavour, the method MetaRanker59 also combines many heterogeneous data sources and forms separate evidence layers from SNP-to-phenotype associations, candidate protein interactions, linkage study data, quantitative disease similarity, and gene expression information. For each layer, all genes in the human genome are ranked with regard to their probability to be associated with the phenotype of interest. The overall score of a gene is the product of its rank scores for each layer. The evaluation of MetaRanker indicates that it is particularly suited to uncover associations in complex polygenic diseases and that the integration of multiple data layers improves the identification of weak contributions to the phenotype of interest in comparison to the use of only few data sources.
Another combination of network-based methods with score aggregation has been proposed by Chen et al.60 The authors generate an individual network for each data source and quantify potential gene–disease relationships in each network using a global network measure based on diffusion kernels. The final candidate ranking considers only the most informative network score for each candidate gene. Furthermore, an alternative way of integrating information from multiple data sources is the application of machine learning techniques. Here, each data source can be represented as one or more individual features and used as input for the training of supervised learning methods. In particular, support vector machines,61–63 decision-tree-based classifiers,64 and PU learning65 (machine learning from positive and unlabeled examples) have been applied to prioritize candidate disease genes using multiple data sources.
In contrast, one of the first alternative approaches that integrate information from multiple data sources into a network representation has been Prioritizer.24 Its authors constructed a comprehensive functional human gene network based on a number of datasets from molecular pathway and interaction databases such as KEGG,104 BIND,105 HPRD,106 Reactome107as well as from GO annotations,80 yeast-2-hybrid screens, gene expression experiments, and protein interaction predictions. In this network, positional candidates from different disease loci are ranked according to the length of the shortest paths between them. In functional networks as used by Prioritizer, the main assumption is that relevant genes are involved in specific disease-related pathways and cluster together in the network even if their products are not closely linked by physical protein interactions.
Building upon Prioritizer, several research groups have assembled different types of integrated networks as biological evidence for candidate disease gene prioritization. One example is the two-layered network by Li et al.48presented in the previous section that combines protein interactions and phenotypic similarity. Another method was presented by Linghu et al.66 who employed naïve Bayes integration of diverse functional genomics datasets to generate a weighted functional linkage network and to prioritize candidate genes based on their shortest path distance to known disease genes. Similarly, Huttenhower et al.67 incorporated information from several thousands genomic experiments to generate a functional relationship network. From this network, the authors could derive functional maps of different phenotypes and showed in a case study for macroautophagy that these maps can be used successfully to find novel gene associations.
Recently, Lee et al.68 also provided a large-scale human network of functional gene–gene associations and evaluated the performance of six different network-based methods using it. Similar to the findings by Navlakha and Kingsford,44 the authors concluded that the strongest overall performance is achieved with algorithms that account for the global network structure such as Google’s PageRank. A more general view of the relationships between phenotypes and genes is introduced by BioGraph,69 a heterogeneous network containing diverse biomedical entities and relations between them, which are extracted from over 20 publicly available databases. By computing random walks on this network, the authors aim at the automated generation of functional hypotheses between different concepts, in particular, of candidate genes and diseases.
EVALUATION AND BENCHMARKING OF PRIORITIZATION METHODS
To show the biological applicability and scientific value of disease gene prioritization methods, their authors are normally expected to conduct an extensive performance evaluation and, if possible, a thorough comparison with other methods. To this end, many authors usually benchmark disease phenotypes from OMIM. Depending on the requirements of their method, only phenotypes with at least two or three known disease genes may be suitable. Hence, the number of evaluated diseases can vary from tens to hundreds with hundreds to thousands corresponding genes. The range of disease phenotypes and genes, for which a given method is applicable, depends on the data used by the method. For instance, only about 10% of all human protein–protein interactions have probably been described so far,108only about 10% of all human genes have at least one known disease association,101 and only about every second gene or protein is functionally annotated.109
Leave-one-out cross-validation is a widely used and generally accepted test for how a method might perform on previously unseen data. In each run, one of the known disease genes, the so-called target disease gene, is removed from the training data. The remaining disease genes are used to identify the omitted gene from a test set of genes that are not known to be associated with the disease of interest. In the best case, the top rank should be assigned to the target disease gene and lower ranks to the other test genes. Since cross-validation is a standard performance test, a number of suitable measures of predictive power exist, for example, sensitivity and specificity, receiver operating characteristic (ROC) curve, precision and recall, enrichment and mean rank ratio (see Box 2). Unfortunately, none of these measures is considered as default, which renders the comparison between different methods of disease gene prioritization difficult. In particular, it would be useful to report the performance for the top-ranked candidate genes, e.g., the first 10 or 20 genes, because only a few candidates can usually be considered for further validation experiments.
Another important aspect of the benchmarking strategy is the choice of genes in the test set, i.e., the candidate genes that are prioritized together with the target disease gene. One usual input for prioritization methods is a set of susceptibility loci as determined by GWA studies. These loci typically contain up to several hundreds of possible disease genes. Therefore, different strategies have been followed by authors to derive useful test sets, i.e., the definition of artificial gene loci, the random selection of genes, the use of the whole genome, and the small-scale choice of genes.
Here, we briefly describe frequently used measures for evaluating the performance of disease gene prioritization methods. A simple measure is the mean rank ratio defined as the average of rank ratios for all tested disease genes.110 One speaks of n/m-fold enrichment on average if disease genes are ranked in the top m% of all genes in n% of the linkage intervals.47 Other performance measures are calculated using the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) at a specific rank or score cut-off that discriminates predicted from not predicted ones. Positives are disease genes, while negatives are candidate genes without disease association. For instance, sensitivity is the percentage of correctly identified disease genes among all genes above the cut-off (TP/(TP + FN)), while specificity is the percentage of correctly dismissed candidate genes among all genes below the cut-off (TN/(TN + FP)). Plotting sensitivity versus specificity while varying the cut-off yields a ROC curve. The area-under-the-ROC curve (AUC) is a standard measure for the overall performance of binary classification methods (here, disease genes vs. others). The AUC is 100% in case of perfect prioritization and 50% if the disease genes were ranked randomly. In some cases, the authors of prioritization methods give the percentage of disease genes ranked in the top 1% and 5% of all genes, which corresponds to reporting the sensitivity at 99% or 95% specificity, respectively. The percentage of correctly prioritized disease genes among all disease genes is defined as precision (TP/(TP + FP)), while recall is equal to sensitivity. Thus, the plot of a precision-recall curvecan also be used to evaluate method performance.44
Endeavour56 and several related methods61,103,111 were evaluated with a test set containing 99 candidate genes chosen at random from the whole genome in addition to the target disease gene. Other methods were also benchmarked with this strategy, primarily, in order to be comparable with Endeavour.26,41,47,58,60,110,112 However, since similar genes tend to cluster in chromosomal neighborhoods,113 another, presumably, more difficult setting for performance benchmarking and especially more relevant for GWA studies is the definition of artificial linkage intervals with genes that surround the disease gene on the chromosome.22,24,26,35,40,45,47,48,50,53,57,60,76,112,114 The size of such intervals, as found in the relevant literature, ranges from the 100 nearest genes to 300 genes on average if a 10 Mb genomic neighborhood is considered.26 The average gene number of linkage intervals associated with diseases according to OMIM is estimated to be 108.35
The third option for assembling a test set is the use of all genes in the genome except for the known disease genes in the training set.44,47–49,110This setting is chosen only by the few methods that are capable of performing genome-wide disease gene prioritization. Finally, prioritization methods that consider, for instance, gene expression data are evaluated only on a smaller scale because there is not enough data for a comprehensive benchmarking over many disease phenotypes.38,43,68,77,115–117 Therefore, the authors commonly choose only few diseases that have, for example, the required experimental data available.
In this review, we gave an overview of different approaches to the prioritization of candidate disease genes. We described how disease genes can be identified by their molecular characteristics based on sequence properties, functional annotations, and network information. In particular, we presented recent approaches, which make use of phenotypic information and comprehensive knowledge integration. Finally, we discussed common benchmarking strategies of prioritization methods.
Many disease gene prioritization methods exploit discriminative gene and protein properties and successfully rank candidate genes according to their functional and phenotypic similarity or network proximity to known disease genes. Further improvement of the prioritization performance can be achieved by integrating biological information contained in multiple data sources. Many integrative methods first combine heterogeneous datasets and then apply specific analysis techniques. However, in the course of such analysis, the very useful insight which data source provides the most relevant biological information for the prioritization is usually lost. Therefore, it is also beneficial to follow the alternative approach that first analyzes each data source separately using the most suitable techniques and then combines the resulting ranking lists using sophisticated rank aggregation algorithms. This procedure also facilitates backtracking the origin of the most relevant information.
Among the most widely used data sources for disease gene prioritization are functional annotations and protein interactions as well as phenotypic similarity. In particular, performance evaluations of methods such as Endeavour, MedSim, and Prioritizer demonstrated consistently that functional GO term annotations constitute by far one of the most useful biological evidence sources for candidate prioritization.24,26,56 Further performance gain can be attributed to comprehensive knowledge integration, which reduces the noise in the integrated data and provides additional information from data sources that is not captured (yet) by GO term annotations. Even more performance increase can be expected when the used data sources become more and more complete and exhibit high quality without significant bias toward intensively studied genes and proteins.
Currently, the multitude of benchmarking strategies pursued by different researchers considerably hampers the performance comparison of disease gene prioritization methods. Moreover, some methods make use of only small test datasets due to the lack of the required training data and the limited amount of known disease genes. Nevertheless, established procedures to derive test sets and the application of different standard performance measures should form part of every benchmarking strategy to evaluate new prioritization methods comprehensively with respect to other well-performing methods. To facilitate future performance comparisons, the training and test datasets should always be made publicly available together with the published work. In the end, since follow-up validation experiments tend to be expensive and time-consuming, it is vital that the correct disease genes are found on the few top ranks of the prioritization list.
Part of this study was financially supported by the BMBF through the German National Genome Research Network (NGFN) and the Greifswald Approach to Individualized Medicine (GANI_MED). The research was also conducted in the context of the DFG-funded Cluster of Excellence for Multimodal Computing and Interaction.
The rapid development of high-throughput technologies and computational frameworks enables the examination of biological systems in unprecedented detail. The ability to study biological phenomena at omics levels in turn is expected to lead to significant advances in personalized and precision medicine. Patients can be treated according to their own molecular characteristics. Individual omes as well as the integrated profiles of multiple omes, such as the genome, the epigenome, the transcriptome, the proteome, the metabolome, the antibodyome, and other omics information are expected to be valuable for health monitoring, preventative measures, and precision medicine. Moreover, omics technologies have the potential to transform medicine from traditional symptom-oriented diagnosis and treatment of diseases toward disease prevention and early diagnostics. We discuss here the advances and challenges in systems biology-powered personalized medicine at its current stage, as well as a prospective view of future personalized health care at the end of this review. WIREs Syst Biol Med 2013, 5:73–82. doi: 10.1002/wsbm.1198
Conflict of interest: M.S. serves as founder and consultant for Personalis, a member of the scientific advisory board of GenapSys, and a consultant for Illumina.
For further resources related to this article, please visit the WIREs website.
Personalized or precision medicine is expected to become the paradigm of future health care, owing to the substantial improvement of high-throughput technologies and systems approaches in the past two decades.1,2Conventional symptoms-oriented disease diagnosis and treatment has a number of significant limitations: for example, it focuses on only late/terminal symptoms and generally neglects preclinical pathophenotypes or risk factors; it generally disregards the underlying mechanisms of the symptoms; the disease descriptions are often quite broad so that they may actually include multiple diseases with shared symptoms; the reductionist approach to identify therapeutic targets in traditional medicine may over-simplify the complex nature of most diseases.3 Advances in the ability to perform large-scale genetic and molecular profiling are expected to overcome these limitations by addressing individualized differences in diagnosis and treatment in unprecedented detail.
The rapid development of high-throughput technologies also drives modern biological and medical researches from traditional hypothesis-driven designs toward data-driven studies. Modern high-throughput technologies, such as high-throughout DNA sequencing and mass spectrometry, have enabled the facile monitoring of thousands of molecules simultaneously instead of just a few components that have been analyzed in traditional research, thus generating a huge amount of data to document the real-time molecular details of a given biological system. Ultimately, when enough knowledge is gained, these molecular signatures, as well as the biological networks they form, may be associated with the physiological state/phenotype of the biological system at the very moment when the sample is taken.
Future personalized health care is expected to benefit from the combined personal omics data, which should include genomic information as well as longitudinal documentation of all possible molecular components. This combined information not only determines the genetic susceptibility of the person, but also monitors his/her real-time physiological states, as our integrative Personal Omics Profile (iPOP) study exemplified.4 In this review we will cover recent advances in systems biology and personalized medicine. We will also discuss limitations and concerns in applying omics approaches to individualized, precision health care.
GENOMICS IN DISEASE-ORIENTED MEDICINE
The revolution of omics profiling technologies significantly benefited disease-oriented studies and health care, especially in disease mechanism elucidation, molecular diagnosis, and personalized treatment. These new technologies greatly facilitated the development of genomics, transcriptomics, proteomics, and metabolomics, which have become powerful tools for disease studies. Today, molecular disease analyses using large-scale approaches are pursued by an increasing number of physicians and pathologists.5,6
Initially, genome-wide association studies (http://gwas.nih.gov/) were launched in search of association of common genetic variants to certain phenotypes of interest, which typically assayed more than 500,000 single nucleotide polymorphisms (SNPs) and/or copy number variations (CNVs) with DNA microarrays in thousands to hundred thousands of participants.7 To date, 1,355 publications are listed in the National Human Genome Research Institute (NHGRI) GWAS Catalog reporting the association of 7,226 SNPs with 710 complex traits.7 The studied complex traits vary vastly, from cancers (e.g., prostate cancer and breast cancer) and complex diseases (e.g., type 1 and type 2 diabetes (T2D), Crohn’s Disease) to common traits (e.g., height and body mass index). These findings greatly broadened our knowledge on disease loci, and can potentially benefit disease risk prediction and drug treatments (as discussed in the section Integrative Omics in Preventative Medicine). Although powerful, GWAS studies have proven difficult for most complex diseases as typically a large number of loci are identified, each contributing to a small fraction of the genetic risk. These studies have many limitations including the small fraction of the genome that is analyzed, and failure to account for gene-gene interactions, epistasis and environmental factors.8
Whole genome sequencing (WGS) and whole exome sequencing (WES) have become more and more affordable for genomic studies and are rapidly replacing DNA microarrays. Single-base analysis of a genome/exome is achieved, which allows scientists to investigate the genetic basis of health and disease in unprecedented detail. Assigning variants to paternal and maternal chromosomes i.e. ‘phasing’ can be obtained through the analysis of families9or other methods.1,10,11 With the generation of massive amount of whole genome and exome data from diseased and healthy populations, understanding of both human population variation and genetic diseases, especially complex diseases, has been brought to a new level.1,12
One field that significantly benefited from WGS technologies is cancer-related research. A large number of cancer genomes have been sequenced through individual or collaborative efforts, such as the International Cancer Genome Consortium (http://www.icgc.org/) and the Cancer Genome Atlas (http://cancergenome.nih.gov/). The DNA from many types of cancer have been sequenced, including breast cancer,13–15 chronic lymphocytic leukaemia,16 hepatocellular carcinoma,17 pediatric glioblastoma,13melanoma,18 ovarian cancer,19 small-cell lung cancer,20 and Sonic-Hedgehog medulloblastoma,21 and databases are established, such as the cancer cell line encyclopedia.22 In addition, single-cell level cancer genome has also been investigated by WES for clear cell renal cell carcinoma23 and JAK2-negative myeloproliferative neoplasm.24 Somatic mutations and subtyping molecular markers were identified from these genomes. These different studies have revealed that nearly every tumor is different with distinct types of potential ‘driver’ mutations. Importantly, cancer genome sequencing often reveals potential targets that may suggest precision cancer treatment for the specific patients. As an example, a novel spontaneous germline mutation in the p53 gene was identified by WGS in a female patient, which accounted for the three types of cancers she developed in merely 5 years.25 An attempt has been made recently to treat a female patient with T Cell Lymphoma based on the target gene, CTLA4, identified by whole genome sequencing.26 The patient’s cancer was suppressed for two months with the anti-CTLA4 drug ipilimumab, although she died of recurrence soon after.
Whole genome and exome sequencing can also facilitate the identification of possible causal genes for hereditary genetic diseases, and is increasingly used in attempts to understand the basis of these ‘mystery diseases’ once obvious candidates are ruled out. In one successful example, whole genome sequencing of a fraternal twin pair with dopa (3,4-dihydroxyphenylalanine)-responsive dystonia helped the identification of one pair of personalized compound heterozygous mutations in the gene SPR, which accounted for the disease in both individuals.27 Importantly, based on the genome information the authors supplemented the l-dopa therapy with 5-hydroxytryptophan (SPR-dependent serotonin precursor) and significantly improved the health of both patients. In another example, Roach et al. sequenced the whole genomes of a family quartet and identified rare mutations in the genes DHODH and DNAH5 responsible for the two recessive disorders in both children—Miller syndrome and primary ciliary dyskinesia.28
Pharmacogenomics is another important application of genomic sequencing. It is known that the same drug may have different effect on different individuals due to their personal genomic background and living habits.8,29Genetic information can be used to assign drug doses and reduce side effects. For example, genetic variants are known to affect patients’ response to antipsychotic drugs.30 Based on pharmacogenomic trials, genetic tests for four drugs are required by the US Food and Drug Administration (FDA) before the administration of these drugs to patients, including the anti-cancer drugs cetuximab, trastuzumab, and dasatinib, and the anti-HIV drug maraviroc, and more are recommended such as the anticoagulant drug Warfarin and the anti-HIV drug Abacavir.8
OTHER OMICS TECHNOLOGIES AND MEDICINE
Other omics technologies are also likely to impact medicine. High throughput sequencing technologies have enabled whole transcriptome (cDNA) sequencing, or abbreviated as RNA-Seq.31 RNA-Seq has become a powerful tool for disease-related studies, as it has great accuracy and sensitivity relative to microarray technology and it can also detect splicing isoforms.32 As RNA profiles reflect actual gene activity, it is closer to the real phenotype compared to genomic sequence. With RNA-Seq, Shah et al. discovered varied clonal preference and allelic abundance in 104 cases of primary triple-negative breast cancers, and observed that ∼36% of the genomic mutations were actually expressed.33 Combining such information with genomic information may be valuable in treatment of cancer and other diseases. Moreover, RNA-Seq also captures more complex aspects of the transcriptome, such as splicing isoforms34 and editing events,35 which are generally overlooked by hybridization-based methods. Splicing variants have now been associated with several distinct types of cancer and cancer prognosis.36–40
Although proteins have long been deemed as the executors of most biological functions, clinical proteomics is still a relatively young field due to technological limitations to profile the complexity of the proteome with high sensitivity and accuracy. Since the development of new soft desorption methods that enabled the analysis of biological macromolecules with mass spectrometry, proteomics advanced significantly in the past decade.41,42With current mass spectrometry technology, one can now quantify thousands of proteins in a single sample. For example, we were able to reliably detect 6,280 proteins in the human peripheral blood mononuclear cell proteome.4Mass spectrometry also allows the detection of expressed mutations, allele-specific sequences and editing events in the human proteome,4,43 as well as profiling of the phosphoproteome.44 Also of note is the MALDI-TOF (matrix-assisted laser desorption/ionization-time of flight) mass spectrometry-based imaging technology (MALDI-MSI) developed by Cornett et al., which allows spatial proteome profiling in defined two-dimentional laser-shot areas using tissue sections.45 Using MALDI-MSI, Kang et al. identified immunoglobulin heavy constant α2 as a novel potential marker for breast cancer metastasis.46
The field of metabolomics has also advanced significantly with the improvement of mass spectrometry. Both hydrophilic and hydrophobic metabolites can be profiled in specific samples.4,47 As the metabolome reflects the real-time energy status as well as metabolism of the living organism, it is expected that certain metabolome profiles may be associated with different diseases.48 Therefore, metabolomic profiles become an important aspect for personalized medicine.49,50 Jamshidi et al. profiled the metabolome of a female patient with Hereditary Hemorrhagic Telangiectasia (HHT) along with four healthy controls, and identified differences which highlighted the nitric oxide synthase pathway.51 The authors then treated the patient with bevacizumab and shifted her metabolomic profile toward those of the healthy controls and improved the patient’s health. In addition, branched-chain amino acids such as isoleucine have been associated with T2D and may ultimately prove to be valuable biomarkers.52 Finally, since some metabolites bind and directly regulate the activity of other biomolecules (e.g., kinases),53 there is significant potential to modulate cellular pathways using diet and metabolic analogs that serve as agonist or antagonist of protein function.
INTEGRATIVE OMICS IN PREVENTATIVE MEDICINE
The concept of personalized medicine emphasizes not only personalized diagnosis and treatment, but also personalized disease susceptibility assessment, health monitoring and preventative medicine. Because disease is easier to manage prior to it onset or when a disease is at its early stages, risk assessment and early detection will be transformative in personalized medicine. Systems biology has the potential to capture real-time molecular phenotypes of a biological system, which enables the detection of subtle network perturbations preluding the actual development of clinical symptoms.
Disease susceptibility and drug response can be assessed with a person’s genomic information.8 This information may serve as a guideline for monitoring the health of a particular patient to achieve personalized health care, as showcased by Ashley et al.54 Whole genome sequence revealed variants for both high-penetrance Mendelian disorders, such as HTT(Huntington’s disease55) and PAH (Phenylketonuria56), as well as common, complex diseases, such as the disease-associated genetic variants reported in GWAS studies.57 Disease risks can be evaluated for a given person and an increase or decrease in disease risk compared with the population risk (of the same ethnicity, age, and gender) can be estimated (Figure 1). In the study of Ashley et al., the genome of a patient was analyzed and increased post-test probability risks for myocardial infarction and coronary artery disease were estimated.54 Their estimation matched the fact that the patient, although generally healthy, had a family history of vascular disease as well as early sudden death.58 Genetic variants associated with heart-related morbidities as well as drug response were identified in the patient’s genome, the information of which, as the authors stated, may direct the future health care for this particular patient. Similarly, Dewey et al. further extended this work by analysing a family quartet using a major allele reference sequence, and identified high-risk genes for familial thrombophilia, obesity, and psoriasis.59
To further explore variation and power of the full human genome, projects and databases (such as the Personal Genome Project60) are being launched to help advance this field. However, genomic information alone usually is not adequate to predict disease onset, and other factors such as environment are expected to play a critical role in this process.61,62 The predictive capability of whole genome sequence was assessed by Roberts et al. through modeling 24 disease risks in monozygotic twins.63 For each disease, the authors modeled the genotype distribution in the twin population according to the observed concordance/discordance, and discovered that for most individuals and most diseases, the relative risk would be tested negative compared to the population, and in the best-case scenario, only one disease or more could be forewarned for any individual. The results of Roberts et al. are not surprising, as disease manifestation is probabilistic and not deterministic. Nonetheless, whole genome information by itself is expected to have partial value in disease prediction for complex diseases. In addition, from a systems point of view, peripheral components of the biological network would be more likely to contribute to complex diseases, as perturbation of the main nodes, which are usually essential genes, would be lethal.64 Therefore it is more difficult to identify the exact contributors of complex diseases. Moreover, as stated above, non-genomic factors may also exist and further complicate the situation. As an example of this, multiple sclerosis is known to have genetic components, however, Baranzini et al. failed to identify genomic, epigenomic or transcriptomic contributors in discordant monozygotic twins, which may indicate the existence of other factors, such as the environment.65
Current technologies, especially high-throughput sequencing and mass spectrometry, enable the monitoring of at least 105 molecular components, including DNA, RNA, protein, and metabolites in the human body. Therefore it is now feasible to identify the profiles of these components that correlate with various physiological states of the body, and profile alterations as a result of physiological state changes and diseases. Compared with genomic sequences alone, the profiles of transcriptome, proteome and metabolome are closer indicators to the real-time phenotype, therefore collecting these omics information in a longitudinal manner would allow monitoring of an individual’s physiological states. To test this concept, we implemented a study by following a generally healthy participant for 14 (now 32) months with integrated Personal Omics Profile (iPOP) analysis, incorporating information of the participant’s genome with longitudinal data from the person’s transcriptome, proteome, metabolome, and autoantibodyome.4 As blood constantly circulates the human body and exchanges biological matters with local tissues and is presently analyzed in medical tests, we chose to monitor the participant’s physiological states by profiling the blood components (PBMCs, serum and plasma) with iPOP analysis. The genome of this individual was sequenced with two WGS (Illumina and Complete Genomics) and three WES (Agilent, Roche Nimblegen, and Illumina) platforms to achieve high accuracy, which was further analyzed for disease risk and drug efficiency. The identified elevated risks included coronary artery disease, basal-cell carcinoma, hypertriglyceridemia and T2D, and the participant was estimated to have favorable response to rosiglitazone and metformin, both are antidiabetic medications. Although the participant has a known family history for some of the high-risk diseases (but not T2D), he was free from most of them (except for hypertriglyceridemia, for which he used medication) and had a normal Body Mass Index at the start of our study. Nonetheless, these elevated disease risks served as a guideline to monitor his personal health with iPOP analysis. We profiled the transcriptome, proteome and metabolome from 20 time points in the 14 months, and monitored molecular profile changes for physiological state change events during our study, including two viral infections. The subject also acquired T2D during the study, immediately after one of the viral (respiratory syncytial virus) infections. Two types of changes were observed from our iPOP data: the autocorrelated trends that reflect chronic changes, and the spikes which include significantly up/down-regulated genes and pathways especially at the onset of each event. With our iPOP approach, we acquired a comprehensive picture of detailed molecular differences between different physiological states, as well as during disease onset. In particular, interesting changes in glucose and insulin signaling pathways were observed during the onset of T2D. We also obtained other important information from our omics data, such as dynamic changes in allele-specific expression and RNA-editing events, as well as personalized autoantibody profiles. Overall, this study revealed an important application of the use of genomics and other omics profiling for personalized disease risk estimation and precision medicine, as we discovered the increased T2D risk, monitored its early onset, and helped the participant effectively control and eventually reverse the phenotype by proactive interventions (diet change and physical exercise).
Another important feature of our study is that samples are collected in a longitudinal fashion so that aberrant/disease states can be compared to healthy states of the same individual. One other advantage of our iPOP approach is its modularity, as other omics and quantifiable information can also be included in the iPOP profile, which can be readily tailored to monitor any biological or pathological event of interest (Figure 2). Examples of other information are: epigenome,66 gut microbiome,67 microRNA profiles68 and immune receptor repertoire.69 Moreover, quantifiable behavioral parameters such as nutrition, exercise, stress control and sleep may also be added to the profile.70
THE IMPORTANCE OF DATA MINING AND RE-MINING
One important aspect of systems biology is data mining. Data management and access can become a daunting task given the tremendous amount of data generated with current high-throughput technologies, and the data size is constantly increasing with time.71 Challenges exist computationally in each step to handle, process and annotate high-throughput data, integrate data from different sources and platforms, and pursue clinical interpretation of the data.72 These steps can be quite computationally intensive and require significant computational hardware; for example, to map short reads to achieve 30× coverage of the human genome, 13 CPU days is typically required72 although these times are rapidly decreasing. Moreover, as biological systems act more than just the sum of its individual parts, knowledge from multiple levels (such as epistasis, interaction, localization, and activation status) should be considered to capture the underlying highly organized networks for functional annotations.73 Ultimately it will be important to have a comprehensive database that contains Electronic Health records (including treatment information), genome sequences with variant calls and as much molecular information as possible. In principle with appropriate algorithms such a database could be mined by physicians to make data-driven medical decisions.
Currently many high-throughput datasets of similar types (e.g., expression and genome-wide association data collected from different populations of the same disease) were created as smaller, separate studies. Thus combining these publicly available datasets bioinformatically may provide more statistical power and lead to a clearer conclusion that could not be achieved in the individual studies. The work by Roberts et al. mentioned above serves as one example.63 In order to test the capacity of whole genome information, the authors combined monozygotic twin pair data from a total of five sources in 13 publications to obtain a much large dataset for their test. Similarly, Butte and colleagues combined the results of 130 functional microarray experiments for T2D and re-mined the data for repeatedly appeared candidate genes.74 They identified CD44 as the top candidate gene associated with T2D. In a related effort, by analyzing curated data of 2,510 individuals from 74 populations, the group led by Butte also discovered that T2D risk alleles were unevenly distributed across different human populations, with the risk higher in African and lower in Asian populations.75
CONCERNS AND LIMITATIONS
Personalized health monitoring and precision medicine is just accelerating at a rapid pace because of the development of systems biology. As noted above, multiple efforts in both technology development and biological application have occurred, and an increasing number of researchers and physicians alike are sharing this vision. Hood et al. termed this approach as ‘P4 Medicine’ for predictive, preventive, personalized and participatory medicine.12
Nevertheless, many concerns also exist, and guidelines on translational omics research have been recommended by the Institute of Medicine.76 Khoury et al. suggested ‘a fifth P’, that is, the population perspective be added to personalized medicine77 and population validation of systems results with strong evidence should be achieved before its clinical application. Many disease-associated genetic variants discovered in GWAS still need to be functionally validated.78 In addition, Khoury et al. raised concerns that restricted health care resources might be wasted if unneeded disease screening/subclassification with systems approaches were conducted rather than lowering health care costs. However, with the rapid drop in technology costs and carefully designed pilot studies, the optimal screening frequencies/levels of subclassification necessary for precision medicine could be determined and costs maintained at affordable levels. It is worth noting that generating personalized omics data with appropriate interpretation can greatly benefit our understanding of physiological events for health and disease, and precision health care as we gain more knowledge in this field. In addition to personalized diagnosis and treatment, the future of precision medicine with omics approaches should emphasize personalized health monitoring, molecular symptom, early detection and preventative medicine, a paradigm shift from traditional health care.
As the human body is a highly organized, complex system with multiple organs and tissues, it is important to select the correct sample type for understanding a specific biological problem. However, as many sample types are unavailable (e.g., brain tissue) or not regularly accessible (e.g., biopsy samples from internal organs) from living individuals, our scope for personalized health monitoring is thus restricted. Therefore systems biology results, especially iPOP results, should not be over-interpreted. Although iPOP data from blood components may indicate changes in the other parts of the human body, the actual profiles for the tissue of interest might be underrepresented in blood or delayed in phase.
It is still not clear who is to develop and deliver personalized treatments for personalized medicine if they are not available as conventional medication. The cost for developing personalized drugs may become prohibitive to accurately address personal specificity, and may face other difficulties such as Food and Drug Administration approval. However, advances in high-throughput drug discovery will help accelerate this field.
In addition, personalized medicine using omics approaches relies heavily on technology development for biological research. This includes advances in both research instrumentation and computational framework. For example, it is still not possible to accurately determine the entire sequence of a genome due to limitations of current WGS/WES methods,79,80 even after computational improvement of signal-to-noise ratio.81,82 A low sequencing error rate was claimed by both the Illumina HiSeq (for 2 × 100 bp reads, more than 80% of the bases have a quality score above Q30, or 99.9% accuracy, http://www.illumina.com/documents//products/datasheets/datasheet_hiseq_systems.pdf) and the Complete Genomics platform (1 × 10−5 at the time of our study80 and 2 × 10−6 as of October 8th, 2012, www.completegenomics.com); however, per variant error rate is still high (15.50% and 9.08% for Illumina and Complete Genomics respectively with no filter, and 1.01% and 1.12% post multiple filters) as reported by Reumers et al.,81 which agreed with our observation that only 88.1% of the SNP calls overlapped when the same genome was sequenced with the two platforms.80 Thus possible disease-associated variants in these platform-specific regions might be overlooked or misinterpreted. Another issue lies in storage and processing of the omics data, as petabytes of data can easily be generated for a small iPOP study of 200 participants and demanding computing resources will be needed for data analysis. Therefore, interdisciplinary efforts from biologists, computer scientists and hardware engineers should be organized to ensure the continued improvement of this field.
The era of personalized precision medicine is about to emerge. The steady improvement of high-throughput technologies greatly facilitates this process by enabling profiling of various omes such as whole genome, epigenome, transcriptome, proteome and metabolome, which convey detailed information of the human body. Integrated profiles of these omes should reflect the physiological status of the host at the time the samples are collected. Personalized omics approach catalyzes precision medicine at two levels: for diseases and biological processes whose mechanisms are still unclear, omics approach will facilitate researches that would greatly advance our understanding; and when the mechanisms are clarified, individualized health care can be provided through health monitoring, preventative medicine, and personalized treatment. This would be especially helpful for complex diseases such as autism83 and Alzheimer’s disease,84 where multiple factors are responsible for the phenotypes. Furthermore, omics approach also facilitates the development of other less-stressed but important health-related fields, such as nutritional systems biology, which studies personalized diet and its relationship to health in systems point of view.85 With the rapid decrease in the cost of omics profiling, we anticipate an increased number of personalized medicine applications in many aspects of health care besides our proof-of-principle study. This will significantly improve the health of the general public and cut down health care costs. Scientists, governments, pharmaceutical companies and patients should work closely together to ensure the success of this transformation.86
This work is supported by funding from the Stanford University Department of Genetics and the National Institutes of Health. We thank Drs. George I. Mias and Hogune Im for their help in proof-reading the article and the insightful discussions.
Preparing scientists to harness and apply the power of data to understand the complexities of human biology
Bioinformatics (BI) is a curriculm pathway within the Biological and Medical Informatics Graduate Program at the University of California, San Francisco (UCSF). The program offers two optional designated emphases in:
- Computational Biology and Bioinformatics (CBBI).
- Complex Biological Systems (CBS).
We prepare scientists to use tools from mathematics to physics and from chemistry to biology to gather, store, analyze, predict, and disseminate information about biology. The field is essential, for without quantitative analysis of the massive and growing amounts of biological data generated by various systems, biology and -omics data cannot be interpreted or exploited.