Deep Learning Papers on Medical Image Analysis
To the best of our knowledge, this is the first list of deep learning papers on medical applications. There are couple of lists for deep learning papers in general, or computer vision, for example Awesome Deep Learning Papers. In this list, I try to classify the papers based on their deep learning techniques and learning methodology. I believe this list could be a good starting point for DL researchers on Medical Applications.
- A list of top deep learning papers published since 2015.
- Papers are collected from peer-reviewed journals and high reputed conferences. However, it may have recent papers on arXiv.
- A meta-data is required along with the paper, i.e. Deep Learning technique, Imaging Modality, Area of Interest, Clinical Database (DB).
List of Journals / Conferences (J/C):
- Medical Image Analysis (MedIA)
- IEEE Transaction on Medical Imaging (IEEE-TMI)
- IEEE Transaction on Biomedical Engineering (IEEE-TBME)
- IEEE Journal of Biomedical and Health Informatics (IEEE-JBHI)
- International Journal on Computer Assisted Radiology and Surgery (IJCARS)
- International Conference on Information Processing in Medical Imaging (IPMI)
- International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI)
- International Conference on Information Processing in Computer-Assisted Interventions (IPCAI)
- IEEE International Symposium on Biomedical Imaging (ISBI)
Deep Learning Techniques:
- NN: Neural Networks
- MLP: Multilayer Perceptron
- RBM: Restricted Boltzmann Machine
- SAE: Stacked Auto-Encoders
- CAE: Convolutional Auto-Encoders
- CNN: Convolutional Neural Networks
- RNN: Recurrent Neural Networks
- LSTM: Long Short Term Memory
- M-CNN: Multi-Scale/View/Stream CNN
- MIL-CNN: Multi-instance Learning CNN
- FCN: Fully Convolutional Networks
- US: Ultrasound
- MR/MRI: Magnetic Resonance Imaging
- PET: Positron Emission Tomography
- MG: Mammography
- CT: Computed Tompgraphy
- H&E: Hematoxylin & Eosin Histology Images
- RGB: Optical Images
- AutoEncoders/ Stacked AutoEncoders
- Convolutional Neural Networks
- Recurrent Neural Networks
- Generative Adversarial Networks
- Detection/ Localization
- Image Reconstruction and Post-Processing
- Other tasks
- AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images
- Fast Convolutional Neural Network Training Using Selective Data Sampling: Application to Hemorrhage Detection in Color Fundus Images
|NN||H&E||N/A||Deep learning of feature representation with multiple instance learning for medical image analysis [pdf]||ICASSP||2014|
|M-CNN||H&E||Breast||AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images [pdf]||AMIDA||IEEE-TMI||2016|
|FCN||H&E||N/A||Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation pdf||MICCAI||2017|
|M-CNN||CT||Lung||Multi-scale Convolutional Neural Networks for Lung Nodule Classification [pdf]||LIDC-IDRI||IPMI||2015|
|3D-CNN||MRI||Brain||Predicting Alzheimer’s disease: a neuroimaging study with 3D convolutional neural networks [pdf]||ADNI||arXiv||2015|
|CNN+RNN||RGB||Eye||Automatic Feature Learning to Grade Nuclear Cataracts Based on Deep Learning [pdf]||IEEE-TBME||2015|
|CNN||X-ray||Knee||Quantifying Radiographic Knee Osteoarthritis Severity using Deep Convolutional Neural Networks [pdf]||O.E.1||arXiv||2016|
|CNN||H&E||Thyroid||A Deep Semantic Mobile Application for Thyroid Cytopathology [pdf]||SPIE||2016|
|3D-CNN, 3D-CAE||MRI||Brain||Alzheimer’s Disease Diagnostics by a Deeply Supervised Adaptable 3D Convolutional Network [pdf]||ADNI||arXiv||2016|
|M-CNN||RGB||Skin||Multi-resolution-tract CNN with hybrid pretrained and skin-lesion trained layers [pdf]||Dermofit||MLMI||2016|
|CNN||RGB||Skin, Eye||Towards Automated Melanoma Screening: Exploring Transfer Learning Schemes [pdf]||EDRA, DRD||arXiv||2016|
|M-CNN||CT||Lung||Pulmonary Nodule Detection in CT Images: False Positive Reduction Using Multi-View Convolutional Networks [pdf]||LIDC-IDRI, ANODE09, DLCST||IEEE-TMI||2016|
|3D-CNN||CT||Lung||DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification [pdf]||LIDC-IDRI, LUNA16||IEEE-WACV||2018|
|3D-CNN||MRI||Brain||3D Deep Learning for Multi-modal Imaging-Guided Survival Time Prediction of Brain Tumor Patients [pdf]||MICCAI||2016|
|SAE||US, CT||Breast, Lung||Computer-Aided Diagnosis with Deep Learning Architecture: Applications to Breast Lesions in US Images and Pulmonary Nodules in CT Scans [pdf]||LIDC-IDRI||Nature||2016|
|CAE||MG||Breast||Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring [pdf]||IEEE-TMI||2016|
|MIL-CNN||MG||Breast||Deep multi-instance networks with sparse label assignment for whole mammogram classification [pdf]||INbreast||MICCAI||2017|
|GCN||MRI||Brain||Spectral Graph Convolutions for Population-based Disease Prediction [pdf]||ADNI, ABIDE||arXiv||2017|
|CNN||RGB||Skin||Dermatologist-level classification of skin cancer with deep neural networks||Nature||2017|
|FCN + CNN||MRI||Liver-Liver Tumor||SurvivalNet: Predicting patient survival from diffusion weighted magnetic resonance images using cascaded fully convolutional and 3D convolutional neural networks [pdf]||ISBI||2017|
|MLP||CT||Head-Neck||3D Deep Learning for Efficient and Robust Landmark Detection in Volumetric Data [pdf]||MICCAI||2015|
|CNN||US||Fetal||Standard Plane Localization in Fetal Ultrasound via Domain Transferred Deep Neural Networks [pdf]||IEEE-JBHI||2015|
|2.5D-CNN||MRI||Femur||Automated anatomical landmark detection ondistal femur surface using convolutional neural network [pdf]||OAI||ISBI||2015|
|LSTM||US||Fetal||Automatic Fetal Ultrasound Standard Plane Detection Using Knowledge Transferred Recurrent Neural Networks [pdf]||MICCAI||2015|
|CNN||X-ray, MRI||Hand||Regressing Heatmaps for Multiple Landmark Localization using CNNs [pdf]||DHADS||MICCAI||2016|
|CNN||MRI, US, CT||–||An artificial agent for anatomical landmark detection in medical images [pdf]||SATCOM||MICCAI||2016|
|FCN||US||Fetal||Real-time Standard Scan Plane Detection and Localisation in Fetal Ultrasound using Fully Convolutional Neural Networks [pdf]||MICCAI||2016|
|CNN+LSTM||MRI||Heart||Recognizing end-diastole and end-systole frames via deep temporal regression network [pdf]||MICCAI||2016|
|M-CNN||MRI||Heart||Improving Computer-Aided Detection Using Convolutional Neural Networks and Random View Aggregation Neural Networks [pdf]||IEEE-TMI||2016|
|CNN||PET/CT||Heart||Automated detection of pulmonary nodules in PET/CT images: Ensemble false-positive reduction using a convolutional neural network technique Neural Networks [pdf]||MP||2016|
|3D-CNN||MRI||Brain||Automatic Detection of Cerebral Microbleeds From MR Images via 3D Convolutional Neural Networks [pdf]||IEEE-TMI||2016|
|CNN||X-ray, MG||–||Self-Transfer Learning for Fully Weakly Supervised Lesion Localization [pdf]||NIH,China, DDSM,MIAS||MICCAI||2016|
|CNN||RGB||Eye||Fast Convolutional Neural Network Training Using Selective Data Sampling: Application to Hemorrhage Detection in Color Fundus Images [pdf]||DRD, MESSIDOR||MICCAI||2016|
|GAN||–||–||Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery||IPMI||2017|
|FCN||X-ray||Cardiac||CathNets: Detection and Single-View Depth Prediction of Catheter Electrodes||MIAR||2016|
|3D-CNN||CT||Lung||DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification [pdf]||LIDC-IDRI, LUNA16||IEEE-WACV||2018|
|U-Net||–||–||U-net: Convolutional networks for biomedical image segmentation||MICCAI||2015|
|FCN||MRI||Head-Neck||Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation [pdf]||arXiv||2016|
|FCN||CT||Liver-Liver Tumor||Automatic Liver and Lesion Segmentation in CT Using Cascaded Fully Convolutional Neural Networks and 3D Conditional Random Fields [pdf]||MICCAI||2016|
|3D-CNN||MRI||Spine||Model-Based Segmentation of Vertebral Bodies from MR Images with 3D CNNs||MICCAI||2016|
|FCN||CT||Liver-Liver Tumor||Automatic Liver and Tumor Segmentation of CT and MRI Volumes using Cascaded Fully Convolutional Neural Networks [pdf]||arXiv||2017|
|FCN||MRI||Liver-Liver Tumor||SurvivalNet: Predicting patient survival from diffusion weighted magnetic resonance images using cascaded fully convolutional and 3D convolutional neural networks [pdf]||ISBI||2017|
|3D-CNN||Diffusion MRI||Brain||q-Space Deep Learning: Twelve-Fold Shorter and Model-Free Diffusion MRI [pdf] (Section II.B.2)||IEEE-TMI||2016|
|GAN||MG||Breast Mass||Adversarial Deep Structured Nets for Mass Segmentation from Mammograms [pdf]||INbreast, DDSM-BCRP||ISBI||2018|
|3D-CNN||CT||Liver||3D Deeply Supervised Network for Automatic Liver Segmentation from CT Volumes pdf||MICCAI||2017|
|3D-CNN||MRI||Brain||Unsupervised domain adaptation in brain lesion segmentation with adversarial networks pdf||IPMI||2017|
|3D-CNN||CT||Spine||An Artificial Agent for Robust Image Registration [pdf]||2016|
|2.5D-CNN||MRI||Automated anatomical landmark detection ondistal femur surface using convolutional neural network [pdf]||OAI||ISBI||2015|
|3D-CNN||Diffusion MRI||Brain||q-Space Deep Learning: Twelve-Fold Shorter and Model-Free Diffusion MRI [pdf] (Section II.B.1)||[HCP]and other||IEEE-TMI||2016|
|CNN||CS-MRI||A Deep Cascade of Convolutional Neural Networks for Dynamic MR Image Reconstruction pdf||IEEE-TMI||2017|
|GAN||CS-MRI||Deep Generative Adversarial Networks for Compressed Sensing Automates MRI pdf||NIPS||2017|
Deep Learning for Medical Image Processing: Overview, Challenges and Future
Muhammad Imran Razzak, Saeeda Naz and Ahmad Zaib
Medical Image Analysis with Deep Learning
In this article, I start with basics of image processing, basics of medical image format data and visualize some medical data.
By Taposh Roy, Kaiser Permanente.
Analyzing images and videos, and using them in various applications such as self driven cars, drones etc. with underlying deep learning techniques has been the new research frontier. The recent research papers such as “A Neural Algorithm of Artistic Style”, show how a styles can be transferred from an artist and applied to an image, to create a new image. Other papers such as “Generative Adversarial Networks” (GAN) and “Wasserstein GAN” have paved the path to develop models that can learn to create data that is similar to data that we give them. Thus opening up the world to semi-supervised learning and paving the path to a future of unsupervised learning.
While these research areas are still on the generic images, our goal is to use these research into medical images to help healthcare. We need to start with some basics. In this article, I start with basics of image processing, basics of medical image format data and visualize some medical data. In the next article I will deep dive into some convolutional neural nets and use them with Keras for predicting lung cancer.
Basic Image Processing (using python)
There are a variety of image processing libraries, however OpenCV (open computer vision) has become mainstream due to its large community support and availability in C++, java and python. I prefer using opencv using jupyter notebook.
Install OpenCV using: pip install opencv-python or install directly from the source from opencv.org
Now, lets check if you can open an image and view it on your notebook using the code below.
Example image load through OpenCV.
Lets, do something fun such as detecting a face. To detect face we will use an open source xml stump-based 20×20 gentle adaboost frontal face detector originally created by Rainer Lienhart. A good post with details on Haar-cascade detection is here.
Face detection using OpenCV.
Medical Image Data Format
Medical images follow Digital Imaging and Communications (DICOM) as a standard solution for storing and exchanging medical image-data. The first version of this standard was released in 1985. Since then there are several changes made. This standard uses a file format and a communications protocol.
- File Format — All patient medical images are saved in the DICOM file format. This format has PHI (protected health information) about the patient such as — name, sex, age in addition to other image related data such as equipment used to capture the image and some context to the medical treatment. Medical Imaging Equipments create DICOM files. Doctors use DICOM Viewers, computer software applications that can display DICOM images, read and to diagnose the findings in the images.
- Communications Protocol — The DICOM communication protocol is used to search for imaging studies in the archive and restore imaging studies to the workstation in order to display it. All medical imaging applications that are connected to the hospital network use the DICOM protocol to exchange information, mainly DICOM images but also patient and procedure information. There are also more advanced network commands that are used to control and follow the treatment, schedule procedures, report statuses and share the workload between doctors and imaging devices.
A very good blog that goes into details of the DICOM standard is here
Analyze DICOM Images
A very good python package used for analyzing DICOM images is pydicom. In this section, we will see how to render a DICOM image on a Jupyter notebook.
Install OpenCV using: pip install pydicom
After you install pydicom package, go back to the jupyter notebook. In the notebook, import the dicom package and other packages as shown below.
We also use other packages such as pandas, scipy, skimage, mpl_toolkit for data processing and analysis.
There’s a wealth of freely available DICOM datasets online but here’s a few that should help you get started:
- Kaggle Competitions and Datasets: This is my personal favorite. Check out the data for lung cancer competition and diabetes retinopathy.
- Dicom Library : DICOM Library is a free online medical DICOM image or video file sharing service for educational and scientific purposes.
- Osirix Datasets: Provides a large range of human datasets acquired through a variety of imaging modalities.
- Visible Human Datasets: Parts of the Visible Human project are somehow freely distributed here which is weird cause getting that data is neither free nor hassle-free.
- The Zubal Phantom: This website offers multiple datasets of two human males in CT and MRI which are freely distributed.
Download the dicom files and load them on your jupyter notebook.
Now, load the DICOM images into a list.
Step 1 : Basic Viewing of DICOM Image in Jupyter
In the first line we load the 1st DICOM file, which we’re gonna use as a reference named
RefDs, to extract metadata and whose filename is first in the
We then calculate the total dimensions of the 3D NumPy array which are equal to (Number of pixel rows in a slice) x (Number of pixel columns in a slice) x (Number of slices) along the x, y, and z cartesian axes. Lastly, we use the
SliceThickness attributes to calculate the spacing between pixels in the three axes. We store the array dimensions in
ConstPixelDims and the spacing in
Step 2: Looking into details of DICOM format
The unit of measurement in CT scans is the Hounsfield Unit (HU), which is a measure of radiodensity. CT scanners are carefully calibrated to accurately measure this. A detailed understanding on this can be found here.
Each pixel is assigned a numerical value (CT number), which is the average of all the attenuation values contained within the corresponding voxel. This number is compared to the attenuation value of water and displayed on a scale of arbitrary units named Hounsfield units (HU) after Sir Godfrey Hounsfield.
This scale assigns water as an attenuation value (HU) of zero. The range of CT numbers is 2000 HU wide although some modern scanners have a greater range of HU up to 4000. Each number represents a shade of grey with +1000 (white) and –1000 (black) at either end of the spectrum.
Hounsfield Scale [credits: “Introduction to CT physics” (PDF). elsevierhealth.com.]
CT Scanner Image [credits : “Introduction to CT physics” (PDF). elsevierhealth.com.]
In the next part, we will use Kaggle’s lung cancer data-set and Convolution Neural Nets using Keras. We will build upon the information provided by this article to go to the next one.
- Kaggle community for all the different scripts and support
Bio: Taposh Roy leads innovation team in Kaiser Permanente’s Decision Support group. He works with research, technology and business leaders to derive insights from data.
Original. Reposted with permission.
The earliest deep-learning-like algorithms that had multiple layers of non-linear features can be traced back to Ivakhnenko and Lapa in 1965 (Figure 1), who used thin but deep models with polynomial activation functions which they analyzed with statistical methods. In each layer, they selected the best features through statistical methods and forwarded them to the next layer. They did not use backpropagation to train their network end-to-end but used layer-by-layer least squares fitting where previous layers were independently fitted from later layers.
The earliest convolutional networks were used by Fukushima in 1979. Fukushima’s networks had multiple convolutional and pooling layers similar to modern networks, but the network was trained by using a reinforcement scheme where a trail of strong activation in multiple layers was increased over time. Additionally, one would assign important features of each image by hand by increasing the weight on certain connections.
Backpropagation of errors to train deep models was lacking at this point. Backpropagation was derived already in the early 1960s but in an inefficient and incomplete form. The modern form was derived first by Linnainmaa in his 1970 masters thesis that included FORTRAN code for backpropagation but did not mention its application to neural networks. Even at this point, backpropagation was relatively unknown and very few documented applications of backpropagation existed the early 1980s (e.g. Werbos in 1982). Rumelhart, Hinton, and Williams showed in 1985 that backpropagation in neural networks could yield interesting distributed representations. At this time, this was an important result in cognitive psychology where the question was whether human cognition can be thought of as relying on distributed representations (connectionism) or symbolic logic (computationalism).
The first true, practical application of backpropagation came about through the work of LeCun in 1989 at Bell Labs. He used convolutional networks in combination with backpropagation to classify handwritten digits (MNIST) and this system was later used to read large numbers of handwritten checks in the United States. The video above shows Yann LeCun demonstrating digit classification using the “LeNet” network in 1993.
Despite these successes, funding for research into neural networks was scarce. The term artificial intelligence dropped to near pseudoscience status during the AI winter and the field still needed some time to recover. Some important advances were made in this time, for example, the long short-term memory (LSTM) for recurrent neural networks by Hochreiter and Schmidhuber in 1997, but these advances went mostly unnoticed until later as they were overshadowed by the support vector machine developed by Cortes and Vapnik in 1995.
The next big shift occurred just by waiting for computers to get faster, and then later by the introduction of graphics processing units (GPUs). Waiting for faster computers and GPUs alone increased the computational speed by a factor of 1000 over a span of 10 years. In this period, neural networks slowly began to rival support vector machines. Neural networks can be slow when compared to support vector machines, but they reach much better results with the same amount of data. Unlike simpler algorithms, neural networks continue to improve with more training data.
The main hurdle at this point was to train big, deep networks, which suffered from the vanishing gradient problem, where features in early layers could not be learned because no learning signal reached these layers.
The first solution to this problem was layer-by-layer pretraining, where the model is built in a layer-by-layer fashion by using unsupervised learning so that the features in early layers are already initialized or “pretrained” with some suitable features (weights). Pretrained features in early layers only need to be adjusted slightly during supervised learning to achieve good results. The first pretraining approaches where developed for recurrent neural networks by Schmidhuber in 1992, and for feed-forward networks by Hinton and Salakhutdinov in 2006. Another solution for the vanishing gradient problem in recurrent neural networks was long short-term memory in 1997.
As the speed of GPUs increased rapidly, it was soon possible to train deep networks such as convolutional networks without the help of pretraining as demonstrated by Ciresan and colleagues in 2011 and 2012 who won character recognition, traffic sign, and medical imaging competitions with their convolutional network architecture. Krizhevsky, Sutskever, and Hinton used a similar architecture in 2012 that also features rectified linear activation functions and dropout for regularization. They received outstanding results in the ILSVRC-2012 ImageNet competition, which marked the abandonment of feature engineering and the adoption of feature learning in the form of deep learning. Google, Facebook, and Microsoft noticed this trend and made major acquisitions of deep learning startups and research teams between 2012 and 2014. From here, research in deep learning accelerated rapidly.
Additional material: Deep Learning in Neural Networks: An Overview
A perceptron contains only a single linear or nonlinear unit. Geometrically, a perceptron with a nonlinear unit trained with the delta rule can find the nonlinear plane separating data points of two different classes (if the separation plane exists). If no such separation plane exists, the perceptron will often still produce separation planes that provide good classification accuracy. The good performance of the perceptron led to a hype of artificial intelligence. In 1969 however, it was shown that a perceptron may fail to separate seemingly simple patterns such as the points provided by the XOR function. The fall from grace of the perceptron was one of the main reasons for the occurrence of the first AI winter. While neural networks with hidden layers do not suffer from the typical problems of the perceptron, neural networks were still associated with the perceptron and therefore also suffered an image problem during the AI winter.
Despite this, and despite the success of deep learning, perceptrons still find widespread use in the realm of big data, where the simplicity of the perceptron allows for successful application to very large data sets.
Rapid advances in machine learning and other approaches of inference led to a hype of artificial intelligence (similar to the buzz around deep learning today). Researchers made promises that these advances would continue and would lead to strong AI and in turn, AI research received lots of funding.
In the 1970s it became clear that those promises could not be kept, funding was cut dramatically and the field of artificial intelligence dropped to near pseudo-science status. Research became very difficult (little funding; publications almost never made it through peer review), but nevertheless, a few researchers continued further down this path and their research soon lead to the reinvigoration of the field and the creation of the field of deep learning.
This is why excessive deep learning hype is dangerous and researchers typically avoid making predictions about the future: AI researchers want to avoid another AI winter.
AlexNet is a convolutional network architecture named after Alex Krizhevsky, who along with Ilya Sutskever under the supervision of Geoffrey Hinton applied this architecture to the ILSVRC-2012 competition that featured the ImageNet dataset. They improved the convolutional network architecture developed by Ciresan and colleagues, which won multiple international competitions in 2011 and 2012 by using rectified linear units for enhanced speed and dropout for improved generalization. Their results stood in stark contrast to feature engineering methods, which immediately created a great rift between deep learning and feature engineering methods for computer vision. From here it was apparent that deep learning would take over computer vision and that other methods would not be able to catch up. AlexNet heralded the mainstream usage and the hype of deep learning.
Training Deep Learning Architectures
The process of training a deep learning architecture is similar to how toddlers start to make sense of the world around them. When a toddler encounters a new animal, say a monkey, he or she will not know what it is. But then an adult points with a finger at the monkey and says: “That is a monkey!” The toddler will then be able to associate the image he or she sees with the label “monkey”.
A single image, however, might not be sufficient to label an animal correctly when it is encountered the next time. For example, the toddler might mistake a sloth for a monkey or a monkey for a sloth, or might simply forget the name of a certain animal. For reliable recall and labeling, a toddler needs to see many different monkeys and similar animals and needs to know each time whether or not it is really a monkey—feedback is essential for learning. After some time, if the toddler encounters enough animals paired with their names, the toddler will have learned to distinguish between different animals.
The deep learning process is similar. We present the neural network with images or other data, such as the image of a monkey. The deep neural network predicts a certain outcome, for example, the label of the object in an image (“monkey”). We then supply the network with feedback. For example, if the network predicted that the image showed a monkey with 30% probability and a sloth with 70% probability, then all the outputs in favor of the sloth class made an error! We use this error to adjust the parameters of the neural network using the backpropagation of errors algorithm.
Usually, we randomly initialize the parameters of a deep network so the network initially outputs random predictions. This means for ImageNet, which consists of 1000 classes, we will achieve an average classification accuracy of just 0.1% for any image after initializing the neural network. To improve the performance we need to adjust the parameters so that the classification performance increases over time. But this is inherently difficult: If we adjust one parameter to improve performance on one class, this change might decrease the classification performance for another class. Only if we find parameter changes that work for all classes can we achieve good classification performance.
If you imagine a neural network with only 2 parameters (e.g. -0.37 and 1.14), then you can imagine a mountain landscape, where the height of the landscape represents the classification error and the two directions—north-south (x-axis) and east-west (y-axis)—represent the directions in which we can change the two parameters (negative-positive direction). The task is to find the lowest altitude point in the mountain landscape: we want to find the minimum.
The problem with this is that the entire mountain landscape is unknown to us at the beginning. It is as if the whole mountain range is covered in fog. We only know our current position (the initial random parameters) and our height (the current classification error). How can we find the minimum quickly when we have so little information about the landscape?
Imagine you stand on top of a mountain with skis strapped to your feet. You want to get down to the valley as quickly as possible, but there is fog and you can only see your immediate surroundings. How can you get down the mountain as quickly as possible? You look around and identify the steepest path down, go down that path for a bit, again look around and find the new steepest path, go down that path, and repeat—this is exactly what gradient descent does.
While gradient descent is equivalent to stopping every 10 meters and measuring the steepness of your surroundings with a measuring tape (you measure your gradient according to the whole data set), stochastic gradient descent is the equivalent of quickly estimating the steepness with a short glance (just a few hundred data points are used to estimate the steepness).
In terms of stochastic gradient descent, we go down the steepest path (the negative gradient or first derivative) on the landscape of the error function to find a local minimum, that is, the point that yields a low error for our task. We do this in tiny steps so that we do not get trapped in half-pipe-like obstacles (if we are too fast, we never get out of these half-pipes and we may even be “catapulted” up the mountain).
While our ski-landscape is 3D, typical error landscapes may have millions of dimensions. In such a space we have many valleys so it is easy to find a good solution, but we also have many saddle points, which makes matters very difficult.
Saddle points are points at which the surroundings are almost entirely flat, yet which may have dramatic descents at one end or the other (saddle points are like plateaus that slightly bend and may lead to a cliff). Most difficulties to find good solutions on an error landscape with many dimensions stems from navigating saddle points (because these plateaus have almost no steepness, progress is very slow near saddle points) rather than finding the minimum itself (there are many minima, which are almost all of the same quality).
Backpropagation of errors, or often simply backpropagation, is a method for finding the gradient of the error with respect to weights over a neural network. The gradient signifies how the error of the network changes with changes to the network’s weights. The gradient is used to perform gradient descent and thus find a set of weights that minimize the error of the network.
There are three good ways to teach backpropagation: (1) Using a visual representation, (2) using a mathematical representation, (3) using a rule-based representation. The bonus material at the end of this section uses a mathematical representation. Here I’ll use a rule-based representation as it requires little math and is easy to understand.
Imagine a neural network with 100 layers. We can imagine a forward pass in which a matrix (dimensions: number of examples x number of input nodes) is input to the network and propagated t through it, where we always have the order (1) input nodes, (2) weight matrix (dimensions: input nodes x output nodes), and (3) output nodes, which usually also have a non-linear activation function (dimensions: examples x output nodes). How can we imagine these matrices?
The input matrix represents the following: For every input node we have one input value, for example, pixels (three input values = three pixels in Figure 1), and we take this times our number of examples, such as the number of images. So for 128 3-pixel images, we have a 128×3 input matrix.
The weight matrix represents the connections between input and output nodes. The value passed to an input node (a pixel) is weighted by the weight matrix values and it “flows” to each output node through these connections. This flow is a result of multipying the input value by the value of each weight between the input node and output nodes. The output matrix is the accumulated “flow” of all input nodes at all output nodes.
So for each input, we multiply by all weights, and add up all those contributions at the output nodes, or more easily we take the matrix product of the input matrix times the weight matrix. In our example, this would be our 128×3 input matrix multiplied by the 3×5 weight matrix (see Figure 1). We thus receive our output matrix as a result which in this example is of size 128×5. We then use this output matrix, apply the non-linear activation function and treat our resulting output matrix as the input matrix to the next layer. We repeat these steps until we reach the error function. We then apply the error function to see how far the predictions are different from the correct values. We can formulate this whole process of the forward pass, and equivalently the backward pass, by defining simple rules (see Figure 1).
For the forward pass with given input data we go from the first to the last layer according to these rules:
- When we encounter a weight matrix, we matrix multiply by this weight and propagate the result.
- If we encounter a function, we put our current result into the function and propagate the function output as our result.
- We treat outputs of the previous layer as inputs into the next layer
- When we encounter the error function we apply it and thus generate the error for our backward pass
The backward pass for a given error is similar but proceeds from the last to the first layer where the error generated in rule 4 in the forward pass represents the “inputs” to the last layer. We then go backward through the network and follow these rules:
- When we encounter a weight matrix, we matrix multiply by the transpose of the matrix and propagate the result.
- If we encounter a function, we multiply (element-wise) by the derivative of that function with respect to the inputs that this function received from the forward pass. (see Figure 1)
- We treat errors of the previous layer as inputs (errors) into the next layer
To calculate the gradients, we use each intermediate result obtained after executing rule 2 in the backward pass and matrix multiply this intermediate result by the value of rule 2 from the forward pass from the previous layer (see Figure 1).
The rectified linear function is a simple non-linearity: It evaluates to 0 for negative inputs, and positive values remain untouched (f(x) = max(0,x)). The gradient of the rectified linear function is 1 for all positive values and 0 for negative values. This means that during backpropagation, negative gradients will not be used to update the weights of the outgoing rectified linear unit.
However, because we have a gradient of 1 for any positive value we have much better training speed when compared to other non-linear functions due to the good gradient flow. For example, the logistic sigmoid function has very tiny gradients for large positive and negative values so that learning nearly stops in these regions (this behavior is similar to a saddle point).
Despite the fact that negative gradients do not propagate with rectified linear functions (the gradient is zero here), large gradients for positive values are very powerful and ensure fast training regardless of the size of the gradient. Once these benefits were discovered, rectified linear functions and similar activation functions with large gradients became the activation functions of choice for deep networks.
Momentum uses the idea that the gradient zigzags every now and then but generally follows a rather straight line towards a local minimum. As such, if we move faster in this general direction and disregard the zigzag directions we will arrive faster at the local minimum, in general.
To realize this behavior we keep track of a running momentum matrix, which is the weighted running sum of the gradient, and we add that momentum matrix value to the gradient. The size of this momentum matrix is kept in check by attenuating it on every update (multiply by a momentum value between 0.7-0.99). Over time, the zigzag dimensions will be smoothed out in our running momentum matrix: A zig in one direction and a zag in the exact opposite direction cancel out and yield a straight line towards the general direction of the local minimum. In the beginning, the general direction towards the local minimum is not strongly established (a sequence of zags with no zigs, or vice versa), and the momentum matrix needs to be attenuated more strongly or the values for the momentum increasingly emphasize zigzagging directions, which in turn can lead to unstable learning. Thus, the momentum value should be kept small (0.5-0.7) in the beginning when no general direction towards a local minimum has been established. Later the momentum value can be increased rapidly (0.9-0.999).
Usually, the gradient update is applied first, and then the jump into the momentum direction follows. However, Nesterov showed that it is better to first jump into the momentum direction and then correct this direction with a gradient update; this procedure is known as “Nesterov’s accelerated gradient” (sometimes “Nesterov momentum”) and yields faster convergence to a local minimum.
Additional material: Coursera: Neural Networks for Machine Learning: 3. The Momentum Method
RMSprop keeps track of the weighted running mean of the squared gradient and then divides each calculated gradient by the square root of this weighted running mean (it essentially normalizes the gradient by dividing by the magnitude of recent gradients). The consequence is that when a plateau in the error surface is encountered and the gradient is very small, the updates take greater steps, ensuring faster learning (a small update: 0.00001, the square root of the weighted average: 0.00005, update size: 0.2). On the other hand, RMSprop protects against exploding gradients (a large update: 100, the square root of the weighted average: 25, update size: 4) and is thus used frequently in recurrent neural networks and LSTMs to protect both against vanishing and exploding gradients.
Imagine you (a unit in a convolutional network) are preparing for an exam (a classification task) and you know that during the exam you are permitted to copy answers from your peers (other units). Will you study for the exam? The answer to this question is probably yes or no depending on whether at least some students in your class have studied for the exam.
Let’s say you know that there are two students (units) in your class (convolutional net) who have the reputation of studying for every exam they take (every image that is presented). So you do not study for the exam and just copy from these students (you weigh the input from a single “elite” unit in the previous layer highly).
Now we introduce an infectious flu (dropout) that affects 50% of all students. Now there is a high chance that these two students who actually studied for the exam will not be present, so relying on copying their answers is no longer a good strategy. So this time you have to learn by yourself (make choices which take into account all units in a layer and not just the elite units).
In other words, dropout decouples the information processing of units so that they cannot rely on some unit “superstars” which always seem to have the right answer (these superstars detect features which are more important than the features that other units detect).
This in turn democratizes the classification process so that every unit makes computations that are largely independent of strong influencers, and thus reduces bias by ensuring less extreme opinions (there are no mainstream opinions). This decoupling of units in turn leads to strong regularization and better generalization (wisdom of the crowd).
L1 and L2 regularization penalizes the size of the weights of a network so that large output values that signify strong confidence can no longer be achieved from a single large weight, but instead require several medium-sized weights. Since many units have to agree to achieve a large value, it is less likely that the output will be biased by the opinion of a single unit. Conceptually, it penalizes strong opinions from single units and encourages taking into account the opinion of multiple units, thus reducing bias.
The L1 regularization penalizes the absolute size of the weight, while the L2 penalizes the squared size of the weight. This penalty is added to the error function value thus increasing the error if larger weights are used. As a result, the network is driven to solve the problem with small weights.
Since even small weights produce a sizeable L1 penalty, the L1 penalty has the effect that most weights will be set to zero while a few medium-to-large weights remain. Because fewer non-zero weights exist, the network must be highly confident about its results to achieve good predictive performance.
The L2 penalty encourages very small non-zero weights (large weight = very large error). Here the prediction is made by almost all weights thus reducing the bias (there are no influencers that can turn around outcomes by themselves).
Additional material: Coursera: Neural Networks for Machine Learning: 2. Limiting the Size of the Weights
Conclusion to Part 2
This concludes part 2 of this crash course on deep learning. Please check back soon for the next part of the series. In part 3, I’ll provide some details on learning algorithms, unsupervised learning, sequence learning, and natural language processing, and in part 4 I’ll go into reinforcement learning. In case you missed it, be sure to check out part 1 of the series.
Meanwhile, you might be interested in learning about cuDNN, DIGITS, Computer Vision with Caffe, Natural Language Processing with Torch, Neural Machine Translation, the Mocha.jl deep learning framework for Julia, or other Parallel Forall posts on deep learning.
“This is a really, really, really big deal,” said Jeremy Howard, president and chief scientist of data-science competition platform Kaggle. “… It’s going to enable whole new classes of products that have never existed before.” Think of Siri on steroids, for starters, or perhaps emulators that could mimic your writing style down to the tone.
When deep learning works, it works great
To understand Howard’s excitement, let’s go back a few days. It was Monday and I was watching him give a presentation in Chicago about how deep learning was dominating the competition in Kaggle, the online platform where organization present vexing predictive problems and data scientists compete to create the best models. Whenever someone has used a deep learning model to tackle one of the challenges, he told the room, it has performed better than any model ever previously devised to tackle that specific problem.
But there’s a catch: deep learning is really hard. So far, only a handful of teams in hundreds of Kaggle competitions have used it. Most of them have included Geoffrey Hinton or have been associated with him.
Hinton is a University of Toronto professor who pioneered the use of deep learning for image recognition and is now a distinguished engineer at Google, as well. What got Google really interested in Hinton — at least to the point where it hired him — was his work in an image-recognition competition called ImageNet. For years the contest’s winners had been improving only incrementally on previous results, until Hinton and his team used deep learning to improve by an order of magnitude.
Neural networks: A way-simplified overview
Deep learning, Howard explained, is essentially a bigger, badder take on the neural network models that have been around for some time. It’s particularly useful for analyzing image, audio, text, genomic and other multidimensional data that doesn’t lend itself well to traditional machine learning techniques.
Neural networks work by analyzing inputs (e.g., words or images) and recognizing the features that comprise them as well as how all those features relate to each other. With images, for example, a neural network model might recognize various formations of pixels or intensities of pixels as features.
Trained against a set of labeled data, the output of a neural network might be the classification of an input as a dog or cat, for example. In cases where there is no labeled training data — a process called self-taught learning — neural networks can be used to identify the common features of their inputs and group similar inputs even though the models can’t predict what they actually are. Like when Google researchers constructed neural networks that were able to recognize cats and human faces without having been trained to do so.
Stacking neural networks to do deep learning
In deep learning, multiple neural networks are “stacked” on top of each other, or layered, in order to create models that are even better at prediction because each new layer learns from the ones before it. In Hinton’s approach, each layer randomly omits features — a process called “dropout” — to minimize the chances the model will overfit itself to just the data upon which it was trained. That’s a technical way of saying the model won’t work as well when trying to analyze new data.
So dropout or similar techniques are critical to helping deep learning models understand the real causality between the inputs and the outputs, Howard explained during a call on Thursday. It’s like looking at the same thing under the same lighting all the time versus looking at it in different lighting and from different angles. You’ll see new aspects and won’t see others, he said, “But the underlying structure is going to be the same each time.”
Still, it’s difficult to create accurate models and to program them to run on the number of computing cores necessary to process them in a reasonable timeframe. It’s also can be difficult to train them on enough data to guarantee accuracy in an unsupervised environment. That’s why so much of the cutting-edge work in the field is still done by experts such as Hinton, Jeff Dean and Andrew Ng, all of whom had or still have strong ties to Google.
There are open source tools such as Theano and PyLearn2 that try to minimize the complexity, Howard told the audience on Monday, but a user-friendly, commercialized software package could be revolutionary. If data scientists in places outside Google could simply (a relative term if ever there was one) input their multidimensional data and train models to learn it, that could make other approaches to predictive modeling all but obsolete. It wouldn’t be inconceivable, Howard noted, that a software package like this could emerge within the next year.
Which brings us back to word2vec. Google calls it “an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words.” Those “architectures” are two new natural-language processing techniques developed by Google researchers Tomas Mikolov, Ilya Sutskever, and Quoc Le (Google Fellow Jeff Dean was also involved, although modestly, he told me.) They’re like neural networks, only simpler so they can be trained on larger data sets.
Kaggle’s Howard calls word2vec the “crown jewel” of natural language processing. “It’s the English language compressed down to a list of numbers,” he said.
Word2vec is designed to run on a system as small as a single multicore machine (Google tested its underlying techniques over days across more than 100 cores on its data center servers). Its creators have shown how it can recognize the similarities among words (e.g., the countries in Europe) as well as how they’re related to other words (e.g., countries and capitals). It’s able to decipher analogical relationships (e.g., short is to shortest as big is to biggest), word classes (e.g., carnivore and cormorant both relate to animals) and “linguistic regularities” (e.g., “vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to vector(‘queen’)).
Right now, the word2vec Google Code page notes, “The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences.”
This is accomplished by turning words into numbers that correlate with their characteristics, Howard said. Words that express positive sentiment, adjectives, nouns associated with sporting events — they’ll all have certain numbers in common based on how they’re used in the training data (so bigger data is better).
Smarter models means smarter apps
If this is all too esoteric, think about these methods applied to auto-correct or word suggestions in text-messaging apps. Current methods for doing this might be as simple as suggesting words that are usually paired together, Howard explained, meaning a suggestion is could be based solely on the word immediately before it. Using deep-learning-based approaches, a texting app could take into account the entire sentence, for example, because the app would have a better understanding of what the all words really mean in context.
Maybe you could average out all the numbers in a tweet, Howard suggested, and get a vector output that would accurately infer the sentiment, subject and level of formality of the tweet. Really, the possibilities are limited only to the types of applications people can think up to take advantage of word2vec’s deep understanding of natural language.
The big caveat, however, is researchers and industry data scientists still need to learn how to use word2vec. There hasn’t been a lot of research done on how to best use these types of models, Howard said, and the thousands of researchers working on other methods of natural language processing aren’t going to jump ship to Google’s tools overnight. Still, he believes the community will come around and word2vec and its underlying techniques could make all other approaches to natural language processing obsolete.
And this is just the start. A year from now, Howard predicts, deep learning will have surpassed a whole class of algorithms in other fields (i.e., things other than speech recognition, image recognition and natural language processing), and a year after that it will be integrated into all sorts of software packages. The only questions — and they’re admittedly big ones — is how smart deep learning models can get (and whether they’ll run into another era of hardware constraints that graphical processing units helped resolve earlier this millennium) and how accessible software packages like word2vec can make deep learning even for relatively unsophisticated users.
“Maybe in 10 years’ time,” Howard proposed, “we’ll get to that next level.”
In this post, we take a tour of the most popular machine learning algorithms.
It is useful to tour the main algorithms in the field to get a feeling of what methods are available.
There are so many algorithms available that it can feel overwhelming when algorithm names are thrown around and you are expected to just know what they are and where they fit.
I want to give you two ways to think about and categorize the algorithms you may come across in the field.
- The first is a grouping of algorithms by the learning style.
- The second is a grouping of algorithms by similarity in form or function (like grouping similar animals together).
Both approaches are useful, but we will focus in on the grouping of algorithms by similarity and go on a tour of a variety of different algorithm types.
After reading this post, you will have a much better understanding of the most popular machine learning algorithms for supervised learning and how they are related.
Algorithms Grouped by Learning Style
There are different ways an algorithm can model a problem based on its interaction with the experience or environment or whatever we want to call the input data.
It is popular in machine learning and artificial intelligence textbooks to first consider the learning styles that an algorithm can adopt.
There are only a few main learning styles or learning models that an algorithm can have and we’ll go through them here with a few examples of algorithms and problem types that they suit.
This taxonomy or way of organizing machine learning algorithms is useful because it forces you to think about the roles of the input data and the model preparation process and select one that is the most appropriate for your problem in order to get the best result.
Let’s take a look at four different learning styles in machine learning algorithms:
Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include Logistic Regression and the Back Propagation Neural Network.
Input data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and k-Means.
There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.
Example problems are classification and regression.
Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.
When crunching data to model business decisions, you are most typically using supervised and unsupervised learning methods.
A hot topic at the moment is semi-supervised learning methods in areas such as image classification where there are large datasets with very few labeled examples.
Get your FREE Algorithms Mind Map
I’ve created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
Algorithms Grouped By Similarity
Algorithms are often grouped by similarity in terms of their function (how they work). For example, tree-based methods, and neural network inspired methods.
I think this is the most useful way to group algorithms and it is the approach we will use here.
This is a useful grouping method, but it is not perfect. There are still algorithms that could just as easily fit into multiple categories like Learning Vector Quantization that is both a neural network inspired method and an instance-based method. There are also categories that have the same name that describe the problem and the class of algorithm such as Regression and Clustering.
We could handle these cases by listing algorithms twice or by selecting the group that subjectively is the “best” fit. I like this latter approach of not duplicating algorithms to keep things simple.
In this section, I list many of the popular machine learning algorithms grouped the way I think is the most intuitive. The list is not exhaustive in either the groups or the algorithms, but I think it is representative and will be useful to you to get an idea of the lay of the land.
Please Note: There is a strong bias towards algorithms used for classification and regression, the two most prevalent supervised machine learning problems you will encounter.
If you know of an algorithm or a group of algorithms not listed, put it in the comments and share it with us. Let’s dive in.
Regression is concerned with modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model.
Regression methods are a workhorse of statistics and have been co-opted into statistical machine learning. This may be confusing because we can use regression to refer to the class of problem and the class of algorithm. Really, regression is a process.
The most popular regression algorithms are:
- Ordinary Least Squares Regression (OLSR)
- Linear Regression
- Logistic Regression
- Stepwise Regression
- Multivariate Adaptive Regression Splines (MARS)
- Locally Estimated Scatterplot Smoothing (LOESS)
Instance-based learning model is a decision problem with instances or examples of training data that are deemed important or required to the model.
Such methods typically build up a database of example data and compare new data to the database using a similarity measure in order to find the best match and make a prediction. For this reason, instance-based methods are also called winner-take-all methods and memory-based learning. Focus is put on the representation of the stored instances and similarity measures used between instances.
The most popular instance-based algorithms are:
- k-Nearest Neighbor (kNN)
- Learning Vector Quantization (LVQ)
- Self-Organizing Map (SOM)
- Locally Weighted Learning (LWL)
An extension made to another method (typically regression methods) that penalizes models based on their complexity, favoring simpler models that are also better at generalizing.
I have listed regularization algorithms separately here because they are popular, powerful and generally simple modifications made to other methods.
The most popular regularization algorithms are:
- Ridge Regression
- Least Absolute Shrinkage and Selection Operator (LASSO)
- Elastic Net
- Least-Angle Regression (LARS)
Decision Tree Algorithms
Decision tree methods construct a model of decisions made based on actual values of attributes in the data.
Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning.
The most popular decision tree algorithms are:
- Classification and Regression Tree (CART)
- Iterative Dichotomiser 3 (ID3)
- C4.5 and C5.0 (different versions of a powerful approach)
- Chi-squared Automatic Interaction Detection (CHAID)
- Decision Stump
- Conditional Decision Trees
Bayesian methods are those that explicitly apply Bayes’ Theorem for problems such as classification and regression.
The most popular Bayesian algorithms are:
- Naive Bayes
- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Averaged One-Dependence Estimators (AODE)
- Bayesian Belief Network (BBN)
- Bayesian Network (BN)
Clustering, like regression, describes the class of problem and the class of methods.
Clustering methods are typically organized by the modeling approaches such as centroid-based and hierarchal. All methods are concerned with using the inherent structures in the data to best organize the data into groups of maximum commonality.
The most popular clustering algorithms are:
- Expectation Maximisation (EM)
- Hierarchical Clustering
Association Rule Learning Algorithms
Association rule learning methods extract rules that best explain observed relationships between variables in data.
These rules can discover important and commercially useful associations in large multidimensional datasets that can be exploited by an organization.
The most popular association rule learning algorithms are:
- Apriori algorithm
- Eclat algorithm
Artificial Neural Network Algorithms
Artificial Neural Networks are models that are inspired by the structure and/or function of biological neural networks.
They are a class of pattern matching that are commonly used for regression and classification problems but are really an enormous subfield comprised of hundreds of algorithms and variations for all manner of problem types.
Note that I have separated out Deep Learning from neural networks because of the massive growth and popularity in the field. Here we are concerned with the more classical methods.
The most popular artificial neural network algorithms are:
- Hopfield Network
- Radial Basis Function Network (RBFN)
Deep Learning Algorithms
Deep Learning methods are a modern update to Artificial Neural Networks that exploit abundant cheap computation.
They are concerned with building much larger and more complex neural networks and, as commented on above, many methods are concerned with semi-supervised learning problems where large datasets contain very little labeled data.
The most popular deep learning algorithms are:
- Deep Boltzmann Machine (DBM)
- Deep Belief Networks (DBN)
- Convolutional Neural Network (CNN)
- Stacked Auto-Encoders
Dimensionality Reduction Algorithms
Like clustering methods, dimensionality reduction seek and exploit the inherent structure in the data, but in this case in an unsupervised manner or order to summarize or describe data using less information.
This can be useful to visualize dimensional data or to simplify data which can then be used in a supervised learning method. Many of these methods can be adapted for use in classification and regression.
- Principal Component Analysis (PCA)
- Principal Component Regression (PCR)
- Partial Least Squares Regression (PLSR)
- Sammon Mapping
- Multidimensional Scaling (MDS)
- Projection Pursuit
- Linear Discriminant Analysis (LDA)
- Mixture Discriminant Analysis (MDA)
- Quadratic Discriminant Analysis (QDA)
- Flexible Discriminant Analysis (FDA)
Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction.
Much effort is put into what types of weak learners to combine and the ways in which to combine them. This is a very powerful class of techniques and as such is very popular.
- Bootstrapped Aggregation (Bagging)
- Stacked Generalization (blending)
- Gradient Boosting Machines (GBM)
- Gradient Boosted Regression Trees (GBRT)
- Random Forest
Many algorithms were not covered.
For example, what group would Support Vector Machines go into? Its own?
I did not cover algorithms from specialty tasks in the process of machine learning, such as:
- Feature selection algorithms
- Algorithm accuracy evaluation
- Performance measures
I also did not cover algorithms from specialty subfields of machine learning, such as:
- Computational intelligence (evolutionary algorithms, etc.)
- Computer Vision (CV)
- Natural Language Processing (NLP)
- Recommender Systems
- Reinforcement Learning
- Graphical Models
- And more…
These may feature in future posts.
This tour of machine learning algorithms was intended to give you an overview of what is out there and some ideas on how to relate algorithms to each other.
I’ve collected together some resources for you to continue your reading on algorithms. If you have a specific question, please leave a comment.
Other Lists of Algorithms
There are other great lists of algorithms out there if you’re interested. Below are few hand selected examples.
- List of Machine Learning Algorithms: On Wikipedia. Although extensive, I do not find this list or the organization of the algorithms particularly useful.
- Machine Learning Algorithms Category: Also on Wikipedia, slightly more useful than Wikipedias great list above. It organizes algorithms alphabetically.
- CRAN Task View: Machine Learning & Statistical Learning: A list of all the packages and all the algorithms supported by each machine learning package in R. Gives you a grounded feeling of what’s out there and what people are using for analysis day-to-day.
- Top 10 Algorithms in Data Mining: Published article and now a book (Affiliate Link) on the most popular algorithms for data mining. Another grounded and less overwhelming take on methods that you could go off and learn deeply.
How to Study Machine Learning Algorithms
Algorithms are a big part of machine learning. It’s a topic I am passionate about and write about a lot on this blog. Below are few hand selected posts that might interest you for further reading.
- How to Learn Any Machine Learning Algorithm: A systematic approach that you can use to study and understand any machine learning algorithm using “algorithm description templates” (I used this approach to write my first book).
- How to Create Targeted Lists of Machine Learning Algorithms: How you can create your own systematic lists of machine learning algorithms to jump start work on your next machine learning problem.
- How to Research a Machine Learning Algorithm: A systematic approach that you can use to research machine learning algorithms (works great in collaboration with the template approach listed above).
- How to Investigate Machine Learning Algorithm Behavior: A methodology you can use to understand how machine learning algorithms work by creating and executing very small studies into their behavior. Research is not just for academics!
- How to Implement a Machine Learning Algorithm: A process and tips and tricks for implementing machine learning algorithms from scratch.
How to Run Machine Learning Algorithms
Sometimes you just want to dive into code. Below are some links you can use to run machine learning algorithms, code them up using standard libraries or implement them from scratch.
- How To Get Started With Machine Learning Algorithms in R: Links to a large number of code examples on this site demonstrating machine learning algorithms in R.
- Machine Learning Algorithm Recipes in scikit-learn: A collection of Python code examples demonstrating how to create predictive models using scikit-learn.
- How to Run Your First Classifier in Weka: A tutorial for running your very first classifier in Weka (no code required!).
I hope you have found this tour useful.
Please, leave a comment if you have any questions or ideas on how to improve the algorithm tour.
Update #2: I’ve added a bunch more resources and more algorithms. I’ve also added a handy mind map that you can download (see above).
Frustrated With Machine Learning Math?
See How Algorithms Work in Minutes
…with just arithmetic and simple examples
Discover how in my new Ebook: Master Machine Learning Algorithms
It covers explanations and examples of 10 top algorithms, including:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more…
Finally, Pull Back the Curtain on
Machine Learning Algorithms
Skip the Academics. Just Results.