The news in our blog

Deep Learning Papers on Medical Image Analysis

Deep Learning Papers on Medical Image Analysis


To the best of our knowledge, this is the first list of deep learning papers on medical applications. There are couple of lists for deep learning papers in general, or computer vision, for example Awesome Deep Learning Papers. In this list, I try to classify the papers based on their deep learning techniques and learning methodology. I believe this list could be a good starting point for DL researchers on Medical Applications.


  1. A list of top deep learning papers published since 2015.
  2. Papers are collected from peer-reviewed journals and high reputed conferences. However, it may have recent papers on arXiv.
  3. A meta-data is required along with the paper, i.e. Deep Learning technique, Imaging Modality, Area of Interest, Clinical Database (DB).

List of Journals / Conferences (J/C):


Deep Learning Techniques:

  • NN: Neural Networks
  • MLP: Multilayer Perceptron
  • RBM: Restricted Boltzmann Machine
  • SAE: Stacked Auto-Encoders
  • CAE: Convolutional Auto-Encoders
  • CNN: Convolutional Neural Networks
  • RNN: Recurrent Neural Networks
  • LSTM: Long Short Term Memory
  • M-CNN: Multi-Scale/View/Stream CNN
  • MIL-CNN: Multi-instance Learning CNN
  • FCN: Fully Convolutional Networks

Imaging Modality:

  • US: Ultrasound
  • MR/MRI: Magnetic Resonance Imaging
  • PET: Positron Emission Tomography
  • MG: Mammography
  • CT: Computed Tompgraphy
  • H&E: Hematoxylin & Eosin Histology Images
  • RGB: Optical Images

Table of Contents

Deep Learning Techniques

Medical Applications

Deep Learning Techniques

Auto-Encoders/ Stacked Auto-Encoders

Convolutional Neural Networks

Recurrent Neural Networks

Generative Adversarial Networks

Medical Applications


Technique Modality Area Paper Title DB J/C Year
NN H&E N/A Deep learning of feature representation with multiple instance learning for medical image analysis [pdf] ICASSP 2014
M-CNN H&E Breast AggNet: Deep Learning From Crowds for Mitosis Detection in Breast Cancer Histology Images [pdf] AMIDA IEEE-TMI 2016
FCN H&E N/A Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation pdf MICCAI 2017


Technique Modality Area Paper Title DB J/C Year
M-CNN CT Lung Multi-scale Convolutional Neural Networks for Lung Nodule Classification [pdf] LIDC-IDRI IPMI 2015
3D-CNN MRI Brain Predicting Alzheimer’s disease: a neuroimaging study with 3D convolutional neural networks [pdf] ADNI arXiv 2015
CNN+RNN RGB Eye Automatic Feature Learning to Grade Nuclear Cataracts Based on Deep Learning [pdf] IEEE-TBME 2015
CNN X-ray Knee Quantifying Radiographic Knee Osteoarthritis Severity using Deep Convolutional Neural Networks [pdf] O.E.1 arXiv 2016
CNN H&E Thyroid A Deep Semantic Mobile Application for Thyroid Cytopathology [pdf] SPIE 2016
3D-CNN, 3D-CAE MRI Brain Alzheimer’s Disease Diagnostics by a Deeply Supervised Adaptable 3D Convolutional Network [pdf] ADNI arXiv 2016
M-CNN RGB Skin Multi-resolution-tract CNN with hybrid pretrained and skin-lesion trained layers [pdf] Dermofit MLMI 2016
CNN RGB Skin, Eye Towards Automated Melanoma Screening: Exploring Transfer Learning Schemes [pdf] EDRADRD arXiv 2016
M-CNN CT Lung Pulmonary Nodule Detection in CT Images: False Positive Reduction Using Multi-View Convolutional Networks [pdf] LIDC-IDRIANODE09DLCST IEEE-TMI 2016
3D-CNN CT Lung DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification [pdf] LIDC-IDRILUNA16 IEEE-WACV 2018
3D-CNN MRI Brain 3D Deep Learning for Multi-modal Imaging-Guided Survival Time Prediction of Brain Tumor Patients [pdf] MICCAI 2016
SAE US, CT Breast, Lung Computer-Aided Diagnosis with Deep Learning Architecture: Applications to Breast Lesions in US Images and Pulmonary Nodules in CT Scans [pdf] LIDC-IDRI Nature 2016
CAE MG Breast Unsupervised deep learning applied to breast density segmentation and mammographic risk scoring [pdf] IEEE-TMI 2016
MIL-CNN MG Breast Deep multi-instance networks with sparse label assignment for whole mammogram classification [pdf] INbreast MICCAI 2017
GCN MRI Brain Spectral Graph Convolutions for Population-based Disease Prediction [pdf] ADNIABIDE arXiv 2017
CNN RGB Skin Dermatologist-level classification of skin cancer with deep neural networks Nature 2017
FCN + CNN MRI Liver-Liver Tumor SurvivalNet: Predicting patient survival from diffusion weighted magnetic resonance images using cascaded fully convolutional and 3D convolutional neural networks [pdf] ISBI 2017

Detection / Localization

Technique Modality Area Paper Title DB J/C Year
MLP CT Head-Neck 3D Deep Learning for Efficient and Robust Landmark Detection in Volumetric Data [pdf] MICCAI 2015
CNN US Fetal Standard Plane Localization in Fetal Ultrasound via Domain Transferred Deep Neural Networks [pdf] IEEE-JBHI 2015
2.5D-CNN MRI Femur Automated anatomical landmark detection ondistal femur surface using convolutional neural network [pdf] OAI ISBI 2015
LSTM US Fetal Automatic Fetal Ultrasound Standard Plane Detection Using Knowledge Transferred Recurrent Neural Networks [pdf] MICCAI 2015
CNN X-ray, MRI Hand Regressing Heatmaps for Multiple Landmark Localization using CNNs [pdf] DHADS MICCAI 2016
CNN MRI, US, CT An artificial agent for anatomical landmark detection in medical images [pdf] SATCOM MICCAI 2016
FCN US Fetal Real-time Standard Scan Plane Detection and Localisation in Fetal Ultrasound using Fully Convolutional Neural Networks [pdf] MICCAI 2016
CNN+LSTM MRI Heart Recognizing end-diastole and end-systole frames via deep temporal regression network [pdf] MICCAI 2016
M-CNN MRI Heart Improving Computer-Aided Detection Using Convolutional Neural Networks and Random View Aggregation Neural Networks [pdf] IEEE-TMI 2016
CNN PET/CT Heart Automated detection of pulmonary nodules in PET/CT images: Ensemble false-positive reduction using a convolutional neural network technique Neural Networks [pdf] MP 2016
3D-CNN MRI Brain Automatic Detection of Cerebral Microbleeds From MR Images via 3D Convolutional Neural Networks [pdf] IEEE-TMI 2016
CNN X-ray, MG Self-Transfer Learning for Fully Weakly Supervised Lesion Localization [pdf] NIH,ChinaDDSM,MIAS MICCAI 2016
CNN RGB Eye Fast Convolutional Neural Network Training Using Selective Data Sampling: Application to Hemorrhage Detection in Color Fundus Images [pdf] DRDMESSIDOR MICCAI 2016
GAN Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery IPMI 2017
FCN X-ray Cardiac CathNets: Detection and Single-View Depth Prediction of Catheter Electrodes MIAR 2016
3D-CNN CT Lung DeepLung: Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification [pdf] LIDC-IDRILUNA16 IEEE-WACV 2018


Technique Modality Area Paper Title DB J/C Year
U-Net U-net: Convolutional networks for biomedical image segmentation MICCAI 2015
FCN MRI Head-Neck Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation [pdf] arXiv 2016
FCN CT Liver-Liver Tumor Automatic Liver and Lesion Segmentation in CT Using Cascaded Fully Convolutional Neural Networks and 3D Conditional Random Fields [pdf] MICCAI 2016
3D-CNN MRI Spine Model-Based Segmentation of Vertebral Bodies from MR Images with 3D CNNs MICCAI 2016
FCN CT Liver-Liver Tumor Automatic Liver and Tumor Segmentation of CT and MRI Volumes using Cascaded Fully Convolutional Neural Networks [pdf] arXiv 2017
FCN MRI Liver-Liver Tumor SurvivalNet: Predicting patient survival from diffusion weighted magnetic resonance images using cascaded fully convolutional and 3D convolutional neural networks [pdf] ISBI 2017
3D-CNN Diffusion MRI Brain q-Space Deep Learning: Twelve-Fold Shorter and Model-Free Diffusion MRI [pdf] (Section II.B.2) IEEE-TMI 2016
GAN MG Breast Mass Adversarial Deep Structured Nets for Mass Segmentation from Mammograms [pdf] INbreastDDSM-BCRP ISBI 2018
3D-CNN CT Liver 3D Deeply Supervised Network for Automatic Liver Segmentation from CT Volumes pdf MICCAI 2017
3D-CNN MRI Brain Unsupervised domain adaptation in brain lesion segmentation with adversarial networks pdf IPMI 2017


Technique Modality Area Paper Title DB J/C Year
3D-CNN CT Spine An Artificial Agent for Robust Image Registration [pdf] 2016


Technique Modality Area Paper Title DB J/C Year
2.5D-CNN MRI Automated anatomical landmark detection ondistal femur surface using convolutional neural network [pdf] OAI ISBI 2015
3D-CNN Diffusion MRI Brain q-Space Deep Learning: Twelve-Fold Shorter and Model-Free Diffusion MRI [pdf] (Section II.B.1) [HCP]and other IEEE-TMI 2016

Image Reconstruction and Post Processing

Technique Modality Area Paper Title DB J/C Year
CNN CS-MRI A Deep Cascade of Convolutional Neural Networks for Dynamic MR Image Reconstruction pdf IEEE-TMI 2017
GAN CS-MRI Deep Generative Adversarial Networks for Compressed Sensing Automates MRI pdf NIPS 2017

Other tasks


A survey on deep learning in medical image analysis

Author links open overlay panelGeertLitjensThijsKooiBabak EhteshamiBejnordiArnaud Arindra AdiyosoSetioFrancescoCiompiMohsenGhafoorianJeroen A.W.M.van der LaakBramvan GinnekenClara I.Sánchez

Medical Image Analysis with Deep Learning 


Medical Image Analysis with Deep Learning

In this article, I start with basics of image processing, basics of medical image format data and visualize some medical data.

By Taposh Roy, Kaiser Permanente.

Analyzing images and videos, and using them in various applications such as self driven cars, drones etc. with underlying deep learning techniques has been the new research frontier. The recent research papers such as “A Neural Algorithm of Artistic Style”, show how a styles can be transferred from an artist and applied to an image, to create a new image. Other papers such as “Generative Adversarial Networks” (GAN) and “Wasserstein GAN” have paved the path to develop models that can learn to create data that is similar to data that we give them. Thus opening up the world to semi-supervised learning and paving the path to a future of unsupervised learning.

While these research areas are still on the generic images, our goal is to use these research into medical images to help healthcare. We need to start with some basics. In this article, I start with basics of image processing, basics of medical image format data and visualize some medical data. In the next article I will deep dive into some convolutional neural nets and use them with Keras for predicting lung cancer.

Basic Image Processing (using python)

There are a variety of image processing libraries, however OpenCV (open computer vision) has become mainstream due to its large community support and availability in C++, java and python. I prefer using opencv using jupyter notebook.

Install OpenCV using: pip install opencv-python or install directly from the source from

Installing opencv.
Now open your Jupyter notebook and confirm you can import cv2. You will also need numpy and matplotlib to view your plots inside the notebook.

Now, lets check if you can open an image and view it on your notebook using the code below.

Example image load through OpenCV.
Basic Face Detection

Lets, do something fun such as detecting a face. To detect face we will use an open source xml stump-based 20×20 gentle adaboost frontal face detector originally created by Rainer Lienhart. A good post with details on Haar-cascade detection is here.

Face detection using OpenCV.
There are a lot of examples for image processing using opencv in the docs section. I leave it up to the reader to play with more examples. Now that we know the basics of image processing, lets move to the next level of understanding medical image format.

Medical Image Data Format

Medical images follow Digital Imaging and Communications (DICOM) as a standard solution for storing and exchanging medical image-data. The first version of this standard was released in 1985. Since then there are several changes made. This standard uses a file format and a communications protocol.

  • File Format — All patient medical images are saved in the DICOM file format. This format has PHI (protected health information) about the patient such as — name, sex, age in addition to other image related data such as equipment used to capture the image and some context to the medical treatment. Medical Imaging Equipments create DICOM files. Doctors use DICOM Viewers, computer software applications that can display DICOM images, read and to diagnose the findings in the images.
  • Communications Protocol — The DICOM communication protocol is used to search for imaging studies in the archive and restore imaging studies to the workstation in order to display it. All medical imaging applications that are connected to the hospital network use the DICOM protocol to exchange information, mainly DICOM images but also patient and procedure information. There are also more advanced network commands that are used to control and follow the treatment, schedule procedures, report statuses and share the workload between doctors and imaging devices.

A very good blog that goes into details of the DICOM standard is here

Analyze DICOM Images

A very good python package used for analyzing DICOM images is pydicom. In this section, we will see how to render a DICOM image on a Jupyter notebook.

Install OpenCV using: pip install pydicom

After you install pydicom package, go back to the jupyter notebook. In the notebook, import the dicom package and other packages as shown below.

We also use other packages such as pandas, scipy, skimage, mpl_toolkit for data processing and analysis.

There’s a wealth of freely available DICOM datasets online but here’s a few that should help you get started:

Download the dicom files and load them on your jupyter notebook.

Now, load the DICOM images into a list.

Step 1 : Basic Viewing of DICOM Image in Jupyter

In the first line we load the 1st DICOM file, which we’re gonna use as a reference named RefDs, to extract metadata and whose filename is first in the lstFilesDCM list.

We then calculate the total dimensions of the 3D NumPy array which are equal to (Number of pixel rows in a slice) x (Number of pixel columns in a slice) x (Number of slices) along the x, y, and z cartesian axes. Lastly, we use the PixelSpacing and SliceThickness attributes to calculate the spacing between pixels in the three axes. We store the array dimensions in ConstPixelDims and the spacing in ConstPixelSpacing [1].

Step 2: Looking into details of DICOM format

The unit of measurement in CT scans is the Hounsfield Unit (HU), which is a measure of radiodensity. CT scanners are carefully calibrated to accurately measure this. A detailed understanding on this can be found here.

Each pixel is assigned a numerical value (CT number), which is the average of all the attenuation values contained within the corresponding voxel. This number is compared to the attenuation value of water and displayed on a scale of arbitrary units named Hounsfield units (HU) after Sir Godfrey Hounsfield.

This scale assigns water as an attenuation value (HU) of zero. The range of CT numbers is 2000 HU wide although some modern scanners have a greater range of HU up to 4000. Each number represents a shade of grey with +1000 (white) and –1000 (black) at either end of the spectrum.

Hounsfield Scale [credits: “Introduction to CT physics” (PDF).]
Some scanners have cylindrical scanning bounds, but the output image is square. The pixels that fall outside of these bounds get the fixed value -2000.

CT Scanner Image [credits : “Introduction to CT physics” (PDF).]
The first step usually is setting these values to 0. Next, let’s go back to HU units, by multiplying with the rescale slope and adding the intercept (which are conveniently stored in the metadata of the scans!).

In the next part, we will use Kaggle’s lung cancer data-set and Convolution Neural Nets using Keras. We will build upon the information provided by this article to go to the next one.


  8. Kaggle community for all the different scripts and support

Bio: Taposh Roy leads innovation team in Kaiser Permanente’s Decision Support group. He works with research, technology and business leaders to derive insights from data.

Original. Reposted with permission.


Deep Learning Applications in Medical Imaging

Deep Learning Applications in Medical Imaging






The earliest deep-learning-like algorithms that had multiple layers of non-linear features can be traced back to Ivakhnenko and Lapa in 1965 (Figure 1), who used thin but deep models with polynomial activation functions which they analyzed with statistical methods. In each layer, they selected the best features through statistical methods and forwarded them to the next layer. They did not use backpropagation to train their network end-to-end but used layer-by-layer least squares fitting where previous layers were independently fitted from later layers.

Figure 1: The achitecture of the first known deep network which was trained by Alexey Grigorevich Ivakhnenko in 1965. The feature selection steps after every layer lead to an ever-narrowing architecture which terminates when no further improvement can be achieved by the addition of another layer.
Figure 1: The achitecture of the first known deep network which was trained by Alexey Grigorevich Ivakhnenko in 1965. The feature selection steps after every layer lead to an ever-narrowing architecture which terminates when no further improvement can be achieved by the addition of another layer. Image of Prof. Alexey Ivakhnenko courtesy of Wikipedia.

The earliest convolutional networks were used by Fukushima in 1979. Fukushima’s networks had multiple convolutional and pooling layers similar to modern networks, but the network was trained by using a reinforcement scheme where a trail of strong activation in multiple layers was increased over time. Additionally, one would assign important features of each image by hand by increasing the weight on certain connections.

Backpropagation of errors to train deep models was lacking at this point. Backpropagation was derived already in the early 1960s but in an inefficient and incomplete form. The modern form was derived first by Linnainmaa in his 1970 masters thesis that included FORTRAN code for backpropagation but did not mention its application to neural networks. Even at this point, backpropagation was relatively unknown and very few documented applications of backpropagation existed the early 1980s (e.g. Werbos in 1982). Rumelhart, Hinton, and Williams showed in 1985 that backpropagation in neural networks could yield interesting distributed representations. At this time, this was an important result in cognitive psychology where the question was whether human cognition can be thought of as relying on distributed representations (connectionism) or symbolic logic (computationalism).

The first true, practical application of backpropagation came about through the work of LeCun in 1989 at Bell Labs. He used convolutional networks in combination with backpropagation to classify handwritten digits (MNIST) and this system was later used to read large numbers of handwritten checks in the United States. The video above shows Yann LeCun demonstrating digit classification using the “LeNet” network  in 1993.

Despite these successes, funding for research into neural networks was scarce. The term artificial intelligence dropped to near pseudoscience status during the AI winter and the field still needed some time to recover. Some important advances were made in this time, for example, the long short-term memory (LSTM) for recurrent neural networks by Hochreiter and Schmidhuber in 1997, but these advances went mostly unnoticed until later as they were overshadowed by the support vector machine developed by Cortes and Vapnik in 1995.

The next big shift occurred just by waiting for computers to get faster, and then later by the introduction of graphics processing units (GPUs). Waiting for faster computers and GPUs alone increased the computational speed by a factor of 1000 over a span of 10 years. In this period, neural networks slowly began to rival support vector machines. Neural networks can be slow when compared to support vector machines, but they reach much better results with the same amount of data. Unlike simpler algorithms, neural networks continue to improve with more training data.

The main hurdle at this point was to train big, deep networks, which suffered from the vanishing gradient problem, where features in early layers could not be learned because no learning signal reached these layers.

The first solution to this problem was layer-by-layer pretraining, where the model is built in a layer-by-layer fashion by using unsupervised learning so that the features in early layers are already initialized or “pretrained” with some suitable features (weights). Pretrained features in early layers only need to be adjusted slightly during supervised learning to achieve good results. The first pretraining approaches where developed for recurrent neural networks by Schmidhuber in 1992, and for feed-forward networks by Hinton and Salakhutdinov in 2006. Another solution for the vanishing gradient problem in recurrent neural networks was long short-term memory in 1997.

As the speed of GPUs increased rapidly, it was soon possible to train deep networks such as convolutional networks without the help of pretraining as demonstrated by Ciresan and colleagues in 2011 and 2012 who won character recognition, traffic sign, and medical imaging competitions with their convolutional network architecture. Krizhevsky, Sutskever, and Hinton used a similar architecture in 2012 that also features rectified linear activation functions and dropout for regularization. They received outstanding results in the ILSVRC-2012 ImageNet competition, which marked the abandonment of feature engineering and the adoption of feature learning in the form of deep learning. Google, Facebook, and Microsoft noticed this trend and made major acquisitions of deep learning startups and research teams between 2012 and 2014. From here, research in deep learning accelerated rapidly.

Additional material: Deep Learning in Neural Networks: An Overview


A perceptron contains only a single linear or nonlinear unit. Geometrically, a perceptron with a nonlinear unit trained with the delta rule can find the nonlinear plane separating data points of two different classes (if the separation plane exists). If no such separation plane exists, the perceptron will often still produce separation planes that provide good classification accuracy. The good performance of the perceptron led to a hype of artificial intelligence. In 1969 however, it was shown that a perceptron may fail to separate seemingly simple patterns such as the points provided by the XOR function. The fall from grace of the perceptron was one of the main reasons for the occurrence of the first AI winter. While neural networks with hidden layers do not suffer from the typical problems of the perceptron, neural networks were still associated with the perceptron and therefore also suffered an image problem during the AI winter.

Despite this, and despite the success of deep learning, perceptrons still find widespread use in the realm of big data, where the simplicity of the perceptron allows for successful application to very large data sets.


Rapid advances in machine learning and other approaches of inference led to a hype of artificial intelligence (similar to the buzz around deep learning today). Researchers made promises that these advances would continue and would lead to strong AI and in turn, AI research received lots of funding.

In the 1970s it became clear that those promises could not be kept, funding was cut dramatically and the field of artificial intelligence dropped to near pseudo-science status. Research became very difficult (little funding; publications almost never made it through peer review), but nevertheless, a few researchers continued further down this path and their research soon lead to the reinvigoration of the field and the creation of the field of deep learning.

This is why excessive deep learning hype is dangerous and researchers typically avoid making predictions about the future: AI researchers want to avoid another AI winter.


AlexNet is a convolutional network architecture named after Alex Krizhevsky, who along with Ilya Sutskever under the supervision of Geoffrey Hinton applied this architecture to the ILSVRC-2012 competition that featured the ImageNet dataset. They improved the convolutional network architecture developed by Ciresan and colleagues, which won multiple international competitions in 2011 and 2012 by using rectified linear units for enhanced speed and dropout for improved generalization. Their results stood in stark contrast to feature engineering methods, which immediately created a great rift between deep learning and feature engineering methods for computer vision. From here it was apparent that deep learning would take over computer vision and that other methods would not be able to catch up. AlexNet heralded the mainstream usage and the hype of deep learning.

ImageNet Classification with Deep Convolutional Neural Networks.

Training Deep Learning Architectures


The process of training a deep learning architecture is similar to how toddlers start to make sense of the world around them. When a toddler encounters a new animal, say a monkey, he or she will not know what it is. But then an adult points with a finger at the monkey and says: “That is a monkey!” The toddler will then be able to associate the image he or she sees with the label “monkey”.

A single image, however, might not be sufficient to label an animal correctly when it is encountered the next time. For example, the toddler might mistake a sloth for a monkey or a monkey for a sloth, or might simply forget the name of a certain animal. For reliable recall and labeling, a toddler needs to see many different monkeys and similar animals and needs to know each time whether or not it is really a monkey—feedback is essential for learning. After some time, if the toddler encounters enough animals paired with their names, the toddler will have learned to distinguish between different animals.

The deep learning process is similar. We present the neural network with images or other data, such as the image of a monkey. The deep neural network predicts a certain outcome, for example, the label of the object in an image (“monkey”). We then supply the network with feedback. For example, if the network predicted that the image showed a monkey with 30% probability and a sloth with 70% probability, then all the outputs in favor of the sloth class made an error! We use this error to adjust the parameters of the neural network using the backpropagation of errors algorithm.

Usually, we randomly initialize the parameters of a deep network so the network initially outputs random predictions. This means for ImageNet, which consists of 1000 classes, we will achieve an average classification accuracy of just 0.1% for any image after initializing the neural network. To improve the performance we need to adjust the parameters so that the classification performance increases over time. But this is inherently difficult: If we adjust one parameter to improve performance on one class, this change might decrease the classification performance for another class. Only if we find parameter changes that work for all classes can we achieve good classification performance.

If you imagine a neural network with only 2 parameters (e.g. -0.37 and 1.14), then you can imagine a mountain landscape, where the height of the landscape represents the classification error and the two directions—north-south (x-axis) and east-west (y-axis)—represent the directions in which we can change the two parameters (negative-positive direction). The task is to find the lowest altitude point in the mountain landscape: we want to find the minimum.

The problem with this is that the entire mountain landscape is unknown to us at the beginning. It is as if the whole mountain range is covered in fog. We only know our current position (the initial random parameters) and our height (the current classification error). How can we find the minimum quickly when we have so little information about the landscape?


Imagine you stand on top of a mountain with skis strapped to your feet. You want to get down to the valley as quickly as possible, but there is fog and you can only see your immediate surroundings. How can you get down the mountain as quickly as possible? You look around and identify the steepest path down, go down that path for a bit, again look around and find the new steepest path, go down that path, and repeat—this is exactly what gradient descent does.

While gradient descent is equivalent to stopping every 10 meters and measuring the steepness of your surroundings with a measuring tape (you measure your gradient according to the whole data set), stochastic gradient descent is the equivalent of quickly estimating the steepness with a short glance (just a few hundred data points are used to estimate the steepness).

In terms of stochastic gradient descent, we go down the steepest path (the negative gradient or first derivative) on the landscape of the error function to find a local minimum, that is, the point that yields a low error for our task. We do this in tiny steps so that we do not get trapped in half-pipe-like obstacles (if we are too fast, we never get out of these half-pipes and we may even be “catapulted” up the mountain).

While our ski-landscape is 3D, typical error landscapes may have millions of dimensions. In such a space we have many valleys so it is easy to find a good solution, but we also have many saddle points, which makes matters very difficult.

Saddle points are points at which the surroundings are almost entirely flat, yet which may have dramatic descents at one end or the other (saddle points are like plateaus that slightly bend and may lead to a cliff). Most difficulties to find good solutions on an error landscape with many dimensions stems from navigating saddle points (because these plateaus have almost no steepness, progress is very slow near saddle points) rather than finding the minimum itself (there are many minima, which are almost all of the same quality).

Additional material: Coursera: Neural Networks for Machine Learning: Optimization – How to Make the Learning Go Faster


Backpropagation of errors, or often simply backpropagation, is a method for finding the gradient of the error with respect to weights over a neural network. The gradient signifies how the error of the network changes with changes to the network’s weights. The gradient is used to perform gradient descent and thus find a set of weights that minimize the error of the network.

Figure 1: Backpropagation for an arbitrary layer in a deep neural network.
Figure 1: Backpropagation for an arbitrary layer in a deep neural network.

There are three good ways to teach backpropagation: (1) Using a visual representation, (2) using a mathematical representation, (3) using a rule-based representation. The bonus material at the end of this section uses a mathematical representation. Here I’ll use a rule-based representation as it requires little math and is easy to understand.

Imagine a neural network with 100 layers. We can imagine a forward pass in which a matrix (dimensions: number of examples x number of input nodes) is input to the network and propagated t through it, where we always have the order (1) input nodes, (2) weight matrix (dimensions: input nodes x output nodes), and (3) output nodes, which usually also have a non-linear activation function (dimensions: examples x output nodes). How can we imagine these matrices?

The input matrix represents the following: For every input node we have one input value, for example, pixels (three input values = three pixels in Figure 1), and we take this times our number of examples, such as the number of images. So for 128 3-pixel images, we have a 128×3 input matrix.

The weight matrix represents the connections between input and output nodes. The value passed to an input node (a pixel) is weighted by the weight matrix values and it “flows” to each output node through these connections. This flow is a result of multipying the input value by the value of each weight between the input node and output nodes. The output matrix is the accumulated “flow” of all input nodes at all output nodes.

So for each input, we multiply by all weights, and add up all those contributions at the output nodes, or more easily we take the matrix product of the input matrix times the weight matrix. In our example, this would be our 128×3 input matrix multiplied by the 3×5 weight matrix (see Figure 1). We thus receive our output matrix as a result which in this example is of size 128×5. We then use this output matrix, apply the non-linear activation function and treat our resulting output matrix as the input matrix to the next layer. We repeat these steps until we reach the error function. We then apply the error function to see how far the predictions are different from the correct values. We can formulate this whole process of the forward pass, and equivalently the backward pass, by defining simple rules (see Figure 1).

For the forward pass with given input data we go from the first to the last layer according to these rules:

  1. When we encounter a weight matrix, we matrix multiply by this weight and propagate the result.
  2. If we encounter a function, we put our current result into the function and propagate the function output as our result.
  3. We treat outputs of the previous layer as inputs into the next layer
  4. When we encounter the error function we apply it and thus generate the error for our backward pass

The backward pass for a given error is similar but proceeds from the last to the first layer where the error generated in rule 4 in the forward pass represents the “inputs” to the last layer. We then go backward through the network and follow these rules:

  1. When we encounter a weight matrix, we matrix multiply by the transpose of the matrix and propagate the result.
  2. If we encounter a function, we multiply (element-wise) by the derivative of that function with respect to the inputs that this function received from the forward pass. (see Figure 1)
  3. We treat errors of the previous layer as inputs (errors) into the next layer

To calculate the gradients, we use each intermediate result obtained after executing rule 2 in the backward pass and matrix multiply this intermediate result by the value of rule 2 from the forward pass from the previous layer (see Figure 1).

Additional material: Coursera: Neural Networks for Machine Learning: The Backpropagation Learning Procedure


The rectified linear function is a simple non-linearity: It evaluates to 0 for negative inputs, and positive values remain untouched (f(x) = max(0,x)). The gradient of the rectified linear function is 1 for all positive values and 0 for negative values. This means that during backpropagation, negative gradients will not be used to update the weights of the outgoing rectified linear unit.

However, because we have a gradient of 1 for any positive value we have much better training speed when compared to other non-linear functions due to the good gradient flow. For example, the logistic sigmoid function has very tiny gradients for large positive and negative values so that learning nearly stops in these regions (this behavior is similar to a saddle point).

Despite the fact that negative gradients do not propagate with rectified linear functions (the gradient is zero here), large gradients for positive values are very powerful and ensure fast training regardless of the size of the gradient. Once these benefits were discovered, rectified linear functions and similar activation functions with large gradients became the activation functions of choice for deep networks.


Momentum uses the idea that the gradient zigzags every now and then but generally follows a rather straight line towards a local minimum. As such, if we move faster in this general direction and disregard the zigzag directions we will arrive faster at the local minimum, in general.

To realize this behavior we keep track of a running momentum matrix, which is the weighted running sum of the gradient, and we add that momentum matrix value to the gradient. The size of this momentum matrix is kept in check by attenuating it on every update (multiply by a momentum value between 0.7-0.99). Over time, the zigzag dimensions will be smoothed out in our running momentum matrix: A zig in one direction and a zag in the exact opposite direction cancel out and yield a straight line towards the general direction of the local minimum. In the beginning, the general direction towards the local minimum is not strongly established (a sequence of zags with no zigs, or vice versa), and the momentum matrix needs to be attenuated more strongly or the values for the momentum increasingly emphasize zigzagging directions, which in turn can lead to unstable learning. Thus, the momentum value should be kept small (0.5-0.7) in the beginning when no general direction towards a local minimum has been established. Later the momentum value can be increased rapidly (0.9-0.999).

Usually, the gradient update is applied first, and then the jump into the momentum direction follows. However, Nesterov showed that it is better to first jump into the momentum direction and then correct this direction with a gradient update; this procedure is known as “Nesterov’s accelerated gradient” (sometimes “Nesterov momentum”) and yields faster convergence to a local minimum.

Additional material: Coursera: Neural Networks for Machine Learning: 3. The Momentum Method


RMSprop keeps track of the weighted running mean of the squared gradient and then divides each calculated gradient by the square root of this weighted running mean (it essentially normalizes the gradient by dividing by the magnitude of recent gradients). The consequence is that when a plateau in the error surface is encountered and the gradient is very small, the updates take greater steps, ensuring faster learning (a small update: 0.00001, the square root of the weighted average: 0.00005, update size: 0.2). On the other hand, RMSprop protects against exploding gradients (a large update: 100, the square root of the weighted average: 25, update size: 4) and is thus used frequently in recurrent neural networks and LSTMs to protect both against vanishing and exploding gradients.

Figure 2: Behavior of different methods to accelerate gradient descent on a saddle point. Saddle points are thought to be the main difficulty in optimizing deep networks. Image by Alec Radford.
Figure 2: Behavior of different methods to accelerate gradient descent on a saddle point. Saddle points are thought to be the main difficulty in optimizing deep networks. Image by Alec Radford.

Additional material:

Coursera: Neural Networks for Machine Learning for Machine Learning: RMSProp

Additional animations comparing different optimization problems.


Imagine you (a unit in a convolutional network) are preparing for an exam (a classification task) and you know that during the exam you are permitted to copy answers from your peers (other units). Will you study for the exam? The answer to this question is probably yes or no depending on whether at least some students in your class have studied for the exam.

Let’s say you know that there are two students (units) in your class (convolutional net) who have the reputation of studying for every exam they take (every image that is presented). So you do not study for the exam and just copy from these students (you weigh the input from a single “elite” unit in the previous layer highly).

Now we introduce an infectious flu (dropout) that affects 50% of all students. Now there is a high chance that these two students who actually studied for the exam will not be present, so relying on copying their answers is no longer a good strategy. So this time you have to learn by yourself (make choices which take into account all units in a layer and not just the elite units).

In other words, dropout decouples the information processing of units so that they cannot rely on some unit “superstars” which always seem to have the right answer (these superstars detect features which are more important than the features that other units detect).

This in turn democratizes the classification process so that every unit makes computations that are largely independent of strong influencers, and thus reduces bias by ensuring less extreme opinions (there are no mainstream opinions). This decoupling of units in turn leads to strong regularization and better generalization (wisdom of the crowd).


L1 and L2 regularization penalizes the size of the weights of a network so that large output values that signify strong confidence can no longer be achieved from a single large weight, but instead require several medium-sized weights. Since many units have to agree to achieve a large value, it is less likely that the output will be biased by the opinion of a single unit. Conceptually, it penalizes strong opinions from single units and encourages taking into account the opinion of multiple units, thus reducing bias.

The L1 regularization penalizes the absolute size of the weight, while the L2 penalizes the squared size of the weight. This penalty is added to the error function value thus increasing the error if larger weights are used. As a result, the network is driven to solve the problem with small weights.

Since even small weights produce a sizeable L1 penalty, the L1 penalty has the effect that most weights will be set to zero while a few medium-to-large weights remain. Because fewer non-zero weights exist, the network must be highly confident about its results to achieve good predictive performance.

The L2 penalty encourages very small non-zero weights (large weight = very large error). Here the prediction is made by almost all weights thus reducing the bias (there are no influencers that can turn around outcomes by themselves).

Additional material: Coursera: Neural Networks for Machine Learning: 2. Limiting the Size of the Weights

Conclusion to Part 2

This concludes part 2 of this crash course on deep learning. Please check back soon for the next part of the series. In part 3, I’ll provide some details on learning algorithms, unsupervised learning, sequence learning, and natural language processing, and in part 4 I’ll go into reinforcement learning. In case you missed it, be sure to check out part 1 of the series.

Meanwhile, you might be interested in learning about cuDNN, DIGITS, Computer Vision with Caffe, Natural Language Processing with Torch, Neural Machine Translation, the Mocha.jl deep learning framework for Julia, or other Parallel Forall posts on deep learning.

When deep learning works, it works great



Google (s goog) silently did something revolutionary on Thursday. It open sourced a tool called word2vec, prepackaged deep-learning software designed to understand the relationships between words with no human guidance. Just input a textual data set and let underlying predictive models get to work learning.

“This is a really, really, really big deal,” said Jeremy Howard, president and chief scientist of data-science competition platform Kaggle. “… It’s going to enable whole new classes of products that have never existed before.” Think of Siri on steroids, for starters, or perhaps emulators that could mimic your writing style down to the tone.

When deep learning works, it works great

To understand Howard’s excitement, let’s go back a few days. It was Monday and I was watching him give a presentation in Chicago about how deep learning was dominating the competition in Kaggle, the online platform where organization present vexing predictive problems and data scientists compete to create the best models. Whenever someone has used a deep learning model to tackle one of the challenges, he told the room, it has performed better than any model ever previously devised to tackle that specific problem.

Structure Data 2012: Ryan Kim – Staff Writer, GigaOM, Eric Huls – VP, Allstate Insurance Company, Jeremy Howard – President and Chief Scientist, Kaggle

But there’s a catch: deep learning is really hard. So far, only a handful of teams in hundreds of Kaggle competitions have used it. Most of them have included Geoffrey Hinton or have been associated with him.

Hinton is a University of Toronto professor who pioneered the use of deep learning for image recognition and is now a distinguished engineer at Google, as well. What got Google really interested in Hinton — at least to the point where it hired him — was his work in an image-recognition competition called ImageNet. For years the contest’s winners had been improving only incrementally on previous results, until Hinton and his team used deep learning to improve by an order of magnitude.

Neural networks: A way-simplified overview

Deep learning, Howard explained, is essentially a bigger, badder take on the neural network models that have been around for some time. It’s particularly useful for analyzing image, audio, text, genomic and other multidimensional data that doesn’t lend itself well to traditional machine learning techniques.

Neural networks work by analyzing inputs (e.g., words or images) and recognizing the features that comprise them as well as how all those features relate to each other. With images, for example, a neural network model might recognize various formations of pixels or intensities of pixels as features.

A very simple neural network. Source: Wikipedia Commons

Trained against a set of labeled data, the output of a neural network might be the classification of an input as a dog or cat, for example. In cases where there is no labeled training data — a process called self-taught learning — neural networks can be used to identify the common features of their inputs and group similar inputs even though the models can’t predict what they actually are. Like when Google researchers constructed neural networks that were able to recognize cats and human faces without having been trained to do so.

Stacking neural networks to do deep learning

In deep learning, multiple neural networks are “stacked” on top of each other, or layered, in order to create models that are even better at prediction because each new layer learns from the ones before it. In Hinton’s approach, each layer randomly omits features — a process called “dropout” — to minimize the chances the model will overfit itself to just the data upon which it was trained. That’s a technical way of saying the model won’t work as well when trying to analyze new data.

So dropout or similar techniques are critical to helping deep learning models understand the real causality between the inputs and the outputs, Howard explained during a call on Thursday. It’s like looking at the same thing under the same lighting all the time versus looking at it in different lighting and from different angles. You’ll see new aspects and won’t see others, he said, “But the underlying structure is going to be the same each time.”

An example of what features a neural network might learn from images. Source: Hinton et al

Still, it’s difficult to create accurate models and to program them to run on the number of computing cores necessary to process them in a reasonable timeframe. It’s also can be difficult to train them on enough data to guarantee accuracy in an unsupervised environment. That’s why so much of the cutting-edge work in the field is still done by experts such as Hinton, Jeff Dean and Andrew Ng, all of whom had or still have strong ties to Google.

There are open source tools such as Theano and PyLearn2 that try to minimize the complexity, Howard told the audience on Monday, but a user-friendly, commercialized software package could be revolutionary. If data scientists in places outside Google could simply (a relative term if ever there was one) input their multidimensional data and train models to learn it, that could make other approaches to predictive modeling all but obsolete. It wouldn’t be inconceivable, Howard noted, that a software package like this could emerge within the next year.

Enter word2vec

Which brings us back to word2vec. Google calls it “an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words.” Those “architectures” are two new natural-language processing techniques developed by Google researchers Tomas Mikolov, Ilya Sutskever, and Quoc Le (Google Fellow Jeff Dean was also involved, although modestly, he told me.) They’re like neural networks, only simpler so they can be trained on larger data sets.

Kaggle’s Howard calls word2vec the “crown jewel” of natural language processing. “It’s the English language compressed down to a list of numbers,” he said.

Word2vec is designed to run on a system as small as a single multicore machine (Google tested its underlying techniques over days across more than 100 cores on its data center servers). Its creators have shown how it can recognize the similarities among words (e.g., the countries in Europe) as well as how they’re related to other words (e.g., countries and capitals). It’s able to decipher analogical relationships (e.g., short is to shortest as big is to biggest), word classes (e.g., carnivore and cormorant both relate to animals) and “linguistic regularities” (e.g., “vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to vector(‘queen’)).

Source: Google

Right now, the word2vec Google Code page notes, “The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences.”

This is accomplished by turning words into numbers that correlate with their characteristics, Howard said. Words that express positive sentiment, adjectives, nouns associated with sporting events — they’ll all have certain numbers in common based on how they’re used in the training data (so bigger data is better).

Smarter models means smarter apps

If this is all too esoteric, think about these methods applied to auto-correct or word suggestions in text-messaging apps. Current methods for doing this might be as simple as suggesting words that are usually paired together, Howard explained, meaning a suggestion is could be based solely on the word immediately before it. Using deep-learning-based approaches, a texting app could take into account the entire sentence, for example, because the app would have a better understanding of what the all words really mean in context.

Maybe you could average out all the numbers in a tweet, Howard suggested, and get a vector output that would accurately infer the sentiment, subject and level of formality of the tweet. Really, the possibilities are limited only to the types of applications people can think up to take advantage of word2vec’s deep understanding of natural language.

An example output file from word2vec that has grouped similar words

The big caveat, however, is researchers and industry data scientists still need to learn how to use word2vec. There hasn’t been a lot of research done on how to best use these types of models, Howard said, and the thousands of researchers working on other methods of natural language processing aren’t going to jump ship to Google’s tools overnight. Still, he believes the community will come around and word2vec and its underlying techniques could make all other approaches to natural language processing obsolete.

And this is just the start. A year from now, Howard predicts, deep learning will have surpassed a whole class of algorithms in other fields (i.e., things other than speech recognition, image recognition and natural language processing), and a year after that it will be integrated into all sorts of software packages. The only questions — and they’re admittedly big ones — is how smart deep learning models can get (and whether they’ll run into another era of hardware constraints that graphical processing units helped resolve earlier this millennium) and how accessible software packages like word2vec can make deep learning even for relatively unsophisticated users.

“Maybe in 10 years’ time,” Howard proposed, “we’ll get to that next level.”

A Tour of Machine Learning Algorithms

A Tour of Machine Learning Algorithms

In this post, we take a tour of the most popular machine learning algorithms.

It is useful to tour the main algorithms in the field to get a feeling of what methods are available.

There are so many algorithms available that it can feel overwhelming when algorithm names are thrown around and you are expected to just know what they are and where they fit.

I want to give you two ways to think about and categorize the algorithms you may come across in the field.

  • The first is a grouping of algorithms by the learning style.
  • The second is a grouping of algorithms by similarity in form or function (like grouping similar animals together).

Both approaches are useful, but we will focus in on the grouping of algorithms by similarity and go on a tour of a variety of different algorithm types.

After reading this post, you will have a much better understanding of the most popular machine learning algorithms for supervised learning and how they are related.

Ensemble Learning Method

Algorithms Grouped by Learning Style

There are different ways an algorithm can model a problem based on its interaction with the experience or environment or whatever we want to call the input data.

It is popular in machine learning and artificial intelligence textbooks to first consider the learning styles that an algorithm can adopt.

There are only a few main learning styles or learning models that an algorithm can have and we’ll go through them here with a few examples of algorithms and problem types that they suit.

This taxonomy or way of organizing machine learning algorithms is useful because it forces you to think about the roles of the input data and the model preparation process and select one that is the most appropriate for your problem in order to get the best result.

Let’s take a look at four different learning styles in machine learning algorithms:

Supervised Learning

Supervised Learning AlgorithmsInput data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.

A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.

Example problems are classification and regression.

Example algorithms include Logistic Regression and the Back Propagation Neural Network.

Unsupervised Learning

Unsupervised Learning AlgorithmsInput data is not labeled and does not have a known result.

A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.

Example problems are clustering, dimensionality reduction and association rule learning.

Example algorithms include: the Apriori algorithm and k-Means.

Semi-Supervised Learning

Semi-supervised Learning AlgorithmsInput data is a mixture of labeled and unlabelled examples.

There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.

Example problems are classification and regression.

Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.


When crunching data to model business decisions, you are most typically using supervised and unsupervised learning methods.

A hot topic at the moment is semi-supervised learning methods in areas such as image classification where there are large datasets with very few labeled examples.

Get your FREE Algorithms Mind Map

Machine Learning Algorithms Mind Map

I’ve created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Download For Free
Also get exclusive access to the machine learning algorithms email mini-course.



Algorithms Grouped By Similarity

Algorithms are often grouped by similarity in terms of their function (how they work). For example, tree-based methods, and neural network inspired methods.

I think this is the most useful way to group algorithms and it is the approach we will use here.

This is a useful grouping method, but it is not perfect. There are still algorithms that could just as easily fit into multiple categories like Learning Vector Quantization that is both a neural network inspired method and an instance-based method. There are also categories that have the same name that describe the problem and the class of algorithm such as Regression and Clustering.

We could handle these cases by listing algorithms twice or by selecting the group that subjectively is the “best” fit. I like this latter approach of not duplicating algorithms to keep things simple.

In this section, I list many of the popular machine learning algorithms grouped the way I think is the most intuitive. The list is not exhaustive in either the groups or the algorithms, but I think it is representative and will be useful to you to get an idea of the lay of the land.

Please Note: There is a strong bias towards algorithms used for classification and regression, the two most prevalent supervised machine learning problems you will encounter.

If you know of an algorithm or a group of algorithms not listed, put it in the comments and share it with us. Let’s dive in.

Regression Algorithms

Regression AlgorithmsRegression is concerned with modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model.

Regression methods are a workhorse of statistics and have been co-opted into statistical machine learning. This may be confusing because we can use regression to refer to the class of problem and the class of algorithm. Really, regression is a process.

The most popular regression algorithms are:

  • Ordinary Least Squares Regression (OLSR)
  • Linear Regression
  • Logistic Regression
  • Stepwise Regression
  • Multivariate Adaptive Regression Splines (MARS)
  • Locally Estimated Scatterplot Smoothing (LOESS)

Instance-based Algorithms

Instance-based AlgorithmsInstance-based learning model is a decision problem with instances or examples of training data that are deemed important or required to the model.

Such methods typically build up a database of example data and compare new data to the database using a similarity measure in order to find the best match and make a prediction. For this reason, instance-based methods are also called winner-take-all methods and memory-based learning. Focus is put on the representation of the stored instances and similarity measures used between instances.

The most popular instance-based algorithms are:

  • k-Nearest Neighbor (kNN)
  • Learning Vector Quantization (LVQ)
  • Self-Organizing Map (SOM)
  • Locally Weighted Learning (LWL)

Regularization Algorithms

Regularization AlgorithmsAn extension made to another method (typically regression methods) that penalizes models based on their complexity, favoring simpler models that are also better at generalizing.

I have listed regularization algorithms separately here because they are popular, powerful and generally simple modifications made to other methods.

The most popular regularization algorithms are:

  • Ridge Regression
  • Least Absolute Shrinkage and Selection Operator (LASSO)
  • Elastic Net
  • Least-Angle Regression (LARS)

Decision Tree Algorithms

Decision Tree AlgorithmsDecision tree methods construct a model of decisions made based on actual values of attributes in the data.

Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning.

The most popular decision tree algorithms are:

  • Classification and Regression Tree (CART)
  • Iterative Dichotomiser 3 (ID3)
  • C4.5 and C5.0 (different versions of a powerful approach)
  • Chi-squared Automatic Interaction Detection (CHAID)
  • Decision Stump
  • M5
  • Conditional Decision Trees

Bayesian Algorithms

Bayesian AlgorithmsBayesian methods are those that explicitly apply Bayes’ Theorem for problems such as classification and regression.

The most popular Bayesian algorithms are:

  • Naive Bayes
  • Gaussian Naive Bayes
  • Multinomial Naive Bayes
  • Averaged One-Dependence Estimators (AODE)
  • Bayesian Belief Network (BBN)
  • Bayesian Network (BN)

Clustering Algorithms

Clustering AlgorithmsClustering, like regression, describes the class of problem and the class of methods.

Clustering methods are typically organized by the modeling approaches such as centroid-based and hierarchal. All methods are concerned with using the inherent structures in the data to best organize the data into groups of maximum commonality.

The most popular clustering algorithms are:

  • k-Means
  • k-Medians
  • Expectation Maximisation (EM)
  • Hierarchical Clustering

Association Rule Learning Algorithms

Assoication Rule Learning AlgorithmsAssociation rule learning methods extract rules that best explain observed relationships between variables in data.

These rules can discover important and commercially useful associations in large multidimensional datasets that can be exploited by an organization.

The most popular association rule learning algorithms are:

  • Apriori algorithm
  • Eclat algorithm

Artificial Neural Network Algorithms

Artificial Neural Network AlgorithmsArtificial Neural Networks are models that are inspired by the structure and/or function of biological neural networks.

They are a class of pattern matching that are commonly used for regression and classification problems but are really an enormous subfield comprised of hundreds of algorithms and variations for all manner of problem types.

Note that I have separated out Deep Learning from neural networks because of the massive growth and popularity in the field. Here we are concerned with the more classical methods.

The most popular artificial neural network algorithms are:

  • Perceptron
  • Back-Propagation
  • Hopfield Network
  • Radial Basis Function Network (RBFN)

Deep Learning Algorithms

Deep Learning AlgorithmsDeep Learning methods are a modern update to Artificial Neural Networks that exploit abundant cheap computation.

They are concerned with building much larger and more complex neural networks and, as commented on above, many methods are concerned with semi-supervised learning problems where large datasets contain very little labeled data.

The most popular deep learning algorithms are:

  • Deep Boltzmann Machine (DBM)
  • Deep Belief Networks (DBN)
  • Convolutional Neural Network (CNN)
  • Stacked Auto-Encoders

Dimensionality Reduction Algorithms

Dimensional Reduction AlgorithmsLike clustering methods, dimensionality reduction seek and exploit the inherent structure in the data, but in this case in an unsupervised manner or order to summarize or describe data using less information.

This can be useful to visualize dimensional data or to simplify data which can then be used in a supervised learning method. Many of these methods can be adapted for use in classification and regression.

  • Principal Component Analysis (PCA)
  • Principal Component Regression (PCR)
  • Partial Least Squares Regression (PLSR)
  • Sammon Mapping
  • Multidimensional Scaling (MDS)
  • Projection Pursuit
  • Linear Discriminant Analysis (LDA)
  • Mixture Discriminant Analysis (MDA)
  • Quadratic Discriminant Analysis (QDA)
  • Flexible Discriminant Analysis (FDA)

Ensemble Algorithms

Ensemble AlgorithmsEnsemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction.

Much effort is put into what types of weak learners to combine and the ways in which to combine them. This is a very powerful class of techniques and as such is very popular.

  • Boosting
  • Bootstrapped Aggregation (Bagging)
  • AdaBoost
  • Stacked Generalization (blending)
  • Gradient Boosting Machines (GBM)
  • Gradient Boosted Regression Trees (GBRT)
  • Random Forest

Other Algorithms

Many algorithms were not covered.

For example, what group would Support Vector Machines go into? Its own?

I did not cover algorithms from specialty tasks in the process of machine learning, such as:

  • Feature selection algorithms
  • Algorithm accuracy evaluation
  • Performance measures

I also did not cover algorithms from specialty subfields of machine learning, such as:

  • Computational intelligence (evolutionary algorithms, etc.)
  • Computer Vision (CV)
  • Natural Language Processing (NLP)
  • Recommender Systems
  • Reinforcement Learning
  • Graphical Models
  • And more…

These may feature in future posts.

Further Reading

This tour of machine learning algorithms was intended to give you an overview of what is out there and some ideas on how to relate algorithms to each other.

I’ve collected together some resources for you to continue your reading on algorithms. If you have a specific question, please leave a comment.

Other Lists of Algorithms

There are other great lists of algorithms out there if you’re interested. Below are few hand selected examples.

How to Study Machine Learning Algorithms

Algorithms are a big part of machine learning. It’s a topic I am passionate about and write about a lot on this blog. Below are few hand selected posts that might interest you for further reading.

How to Run Machine Learning Algorithms

Sometimes you just want to dive into code. Below are some links you can use to run machine learning algorithms, code them up using standard libraries or implement them from scratch.

Final Word

I hope you have found this tour useful.

Please, leave a comment if you have any questions or ideas on how to improve the algorithm tour.

Update #1: Continue the discussion on HackerNews and reddit.

Update #2: I’ve added a bunch more resources and more algorithms. I’ve also added a handy mind map that you can download (see above).

Frustrated With Machine Learning Math?

See How Algorithms Work in Minutes

…with just arithmetic and simple examples

Discover how in my new Ebook: Master Machine Learning Algorithms

It covers explanations and examples of 10 top algorithms, including:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more…

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

Click to learn more.