Despite the success of supervised machine learning and deep learning, there’s a school of thought that says that unsupervised learning has even greater potential. The learning of a supervised learning system is limited by its training; i.e., a supervised learning system can learn only those tasks that it’s trained for. By contrast, an unsupervised system could theoretically achieve “artificial general intelligence,” meaning the ability to learn any task a human can learn. However, the technology isn’t there yet.
If the biggest problem with supervised learning is the expense of labeling the training data, the biggest problem with unsupervised learning (where the data is not labeled) is that it often doesn’t work very well. Nevertheless, unsupervised learning does have its uses: It can sometimes be good for reducing the dimensionality of a data set, exploring the pattern and structure of the data, finding groups of similar objects, and detecting outliers and other noise in the data.
In general, it’s worth trying unsupervised learning methods as part of your exploratory data analysis to discover patterns and clusters, to reduce the dimensionality of your data, to discover latent features, and to remove outliers. Whether you then need to move on to supervised learning or to using pre-trained models to do predictions depends on your goals and your data.
What is unsupervised learning?
Think about how human children learn. As a parent or teacher you don’t need to show young children every breed of dog and cat there is to teach them to recognize dogs and cats. They can learn from a few examples, without a lot of explanation, and generalize on their own. Oh, they might mistakenly call a Chihuahua “Kitty” the first time they see one, but you can correct that relatively quickly.
Children intuitively lump groups of things they see into classes. One goal of unsupervised learning is essentially to allow computers to develop the same ability. As Alex Graves and Kelly Clancy of DeepMind put it in their blog post, “Unsupervised learning: the curious pupil,”
Unsupervised learning is a paradigm designed to create autonomous intelligence by rewarding agents (that is, computer programs) for learning about the data they observe without a particular task in mind. In other words, the agent learns for the sake of learning.
The potential of an agent that learns for the sake of learning is far greater than a system that reduces complex pictures to a binary decision (e.g. dog or cat). Uncovering patterns rather than carrying out a pre-defined task can yield surprising and useful results, as demonstrated when researchers at Lawrence Berkeley Lab ran a text processing algorithm (Word2vec) on several million material science abstracts to predict discoveries of new thermoelectric materials.
Clustering methods
A clustering problem is an unsupervised learning problem that asks the model to find groups of similar data points. There are a number of clustering algorithms currently in use, which tend to have slightly different characteristics. In general, clustering algorithms look at the metrics or distance functions between the feature vectors of the data points, and then group the ones that are “near” each other. Clustering algorithms work best if the classes do not overlap.
Hierarchical clustering
Hierarchical cluster analysis (HCA) can be agglomerative (you build the clusters bottom-up starting with individual points and ending with a single cluster) or divisive (you start with a single cluster and break it up until you wind up with individual points). If you’re lucky you can find an intermediate stage of the clustering process that reflects a meaningful classification.
The clustering process is usually displayed as a dendrogram (tree diagram). HCA algorithms tend to take a lot of compute time [O(n3)] and memory [O(n2)] resources; these limit the applicability of the algorithms to relatively small data sets.
HCA algorithms can use various metrics and linkage criteria. Euclidian distance and squared Euclidian distance are both common for numeric data; Hamming distance and Levenshtein distance are common for non-numeric data. Single-linkage and complete linkage are common; both of these can simplify the clustering algorithms (SLINK and CLINK respectively). SLINK is one of the few clustering algorithms guaranteed to find an optimum solution.
K-means clustering
The k-means clustering problem attempts to divide n observations into k clusters using the Euclidean distance metric, with the objective of minimizing the variance (sum of squares) within each cluster. It is a method of vector quantization, and is useful for feature learning.
Lloyd’s algorithm (iterative cluster agglomeration with centroid updates) is the most common heuristic used to solve the problem, and is relatively efficient, but doesn’t guarantee global convergence. To improve that, people often run the algorithm multiple times using random initial cluster centroids generated by the Forgy or Random Partition methods.
K-means assumes spherical clusters that are separable so that the mean converges towards the cluster center, and also assumes that the ordering of the data points does not matter. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center is the correct assignment.
The heuristics for solving k-means clusters are usually similar to the expectation-maximization (EM) algorithm for Gaussian mixture models.
Mixture models
Mixture models assume that the sub-populations of the observations correspond to some probability distribution, commonly Gaussian distributions for numeric observations or categorical distributions for non-numeric data. Each sub-population may have its own distribution parameters, for example mean and variance for Gaussian distributions.
Expectation maximization (EM) is one of the most popular techniques used to determine the parameters of a mixture with a given number of components. In addition to EM, mixture models can be solved with Markov chain Monte Carlo, moment matching, spectral methods with singular value decomposition (SVD), and graphical methods.
The original mixture model application was to separate two populations of shore crabs by forehead to body length ratios. Karl Pearson solved this problem in 1894 using moment matching.
A common extension of mixture models is to connect the latent variables defining the mixture component identities into a Markov chain instead of assuming that they are independent identically distributed random variables. The resulting model is called a hidden Markov model and is one of the most common sequential hierarchical models.
DBSCAN algorithm
Density-based spatial clustering of applications with noise (DBSCAN) is a non-parametric data-clustering algorithm that dates from 1996. It is optimized for use with databases that can accelerate geometric region queries using an R* tree or some other geometric index structure.
Essentially, DBSCAN clusters core points that have more than some minimum number of neighbors within some distance Epsilon, discards as outliers points that have no neighbors within Epsilon, and adds points that are within Epsilon of a core point to that cluster. DBSCAN is one of the most common clustering algorithms, and can find arbitrarily shaped clusters.
OPTICS algorithm
Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data. OPTICS is similar to DBSCAN, but handles the case of varying point density.
Variations of the ideas in DBSCAN and OPTICS can also be used for simple outlier and noise detection and removal.
Latent variable models
A latent variable model is a statistical model that relates a set of observable variables to a set of latent (hidden) variables. Latent variable models are useful for revealing hidden structures in complex and high-dimensional data.
Principal component analysis
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated numeric variables into a set of values of linearly uncorrelated variables called principal components. Karl Pearson invented PCA in 1901. PCA can be accomplished by eigenvalue decomposition of a data covariance (or correlation) matrix, or singular value decomposition (SVD) of a data matrix, usually after a normalization step of the initial data.
Singular value decomposition
The singular value decomposition (SVD) is a factorization of a real or complex matrix. It’s a common technique in linear algebra, and is often computed using Householder transformations. SVD is one way to solve for principal components. While it’s perfectly possible to code SVD from scratch, there are good implementations in all of the linear algebra libraries.
Method of moments
The method of moments uses the moments of the observed data sample (mean, variance, skewness, and kurtosis) to estimate population parameters. The method is quite simple, can often be calculated by hand, and usually achieves global convergence. In the case of low statistics, however, the method of moments can sometimes produce estimates that are outside the parameter space. The method of moments is an easy way to solve mixture models (above).
Expectation-maximization algorithms
An expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood estimates of parameters in models that depend on unobserved latent variables. The EM iteration alternates between performing an expectation step (E), which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization step (M), which computes parameters maximizing the expected log-likelihood found on the E step.
EM converges to a maximum or saddle point, but not necessarily to the global maximum. You can increase the chance of finding the global maximum by repeating the EM procedure from many random initial estimates for the parameters, or by using the method of moments to determine the initial estimates.
EM applied to a Gaussian mixture model (above) can be used for cluster analysis.
Unsupervised neural networks
Neural networks are usually trained on labeled data for classification or regression, which is by definition supervised machine learning. They can also be trained on unlabeled data, using various unsupervised schemes.
Autoencoders
Autoencoders are neural networks that are trained on their inputs. Essentially, the autoencoder is a feed-forward network that acts as a codec, encoding its input from the input layer to one or more hidden layers with a lower neuron count, and then decoding the encoded representation to an output layer with the topology as the input.
During training the autoencoder uses back propagation to minimize the difference between the input and the output. Autoencoders have been used for dimensionality reduction, feature learning, de-noising, anomaly detection, image processing, and for learning generative models.
Deep belief networks
Deep belief networks (DBNs) are stacks of autoencoders or restricted Boltzmann machines (RBNs) that can learn to reconstruct their inputs. The layers then act as feature detectors. RBNs are usually trained using contrastive divergence.
DBNs have been used for generating and recognizing images, video sequences, and motion-capture data.
Generative adversarial networks
Generative adversarial networks (GANs) simultaneously train two networks, a generative model that captures the data distribution and a discriminative model that estimates the probability that a sample came from the training data. The training attempts to maximize the probability that the generator can fool the discriminator.
GANs can be used to create photos of imaginary people and improve astronomical images. GANs have also been used to up-scale textures from old video games for use in high-resolution versions of the games. Outside of unsupervised learning, GANs have been successfully applied to reinforcement learning of game playing.
Self-organizing map
The self-organizing map (SOM) defines an ordered mapping from a set of given data items onto a regular, usually two-dimensional grid. A model is associated with each grid node. A data item will be mapped into the node whose model is most similar to the data item, i.e., has the smallest distance from the data item in some metric.
There are a number of precautions that you need to take to ensure that the mappings are stable and well ordered. Not all commercial implementations follow all of the precautions.