Machine learning defined
Machine learning is a branch of artificial intelligence that includes methods, or algorithms, for automatically creating models from data. Unlike a system that performs a task by following explicit rules, a machine learning system learns from experience. Whereas a rule-based system will perform a task the same way every time (for better or worse), the performance of a machine learning system can be improved through training, by exposing the algorithm to more data.
Machine learning algorithms are often divided into supervised (the training data are tagged with the answers) and unsupervised (any labels that may exist are not shown to the training algorithm). Supervised machine learning problems are further divided into classification (predicting non-numeric answers, such as the probability of a missed mortgage payment) and regression (predicting numeric answers, such as the number of widgets that will sell next month in your Manhattan store).
Unsupervised learning is further divided into clustering (finding groups of similar objects, such as running shoes, walking shoes, and dress shoes), association (finding common sequences of objects, such as coffee and cream), and dimensionality reduction (projection, feature selection, and feature extraction).
Applications of machine learning
We hear about applications of machine learning on a daily basis, although not all of them are unalloyed successes. Self-driving cars are a good example, where tasks range from simple and successful (parking assist and highway lane following) to complex and iffy (full vehicle control in urban settings, which has led to several deaths).
Game-playing machine learning is strongly successful for checkers, chess, shogi, and Go, having beaten human world champions. Automatic language translation has been largely successful, although some language pairs work better than others, and many automatic translations can still be improved by human translators.
Automatic speech to text works fairly well for people with mainstream accents, but not so well for people with some strong regional or national accents; performance depends on the training sets used by the vendors. Automatic sentiment analysis of social media has a reasonably good success rate, probably because the training sets (e.g. Amazon product ratings, which couple a comment with a numerical score) are large and easy to access.
Automatic screening of résumés is a controversial area. Amazon had to withdraw its internal system because of training sample biases that caused it to downgrade all job applications from women.
Other résumé screening systems currently in use may have training biases that cause them to upgrade candidates who are “like” current employees in ways that legally aren’t supposed to matter (e.g. young, white, male candidates from upscale English-speaking neighborhoods who played team sports are more likely to pass the screening). Research efforts by Microsoft and others focus on eliminating implicit biases in machine learning.
Automatic classification of pathology and radiology images has advanced to the point where it can assist (but not replace) pathologists and radiologists for the detection of certain kinds of abnormalities. Meanwhile, facial identification systems are both controversial when they work well (because of privacy considerations) and tend not to be as accurate for women and people of color as they are for white males (because of biases in the training population).
Machine learning algorithms
Machine learning depends on a number of algorithms for turning a data set into a model. Which algorithm works best depends on the kind of problem you’re solving, the computing resources available, and the nature of the data. No matter what algorithm or algorithms you use, you’ll first need to clean and condition the data.
Let’s discuss the most common algorithms for each kind of problem.
Classification algorithms
A classification problem is a supervised learning problem that asks for a choice between two or more classes, usually providing probabilities for each class. Leaving out neural networks and deep learning, which require a much higher level of computing resources, the most common algorithms are Naive Bayes, Decision Tree, Logistic Regression, K-Nearest Neighbors, and Support Vector Machine (SVM). You can also use ensemble methods (combinations of models), such as Random Forest, other Bagging methods, and boosting methods such as AdaBoost and XGBoost.
Regression algorithms
A regression problem is a supervised learning problem that asks the model to predict a number. The simplest and fastest algorithm is linear (least squares) regression, but you shouldn’t stop there, because it often gives you a mediocre result. Other common machine learning regression algorithms (short of neural networks) include Naive Bayes, Decision Tree, K-Nearest Neighbors, LVQ (Learning Vector Quantization), LARS Lasso, Elastic Net, Random Forest, AdaBoost, and XGBoost. You’ll notice that there is some overlap between machine learning algorithms for regression and classification.
Clustering algorithms
A clustering problem is an unsupervised learning problem that asks the model to find groups of similar data points. The most popular algorithm is K-Means Clustering; others include Mean-Shift Clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), GMM (Gaussian Mixture Models), and HAC (Hierarchical Agglomerative Clustering).
Dimensionality reduction algorithms
Dimensionality reduction is an unsupervised learning problem that asks the model to drop or combine variables that have little or no effect on the result. This is often used in combination with classification or regression. Dimensionality reduction algorithms include removing variables with many missing values, removing variables with low variance, Decision Tree, Random Forest, removing or combining variables with high correlation, Backward Feature Elimination, Forward Feature Selection, Factor Analysis, and PCA (Principal Component Analysis).
Optimization methods
Training and evaluation turn supervised learning algorithms into models by optimizing their parameter weights to find the set of values that best matches the ground truth of your data. The algorithms often rely on variants of steepest descent for their optimizers, for example stochastic gradient descent (SGD), which is essentially steepest descent performed multiple times from randomized starting points.
Common refinements on SGD add factors that correct the direction of the gradient based on momentum, or adjust the learning rate based on progress from one pass through the data (called an epoch or a batch) to the next.
Neural networks and deep learning
Neural networks were inspired by the architecture of the biological visual cortex. Deep learning is a set of techniques for learning in neural networks that involves a large number of “hidden” layers to identify features. Hidden layers come between the input and output layers. Each layer is made up of artificial neurons, often with sigmoid or ReLU (Rectified Linear Unit) activation functions.
In a feed-forward network, the neurons are organized into distinct layers: one input layer, any number of hidden processing layers, and one output layer, and the outputs from each layer go only to the next layer.
In a feed-forward network with shortcut connections, some connections can jump over one or more intermediate layers. In recurrent neural networks, neurons can influence themselves, either directly, or indirectly through the next layer.
Supervised learning of a neural network is done just like any other machine learning: You present the network with groups of training data, compare the network output with the desired output, generate an error vector, and apply corrections to the network based on the error vector, usually using a backpropagation algorithm. Batches of training data that are run together before applying corrections are called epochs.
As with all machine learning, you need to check the predictions of the neural network against a separate test data set. Without doing that you risk creating neural networks that only memorize their inputs instead of learning to be generalized predictors.
The breakthrough in the neural network field for vision was Yann LeCun’s 1998 LeNet-5, a seven-level convolutional neural network (CNN) for recognition of handwritten digits digitized in 32x32 pixel images. To analyze higher-resolution images, the network would need more neurons and more layers.
Convolutional neural networks typically use convolutional, pooling, ReLU, fully connected, and loss layers to simulate a visual cortex. The convolutional layer basically takes the integrals of many small overlapping regions. The pooling layer performs a form of non-linear down-sampling. ReLU layers, which I mentioned earlier, apply the non-saturating activation function f(x) = max(0,x)
.
In a fully connected layer, the neurons have full connections to all activations in the previous layer. A loss layer computes how the network training penalizes the deviation between the predicted and true labels, using a Softmax or cross-entropy loss for classification or a Euclidean loss for regression.
Natural language processing (NLP) is another major application area for deep learning. In addition to the machine translation problem addressed by Google Translate, major NLP tasks include automatic summarization, co-reference resolution, discourse analysis, morphological segmentation, named entity recognition, natural language generation, natural language understanding, part-of-speech tagging, sentiment analysis, and speech recognition.
In addition to CNNs, NLP tasks are often addressed with recurrent neural networks (RNNs), which include the Long-Short Term Memory (LSTM) model.
The more layers there are in a deep neural network, the more computation it takes to train the model on a CPU. Hardware accelerators for neural networks include GPUs, TPUs, and FPGAs.
Reinforcement learning
Reinforcement learning trains an actor or agent to respond to an environment in a way that maximizes some value, usually by trial and error. That’s different from supervised and unsupervised learning, but is often combined with them.
For example, DeepMind’s AlphaGo, in order to learn to play (the action) the game of Go (the environment), first learned to mimic human Go players from a large data set of historical games (apprentice learning). It then improved its play by trial and error (reinforcement learning), by playing large numbers of Go games against independent instances of itself.
Robotic control is another problem that has been attacked with deep reinforcement learning methods, meaning reinforcement learning plus deep neural networks, the deep neural networks often being CNNs trained to extract features from video frames.
How to use machine learning
How does one go about creating a machine learning model? You start by cleaning and conditioning the data, continue with feature engineering, and then try every machine-learning algorithm that makes sense. For certain classes of problem, such as vision and natural language processing, the algorithms that are likely to work involve deep learning.
Data cleaning for machine learning
There is no such thing as clean data in the wild. To be useful for machine learning, data must be aggressively filtered. For example, you’ll want to:
- Look at the data and exclude any columns that have a lot of missing data.
- Look at the data again and pick the columns you want to use (feature selection) for your prediction. This is something you may want to vary when you iterate.
- Exclude any rows that still have missing data in the remaining columns.
- Correct obvious typos and merge equivalent answers. For example, U.S., US, USA, and America should be merged into a single category.
- Exclude rows that have data that is out of range. For example, if you’re analyzing taxi trips within New York City, you’ll want to filter out rows with pickup or drop-off latitudes and longitudes that are outside the bounding box of the metropolitan area.
There is a lot more you can do, but it will depend on the data collected. This can be tedious, but if you set up a data-cleaning step in your machine learning pipeline you can modify and repeat it at will.
Data encoding and normalization for machine learning
To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings.
One is label encoding, which means that each text label value is replaced with a number. The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered.