To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges might tend to dominate the Euclidian distance between feature vectors, their effects could be magnified at the expense of the other fields, and the steepest descent optimization might have difficulty converging. There are a number of ways to normalize and standardize data for machine learning, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling.
Feature engineering for machine learning
A feature is an individual measurable property or characteristic of a phenomenon being observed. The concept of a “feature” is related to that of an explanatory variable, which is used in statistical techniques such as linear regression. Feature vectors combine all the features for a single row into a numerical vector.
Part of the art of choosing features is to pick a minimum set of independent variables that explain the problem. If two variables are highly correlated, either they need to be combined into a single feature, or one should be dropped. Sometimes people perform principal component analysis to convert correlated variables into a set of linearly uncorrelated variables.
Some of the transformations that people use to construct new features or reduce the dimensionality of feature vectors are simple. For example, subtract Year of Birth
from Year of Death
and you construct Age at Death
, which is a prime independent variable for lifetime and mortality analysis. In other cases, feature construction may not be so obvious.
Splitting data for machine learning
The usual practice for supervised machine learning is to split the data set into subsets for training, validation, and test. One way of working is to assign 80% of the data to the training data set, and 10% each to the validation and test data sets. (The exact split is a matter of preference.) The bulk of the training is done against the training data set, and prediction is done against the validation data set at the end of every epoch.
The errors in the validation data set can be used to identify stopping criteria, or to drive hyperparameter tuning. Most importantly, the errors in the validation data set can help you find out whether the model has overfit the training data.
Prediction against the test data set is typically done on the final model. If the test data set was never used for training, it is sometimes called the holdout data set.
There are several other schemes for splitting the data. One common technique, cross-validation, involves repeatedly splitting the full data set into a training data set and a validation data set. At the end of each epoch, the data is shuffled and split again.
AutoML and hyperparameter optimization
AutoML and hyperparameter optimization are ways of getting the computer to try many models and identify the best one. With AutoML (as usually defined) the computer tries all of the appropriate machine learning models, and may also try all of the appropriate feature engineering and feature scaling techniques. With hyperparameter optimization, you typically define which hyperparameters you would like to sweep for a specific model—such as the number of hidden layers, the learning rate, and the dropout rate—and the range you would like to sweep for each.
Google has a different definition for Google Cloud AutoML. Instead of trying every appropriate machine learning model, it attempts to customize a relevant deep learning model (vision, translation, or natural language) using deep transfer learning. Azure Machine Learning Service offers similar transfer learning services by different names: custom vision, customizable speech and translation, and custom search.
Machine learning in the cloud
You can run machine learning and deep learning on your own machines or in the cloud. AWS, Azure, and Google Cloud all offer machine learning services that you can use on demand, and they offer hardware accelerators on demand as well.
While there are free tiers on all three services, you may eventually run up monthly bills, especially if you use large instances with GPUs, TPUs, or FPGAs. You need to balance this operating cost against the capital cost of buying your own workstation-class computers and GPUs. If you need to train a lot of models on a consistent basis, then buying at least one GPU for your own use makes sense.
The big advantage of using the cloud for machine learning and deep learning is that you can spin up significant resources in a matter of minutes, run your training quickly, and then release the cloud resources. Also, all three major clouds offer machine learning and deep learning services that don’t require a Ph.D. in data science to run. You have the options to use their pre-trained models, customize their models for your own use, or create your own models with any of the major machine learning and deep learning frameworks, such as Scikit-learn, PyTorch, and TensorFlow.
There are also free options for running machine learning and deep learning Jupyter notebooks: Google Colab and Kaggle (recently acquired by Google). Colab offers a choice of CPU, GPU, and TPU instances. Kaggle offers CPU and GPU instances, along with competitions, data sets, and shared kernels.
Machine learning in more depth
You can learn a lot about machine learning and deep learning simply by installing one of the deep learning packages, trying out its samples, and reading its tutorials. For more depth, consider one or more of the following resources.
- Neural Networks and Deep Learning by Michael Nielsen
A Brief Introduction to Neural Networks by David Kriesel - Deep Learning by Yoshua Bengio, Ian Goodfellow, and Aaron Courville
- A Course in Machine Learning by Hal Daumé III
- TensorFlow Playground by Daniel Smilkov and Shan Carter
- Stanford Computer Science CS231n: Convolutional Neural Networks for Visual Recognition