Neural architecture search is the task of automatically finding one or more architectures for a neural network that will yield models with good results (low losses), relatively quickly, for a given dataset. Neural architecture search is currently an emergent area. There is a lot of research going on, there are many different approaches to the task, and there isn’t a single best method generally — or even a single best method for a specialized kind of problem such as object identification in images.
Neural architecture search is an aspect of AutoML, along with feature engineering, transfer learning, and hyperparameter optimization. It’s probably the hardest machine learning problem currently under active research; even the evaluation of neural architecture search methods is hard. Neural architecture search research can also be expensive and time-consuming. The metric for the search and training time is often given in GPU-days, sometimes thousands of GPU-days.
The motivation for improving neural architecture search is fairly obvious. Most of the advances in neural network models, for example in image classification and language translation, have required considerable hand-tuning of the neural network architecture, which is time-consuming and error-prone. Even compared to the cost of high-end GPUs on public clouds, the cost of data scientists is very high, and their availability tends to be low.
Evaluating neural architecture search
As multiple authors (for example Lindauer and Hutter, Yang et al., and Li and Talwalkar) have observed, many neural architecture search (NAS) studies are irreproducible, for any of several reasons. Additionally, many neural architecture search algorithms either fail to outperform random search (with early termination criteria applied) or were never compared to a useful baseline.
Yang et al. showed that many neural architecture search techniques struggle to significantly beat a randomly sampled average architecture baseline. (They called their paper “NAS evaluation is frustratingly hard.”) They also provided a repository that includes the code used to evaluate neural architecture search methods on several different datasets as well as the code used to augment architectures with different protocols.
Lindauer and Hutter have proposed a NAS best practices checklist based on their article (also referenced above):
Best practices for releasing code
For all experiments you report, check if you released:
_ Code for the training pipeline used to evaluate the final architectures
_ Code for the search space
_ The hyperparameters used for the final evaluation pipeline, as well as random seeds
_ Code for your NAS method
_ Hyperparameters for your NAS method, as well as random seedsNote that the easiest way to satisfy the first three of these is to use existing NAS benchmarks, rather than changing them or introducing new ones.
Best practices for comparing NAS methods
_ For all NAS methods you compare, did you use exactly the same NAS benchmark, including the same dataset (with the same training-test split), search space and code for training the architectures and hyperparameters for that code?
_ Did you control for confounding factors (different hardware, versions of DL libraries, different runtimes for the different methods)?
_ Did you run ablation studies?
_ Did you use the same evaluation protocol for the methods being compared?
_ Did you compare performance over time?
_ Did you compare to random search?
_ Did you perform multiple runs of your experiments and report seeds?
_ Did you use tabular or surrogate benchmarks for in-depth evaluations?Best practices for reporting important details
_ Did you report how you tuned hyperparameters, and what time and resources this required?
_ Did you report the time for the entire end-to-end NAS method (rather than, e.g., only for the search phase)?
_ Did you report all the details of your experimental setup?
It’s worth discussing the term “ablation studies” mentioned in the second group of criteria. Ablation studies originally referred to the surgical removal of body tissue. When applied to the brain, ablation studies (generally prompted by a serious medical condition, with the research done after the surgery) help to determine the function of parts of the brain.
In neural network research, ablation means removing features from neural networks to determine their importance. In NAS research, it refers to removing features from the search pipeline and training techniques, including hidden components, again to determine their importance.
Neural architecture search methods
Elsken et al. (2018) did a survey of neural architecture search methods, and categorized them in terms of search space, search strategy, and performance estimation strategy. Search spaces can be for whole architectures, layer by layer (macro search), or can be restricted to assembling pre-defined cells (cell search). Architectures built from cells use a drastically reduced search space; Zoph et al. (2018) estimate a 7x speedup.
Search strategies for neural architectures include random search, Bayesian optimization, evolutionary methods, reinforcement learning, and gradient-based methods. There have been indications of success for all of these approaches, but none have really stood out.
The simplest way of estimating performance for neural networks is to train and validate the networks on data. Unfortunately, this can lead to computational demands on the order of thousands of GPU-days for neural architecture search. Ways of reducing the computation include lower fidelity estimates (fewer epochs of training, less data, and downscaled models); learning curve extrapolation (based on a just a few epochs); warm-started training (initialize weights by copying them from a parent model); and one-shot models with weight sharing (the subgraphs use the weights from the one-shot model). All of these methods can reduce the training time to a few GPU-days rather than a few thousands of GPU-days. The biases introduced by these approximations aren’t yet well understood, however.
Microsoft’s Project Petridish
Microsoft Research claims to have developed a new approach to neural architecture search that adds shortcut connections to existing network layers and uses weight-sharing. The added shortcut connections effectively perform gradient boosting on the augmented layers. They call this Project Petridish.
This method supposedly reduces the training time to a few GPU-days rather than a few thousands of GPU-days, and supports warm-started training. According to the researchers, the method works well both on cell search and macro search.
The experimental results quoted were pretty good for the CIFAR-10 image dataset, but nothing special for the Penn Treebank language dataset. While Project Petridish sounds interesting taken in isolation, without detailed comparison to the other methods discussed, it’s not clear whether it’s a major improvement for neural architecture search compared to the other speedup methods we’ve discussed, or just another way to get to the same place.