PaddlePaddle, Baidu's open source framework for deep learning, is now compatible with the Kubernetes cluster management system to allow large models to be trained anywhere Kubernetes can run.
This doesn't simply expand the range of systems that can be used for PaddlePaddle training; it also provides end-to-end deep learning powered by both projects.
Training with the Big K
Deep learning frameworks must be trained on a given data set to produce results. The training process can be processor-intensive and time-consuming, so spreading it out across a cluster of machines speeds up the process. Baidu created Paddle (short for "PArallel Distributed Deep LEarning) to run across a cluster of machines to train models for tasks like machine translation and search-result ranking. The company then turned to Kubernetes to manage those clusters, with help from Kubernetes contributor CoreOS.
Kubernetes can scale based on demand or allocate based on resources, so it can help PaddlePaddle efficiency by clustering together or spreading apart jobs based on their resource needs. "PaddlePaddle jobs that require GPUs [can be combined] with other jobs requiring different resources, like large memory or disk I/O throughput, on the same set of physical computers," wrote Baidu in its blog post describing the changes.
Yi Wang, tech lead of PaddlePaddle at Baidu, noted in an email that Kubernetes provides fault tolerance as well. "During scaling, containers/processes of some jobs get killed, but we don't want these jobs to suspend or crash," said Wang. "Instead, we want them to continue running, even if it's at a slower speed."
Forward-thinking, forward-looking
Wang also noted that PaddlePaddle users were already inclined to use Kubernetes. This, in turn, has spurred Baidu to consider future plans for how to build PaddlePaddle on top of it.
"Many potential clients, especially those in traditional industries, are interested in running deep learning on their own on-premise clusters," wrote Wang. "We would like to work with the Kubernetes community to provide a complete solution -- including cluster setup, data pipeline building, and AI."
With deep learning frameworks, there's growing emphasis on separating the training of a model from using that model in an actual application. The training process needs major CPU and memory to work well, but it's becoming possible to deploy the model on relatively low-end hardware; TensorFlow already is looking at such optimizations. But thus the training phase becomes vital, as more of the critical computation in deep learning gets pushed into it. And it'll bolster the need for robust, easily managed training mechanisms.
You can also expect Kubernetes-powered PaddlePaddle to be offered as a service by one or more of the big-name cloud providers. Aside from bringing the processing closer to where the data likely resides, cloud providers can offer PaddlePaddle-powered deep learning at a scale far beyond any option most companies could put together on their own.