PyTorch vs. TensorFlow: How to choose

If you actually need a deep learning model, PyTorch and TensorFlow are the two leading options

Contributor, InfoWorld |

PyTorch vs. TensorFlow: How to choose — KTSimage / Getty Images

Not every regression or classification problem needs to be solved with deep learning. For that matter, not every regression or classification problem needs to be solved with machine learning. After all, many data sets can be modeled analytically or with simple statistical procedures.

On the other hand, there are cases where deep learning or deep transfer learning can help you train a model that is more accurate than you could create any other way. For these cases, PyTorch and TensorFlow can be quite effective, especially if there is already a trained model similar to what you need in the framework’s model library.

PyTorch

PyTorch builds on the older Torch and Caffe2 frameworks. As you might guess from the name, PyTorch uses Python as its scripting language, and uses an evolved Torch C/CUDA back-end. The production features of Caffe2 are being incorporated into the PyTorch project.

PyTorch is billed as “Tensors and dynamic neural networks in Python with strong GPU acceleration.” What does that mean?

Tensors are a mathematical construct that is used heavily in physics and engineering. A tensor of rank 2 is a special kind of matrix; taking the inner product of a vector with the tensor yields another vector with a new magnitude and a new direction. TensorFlow takes its name from the way tensors (of synapse weights) flow around its network model. NumPy also uses tensors, but calls them ndarray.

GPU acceleration is a given for most modern deep neural network frameworks. A dynamic neural network is one that can change from iteration to iteration, for example allowing a PyTorch model to add and remove hidden layers during training to improve its accuracy and generality. PyTorch recreates the graph on the fly at each iteration step. In contrast, TensorFlow by default creates a single data flow graph, optimizes the graph code for performance, and then trains the model.

While eager execution mode is a fairly new option in TensorFlow, it’s the only way PyTorch runs: API calls execute when invoked, rather than being added to a graph to be run later. That might seem like it would be less computationally efficient, but PyTorch was designed to work that way, and it is no slouch when it comes to training or prediction speed.

PyTorch integrates acceleration libraries such as Intel MKL and Nvidia cuDNN and NCCL to maximize speed. Its core CPU and GPU Tensor and neural network back-ends—TH (Torch), THC (Torch CUDA), THNN (Torch neural network), and THCUNN (Torch CUDA neural network)—are written as independent libraries with a C99 API. At the same time, PyTorch is not a Python binding into a monolithic C++ framework. The intention is for it to be deeply integrated with Python and to allow the use of other Python libraries.

Fast.ai and the fastai library

Fast.ai is a small company making deep learning easier to use and getting more people from all backgrounds involved through its free courses for coders, software library, cutting-edge research, and community.

The fastai library, which is based on PyTorch, simplifies training fast and accurate neural networks using modern best practices. It’s based on research into deep learning best practices undertaken at Fast.ai, including “out of the box” support for vision, text, tabular, and collab (collaborative filtering) models.

To a first approximation, the fastai library is to PyTorch as Keras is to TensorFlow. One significant difference is that PyTorch doesn’t officially support fastai.

TensorFlow

Of all the excellent machine learning and deep learning frameworks available, TensorFlow is the most mature, has the most citations in research papers (even excluding citations from Google employees), and has the best story about use in production. It may not be the easiest framework to learn, but with the arrival of TensorFlow 2, TensorFlow is much less intimidating than it was in 2016. TensorFlow underlies many Google services.

The TensorFlow 2.0 site describes the project as an “end-to-end open source machine learning platform.” By “platform,” Google means a comprehensive ecosystem of tools, libraries, and community resources that lets researchers “push the state-of-the-art” in machine learning and developers easily build and deploy AI-powered applications.

There are four major parts to TensorFlow 2.0:

TensorFlow core, an open source library for developing and training machine learning models;
TensorFlow.js, a JavaScript library for training and deploying models in the web browser and on Node.js;
TensorFlow Lite, a lightweight library for deploying models on mobile and embedded devices; and
TensorFlow Extended, an end-to-end platform for preparing data, training, validating, and deploying models in large production environments.

TensorFlow 2.0 focuses on simplicity and ease of use, with updates like eager execution, intuitive higher-level APIs, and flexible model building on any platform. Eager execution means that TensorFlow code runs when it is defined, as opposed to adding nodes and edges to a graph to be run in a session later, which was TensorFlow’s original mode.

The guidance for effective TensorFlow 2.0 is to use the high-level tf.keras APIs rather than the old low-level APIs; that will greatly reduce the amount of code you need to write. You can build Keras neural networks using one line of code per layer, or fewer if you take advantage of looping constructs.

TensorFlow.js is a library for developing and training machine learning models in JavaScript and deploying them in a browser or on Node.js. There is also a high-level library built on top of TensorFlow.js, ml5.js, which hides the complexities of tensors and optimizers.

In the browser, TensorFlow.js supports mobile devices as well as desktop devices. If your browser supports WebGL shader APIs, TensorFlow.js can use them and take advantage of the GPU. That can give you up to 100x speed-up compared to the CPU back-end. The TensorFlow.js demos run surprisingly quickly in the browser on a machine with a GPU.

TensorFlow Lite is an open source deep learning framework for on-device inference. It currently builds models for iOS, ARM64, and Raspberry Pi. The two main components of TensorFlow Lite are an interpreter and a converter. The interpreter runs specially optimized models on many different hardware types. The converter converts TensorFlow models into an efficient form for use by the interpreter, and can introduce optimizations to improve binary size and performance.

TensorFlow Extended (TFX) is an end-to-end platform for deploying production machine learning pipelines. It is something to consider once you have trained a model. Pipelines include data validation, feature engineering, modeling, model evaluation, serving inference, and managing deployments to online, native mobile, and JavaScript targets.

Read my review of TensorFlow 2.0.

Keras

Keras is a high-level front-end specification and implementation for building neural network models. Keras ships with support for three back-end deep learning frameworks: TensorFlow, CNTK, and Theano. Amazon is currently working on developing a MXNet back-end for Keras. It’s also possible to use PlaidML (an independent project) as a back-end for Keras to take advantage of PlaidML’s OpenCL support for all GPUs.

TensorFlow is the default back-end for Keras, and the one recommended for many use cases involving GPU acceleration on Nvidia hardware via CUDA and cuDNN, as well as for Tensor Processing Unit (TPU) acceleration in the Google Cloud. TensorFlow also contains an internal tf.keras class, separate from an external Keras installation, that is the preferred high-level front-end to TensorFlow, as discussed above.

Keras has a high-level environment that reduces adding a layer to a neural network to one line of code in its Sequential model, and needs one function call each for compiling and training a model. Keras lets you work at a lower level if you want, with its Model or functional API.

Keras allows you to drop down even farther, to the Python coding level, by subclassing keras.Model, but prefers the functional API when possible. It also has a Scikit-learn API, so that you can use the Scikit-learn grid search to perform hyperparameter optimization in Keras models.

Read my review of Keras.

Deep learning vs. transfer learning

Both PyTorch and TensorFlow support deep learning and transfer learning. Transfer learning, which is sometimes called custom machine learning, starts with a pre-trained neural network model and customizes the final layers for your data.

Training a deep neural network from scratch is time-consuming and requires a lot of tagged data. Transfer learning takes less time and requires fewer new labeled exemplars, but it is useful only if a pre-trained model exists. Fortunately, all of the major deep learning frameworks offer a model zoo of some sort.

Convolutional neural networks (aka ConvNets or CNNs) for image classification are a prime example of the utility of transfer learning. Both PyTorch and TensorFlow offer tutorials on how to use transfer learning for training convolutional neural networks. The TensorFlow transfer learning tutorial demonstrates using transfer learning for feature extraction and for fine-tuning. The PyTorch transfer learning tutorial demonstrates the same two methodologies.

If you’d like to learn more about transfer learning for convolutional neural networks, you might want to read the Stanford CS231 notes on the subject and also follow the references. A key quote from these notes:

In practice, very few people train an entire Convolutional Network from scratch (with random initialization) because it is relatively rare to have a data set of sufficient size. Instead, it is common to pretrain a ConvNet on a very large data set (e.g. ImageNet, which contains 1.2 million images with 1,000 categories) and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest.

How to choose a deep learning framework

In the early days of PCs and Macs, people would often ask me which to buy. The real answer was that it was the wrong question (or “It depends”), but I generally helped people to find their own answer by asking them some follow-up questions, starting with “What do you want to do with your computer?” and “Do you have applications you can’t live without?”

Similarly, “What deep learning framework should I use?” is not really the right question. The follow-up questions to help dig out a better answer than “It depends” start with “What do you want to accomplish with your models?” and continue with drill-downs into the kind of data you have available for training.

If you’re new to deep learning, I suggest that you start by going through the tutorials for Keras in TensorFlow 2 and fastai in PyTorch. There is plenty to learn in each of these without even dipping into the lower-level APIs of TensorFlow and PyTorch, and you’ll get a feel for both approaches. You will probably also realize how similar the two frameworks really are, and how much they depend on the same concepts and techniques.

For many use cases, which framework you choose won’t matter: You’ll find essentially the same models available for each framework. For some specific use cases, you may find one better than the other— at least in their current incarnations. You may also find that one is easier for you to learn than the other, whether because of some essential features in the frameworks or because of the quality of the tutorials.

Next read this:

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi.