Machine learning is a complex discipline but implementing machine learning models is far less daunting than it used to be, thanks to machine learning frameworks—such as Google’s TensorFlow —that ease the process of acquiring data, training models, serving predictions, and refining future results.
Created by the Google Brain team and initially released to the public in 2015, TensorFlow is an open source library for numerical computation and large-scale machine learning. TensorFlow bundles together a slew of machine learning and deep learning models and algorithms (aka neural networks) and makes them useful by way of common programmatic metaphors. It uses Python or JavaScript to provide a convenient front-end API for building applications, while executing those applications in high-performance C++.
TensorFlow, which competes with frameworks such as PyTorch and Apache MXNet, can train and run deep neural networks for handwritten digit classification, image recognition, word embeddings, recurrent neural networks, sequence-to-sequence models for machine translation, natural language processing, and PDE (partial differential equation)-based simulations. Best of all, TensorFlow supports production prediction at scale, with the same models used for training.
TensorFlow also has a broad library of pre-trained models that can be used in your own projects. You can also use code from the TensorFlow Model Garden as examples of best practices for training your own models.
How TensorFlow works
TensorFlow allows developers to create dataflow graphs—structures that describe how data moves through a graph, or a series of processing nodes. Each node in the graph represents a mathematical operation, and each connection or edge between nodes is a multidimensional data array, or tensor.
TensorFlow applications can be run on most any target that’s convenient: a local machine, a cluster in the cloud, iOS and Android devices, CPUs or GPUs. If you use Google’s own cloud, you can run TensorFlow on Google’s custom TensorFlow Processing Unit (TPU) silicon for further acceleration. The resulting models created by TensorFlow, though, can be deployed on most any device where they will be used to serve predictions.
TensorFlow 2.0, released in October 2019, revamped the framework in many ways based on user feedback, to make it easier to work with (as an example, by using the relatively simple Keras API for model training) and more performant. Distributed training is easier to run thanks to a new API, and support for TensorFlow Lite makes it possible to deploy models on a greater variety of platforms. However, code written for earlier versions of TensorFlow must be rewritten—sometimes only slightly, sometimes significantly—to take maximum advantage of new TensorFlow 2.0 features.
A trained model can be used to deliver predictions as a service via a Docker container using REST or gRPC APIs. For more advanced serving scenarios, you can use Kubernetes
Using TensorFlow with Python
TensorFlow provides all of this for the programmer by way of the Python language. Python is easy to learn and work with, and it provides convenient ways to express how high-level abstractions can be coupled together. TensorFlow is supported on Python versions 3.7 through 3.10, and while it may work on earlier versions of Python it's not guaranteed to do so.
Nodes and tensors in TensorFlow are Python objects, and TensorFlow applications are themselves Python applications. The actual math operations, however, are not performed in Python. The libraries of transformations that are available through TensorFlow are written as high-performance C++ binaries. Python just directs traffic between the pieces and provides high-level programming abstractions to hook them together.
High-level work in TensorFlow—creating nodes and layers and linking them together—uses the Keras library. The Keras API is outwardly simple; a basic model with three layers can be defined in less than 10 lines of code, and the training code for the same takes just a few more lines of code. But if you want to "lift the hood" and do more fine-grained work, such as writing your own training loop, you can do that.
Using TensorFlow with JavaScript
Python is the most popular language for working with TensorFlow and machine learning generally. But JavaScript is now also a first-class language for TensorFlow, and one of JavaScript's massive advantages is that it runs anywhere there's a web browser.
TensorFlow.js, as the JavaScript TensorFlow library is called, uses the WebGL API to accelerate computations by way of whatever GPUs are available in the system. It's also possible to use a WebAssembly back end for execution, and it's faster than the regular JavaScript back end if you're only running on a CPU, though it's best to use GPUs whenever possible. Pre-built models let you get up and running with simple projects to give you an idea of how things work.
TensorFlow Lite
Trained TensorFlow models can also be deployed on edge computing or mobile devices, such as iOS or Android systems. The TensorFlow Lite toolset optimizes TensorFlow models to run well on such devices, by allowing you to making tradeoffs between model size and accuracy. A smaller model (that is, 12MB versus 25MB, or even 100+MB) is less accurate, but the loss in accuracy is generally small, and more than offset by the model's speed and energy efficiency.
Why use TensorFlow
The single biggest benefit TensorFlow provides for machine learning development is abstraction. Instead of dealing with the nitty-gritty details of implementing algorithms, or figuring out proper ways to hitch the output of one function to the input of another, the developer can focus on the overall application logic. TensorFlow takes care of the details behind the scenes.
TensorFlow offers additional conveniences for developers who need to debug and gain introspection into TensorFlow apps. Each graph operation can be evaluated and modified separately and transparently, instead of constructing the entire graph as a single opaque object and evaluating it all at once. This so-called "eager execution mode," provided as an option in older versions of TensorFlow, is now standard.
The TensorBoard visualization suite lets you inspect and profile the way graphs run by way of an interactive, web-based dashboard. A service, Tensorboard.dev (hosted by Google), lets you host and share machine learning experiments written in TensorFlow. It's free to use with storage for up to 100M scalars, 1GB of tensor data, and 1GB of binary object data. (Note that any data hosted in Tensorboard.dev is public, so don't use it for sensitive projects.)
TensorFlow also gains many advantages from the backing of an A-list commercial outfit in Google. Google has fueled the rapid pace of development behind the project and created many significant offerings that make TensorFlow easier to deploy and use. The above-mentioned TPU silicon for accelerated performance in Google’s cloud is just one example.
Deterministic model training with TensorFlow
A few details of TensorFlow’s implementation make it hard to obtain totally deterministic model-training results for some training jobs. Sometimes, a model trained on one system will vary slightly from a model trained on another, even when they are fed the exact same data. The reasons for this variance are slippery—one reason is how random numbers are seeded and where; another is related to certain non-deterministic behaviors when using GPUs. TensorFlow's 2.0 branch has an option to enable determinism across an entire workflow with a couple of lines of code. This feature comes at a performance cost, however, and should only be used when debugging a workflow.
TensorFlow vs. PyTorch, CNTK, and MXNet
TensorFlow competes with a slew of other machine learning frameworks. PyTorch, CNTK, and MXNet are three major frameworks that address many of the same needs. Let's close with a quick look at where they stand out and come up short against TensorFlow:
- PyTorch is built with Python and has many other similarities to TensorFlow: hardware-accelerated components under the hood, a highly interactive development model that allows for design-as-you-go work, and many useful components already included. PyTorch is generally a better choice for fast development of projects that need to be up and running in a short time, but TensorFlow wins out for larger projects and more complex workflows.
- CNTK, the Microsoft Cognitive Toolkit, is like TensorFlow in using a graph structure to describe dataflow, but it focuses mostly on creating deep learning neural networks. CNTK handles many neural network jobs faster, and has a broader set of APIs (Python, C++, C#, Java). But it isn’t currently as easy to learn or deploy as TensorFlow. It's also only available under the GNU GPL 3.0 license, whereas TensorFlow is available under the more liberal Apache license. And CNTK isn't as aggressively developed; the last major release was in 2019.
- Apache MXNet, adopted by Amazon as the premier deep learning framework on AWS, can scale almost linearly across multiple GPUs and multiple machines. MXNet also supports a broad range of language APIs—Python, C++, Scala, R, JavaScript, Julia, Perl, Go—although its native APIs aren’t as pleasant to work with as TensorFlow’s. It also has a far smaller community of users and developers.