Microsoft’s AI push goes beyond the cloud, as the company is clearly getting ready for desktop hardware with built-in AI accelerators. You only have to look at Microsoft’s collaboration with Qualcomm, which produced the SQ series of Arm processors, all of which come with AI accelerators that deliver new computer vision features on Windows.
AI accelerators aren’t new. Essentially they’re an extension of the familiar GPU, only now they’re designed to accelerate neural networks. That explains the name Microsoft has adopted for them: NPUs, neural processing units.
NPUs fill an important need. End users want to be able to run AI workloads locally, without relying on cloud compute, keeping their data inside their own hardware, often for security and regulatory reasons. While NPU-enabled hardware is still rare, there are signs from major silicon vendors that these accelerators will be a key feature of upcoming processor generations.
Supporting AI applications across hardware architectures
While technologies like ONNX (Open Neural Network Exchange) help make trained models portable, with ONNX runtimes for Windows and ONNX support in most Windows development platforms, including .NET, there’s still a significant roadblock to wider support for local AI applications: different tool chains for different hardware implementations.
If you want to write machine learning applications that inference on the SQ-series Arm NPUs, you need to sign up for Qualcomm’s developer program to get access to the SDKs and libraries you need. They’re not part of the standard .NET distribution, or part of the Windows C++ SDK, nor are they available on GitHub.
That makes it hard to write general purpose AI applications. It also limits features like Microsoft’s real-time camera image processing to Windows on Arm devices with an NPU, even if you have an Intel ML accelerator card or a high-end Nvidia GPU. Code needs to be specific, making it hard to distribute through mechanisms like the Microsoft Store, or even via enterprise application management tooling like Microsoft Intune.
Optimizing ONNX models with Olive
Build 2023 saw Microsoft start to cross the hardware divide, detailing what it describes as a “hybrid loop” based on both ONNX and a new Python tool called Olive, which is intended to give you the same level of access to AI tooling as Microsoft’s own Windows team. Using Olive, you can compress, optimize, and compile models to run on local devices (aka the edge) or in the cloud, allowing on-prem operation when necessary and bursting to Azure when data governance considerations and bandwidth allow.
So, what exactly is Olive? It’s a way of simplifying the packaging process to optimize inferencing for specific hardware, allowing you to build code that can switch inferencing engines as needed. While you still build different inferencing packages for different hardware combinations, your code can load the appropriate package at run time. Or in the case of Windows on Arm, your code can be compiled with a Qualcomm NPU package that’s built at the same time as your x86 equivalents.
Like much of Microsoft’s recent developer tooling, Olive is open source and available on GitHub. Once Olive is installed in your development environment, you can use it to automate the process of tuning and optimizing models for target hardware. Olive provides a range of tuning options, which targeting different model types. If you’re using a transformer, for example, Olive can apply appropriate optimizations, as well as help balance the constraints on your model to manage both latency and accuracy.
Optimization in Olive is a multi-pass process, starting with either a PyTorch model or an ONNX export from any other training platform. You define your requirements for the model and for each pass, which performs a specific optimization. You can run passes (optimizations) using Azure VMs, your local development hardware, or a container that can be run anywhere you have sufficient compute resources. Olive runs a search across various possible tunings, looking for the best implementation of your model before packaging it for testing in your application.
Making Olive part of your AI development process
Because much of Olive’s operation is automated, it should be relatively easy to weave it into existing tool chains and build processes. Olive is triggered by a simple CLI, working against parameters set by a configuration file, so could be included in your CI/CD workflow either as a GitHub Action or as part of an Azure Pipeline. As the output is prepackaged models and runtimes, as well as sample code, you could use Olive to generate build artifacts that could then be included in a deployment package, or in a container for distributed applications, or in an installer for desktop apps.
Getting started with Olive is simple enough. A Python package, Olive is installed using pip, with some dependencies for specific target environments.
You need to write an Olive JSON configuration file before running an optimization. This isn’t for the beginner, although there are sample configurations in the Olive documentation to help you get started. Start by choosing the model type and its inputs and outputs, before defining your desired performance and accuracy. Finally, your configuration determines how Olive will optimize your model, for example converting a PyTorch model to ONNX and applying dynamic quantization.
The results can be impressive, with the team demonstrating significant reductions in both latency and model size. That makes Olive a useful tool for local inferencing, as it ensures that you can make the most of restricted environments with limited compute capabilities and limited storage, for example for deploying safety-critical computer vision applications on edge hardware.
Preparing for the next generation of AI silicon
There’s a significant level of future-proofing in Olive. The tool is built around an optimization plugin model that allows silicon vendors to define their own sets of optimizations and to deliver them to Olive users. Both Intel and AMD have already delivered tooling that works with their own hardware and software, which should make it easier to improve model performance while reducing the compute needed to perform the necessary optimizations. This approach will allow Olive to quickly pick up support for new AI hardware, both integrated chipsets and external accelerators.
Olive is coupled with a new Windows ONNX runtime that allows you to switch between local inferencing and a cloud endpoint, based on logic in your code. For sensitive operations it could be forced to run locally, while for less restrictive operations it could run wherever is most economical.
One more useful feature in Olive is the ability to connect it directly to an Azure Machine Learning account, so you can go directly from your own custom models to ONNX packages. If you’re planning on using hybrid or cloud-only inferencing, Olive will optimize your models for running in Azure.
Optimizing ONNX-format models for specific hardware has many benefits, and having a tool like Olive that supports multiple target environments should help deliver applications with the performance users expect and need on the hardware they use. But that’s only part of the story. For developers charged with building optimized machine learning applications for multiple hardware platforms, Olive provides a way to get over the first few hurdles.