7 Python libraries for parallel processing

Do you need to distribute a heavy Python workload across multiple CPUs or a compute cluster? Here are seven frameworks up to the task.

blackpixel/Shutterstock

Python is long on convenience and programmer-friendly, but it isn’t the fastest programming language around. Some of Python's speed limitations are due to its default implementation, CPython, being single-threaded. That is, CPython doesn’t use more than one hardware thread at a time.

And while you can use Python's built-in threading module to speed things up, threading only gives you concurrency, not parallelism. It’s good for running multiple tasks that aren’t CPU-dependent, but does nothing to speed up multiple tasks that each require a full CPU. This may change in the future, but for now, it's best to assume threading in Python won't give you parallelism.

Python does include a native way to run a workload across multiple CPUs. The multiprocessing module spins up multiple copies of the Python interpreter, each on a separate core, and provides primitives for splitting tasks across cores. But sometimes even multiprocessing isn’t enough.

In some cases, the job calls for distributing work not only across multiple cores, but also across multiple machines. That’s where the Python libraries and frameworks introduced in this article come in. Here are seven frameworks you can use to spread an existing Python application and its workload across multiple cores, multiple machines, or both.

Ray

Developed by a team of researchers at the University of California, Berkeley, Ray underpins a number of distributed machine learning libraries. But Ray isn’t limited to machine learning tasks alone, even if that was its original use case. You can break up and distribute any type of Python task across multiple systems with Ray.

Ray’s syntax is minimal, so you don’t need to rework existing applications extensively to parallelize them. The @ray.remote decorator distributes that function across any available nodes in a Ray cluster, with optionally specified parameters for how many CPUs or GPUs to use. The results of each distributed function are returned as Python objects, so they’re easy to manage and store, and the amount of copying across or within nodes is minimal. This last feature comes in handy when dealing with NumPy arrays, for instance.

Ray even includes its own built-in cluster manager, which can automatically spin up nodes as needed on local hardware or popular cloud computing platforms. Other Ray libraries let you scale common machine learning and data science workloads, so you don't have to manually scaffold them. For instance, Ray Tune lets you perform hyperparameter turning at scale for most common machine learning systems (PyTorch and TensorFlow, among others).

Dask

From the outside, Dask looks a lot like Ray. It, too, is a library for distributed parallel computing in Python, with its own task scheduling system, awareness of Python data frameworks like NumPy, and the ability to scale from one machine to many.

One key difference between Dask and Ray is the scheduling mechanism. Dask uses a centralized scheduler that handles all tasks for a cluster. Ray is decentralized, meaning each machine runs its own scheduler, so any issues with a scheduled task are handled at the level of the individual machine, not the whole cluster. Dask's task framework works hand-in-hand with Python's native concurrent.futures interfaces, so for those who've used that library, most of the metaphors for how jobs work should be familiar.

Dask works in two basic ways. The first is by way of parallelized data structures—essentially, Dask’s own versions of NumPy arrays, lists, or Pandas DataFrames. Swap in the Dask versions of those constructions for their defaults, and Dask will automatically spread their execution across your cluster. This typically involves little more than changing the name of an import, but may sometimes require rewriting to work completely.

The second way is through Dask’s low-level parallelization mechanisms, including function decorators, that parcel out jobs across nodes and return results synchronously (in “immediate” mode) or asynchronously (“lazy” mode). Both modes can be mixed as needed.

Dask also offers a feature called actors. An actor is an object that points to a job on another Dask node. This way, a job that requires a lot of local state can run in-place and be called remotely by other nodes, so the state for the job doesn’t have to be replicated. Ray lacks anything like Dask’s actor model to support more sophisticated job distribution. However, Desk's scheduler isn't aware of what actors do, so if an actor runs wild or hangs, the scheduler can't intercede. "High-performing but not resilient" is how the documentation puts it, so actors should be used with care.

Dispy

Dispy lets you distribute whole Python programs or just individual functions across a cluster of machines for parallel execution. It uses platform-native mechanisms for network communication to keep things fast and efficient, so Linux, macOS, and Windows machines work equally well. That makes it a more generic solution than others discussed here, so it's worth a look if you need something that isn't specifically about accelerating machine-learning tasks or a particular data-processing framework.

Dispy syntax somewhat resembles multiprocessing in that you explicitly create a cluster (where multiprocessing would have you create a process pool), submit work to the cluster, then retrieve the results. A little more work may be required to modify jobs to work with Dispy, but you also gain precise control over how those jobs are dispatched and returned. For instance, you can return provisional or partially completed results, transfer files as part of the job distribution process, and use SSL encryption when transferring data.

Pandaral·lel

Pandaral·lel, as the name implies, is a way to parallelize Pandas jobs across multiple nodes. The downside is that Pandaral·lel works only with Pandas. But if Pandas is what you’re using, and all you need is a way to accelerate Pandas jobs across multiple cores on a single computer, Pandaral·lel is laser-focused on the task.

Note that while Pandaral·lel does run on Windows, it will run only from Python sessions launched in the Windows Subsystem for Linux. Linux and macOS users can run Pandaral·lel as-is. 

Ipyparallel

Ipyparallel is another tightly focused multiprocessing and task-distribution system, specifically for parallelizing the execution of Jupyter notebook code across a cluster. Projects and teams already working in Jupyter can start using Ipyparallel immediately.

Ipyparallel supports many approaches to parallelizing code. On the simple end, there’s map, which applies any function to a sequence and splits the work evenly across available nodes. For more complex work, you can decorate specific functions to always run remotely or in parallel.

Jupyter notebooks support “magic commands” for actions that are only possible in a notebook environment. Ipyparallel adds a few magic commands of its own. For example, you can prefix any Python statement with %px to automatically parallelize it.

Joblib

Joblib has two major goals: run jobs in parallel and don’t recompute results if nothing has changed. These efficiencies make Joblib well-suited for scientific computing, where reproducible results are sacrosanct. Joblib’s documentation provides plenty of examples for how to use all its features.

Joblib syntax for parallelizing work is simple enough—it amounts to a decorator that can be used to split jobs across processors, or to cache results. Parallel jobs can use threads or processes.

Joblib includes a transparent disk cache for Python objects created by compute jobs. This cache not only helps Joblib avoid repeating work, as noted above, but can also be used to suspend and resume long-running jobs, or pick up where a job left off after a crash. The cache is also intelligently optimized for large objects like NumPy arrays. Regions of data can be shared in-memory between processes on the same system by using numpy.memmap. This all makes Joblib highly useful for work that may take a long time to complete, since you can avoid redoing existing work and pause/resume as needed.

One thing Joblib does not offer is a way to distribute jobs across multiple separate computers. In theory it’s possible to use Joblib’s pipeline to do this, but it’s probably easier to use another framework that supports it natively. 

Parsl

Short for "Parallel Scripting Library," Parsl lets you take computing jobs and split them across multiple systems using roughly the same syntax as Python's existing Pool objects. It also lets you stitch together different computing tasks into multi-step workflows, which can run in parallel, in sequence, or via map/reduce operations.

Parsl lets you execute native Python applications, but also run any other external application by way of commands to the shell. Your Python code is just written like normal Python code, save for a special function decorator that marks the entry point to your work. The job-submission system also gives you fine-grained control over how things run on the targets—for example, the number of cores per worker, how much memory per worker, CPU affinity controls, how often to poll for timeouts, and so on.

One excellent feature Parsl offers is a set of prebuilt templates to dispatch work to a variety of high-end computing resources. This not only includes staples like AWS or Kubernetes clusters, but supercomputing resources (assuming you have access) like Blue Waters, ASPIRE 1, Frontera, and so on. (Parsl was co-developed with the aid of many of the institutions that built such hardware.)

Conclusion

Python's limitations with threads will continue to evolve, with major changes slated to allow threads to run side-by-side for CPU-bound work. But those updates are years away from being usable. Libraries designed for parallelism can help fill the gap while we wait.