These days, massively scalable pub/sub messaging is virtually synonymous with Apache Kafka. Apache Kafka continues to be the rock-solid, open-source, go-to choice for distributed streaming applications, whether you’re adding something like Apache Storm or Apache Spark for processing or using the processing tools provided by Apache Kafka itself. But Kafka isn’t the only game in town.
Developed by Yahoo and now an Apache Software Foundation project, Apache Pulsar is going for the crown of messaging that Apache Kafka has worn for many years. Apache Pulsar offers the potential of faster throughput and lower latency than Apache Kafka in many situations, along with a compatible API that allows developers to switch from Kafka to Pulsar with relative ease.
How should one choose between the venerable stalwart Apache Kafka and the upstart Apache Pulsar? Let’s look at their core open source offerings and what the core maintainers’ enterprise editions bring to the table.
Apache Kafka
Developed by LinkedIn and released as open source back in 2011, Apache Kafka has spread far and wide, pretty much becoming the default choice for many when thinking about adding a service bus or pub/sub system to an architecture. Since Apache Kafka’s debut, the Kafka ecosystem has grown considerably, adding the Scheme Registry to enforce schemas in Apache Kafka messaging, Kafka Connect for easy streaming from other data sources such as databases to Kafka, Kafka Streams for distributed stream processing, and most recently KSQL for performing SQL-like querying over Kafka topics. (A topic in Kafka is the name for a particular channel.)
The standard use-case for many real-time pipelines built over the past few years has been to push data into Apache Kafka and then use a stream processor such as Apache Storm or Apache Spark to pull in data, perform and processing, and then publish output to another topic for downstream consumption. With Kafka Streams and KSQL, all of your data pipeline needs can be handled without having to leave the Apache Kafka project at any time, though of course, you can still use an external service to process your data if required.
While Apache Kafka has always been very friendly from a developer’s point of view, it has been something of a mixed bag operationally. Getting a small cluster up and running is relatively easy, but maintaining a large cluster is often fraught with issues (e.g. leader partition swapping after a Kafka broker failure).
Further, the approach taken for supporting multi-tenancy, via a utility called MirrorMaker, has been a surefire way of getting SREs to pull out their hair. Indeed, MirrorMaker is considered such a problem that companies like Uber have created their own system for replicating across data centers (uReplicator). Confluent includes Confluent Replicator as part of its enterprise offering of Apache Kafka. Speaking as someone who has had to maintain a MirrorMaker setup, it’s a shame that Replicator isn’t part of the open source version.
However, it’s definitely not all bad news on the operational front. Much work has been done in the current Apache Kafka 1.x series to reduce some of the headaches of running a cluster. Recently there have been some changes that allow the system to run large clusters of more than 200,000 partitions in a more streamlined manner, and improvements like adding “dead letter” queues to Kafka Connect make identifying and recovering from issues in data sources and sinks so much easier. We’re also likely to see production-level support of running Apache Kafka on Kubernetes in 2019 (via Helm charts and a Kubernetes operator).
Back in 2014, three of the original developers of Kafka (Jun Rao, Jay Kreps, and Neha Narkhede) formed Confluent, which provides additional enterprise features in its Confluent Platform such as the aforementioned Replicator, Control Center, additional security plug-ins, and the usual support and professional services offerings. Confluent also has a cloud offering, Confluent Cloud, which is a fully managed Confluent Platform service that runs on Amazon Web Services or Google Cloud Platform, if you’d prefer not to deal with some of the operational overhead of running clusters yourself.
If you’re locked into AWS and using Amazon services, note that Amazon has introduced a public preview of Amazon Managed Streaming for Kafka (MSK), which is a fully managed Kafka service within the AWS ecosystem. (Note also that Amazon MSK isn’t provided in partnership with Confluent, so running MSK won’t get you all of the features of Confluent Platform, but only what’s provided in the open source Apache Kafka.)
Apache Pulsar
Given the Apache Software Foundation’s predilection for picking up projects that seem to duplicate functionality (would you like Apache Apex, Apache Flink, Apache Heron, Apache Samza, Apache Spark, or Apache Storm for your directed acyclic graph data processing needs?), you would be forgiven for looking right past the announcements about Apache Pulsar becoming a top-level Apache project before selecting Apache Kafka as a trusted option for your messaging needs. But Apache Pulsar deserves a look.
Apache Pulsar was born at Yahoo, where it was created to address the needs of the organization that other open-source offerings could not provide at the time. As a result, Pulsar was built from the ground up to handle millions of topics and partitions with full support for geo-replication and multi-tenancy.
Under the covers, Apache Pulsar uses Apache BookKeeper for maintaining its storage needs, but there’s a twist: Apache Pulsar has a feature called Tiered Storage that is quite something. One of the problems of distributed log systems is that, while you want the data to remain in the log platform for as long as possible, disk drives are not infinite in size. At some point, you make the decision to either delete those messages or store them elsewhere, where they can potentially be replayed through the data pipeline if needed in the future. Which works, but can be operationally complicated. Apache Pulsar, through Tiered Storage, can automatically move older data to Amazon S3, Google Cloud Storage, or Azure Blog Storage, and still present a transparent view back to the client; the client can read from the start of time just as if all of the messages were present in the log.
Just like Apache Kafka, Apache Pulsar has grown an ecosystem for data processing (although it also provides adaptors for Apache Spark and Apache Storm). Pulsar IO is the equivalent of Kafka Connect for connecting to other data systems as either sources or sinks, and Pulsar Functions provides data processing functionality. SQL querying is provided by using an adaptor for Facebook’s open-sourced Presto engine.
An interesting wrinkle is that Pulsar Functions and Pulsar IO run within a standard Pulsar cluster rather than being separate processes that could potentially run anywhere. While this is a reduction in flexibility, it does make things much simpler from an operational point of view. (There’s a local run mode that could be abused to run functions outside of the cluster, but the documentation goes out of its way to say “Don’t do this!”)
Apache Pulsar also offers different methods of running functions inside the cluster: They can be run as separate processes, as Docker containers, or as threads running in a broker’s JVM process. This ties in with the deployment model for Apache Pulsar, which already supports Kubernetes or Mesosphere DC/OS in production. One thing to be aware of is that Pulsar Functions, Pulsar IO, and SQL are relatively new additions to Apache Pulsar, so expect a few sharp edges if you use them.
There is also a limited, Java-only, Kafka-compatible API wrapper, so you can potentially integrate existing Apache Kafka applications into an Apache Pulsar infrastructure. This is probably better suited to exploratory testing and an interim migration plan than a production solution, but it’s nice to have!
In a similar manner to Confluent, the developers of Apache Pulsar at Yahoo (Matteo Merli and Sijie Guo) have formed a spinoff company, Streamlio, where they’re co-founders along with the creators of Apache Heron (Karthik Ramasamy and Sanjeev Kulkarni). Streamlio’s enterprise offering includes the usual commercial support and professional services solutions, along with a closed-source management console, but things like efficient and durable multi-tenancy support are part of the core open source product.
Apache Kafka or Apache Pulsar?
Apache Kafka is a mature, resilient, and battle-tested product. It has clients written in almost every popular language, as well as a host of supported connectors for different data sources in Kafka Connect. With managed services being offered by Amazon and Confluent, it’s easy to a get a large Kafka cluster up, running, and maintained—much easier than in previous years. I continue to use Apache Kafka in new projects, and I will likely do so for many years to come.
However, if you’re going to be building a messaging system that has to be multi-tenant or geo-replicated from the start, or that has large data storage needs, plus the need to easily query and process all that data no matter how long ago in the past, then I suggest kicking the tires of Apache Pulsar. It definitely fits some use cases that Apache Kafka can struggle with, while also working well in terms of the core features you need from a distributed log platform. And if you don’t mind being on the cutting edge in terms of documentation and Stack Overflow questions answered, so much the better!