Saving all of your data in a data warehouse and analyzing it using a nightly batch process is no longer sufficient to monitor and manage a business or process in a timely fashion. Instead, you should perform simple real-time analysis of data streams in addition to saving the data for later in-depth analysis.
Apache Kafka, originally developed at LinkedIn, is one of the most mature platforms for event streaming. Adjuncts to Kafka include Apache Flink, Apache Samza, Apache Spark, Apache Storm, Databricks, and Ververica. Alternatives to Kafka include Amazon Kinesis, Apache Pulsar, Azure Stream Analytics, Confluent, and Google Cloud Dataflow.
One downside of Kafka is that setting up large Kafka clusters can be tricky. Commercial cloud implementations of Kafka, such as Confluent Cloud and Amazon Managed Streaming for Apache Kafka, fix that and other issues, for a price.
Apache Kafka defined
Apache Kafka is an open source, Java/Scala, distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Kafka events are organized and durably stored in topics.
Kafka has five core APIs:
- The Admin API to manage and inspect topics, brokers, and other Kafka objects.
- The Producer API to publish (write) a stream of events to one or more Kafka topics.
- The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them.
- The Kafka Streams API to implement stream processing applications and microservices. It provides higher-level functions to process event streams, including transformations, stateful operations like aggregations and joins, windowing, processing based on event time, and more. Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.
- The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka. For example, a connector to a relational database like PostgreSQL might capture every change to a set of tables. However, in practice, you typically don’t need to implement your own connectors because the Kafka community already provides hundreds of ready-to-use connectors.
To implement stream processing that is more complicated than you can easily handle with the Streams API, you can integrate Kafka with Apache Samza (discussed below) or Apache Flink.
For a commercially supported version of Apache Kafka, consider Confluent.
How does Kafka work?
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers on-premises as well as in cloud environments.
Servers: Kafka is run as a cluster of one or more servers that can span multiple data centers or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant. If any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.
Clients: Kafka clients allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some clients included, which are augmented by dozens of clients provided by the Kafka community. Kafka clients are available for Java and Scala including the higher-level Kafka Streams library, and for Go, Python, C/C++, and many other programming languages as well as REST APIs.
What is Apache Samza?
Apache Samza is an open source, Scala/Java, distributed stream processing framework that was originally developed at LinkedIn, in conjunction with (Apache) Kafka. Samza allows you to build stateful applications that process data in real time from multiple sources, including Apache Kafka. Samza features include:
- Unified API: A simple API to describe application logic in a manner independent of the data source. The same API can process both batch and streaming data.
- Pluggability at every level: Process and transform data from any source. Samza offers built-in integrations with Apache Kafka, AWS Kinesis, Azure Event Hubs (Azure-native Kafka as a service), Elasticsearch, and Apache Hadoop. Also, it’s quite easy to integrate with your own sources.
- Samza as an embedded library: Integrate with your existing applications and eliminate the need to spin up and operate a separate cluster for stream processing. Samza can be used as a lightweight client library embedded in Java/Scala applications.
- Write once, run anywhere: Flexible deployment options to run applications anywhere—from public clouds to containerized environments to bare-metal hardware.
- Samza as a managed service: Run stream processing as a managed service by integrating with popular cluster managers including Apache YARN.
- Fault-tolerance: Transparently migrates tasks along with their associated state in the event of failures. Samza supports host-affinity and incremental checkpointing to enable fast recovery from failures.
- Massive scale: Battle-tested on applications that use several terabytes of state and run on thousands of cores. Samza powers multiple large companies including LinkedIn, Uber, TripAdvisor, and Slack.
Kafka and Confluent
Confluent Platform is a commercial adaptation of Apache Kafka by the original creators of Kafka, offered on-premises and in the cloud. Confluent Cloud was rebuilt from the ground up as a serverless, elastic, cost-effective, and fully managed cloud-native service, and runs on Amazon Web Services, Microsoft Azure, and Google Cloud Platform.
Kafka on major cloud service providers
Amazon Managed Streaming for Apache Kafka (MSK) coexists with Confluent Cloud and Amazon Kinesis on AWS. All three perform essentially the same service. On Microsoft Azure, Apache Kafka on HDInsight and Confluent Cloud coexist with Azure Event Hubs and Azure Stream Analytics. On Google Cloud, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Pub/Sub, and Google Cloud BigQuery coexist with Confluent Cloud.
Kafka usage examples
Tencent (a Confluent customer) used Kafka to build data pipelines for cross-region log ingestion, machine learning platforms, and asynchronous communication among microservices. Tencent needed more throughput and lower latency than it could get from a single Kafka cluster, so it wrapped its Kafka clusters in a proxy layer to create a federated Kafka design that handles more than 10 trillion messages per day with maximum cluster bandwidth of 240 Gb/s.
Microsoft Azure built a prototype end-to-end IoT data processing solution with Confluent Cloud, MQTT brokers and connectors, Azure Cosmos DB’s analytical store, Azure Synapse Analytics, and Azure Spring Cloud. The referenced article includes all setup steps.
ACERTUS built an end-to-end vehicle fleet management system with Confluent Cloud, ksqlDB (a SQL database specialized for streaming data), AWS Lambda, and a Snowflake data warehouse. ACERTUS reports generating more than $10 million in revenue in the first year from this system, which replaced a largely manual system.
As we’ve seen, Kafka can solve real, large-scale problems that require streaming data. At the same time, there are many ways to design Kafka-based solutions and interconnect Kafka with analysis and storage.