Streaming data is generated continuously, often by thousands of data sources, such as sensors or server logs. Streaming data records are often small, perhaps a few kilobytes each, but there are many of them, and in many cases the stream goes on and on without ever stopping. In this article, we will provide some background and discuss how to choose a streaming data platform.
How do streaming data platforms work?
Ingestion and data export. In general, both data ingestion and data export are handled over data connectors that are specialized for the foreign systems. In some cases there is an ETL (extract, transform, and load) or ELT (extract, load, and transform) process to reorder, clean, and condition the data for its destination.
Ingestion for streaming data often reads data generated by multiple sources, sometimes thousands of them, such as in the case of IoT (internet of things) devices. Data export is sometimes to a data warehouse or data lake for deep analysis and machine learning.
Pub/sub and topics. Many streaming data platforms, including Apache Kafka and Apache Pulsar, implement a publish and subscribe model, with data organized into topics. Ingested data may be tagged with one or more topics, so that clients subscribed to any of those topics can receive the data. For example, in an online news publishing use case, an article about a politician’s speech might be tagged as Breaking News, US News, and Politics, so that it could be included in each of those sections by the page layout software under the supervision of the (human) section editor.
Data analytics. Data streaming platforms usually include the opportunity to perform analytics at two points in the pipeline. The first is part of the real-time stream, and the second is at a persistent endpoint. For example, Apache Kafka has simple real-time analytical capabilities in its Streams API, and can also call out to Apache Samza or another analysis processor for more complex real-time calculations. There are additional opportunities for analytics and machine learning once the data has been posted to a persistent data store. This processing can be near-real-time or a periodic batch process.
Serverless functions such as AWS Lambda may be used with data streams to analyze streaming data records using custom programs. Serverless functions are an alternative to stream analytics processors such as Apache Flink.
Clustering. Data streaming platforms are rarely single instances except for development and test installations. Production streaming platforms need to scale, so they are usually run as clusters of nodes. The very latest way to implement this for cloud event streaming is on a serverless, elastic platform such as Confluent Cloud.
Streaming use cases
The following list of use cases comes from the open source Kafka documentation:
- To process payments and financial transactions in real time, such as in stock exchanges, banks, and insurances.
- To track and monitor cars, trucks, fleets, and shipments in real time, such as in logistics and the automotive industry.
- To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind parks.
- To collect and immediately react to customer interactions and orders, such as in retail, the hotel and travel industry, and mobile applications.
- To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.
- To connect, store, and make available data produced by different divisions of a company.
- To serve as the foundation for data platforms, event-driven architectures, and microservices.
Kafka also lists named customer use cases, for example: The New York Times uses Apache Kafka and the Kafka Streams API to store and distribute, in real time, published content to the various applications and systems that make it available to the readers.
Criteria for choosing a streaming data platform
The key performance indicators (KPIs) for streaming data platforms are event rate, throughput (event rate times event size), latency, reliability, and the number of topics (for pub-sub architectures). Scaling can be accomplished by adding nodes in a clustered geometry, which also increases reliability. Scaling is automatic in serverless platforms. Not all streaming data platforms will necessarily meet all your KPIs.
Client programming language support can be a differentiator among streaming data platforms, since your own developers or consultants will be writing client applications. For example, open source Apache Kafka officially gives you a choice of calling the Streams API from Java or Scala (both JVM languages), but there is a community project called librdkafka that supports other languages for clients including C/C++, Go, .NET, and Python. Confluent maintains its own official, signed set of binaries for librdkafka.
In contrast, Apache Storm was designed from the ground up to be usable with any programming language, via the Apache Thrift cross-language compiler. Apache Pulsar supports clients in Java, Go, Python, C++, Node.js, WebSocket, and C#. The Amazon Kinesis Streams API supports all languages that have an Amazon SDK or CDK: Java, JavaScript, .NET, PHP, Python, Ruby, Go, C++, and Swift.
Connection support can be another differentiator. Ideally, connectors for all your data sources should already be available and tested. For example, Confluent lists over 120 connectors available for Kafka, some source-only (e.g. Splunk), some sink-only (e.g. Snowflake), and some both sink and source (e.g. Microsoft SQL Server). Confluent’s list includes community-developed connectors. If you need to write your own Kafka Connectors in Java, you can use the Kafka Connect API.
You need to consider the locations of your data sources and sinks when deciding where to host your streaming data platform. In general, you want to minimize the latency of the stream, which implies keeping the components near each other. On the other hand, some of the streaming data platforms support geographically distributed clusters, which can reduce the latency for far-flung sources and sinks.
You also need to consider the manageability of your candidate streaming data platforms. Some platforms have the reputation of being hard to configure and maintain unless you have expertise in running them. Others, particularly the commercially supported cloud servers, have the reputation of being very easy to manage.
Key streaming data platforms and services
Amazon Kinesis
Amazon Kinesis lets you collect, process, and analyze real-time, streaming data at scale. It has three services for data (Data Streams, Data Firehose, and Data Analytics) and one for media (Video Streams). Kinesis Data Streams is an ingestion service that can continuously capture gigabytes of data per second from thousands of sources. Kinesis Data Analytics can process data streams in real time with SQL or Apache Flink. Kinesis Data Firehose can capture, transform, and load data streams into AWS data stores for near-real-time analytics with existing business intelligence tools. You can use AWS Lambda serverless functions instead of Kinesis Data Analytics if you wish to process the stream with a program instead of using SQL or Flink.
Apache Flink
Apache Flink is an open source, Java/Scala/Python framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments and perform computations at in-memory speed and at any scale. Flink integrates with common cluster resource managers such as Hadoop YARN, Apache Mesos, and Kubernetes, but can also run as a stand-alone cluster.
Apache Kafka
Apache Kafka is an open source, Java/Scala distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Kafka events are organized and durably stored in topics. Kafka was originally developed at LinkedIn, and currently has the majority of the event streaming market share, including the commercial Confluent version.
Kafka has five core APIs:
- The Admin API to manage and inspect topics, brokers, and other Kafka objects.
- The Producer API to publish (write) a stream of events to one or more Kafka topics.
- The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them.
- The Kafka Streams API to implement stream processing applications and microservices. It provides higher-level functions to process event streams, including transformations, stateful operations like aggregations and joins, windowing, processing based on event time, and more. Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.
- The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka. For example, a connector to a relational database like PostgreSQL might capture every change to a set of tables. However, in practice, you typically don’t need to implement your own connectors because the Kafka community already provides hundreds of ready-to-use connectors.
Apache Pulsar
Apache Pulsar is an open source, cloud-native, Java/C++/Python distributed pub-sub messaging and streaming platform. Pulsar was originally developed at Yahoo. Pulsar features include:
- Native support for multiple clusters in a Pulsar instance, with seamless geo-replication of messages across clusters.
- Very low publish and end-to-end latency.
- Seamless scalability to over a million topics.
- A simple client API with bindings for Java, Go, Python and C++.
- Multiple subscription modes (exclusive, shared, and failover) for topics.
- Guaranteed message delivery with persistent message storage provided by Apache BookKeeper.
- A serverless, lightweight computing framework, Pulsar Functions, for stream-native data processing.
- A serverless connector framework, Pulsar IO, which is built on Pulsar Functions, that makes it easier to move data in and out of Apache Pulsar.
- Tiered storage that offloads data from hot/warm storage to cold/long-term storage (such as Amazon S3 and Google Cloud Storage) when the data is aging out.
Apache Samza
Apache Samza is a distributed open source, Scala/Java stream processing framework that was originally developed at LinkedIn, in conjunction with (Apache) Kafka. Samza allows you to build stateful applications that process data in real time from multiple sources, including Apache Kafka. Samza features include:
- Unified API: A simple API to describe application logic in a manner independent of the data source. The same API can process both batch and streaming data.
- Pluggability at every level: Process and transform data from any source. Samza offers built-in integrations with Apache Kafka, AWS Kinesis, Azure Event Hubs (Azure-native Kafka as a service), Elasticsearch, and Apache Hadoop. Also, it’s quite easy to integrate with your own sources.
- Samza as an embedded library: Integrate with your existing applications and eliminate the need to spin up and operate a separate cluster for stream processing. Samza can be used as a lightweight client library embedded in Java/Scala applications.
- Write once, run anywhere: Flexible deployment options to run applications anywhere — from public clouds to containerized environments to bare-metal hardware.
- Samza as a managed service: Run stream processing as a managed service by integrating with popular cluster managers including Apache YARN.
- Fault-tolerance: Transparently migrates tasks along with their associated state in the event of failures. Samza supports host-affinity and incremental checkpointing to enable fast recovery from failures.
- Massive scale: Battle-tested on applications that use several terabytes of state and run on thousands of cores. Samza powers multiple large companies including LinkedIn, Uber, TripAdvisor, and Slack.
Apache Spark
Apache Spark is a multi-language engine, written primarily in Scala, for executing data engineering, data science, and machine learning on single-node machines or clusters. It handles both batch data and real-time streaming data. Spark originated at U.C. Berkeley, and the authors of Spark founded Databricks.
Apache Storm
Apache Storm is a distributed stream processing computation framework written predominantly in Clojure. In Storm, a stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. A topology is a graph of spouts and bolts that are connected with stream groupings; topologies define the logic that processes streams. A spout is a source of streams in a topology. All processing in topologies is done in bolts. Storm integrates with many other systems and libraries, including Kafka, Cassandra, Redis, and Kinesis.