Streaming data, also called event stream processing, is usually discussed in the context of big data. It is data that is generated continuously, often by thousands of data sources, such as sensors or server logs. Streaming data records are often small, perhaps a few kilobytes each, but there are many of them, and in many cases the stream goes on and on without ever stopping.
Historical data, on the other hand, normally goes through a batch ETL (extract, transform, and load) process before going into an analysis database, such as a data warehouse, data lake, or data lakehouse. That’s fine if you’re not in a hurry. On the other hand, it’s common to need to process streaming data quickly in order to act on the results in as close to real time as you can.
Streaming data processing software typically analyzes the data incrementally, and performs real-time aggregation and correlation, filtering, or sampling. The stream is often stored as well, so that it can contribute to the historical record. Incremental processing can be performed on a record-by-record basis, or over sliding time windows.
By analyzing stream data in real time, you can detect unusual events, significant deviations from normal values, and developing trends. That can then inform real-time responses, such as turning on the irrigation when a field is drying out, or buying a stock when its trades have dipped below a target value. Sources of streaming data include the following:
- Sensors, such as those in industrial machines, vehicles, and farm machinery
- Stock transaction pricing data from stock exchanges
- Mobile device location data
- Clicks on web properties
- Game interactions
- Server logs
- Database transactions
Approaches to processing streaming data
There are three ways to deal with streaming data: batch process it at intervals ranging from hours to days, process the stream in real time, or do both in a hybrid process.
Batch processing has the advantage of being able to perform deep analysis, including machine learning, and the disadvantage of having high latency. Stream processing has the advantage of low latency, and the disadvantage of only being able to perform simple analysis, such as calculating average values over a time window and flagging deviations from the expected values.
Hybrid processing combines both methods and reaps the benefits of both. In general, the data is processed as a stream and simultaneously branched off to storage for later batch processing. To give an example, consider an acoustic monitor attached to an industrial machine. The stream processor can detect an abnormal squeak and issue an alert; the batch processor can invoke a model to predict the time to failure based on the squeak as it progresses, and schedule maintenance for the machine long before it fails.
Software for processing streaming data
Amazon Kinesis lets you collect, process, and analyze real-time, streaming data at scale. Kinesis has three services for data (Data Streams, Data Firehose, and Data Analytics) and one for media (Video Streams).
Kinesis Data Streams is an ingestion service that can continuously capture gigabytes of data per second from hundreds of thousands of sources. Kinesis Data Analytics can process data streams in real time with SQL or Apache Flink. Kinesis Data Firehose can capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools. You can use AWS Lambda serverless functions instead of Kinesis Data Analytics if you wish to process the stream with a program instead of using SQL or Flink.
Apache Flink is an open-source, Java/Scala/Python framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments and perform computations at in-memory speed and at any scale. It integrates with all common cluster resource managers, such as Hadoop YARN, Apache Mesos, and Kubernetes, but can also run as a stand-alone cluster.
Apache Kafka is an open-source, distributed, Java/Scala event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Kafka events are organized and durably stored in topics. Kafka was originally developed at LinkedIn. It has five core APIs:
- The Admin API to manage and inspect topics, brokers, and other Kafka objects.
- The Producer API to publish (write) a stream of events to one or more Kafka topics.
- The Consumer API to subscribe to (read) one or more topics and to process the stream of events produced to them.
- The Kafka Streams API to implement stream processing applications and microservices. The Streams API provides higher-level functions to process event streams, including transformations, stateful operations like aggregations and joins, windowing, processing based on event time, and more. Input is read from one or more topics in order to generate output to one or more topics, effectively transforming the input streams to output streams.
- The Kafka Connect API to build and run reusable data import/export connectors that consume (read) or produce (write) streams of events from and to external systems and applications so they can integrate with Kafka. For example, a connector to a relational database like PostgreSQL might capture every change to a set of tables. However, in practice, you typically don’t need to implement your own connectors because the Kafka community already provides hundreds of ready-to-use connectors.
Apache Pulsar is an open-source, distributed, Java/C++/Python, publish-and-subscribe messaging and streaming platform. Pulsar was originally developed at Yahoo. Its features include:
- Native support for multiple clusters in a Pulsar instance, with seamless geo-replication of messages across clusters.
- Very low publish and end-to-end latency.
- Seamless scalability to over a million topics.
- A simple client API with bindings for Java, Go, Python, and C++.
- Multiple subscription modes (exclusive, shared, and fail-over) for topics.
- Guaranteed message delivery with persistent message storage provided by Apache BookKeeper.
- A serverless computing framework Pulsar Functions offers the capability for stream-native data processing.
- A serverless connector framework Pulsar IO, which is built on Pulsar Functions, makes it easier to move data in and out of Apache Pulsar.
- Tiered Storage offloads data from hot/warm storage to long-term storage (such as Amazon S3 and Google Cloud Storage) when the data is aging out.
Apache Samza is an open-source, distributed, Scala/Java stream processing framework that was originally developed at LinkedIn, in conjunction with Apache Kafka. Samza allows you to build stateful applications that process data in real time from multiple sources, including Apache Kafka. Samza features include:
- A unified API that allows you to describe your application logic in a manner independent of your data source. The same API can process both batch and streaming data.
- The ability to process and transform data from cloud and on-prem sources. Samza offers built-in integrations with Apache Kafka, AWS Kinesis, Azure Event Hubs (Azure-native Kafka as a service), Elasticsearch, and Apache Hadoop.
- Effortless integration with existing applications that eliminate the need to spin up and operate a separate cluster for stream processing. Samza can be used as a lightweight client library embedded in Java/Scala applications.
- Flexible deployment options that allow you to run applications anywhere from public clouds to containerized environments to bare-metal hardware.
- The ability to run stream processing as a managed service by integrating with popular cluster managers such as Apache YARN.
- Fault-tolerance that transparently migrates tasks and their associated state in the event of failures. Samza supports host-affinity and incremental checkpoints that enable fast recovery from failures.
- Massive scalability. Samza is battle-tested on applications that use several terabytes of state and run on thousands of cores. It powers multiple large companies including LinkedIn, Uber, TripAdvisor, and Slack.
Apache Spark is an open-source, multi-language engine, written primarily in Scala, for executing data engineering, data science, and machine learning on single-node machines or clusters. It handles both batch data and real-time streaming data. Spark originated at UC Berkeley, and the authors of Spark founded Databricks.
Apache Storm is an open-source, distributed stream processing computation framework written predominantly in Clojure. In Storm, a stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. A topology is a graph of spouts and bolts that are connected with stream groupings; topologies define the logic that process streams. A spout is a source of streams in a topology. All processing in topologies is done in bolts. Storm integrates with many other systems and libraries including Kafka, Cassandra, Redis, and Kinesis.
Azure Stream Analytics is a real-time analytics and complex event processing engine that is designed to analyze and process high volumes of fast streaming data from multiple sources simultaneously. Patterns and relationships can be identified in information extracted from a number of input sources including devices, sensors, clickstreams, social media feeds, and applications. These patterns can be used to trigger actions and initiate workflows such as creating alerts, feeding information to a reporting tool, or storing transformed data for later use.
Confluent is a commercial adaptation of Apache Kafka by the original creators of Kafka, offered for on-premises or cloud deployment. Confluent Cloud was rebuilt from the ground up as a serverless, elastic, cost-effective, and fully managed cloud-native service, and runs on AWS, Microsoft Azure, and Google Cloud Platform.
Google Cloud Dataflow is a fully-managed, serverless, unified stream and batch data processing service based on Apache Beam. Apache Beam is a unified SDK, originally from Google, for Dataflow, Flink, Spark, and Hazelcast Jet.
Ververica is an enterprise stream processing platform by the original creators of Apache Flink. It provides multi-tenancy, authentication, role-based access control, and auto-scaling for Apache Flink.
To summarize, streaming data is generated continuously, often by thousands of data sources. Often, each record is only a few kilobytes of data. A hybrid scheme for quickly analyzing records in real time and additionally storing the data for in-depth analysis is often preferred. There are many good options for event streaming platforms, several of which are free, open-source, Apache projects, and several more of which are commercial enhancements of the Apache projects.