Apache Flink [Java] -
system for high-throughput, low-latency data stream processing that
supports stateful computation, data-driven windowing semantics and
iterative stream processing.
Apache Heron (incubating)
[Java] - a realtime, distributed, fault-tolerant stream processing
engine from Twitter.
Apache Samza [Scala/Java]
- distributed stream processing framework that build on Kafka(messaging,
storage) and YARN(fault tolerance, processor isolation, security and
resource management).
Apache Spark Streaming
[Scala] - makes it easy to build scalable fault-tolerant streaming
applications.
Apache Storm
[Clojure/Java] - distributed real-time computation system. Storm is to
stream processing what Hadoop is to batch processing.
AthenaX [Java] - Uber’s
Stream Analytics Framework used in production
Faust [Python] - stream
processing library, porting the ideas from Kafka Streams to Python
Gearpump [Scala] -
lightweight real-time distributed streaming engine built on Akka.
Hazelcast Jet
[Java] - A general purpose distributed data processing engine, built on
top of Hazelcast.
hailstorm
[Haskell] - distributed stream processing with exactly-once semantics
based on Storm.
Maki Nage [Python] -
A stream processing framework for data scientists, based on Kafka and
ReactiveX.
mantis [Java] -
Netflix’s platform to build an ecosystem of realtime stream processing
applications
mupd8(muppet)
[Scala/Java] - mapReduce-style framework for processing fast/streaming
data.
Onyx [Clojure] -
Distributed, masterless, high performance, fault tolerant data
processing.
s4 [Java] -
general-purpose, distributed, scalable, fault-tolerant, pluggable
platform that allows programmers to easily develop applications for
processing continuous unbounded streams of data.
SPQR [Java] - dynamic
framework for processing high volumn data streams through pipelines.
tigon [C++/Java] - high
throughput real-time streaming processing framework built on Hadoop and
HBase.
Teknek
[Java] - Simple elegant stream processing with interactive prototying
shell SOL (Stream Operator Language) Mesos, designed for high
performance data processing jobs that require flexibility & control.
Trill [.NET/C#] - Trill
is a high-performance one-pass in-memory streaming analytics engine from
Microsoft Research.
Wallaroo [Python]
- A fast, stream-processing framework. Wallaroo makes it easy to react
to data in real-time. By eliminating infrastructure complexity, going
from prototype to production has never been simpler.
HStreamDB [Haskell] -
The streaming database built for IoT data storage and real-time
processing.
Kuiper [Golang] - An edge
lightweight IoT data analytics/streaming software implemented by Golang,
and it can be run at all kinds of resource-constrained edge devices.
Streaming Library
Apache Kafka Streams
[Java] - lightweight stream processing library included in Apache Kafka
(since 0.10 version).
Akka Streams [Scala] - stream
processing library on Akka Actors.
Benthos [Go] - Benthos
is a high performance and resilient message streaming service, able to
connect various sources and sinks and perform arbitrary actions,
transformations and filters on payloads
monix [Scala] -
high-performance Scala / Scala.js library for composing asynchronous and
event-based programs.
Streamline
[Java] - Stream Analytics Framework by Hortonworks, designed as a
wrapper around existing streaming solutions like Storm. Aimed to allow
users to drag-and-drop streaming components to focus on business logic.
StreamAlert [Python]
- Airbnb’s Real-time Data Analysis and Alerting.
Swave [Scala] - A
lightweight Reactive Streams Infrastructure Toolkit for Scala.
Streamz [Python]
- A lightweight library for building pipelines to manage continuous
streams of data; supports complex pipelines that involve branching,
joining, flow control, feedback, back pressure, and so on.
Stream Ops
[Java] - A fully embeddable data streaming engine and stream processing
API for Java.
Tributary [Python]
- A python library for constructing dataflow graphs. Supports
synchronous, reactive data streams built using python generators that
mimic complex event processors, as well as lazily-evaluated acyclic
graphs and functional currying streams.
Streaming Application
straw [Python/Java] - A
platform for real-time streaming search.
storm-crawler
[Java] - Web crawler SDK based on Apache Storm.
IoT
sensorbee [Go] -
lightweight stream processing engine for IoT.
Apache Edgent
[Java] - a programming model and runtime that enables continuous
streaming analytics on gateways and edge devices which can work with
centralized systems to provide efficient and timely analytics across the
whole IoT ecosystem: from the center to the edge, opens sourced by IBM.
Apache StreamPipes
[Java] - a self-service (Industrial) IoT toolbox to enable non-technical
users to connect, analyze and explore IoT data streams.
DSL
Apache Beam [Java, Python,
SQL, Scala, Go] - unified model and set of language-specific SDKs for
defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration
Patterns (EIPs) and Domain Specific Languages (DSLs), open sourced by
Google.
coast [Scala] - a DSL that
builds DAGs on top of Samza and provides exactly-once semantics.
Esper [Java] -
component for complex event processing (CEP) and event series analysis.
Streamparse
[Python] - lets you run Python code against real-time streams of data
via Apache Storm.
summingbird [Scala]
- library that lets you write MapReduce programs that look like native
Scala or Java collection transformations and execute them on a number of
well-known distributed MapReduce platforms, including Storm and
Scalding.
Data Pipeline
Apache Kafka [Scala/Java]
- distributed, partitioned, replicated commit log service, which
provides the functionality of a messaging system, but with a unique
design.
Apache Pulsar
[Java] - distributed pub-sub messaging platform with a very flexible
messaging model and an intuitive client API.
brooklin [Java] - a
distributed system intended for streaming data between various
heterogeneous source and destination systems with high reliability and
throughput at scale from Linkedin (replaced databus).
databus [Java] -
Linkedin’s source-agnostic distributed change data capture system.
flume [Java] -
distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data.
Gazette [golang] -
Distributed streaming infrastructure built on cloud storage which makes
it easy to mix and match batch and streaming paradigms.
LogDevice [C++] - a high-performant
distributed system by Facebook for streaming and storing sequential
data, using a log structure.
metaq [Java] -
Taobao’s high available, high performance distributed messaging system
NATS streaming
[Go] - fast disk-backed messaging solution
nsq [Go] - realtime
distributed messaging platform designed to operate at scale, handling
billions of messages per day.
RudderStack
[Go] - an open source customer data infrastructure (segment, mparticle
alternative).
suro [Java] - data
pipeline service for collecting, aggregating, and dispatching large
volume of application events including log data.
StreamSets Data Collector
[Java] - continuous big data ingestion infrastructure that reads from
and writes to a large number of end-points, including S3, JDBC, Hadoop,
Kafka, Cassandra and many others.
Online Machine Learning
Apache Samoa
[Java] - distributed streaming machine learning (ML) framework that
contains a programing abstraction for distributed streaming ML
algorithms.
DataSketches
[Java] - sketches library from Yahoo!.
streamDM [Scala] -
mining Big Data streams using Spark Streaming from Huawei.
StreamingBandit
[Python] - Provides a webserver to quickly setup and evaluate possible
solutions to contextual multi-armed bandit (cMAB) problems.
StormCV [Java] -
enables the use of Apache Storm for video processing by adding computer
vision (CV) specific operations and data model.
trident-ml [Java]
- realtime online machine learning library based on Trident.
yurita [Scala] - Anomaly
detection framework built on Spark Structured Streaming from Paypal.
Streaming SQL
pipelinedb [C] -
An open-source relational database that runs SQL queries continuously on
streams, incrementally storing results in tables.
squall [Java] - Squall
executes SQL queries on top of Storm for doing online processing.
StreamCQL [Java]
- Continuous Query Language on RealTime Computation System.
ksqlDB [Java] - A
cloud-native, source-available
database purpose-built for stream
processing applications
Materialize [Rust] - A
source-available streaming SQL engine for maintaining materialized views
on data from message brokers and databases.
Siddhi [Java] - A
cloud native Streaming and Complex Event Processing engine that
understands Streaming SQL queries in order to capture events from
diverse data sources, process them, detect complex conditions, and
publish output to various endpoints in real time.
Benchmark
storm-benchmark
[Java] - a set of benchmarks to test Storm performance.
storm-perf-test
[Java] - a simple storm performance/stress test.
streaming-benchmarks
[Java] - Benchmarks for Low Latency (Streaming) solutions including
Apache Storm, Apache Spark, Apache Flink, etc.
flotilla [Go] -
Automated message queue orchestration for scaled-up benchmarking.
Toolkit
akka [Scala] - toolkit and
runtime for building highly concurrent, distributed, and resilient
message-driven application on the JVM.
pulsar [Python] -
Actor based event driven concurrent framework for Python.
aeron [Java/C++] -
efficient reliable unicast and multicast message transport.
StreamFlow [Java] -
stream processing tool designed to help build and monitor processing
workflows.
samza-luwak
[Java] - uses Luwak, a stored-query engine built on Lucene, to implement
full-text search on streams.
Turbine [Java] - tool
for aggregating streams of Server-Sent Event (SSE) JSON data into a
single stream.
Closed Source
Amazon Kinesis Streams
[Java] - real-time, fully managed and scalable data stream engine
provided by AWS.
Azure Stream Analytics
[.NET] a massively scalable, fully managed, real-time, data stream
engine provided by Microsoft Azure.
Cloud Dataflow[Java,
Python, SQL, Scala] - Google’s managed stream and batch data processing
engine. Supports running Beam pipelines.
concord
[C++] - a distributed stream processing framework built in C++ on top of
Apache.
IBM Streams
[Python/Java/Scala] - platform for distributed processing and real-time
analytics. Provides toolkits for advanced analytics like geospatial,
time series, etc. out of the box.