Streaming


High Performance & Low Latency

Flink's data streaming runtime achieves high throughput rates and low latency with little configuration. The charts below show the performance of a distributed item counting task, requiring streaming data shuffles.

Performance of data streaming applications

Support for Event Time and Out-of-Order Events

Flink supports stream processing and windowing with Event Time semantics.

Event time makes it easy to compute over streams where events arrive out of order, and where events may arrive delayed.

Event Time and Out-of-Order Streams

Exactly-once Semantics for Stateful Computations

Streaming applications can maintain custom state during their computation.

Flink's checkpointing mechanism ensures exactly once semantics for the state in the presence of failures.

Exactly-once Semantics for Stateful Computations

Highly flexible Streaming Windows

Flink supports windows over time, count, or sessions, as well as data-driven windows.

Windows can be customized with flexible triggering conditions, to support sophisticated streaming patterns.

Windows

Continuous Streaming Model with Backpressure

Data streaming applications are executed with continuous (long lived) operators.

Flink's streaming runtime has natural flow control: Slow data sinks backpressure faster sources.

Continuous Streaming Model

Fault-tolerance via Lightweight Distributed Snapshots

Flink's fault tolerance mechanism is based on Chandy-Lamport distributed snapshots.

The mechanism is lightweight, allowing the system to maintain high throughput rates and provide strong consistency guarantees at the same time.

Lightweight Distributed Snapshots

Batch and Streaming in One System


One Runtime for Streaming and Batch Processing

Flink uses one common runtime for data streaming applications and batch processing applications.

Batch processing applications run efficiently as special cases of stream processing applications.

Unified Runtime for Batch and Stream Data Analysis

Memory Management

Flink implements its own memory management inside the JVM.

Applications scale to data sizes beyond main memory and experience less garbage collection overhead.

Managed JVM Heap

Iterations and Delta Iterations

Flink has dedicated support for iterative computations (as in machine learning and graph analysis).

Delta iterations can exploit computational dependencies for faster convergence.

Performance of iterations and delta iterations

Program Optimizer

Batch programs are automatically optimized to exploit situations where expensive operations (like shuffles and sorts) can be avoided, and when intermediate data should be cached.

Optimizer choosing between different execution strategies

APIs and Libraries


Streaming Data Applications

The DataStream API supports functional transformations on data streams, with user-defined state, and flexible windows.

The example shows how to compute a sliding histogram of word occurrences of a data stream of texts.

WindowWordCount in Flink's DataStream API

case class Word(word: String, freq: Long)

val texts: DataStream[String] = ...

val counts = text
  .flatMap { line => line.split("\\W+") }
  .map { token => Word(token, 1) }
  .keyBy("word")
  .timeWindow(Time.seconds(5), Time.seconds(1))
  .sum("freq")

Batch Processing Applications

Flink's DataSet API lets you write beautiful type-safe and maintainable code in Java or Scala. It supports a wide range of data types beyond key/value pairs, and a wealth of operators.

The example shows the core loop of the PageRank algorithm for graphs.

case class Page(pageId: Long, rank: Double)
case class Adjacency(id: Long, neighbors: Array[Long])

val result = initialRanks.iterate(30) { pages =>
  pages.join(adjacency).where("pageId").equalTo("id") {

    (page, adj, out: Collector[Page]) => {
      out.collect(Page(page.pageId, 0.15 / numPages))

      val nLen = adj.neighbors.length
      for (n <- adj.neighbors) {
        out.collect(Page(n, 0.85 * page.rank / nLen))
      }
    }
  }
  .groupBy("pageId").sum("rank")
}

Library Ecosystem

Flink's stack offers libraries with high-level APIs for different use cases: Complex Event Processing, Machine Learning, and Graph Analytics.

The libraries are currently in beta status and are heavily developed.

Flink Stack with Libraries

Ecosystem


Broad Integration

Flink is integrated with many other projects in the open-source data processing ecosystem.

Flink runs on YARN, works with HDFS, streams data from Kafka, can execute Hadoop program code, and connects to various other data storage systems.

Other projects that Flink is integrated with