Looking Ahead to Apache Flink 1.4.0 and 1.5.0

22 Nov 2017 Stephan Ewen (@StephanEwen), Aljoscha Krettek (@aljoscha), & Mike Winters (@wints)

The Apache Flink 1.4.0 release is on track to happen in the next couple of weeks, and for all of the readers out there who haven’t been following the release discussion on Flink’s developer mailing list, we’d like to provide some details on what’s coming in Flink 1.4.0 as well as a preview of what the Flink community will save for 1.5.0.

Both releases include ambitious features that we believe will move Flink to an entirely new level in terms of the types of problems it can solve and applications it can support. The community deserves lots of credit for its hard work over the past few months, and we’re excited to see these features in the hands of users.

This post will describe how the community plans to get there and the rationale behind the approach.

There are 3 significant improvements to the Apache Flink engine that the community has nearly completed and that will have a meaningful impact on Flink’s operability and performance.

  1. Rework of the deployment model and distributed process
  2. Transition from configurable, fixed-interval network I/O to event-driven network I/O and application-level flow control for better backpressure handling
  3. Faster recovery from failure

Next, we’ll go through each of these improvements in more detail.

FLIP-6 (FLIP is short for FLink Improvement Proposal and FLIPs are proposals for bigger changes to Flink) is an initiative that’s been in the works for more than a year and represents a major refactor of Flink’s deployment model and distributed process. The underlying motivation for FLIP-6 was the fact that Flink is being adopted by a wider range of developer communities–both developers coming from the big data and analytics space as well as developers coming from the event-driven applications space.

Modern, stateful stream processing has served as a convergence for these two developer communities. Despite a significant overlap of the core concepts in the applications being built, each group of developers has its own set of common tools, deployment models, and expected behaviors when working with a stream processing framework like Flink.

FLIP-6 will ensure that Flink fits naturally in both of these contexts, behaving as though it’s native to each ecosystem and operating seamlessly within a broader technology stack. A few of the specific changes in FLIP-6 that will have such an impact:

  • Leveraging cluster management frameworks to support full resource elasticity
  • First-class support for containerized environments such as Kubernetes and Docker
  • REST-based client-cluster communication to ease operations and 3rd party integrations

FLIP-6, along with already-introduced features like rescalable state, lays the groundwork for dynamic scaling in Flink, meaning that Flink programs will be able to scale up or down automatically based on required resources–a huge step forward in terms of ease of operability and the efficiency of Flink applications.

Speed will always be a key consideration for users who build stream processing applications, and Flink 1.5 will include a rework of the network stack that will even further improve Flink’s latency. At the heart of this work is a transition from configurable, fixed-interval network I/O to event- driven network I/O and application-level flow control, ensuring that Flink will use all available network capacity, as well as credit-based flow control which offers more fine-grained backpressuring for improved checkpoint alignments.

In our testing (see slide 26 here), we’ve seen a substantial improvement in latency using event-driven network I/O, and the community is also doing work to make sure we’re able to provide this increase in speed without a measurable throughput tradeoff.

Faster Recovery from Failures

Flink 1.3.0 introduced incremental checkpoints, making it possible to take a checkpoint of state updates since the last successfully-completed checkpoint only rather than the previous behavior of only taking checkpoints of the entire state of the application. This has led to significant performance improvements for users with large state.

Flink 1.5 will introduce task-local recovery, which means that Flink will store a second copy of the most recent checkpoint on the local disk (or even in main memory) of a task manager. The primary copy still goes to durable storage so that it’s resilient to machine failures.

In case of failover, the scheduler will try to reschedule tasks to their previous task manager (in other words, to the same machine again) if this is possible. The task can then recover from the locally-kept state. This makes it possible to avoid reading all state from the distributed file system (which is remote over the network). Especially in applications with very large state, not having to read many gigabytes over the network and instead from local disk will result in significant performance gains in recovery.

The good news is that all 3 of the features described above are well underway, and in fact, much of the work is already covered by open pull requests.

But given these features’ importance and the complexity of the work involved, the community expected that the QA and testing required would be extensive and would delay the release of the otherwise- ready features also on the list for the next release.

And so the community decided to withhold the 3 features above (deployment model rework, improvements to the network stack, and faster recovery) to be included a separate Flink 1.5 release that will come shortly after the Flink 1.4 release. Flink 1.5 is estimated to come just a couple of months after 1.4 rather than the typical 4-month cycle in between major releases.

The soon-to-be-released Flink 1.4 represents the current state of Flink without merging those 3 features. And Flink 1.4 is a substantial release in its own right, including, but not limited to, the following:

  • A significantly improved dependency structure, removing many of Flink’s dependencies and subtle runtime conflicts. This increases overall stability and removes friction when embedding Flink or calling Flink “library style”.
  • Reversed class loading for dynamically-loaded user code, allowing for different dependencies than those included in the core framework.
  • An Apache Kafka 0.11 exactly-once producer, making it possible to build end-to-end exactly once applications with Flink and Kafka.
  • Streaming SQL JOIN based on processing time and event time, which gives users the full advantage of Flink’s time handling while using a SQL JOIN.
  • Table API / Streaming SQL Source and Sink Additions, including a Kafka 0.11 source and JDBC sink.
  • Hadoop-free Flink, meaning that users who don’t rely on any Hadoop components (such as YARN or HDFS) in their Flink applications can use Flink without Hadoop for the first time.
  • Improvements to queryable state, including a more container-friendly architecture, a more user-friendly API that hides configuration parameters, and the groundwork to be able to expose window state (the state of an in-flight window) in the future.
  • Connector improvements and fixes for a range of connectors including Kafka, Apache Cassandra, Amazon Kinesis, and more.
  • Improved RPC performance for faster recovery from failure

The community decided it was best to get these features into a stable version of Flink as soon as possible, and the separation of what could have been a single (and very substantial) Flink 1.4 release into 1.4 and 1.5 serves that purpose.

We’re excited by what each of these represents for Apache Flink, and we’d like to extend our thanks to the Flink community for all of their hard work.

If you’d like to follow along with release discussions, please subscribe to the dev@ mailing list.