07 Jan 2022 Dong Lin & Yun Gao
The Apache Flink community is excited to announce the release of Flink ML 2.0.0! Flink ML is a library that provides APIs and infrastructure for building stream-batch unified machine learning algorithms, that can be easy-to-use and performant with (near-) real-time latency.
This release involves a major refactor of the earlier Flink ML library and introduces major features that extend the Flink ML API and the iteration runtime, such as supporting stages with multi-input multi-output, graph-based stage composition, and a new stream-batch unified iteration library. Moreover, we added five algorithm implementations in this release, which is the start of a long-term initiative to provide a large number of off-the-shelf algorithms in Flink ML with state-of-the-art performance.
We believe this release is an important step towards extending Apache Flink to a wide range of machine learning use cases, especially the real-time machine learning scenarios.
We encourage you to download the release and share your feedback with the community through the Flink mailing lists or JIRA! We hope you like the new release and we’d be eager to learn about your experience with it.
- Notable Features
- Related Work
- Upgrade Notes
- Release Notes and Resources
- List of Contributors
API and Infrastructure
Supporting stages requiring multi-input multi-output
Stages in a machine learning workflow might take multiple inputs and return multiple outputs. For example, a graph embedding algorithm might need to read two tables, which represent the edge and node of the graph respectively. And a workflow might need a stage that splits the input dataset into two output datasets, for training and testing respectively.
With this capability, algorithm developers can assemble a machine learning workflow as a directed acyclic graph (DAG) of pre-defined stages. And this workflow can be configured and deployed without users knowing the implementation details of this graph. This improvement could considerably expand the applicability and usability of Flink ML.
Supporting online learning with APIs exposing model data
In a native online learning scenario, we have a long-running job that keeps processing training data and updating a machine learning model. And we could have multiple jobs deployed in web servers which do online inference. It is necessary to transmit the latest model data from the training job to those inference jobs in (near-) real-time latency.
The traditional Estimator/Transformer paradigm does not provide APIs to expose this model data in a streaming manner. Users have to repeatedly call fit() to update model data. Although users might be able to update model data once every few minutes, it is likely very inefficient, if not impossible, to update model data once every few seconds with this approach.
With FLIP-173, model data can be exposed as an unbounded stream via the getModelData() API. Then algorithm users can transfer the model data to web servers in real-time and use the up-to-date model data to do online inference. This feature could significantly strengthen Flink ML’s capability to support online learning applications.
Improved usability for managing parameters
We care a lot about usability and developer velocity in Flink ML. In this release, we refactored and significantly simplified the experience of defining, getting and setting parameters for algorithms.
With FLIP-174, parameters can be defined as static variables of an interface, and any algorithm that implements the interface could inherit these variable definitions without additional work. Commonly used parameter validators are provided as part of the infrastructure.
Tools for composing DAG of stages into a new stage
One of the most useful tool in the existing ML libraries (e.g. Scikit-learn, Flink, Spark) is Pipeline, which allows users to compose an estimator from an ordered list of estimators and transformers, without having to explicitly implement the fit/transform for the estimator/transformer.
FLIP-175 extended this capability from pipeline to DAG. Users can now compose an estimator from a DAG of estimator and transformers. This capability of composition allows developers to slice a complex workflow into simpler modules and re-use the modules across multiple workflows. We believe this capability could significantly improve the experience of building and deploying complex workflows using Flink ML.
Stream-batch Unified Iteration Library
To support training machine learning algorithms and adjust the model parameters dynamically based on the prediction result, it is necessary to have native support for processing data iteratively. It is known that Flink uses DAG to describe the process logic, thus we need to provide the iteration library on top of Flink separately. Besides, since we need to support both offline training and online training / adjustment, the iteration library should support both streaming and batch cases.
FLIP-176 implements a stream-batch unified iteration library. It provides the function of transmitting records back to the precedent operators and the ability to track the progress of rounds inside the iteration. Users could directly use DataStream API and Table API to express the execution logic inside the iteration. Besides, the new iteration library also extends Flink’s checkpointing mechanism to also support exactly-once failover for jobs using iterations.
Nowadays many machine learning practitioners are used to developing machine learning workflows in Python due to its ease-of-use and excellent ecosystem. To meet the needs of these users, a Python package dedicated for Flink ML is created starting from this release. The Python package currently provides APIs similar to their Java counterparts for developing machine learning algorithms.
Users can install Flink ML Python package through pip using the following command:
pip install apache-flink-ml
In the future we will enhance the Python SDK to enable its interoperability with Flink ML’s Java library, for example, allowing users to express machine learning workflows in Python, where workflows consist of a mixture of stages from the Flink ML Java library as well as stages implemented in Python (e.g. a TensorFlow program).
Now that the Flink ML API re-design is done, we started the initiative to add off-the-shelf algorithms in Flink ML. The release of Flink-ML 2.0.0 is closely related to project Alink - an Apache Flink ecosystem project open sourced by Alibaba. The connection between the Flink community and developers of the Alink project dates back to 2017. The project Alink developers have a significant contribution in designing the new Flink ML APIs, refactoring, optimizing and migrating algorithms from Alink to Flink. Our long-term goal is to provide a library of performant algorithms that are easy to use, debug and customize for your needs.
We have implemented five algorithms in this release, i.e. logistic regression, k-means, k-nearest neighbors, naive bayes and one-hot encoder. For now these algorithms focus on validating the APIs and iteration runtime. In addition to adding more and more algorithms, we will also stress test and optimize their performance to make sure these algorithms have state-of-the-art performance. Stay tuned!
Flink ML project moved to a separate repository
To accelerate the development of Flink ML, the effort has moved to the new repository flink-ml under the Flink project. We here follow a similar approach like the Stateful Functions effort, where a separate repository has helped to speed up the development by allowing for more light-weight contribution workflows and separate release cycles.
Github organization created for Flink ecosystem projects
To facilitate the community collaboration on ecosystem projects that extend the capability of the Apache Flink, Apache Flink PMC has granted the permission to use flink-extended as the name of this GitHub organization, which provides a neutral place to host the code of ecosystem projects.
Two Flink ML related projects have been moved to this organization. dl-on-flink provides the capability to implement Flink ML stages using TensorFlow. And clink is a library that facilitates the implementation of Flink ML stages using C++ in order to support e.g. real-time feature engineering.
We hope you can join this effort and share your Flink ecosystem projects in this Github organization. And stay tuned for more updates on ecosystem projects.
Please review this note for a list of adjustments to make and issues to check if you plan to upgrade to Flink ML 2.0.0.
This note discusses any critical information about incompatibilities and breaking changes, performance changes, and any other changes that might impact your production deployment of Flink ML.
Module names are changed.
We have replaced the
flink-ml-apimodule with the
For users who have a dependency on
flink-ml-api, please replace it with
PipelineStage and its subclasses are changed.
FLIP-173 made major changes to PipelineStage and its subclasses. Changes include class rename, method signature change, method removal etc.
Users who use PipelineStage and its subclasses should use the new APIs introduced in FLIP-173.
Param-related classes are changed.
FLIP-174 made major changes to the param-related classes. Changes include class rename, method signature change, method removal etc.
Users who use classes such as Params and WithParams should use the new APIs introduced in FLIP-174.
Flink dependency is changed from 1.12 to 1.14.
This change introduces all the breaking changes listed in the Flink 1.14 release notes. One major change is that the DataSet API is not supported anymore.
Users who use DataSet::iterate should switch to using the datastream-based iteration API introduced in FLIP-176.
Release Notes and Resources
Please take a look at the release notes for a detailed list of changes and new features.
List of Contributors
The Apache Flink community would like to thank each one of the contributors that have made this release possible:
Yun Gao, Dong Lin, Zhipeng Zhang, huangxingbo, Yunfeng Zhou, Jiangjie (Becket) Qin, weibo, abdelrahman-ik.