Apache Flink ML 2.2.0 Release Announcement

April 19, 2023 - Dong Lin

The Apache Flink community is excited to announce the release of Flink ML 2.2.0! This release focuses on enriching Flink ML’s feature engineering algorithms. The library now includes 33 feature engineering algorithms, making it a more comprehensive library for feature engineering tasks.

With the addition of these algorithms, we believe Flink ML library is ready for use in production jobs that require feature engineering capabilities, whose input can then be consumed by both offline and online machine learning tasks.

We encourage you to download the release and share your feedback with the community through the Flink mailing lists or JIRA! We hope you like the new release and we’d be eager to learn about your experience with it.

Notable Features #

Introduced API and infrastructure for online serving #

In machine learning, one of the main goals of model training is to deploy the trained model to perform online inference, where the model server must respond to incoming requests with millisecond-level latency. However, prior releases of Flink ML only supported nearline inference using the Flink runtime, which may not meet the requirements of online inference use-cases.

With FLIP-289, Flink ML now provides an API and infrastructure for users to load a ModelServable from model data generated by an Estimator. This ModelServable can be replicated across multiple model servers to process online inference requests in parallel. As the ModelServable is effectively a UDF that does not rely on Flink runtime, it can also be integrated as a UDF into other serving or processing frameworks to serve the model trained by Flink ML.

As a first step, the LogisticRegressionModelServable has been added to serve the logistic regression model online, and more servables will be added in the future. This new feature enables Flink ML to be used for both offline and online machine learning tasks, making it more versatile for a wider range of use cases.

Added 27 feature engineering algorithms #

Flink ML 2.2.0 significantly expanded the coverage of feature engineering algorithms, increasing the number from 6 to 33. Flink ML now covers 28 out of the 33 feature engineering algorithms provided in Spark ML, making it a more comprehensive library for feature engineering tasks.

Feature engineering is a critical step in modern AI infrastructures as it can preprocess data not only for traditional machine learning algorithms like GBT but also for deep learning algorithms and large language models like Transformer, which are increasingly popular. With the addition of these algorithms, we hope Flink ML can be more useful in machine-learning tasks for Flink users.

All feature engineering algorithms can be easily accessed through the drop-down list on the left side of this Flink ML page. For each algorithm, we have provided Python and Java examples to demonstrate how to use them.

Added two production-validated online learning algorithms #

Flink ML offers a significant advantage over other machine learning libraries in terms of its ability to perform online learning using Flink’s streaming runtime. To leverage this strength, we implemented two online algorithms in Flink ML and successfully used them in a production machine learning job at Alibaba.

This job involves dynamically clustering similar logs and detecting errors in the logs to help site reliability engineers. By using OnlineStandardScaler and AgglomerativeClustering to standardize and cluster logs in real-time, the job is able to update models more frequently with a much simpler infrastructure setup. We presented this work at Flink Forward Asia last year, and it will soon be integrated into the open-source project SREWorks.

With these online algorithms, Flink ML provides users with the ability to continuously update models using new data in real-time, resulting in more accurate and up-to-date predictions. This can be particularly useful in use cases where data is constantly streaming in, and it’s important to make quick decisions based on the latest available information.

Upgrade Notes #

This release is fully backward compatible with Flink ML 2.1. Users should be able to upgrade to Flink ML 2.2.0 without worrying about any incompatibilities or breaking changes.

Release Notes and Resources #

Please take a look at the release notes for a detailed list of changes and new features.

The binary distribution and source artifacts are now available on the updated Downloads page of the Flink website, and the most recent distribution of Flink ML Python package is available on PyPI.

List of Contributors #

The Apache Flink community would like to thank each one of the contributors that have made this release possible:

Zhipeng Zhang, Dong Lin, Fan Hong, JiangXin, Zsombor Chikan, huangxingbo, taosiyuan163, vacaly, weibozhao, yunfengzhou-hub