Apache Flink's Application to Season of Docs

17 Apr 2019 Konstantin Knauf (@snntrable)

The Apache Flink community is happy to announce its application to the first edition of Season of Docs by Google. The program is bringing together Open Source projects and technical writers to raise awareness for and improve documentation of Open Source projects. While the community is continuously looking for new contributors to collaborate on our documentation, we would like to take this chance to work with one or two technical writers to extend and restructure parts of our documentation (details below).

The community has discussed this opportunity on the dev mailinglist and agreed on three project ideas to submit to the program. We have a great team of mentors (Stephan, Fabian, David, Jark & Konstantin) lined up and are very much looking forward to the first proposals by potential technical writers (given we are admitted to the program ;)). In case of questions feel free to reach out to the community via dev@flink.apache.org.

Project Ideas List

Project 1: Improve Documentation of Stream Processing Concepts

Description: Stream processing is the processing of data in motion―in other words, computing on data directly as it is produced or received. Apache Flink has pioneered the field of distributed, stateful stream processing over the last several years. As the community has pushed the boundaries of stream processing, we have introduced new concepts that users need to become familiar with to develop and operate Apache Flink applications efficiently. The Apache Flink documentation [1] already contains a “concepts” section, but it is a ) incomplete and b) lacks an overall structure & reading flow. In addition, “concepts”-content is also spread over the development [2] & operations [3] documentation without references to the “concepts” section. An example of this can be found in [4] and [5].

In this project, we would like to restructure, consolidate and extend the concepts documentation for Apache Flink to better guide users who want to become productive as quickly as possible. This includes better conceptual introductions to topics such as event time, state, and fault tolerance with proper linking to and from relevant deployment and development guides.

Related material:

  1. https://ci.apache.org/projects/flink/flink-docs-release-1.8/
  2. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev
  3. https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops
  4. https://ci.apache.org/projects/flink/flink-docs-release-1.8/concepts/programming-model.html#time
  5. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/event_time.html

Description: Stream processing is the processing of data in motion―in other words, computing on data directly as it is produced or received. Apache Flink has pioneered the field of distributed, stateful stream processing for the last few years. As a stateful distributed system in general and a continuously running, low-latency system in particular, Apache Flink deployments are non-trivial to setup and manage. Unfortunately, the operations [1] and monitoring documentation [2] are arguably the weakest spots of the Apache Flink documentation. While it is comprehensive and often goes into a lot of detail, it lacks an overall structure and does not address common overarching concerns of operations teams in an efficient way.

In this project, we would like to restructure this part of the documentation and extend it if possible. Ideas for extension include: discussion of session and per-job clusters, better documentation for containerized deployments (incl. K8s), capacity planning & integration into CI/CD pipelines.

Related material:

  1. https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops
  2. https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring

Project 3: Improve Documentation for Relational APIs (Table API & SQL)

Description: Apache Flink features APIs at different levels of abstraction which enables its users to trade conciseness for expressiveness. Flink’s relational APIs, SQL and the Table API, are “younger” than the DataStream and DataSet APIs, more high-level and focus on data analytics use cases. A core principle of Flink’s SQL and Table API is that they can be used to process static (batch) and continuous (streaming) data and that a program or query produces the same result in both cases. The documentation of Flink’s relational APIs has organically grown and can be improved in a few areas. There are several on-going development efforts (e.g. Hive Integration, Python Support or Support for Interactive Programming) that aim to extend the scope of the Table API and SQL.

The existing documentation could be reorganized to prepare for covering the new features. Moreover, it could be improved by adding a concepts section that describes the use cases and internals of the APIs in more detail. Moreover, the documentation of built-in functions could be improved by adding more concrete examples.

Related material:

  1. Table API & SQL docs main page
  2. Built-in functions
  3. Concepts
  4. Streaming Concepts