Technical Architect Brain: Flink Ecosystem Components

Apache Flink Tutorial- Ecosystem Components

Above diagram shows complete ecosystem of Apache Flink. There are different layers in the ecosystem diagram:

i. Storage / Streaming

Flink doesn’t ship with the storage system; it is just a computation engine. Flink can read, write data from different storage system as well as can consume data from streaming systems. Below is the list of storage/streaming system from which Flink can read write data:

HDFS – Hadoop Distributed File System
Local-FS – Local File System
S3 – Simple Storage Service from Amazon
HBase – NoSQL Database in Hadoop ecosystem
MongoDB – NoSQL Database
RDBMS – Any relational database
Kafka – Distributed messaging Queue
RabbitMQ – Messaging Queue
Flume – Data Collection and Aggregation Tool

The second layer is the deployment/resource management. Flink can be deployed in following modes:

Local mode – On a single node, in single JVM
Cluster –

On a multi-node cluster, with following resource manager.

Standalone – This is the default resource manager which is shipped with Flink.
YARN – This is a very popular resource manager, it is part of Hadoop, introduced in Hadoop 2.x
Mesos – This is a generalized resource manager.

Cloud – on Amazon or Google cloud

The next layer is Runtime – the Distributed Streaming Dataflow, which is also called as the kernel of Apache Flink. This is the core layer of flink which provides distributed processing, fault tolerance, reliability, native iterative processing capability, etc.
The top layer is for APIs and Library, which provides the diverse capability to Flink:

ii. DataSet API

It handles the data at the rest, it allows the user to implement operations like map, filter, join, group, etc. on the dataset. It is mainly used for distributed processing. Actually, it is a special case of Stream processing where we have a finite data source. The batch application is also executed on the streaming runtime.

iii.DataStream API

It handles a continuous stream of the data. To process live data stream it provides various operations like map, filter, update states, window, aggregate, etc. It can consume the data from the various streaming source and can write the data to different sinks. It supports both Java and Scala.
Now let’s discuss some DSL (Domain Specific Library) Tool’s

iv. Table

It enables users to perform ad-hoc analysis using SQL like expression language for relational stream and batch processing. It can be embedded in DataSet and DataStream APIs. Actually, it saves users from writing complex code to process the data instead allows them to run SQL queries on the top of Flink.

v. Gelly

It is the graph processing engine which allows users to run set of operations to create, transform and process the graph. Gelly also provides the library of an algorithm to simplify the development of graph applications. It leverages native iterative processing model of Flink to handle graph efficiently. Its APIs are available in Java and Scala.

vi. FlinkML

It is the machine learning library which provides intuitive APIs and an efficient algorithm to handle machine learning applications. We write it in Scala. As we know machine learning algorithms are iterative in nature, Flink provides native support for iterative algorithm to handle the same quite effectively and efficiently.