Above diagram shows complete ecosystem of Apache Flink. There are different layers in the ecosystem diagram:
Flink doesn’t ship with the storage system; it is just a computation engine. Flink can read, write data from different storage system as well as can consume data from streaming systems. Below is the list of storage/streaming system from which Flink can read write data:
- HDFS – Hadoop Distributed File System
- Local-FS – Local File System
- S3 – Simple Storage Service from Amazon
- HBase – NoSQL Database in Hadoop ecosystem
- MongoDB – NoSQL Database
- RDBMS – Any relational database
- Kafka – Distributed messaging Queue
- RabbitMQ – Messaging Queue
- Flume – Data Collection and Aggregation Tool
The second layer is the deployment/resource management. Flink can be deployed in following modes:
- Local mode – On a single node, in single JVM
- Cluster –On a multi-node cluster, with following resource manager.
- Standalone – This is the default resource manager which is shipped with Flink.
- YARN – This is a very popular resource manager, it is part of Hadoop, introduced in Hadoop 2.x
- Mesos – This is a generalized resource manager.
- Cloud – on Amazon or Google cloud
The next layer is Runtime – the Distributed Streaming Dataflow, which is also called as the kernel of Apache Flink. This is the core layer of flink which provides distributed processing, fault tolerance, reliability, native iterative processing capability, etc.
The top layer is for APIs and Library, which provides the diverse capability to Flink:
ii. DataSet API
It handles the data at the rest, it allows the user to implement operations like map, filter, join, group, etc. on the dataset. It is mainly used for distributed processing. Actually, it is a special case of Stream processing where we have a finite data source. The batch application is also executed on the streaming runtime.
iii.DataStream API
It handles a continuous stream of the data. To process live data stream it provides various operations like map, filter, update states, window, aggregate, etc. It can consume the data from the various streaming source and can write the data to different sinks. It supports both Java and Scala.
Now let’s discuss some DSL (Domain Specific Library) Tool’s
iv. Table
It enables users to perform ad-hoc analysis using SQL like expression language for relational stream and batch processing. It can be embedded in DataSet and DataStream APIs. Actually, it saves users from writing complex code to process the data instead allows them to run SQL queries on the top of Flink.
v. Gelly
It is the graph processing engine which allows users to run set of operations to create, transform and process the graph. Gelly also provides the library of an algorithm to simplify the development of graph applications. It leverages native iterative processing model of Flink to handle graph efficiently. Its APIs are available in Java and Scala.