Technical Architect Brain: Data Lake

What is a Data Lake?

Think of an actual lake or river, which is formed because of a lot of tributaries coming in together. Data lake is just like that. A data lake is made up of different types of unstructured, semi structured and structured data coming in together. Simply put, a data lake is a system or a depositary of data, stored in its natural and raw format.

Why Data Lake?

Contrary to a data warehouse, where data is processed and stored in files and folder, a data lake has a flat architecture, meaning that a data lake stores all the data without any prior processing done, reducing the time required for compilation. The data in a data lake is retained in its original format, until it is needed. Data lakes provides agility and flexibility, making it easier to make changes. Though the reason to store data in a data lake is not predefined, the main objective of building a data lake is to offer an unrefined view of data to data scientists, whenever needed. Data Lake also allows Ingestion i.e. connectors to get data from different data sources to be loaded into the Data lake. Data lake storage is more scalable and cost efficient and allows fast data exploration.

A Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.

While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended meta-data tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.

Data Lake Tech Stack

Important tiers in Data Lake Architecture

Ingestion Tier: The tiers on the left side depict the data sources. The data could be loaded into the data lake in batches or in real-time
Insights Tier: The tiers on the right represent the research side where insights from the system are used. SQL, NoSQL queries, or even excel could be used for data analysis.
HDFS is a cost-effective solution for both structured and unstructured data. It is a landing zone for all data that is at rest in the system.
Distillation tier takes data from the storage tire and converts it to structured data for easier analysis.
Processing tier run analytical algorithms and users queries with varying real time, interactive, batch to generate structured data for easier analysis.
Unified operations tier governs system management and monitoring. It includes auditing and proficiency management, data management, workflow management.

Difference between Data lakes and Data warehouse

Parameters	Data Lakes	Data Warehouse
Data	Data lakes store everything.	Data Warehouse focuses only on Business Processes.
Processing	Data are mainly unprocessed	Highly processed data.
Type of Data	It can be Unstructured, semi-structured and structured.	It is mostly in tabular form & structure.
Task	Share data stewardship	Optimized for data retrieval
Agility	Highly agile, configure and reconfigure as needed.	Compare to Data lake it is less agile and has fixed configuration.
Users	Data Lake is mostly used by Data Scientist	Business professionals widely use data Warehouse
Storage	Data lakes design for low-cost storage.	Expensive storage that give fast response times are used
Security	Offers lesser control.	Allows better control of the data.
Replacement of EDW	Data lake can be source for EDW	Complementary to EDW (not replacement)
Schema	Schema on reading (no predefined schemas)	Schema on write (predefined schemas)
Data Processing	Helps for fast ingestion of new data.	Time-consuming to introduce new content.
Data Granularity	Data at a low level of detail or granularity.	Data at the summary or aggregated level of detail.
Tools	Can use open source/tools like Hadoop/ Map Reduce	Mostly commercial tools.