Total WebSite Views Count

Data Lake


What is a Data Lake?
Think of an actual lake or river, which is formed because of a lot of tributaries coming in together. Data lake is just like that. A data lake is made up of different types of unstructured, semi structured and structured data coming in together. Simply put, a data lake is a system or a depositary of data, stored in its natural and raw format.
Why Data Lake?
Contrary to a data warehouse, where data is processed and stored in files and folder, a data lake has a flat architecture, meaning that a data lake stores all the data without any prior processing done, reducing the time required for compilation. The data in a data lake is retained in its original format, until it is needed. Data lakes provides agility and flexibility, making it easier to make changes. Though the reason to store data in a data lake is not predefined, the main objective of building a data lake is to offer an unrefined view of data to data scientists, whenever needed. Data Lake also allows Ingestion i.e. connectors to get data from different data sources to be loaded into the Data lake. Data lake storage is more scalable and cost efficient and allows fast data exploration.

A Data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended meta-data tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
Data Lake Tech Stack


Important tiers in Data Lake Architecture
  1. Ingestion Tier: The tiers on the left side depict the data sources. The data could be loaded into the data lake in batches or in real-time
  2. Insights Tier: The tiers on the right represent the research side where insights from the system are used. SQL, NoSQL queries, or even excel could be used for data analysis.
  3. HDFS is a cost-effective solution for both structured and unstructured data. It is a landing zone for all data that is at rest in the system.
  4. Distillation tier takes data from the storage tire and converts it to structured data for easier analysis.
  5. Processing tier run analytical algorithms and users queries with varying real time, interactive, batch to generate structured data for easier analysis.
  6. Unified operations tier governs system management and monitoring. It includes auditing and proficiency management, data management, workflow management.

Difference between Data lakes and Data warehouse

ParametersData LakesData Warehouse
DataData lakes store everything.Data Warehouse focuses only on Business Processes.
ProcessingData are mainly unprocessedHighly processed data.
Type of DataIt can be Unstructured, semi-structured and structured.It is mostly in tabular form & structure.
TaskShare data stewardshipOptimized for data retrieval
AgilityHighly agile, configure and reconfigure as needed.Compare to Data lake it is less agile and has fixed configuration.
UsersData Lake is mostly used by Data ScientistBusiness professionals widely use data Warehouse
StorageData lakes design for low-cost storage.Expensive storage that give fast response times are used
SecurityOffers lesser control.Allows better control of the data.
Replacement of EDWData lake can be source for EDWComplementary to EDW (not replacement)
SchemaSchema on reading (no predefined schemas)Schema on write (predefined schemas)
Data ProcessingHelps for fast ingestion of new data.Time-consuming to introduce new content.
Data GranularityData at a low level of detail or granularity.Data at the summary or aggregated level of detail.
ToolsCan use open source/tools like Hadoop/ Map ReduceMostly commercial tools.

AWS Services

AWS Services

Technology Selection & Evaluation Criteria

Technology Selection & Evaluation Criteria

Scale Cube - Scale In X Y Z Cube

Scale Cube - Scale In X Y Z Cube

Feature Post

AWS Services

About Me

About Me

Spring Cloud

Spring Cloud
Spring Cloud

Spring Cloud +mCloud Native + Big Data Archittect

Spring Cloud +mCloud Native + Big Data Archittect

ACID Transaction

ACID Transaction

Data Pipe Line Stack

Data Pipe Line Stack

Popular Posts