Think of an actual lake or river, which is formed because of a lot of tributaries coming in together. Data lake is just like that. A data lake is made up of different types of unstructured, semi structured and structured data coming in together. Simply put, a data lake is a system or a depositary of data, stored in its natural and raw format.
Why Data Lake?
While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended meta-data tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
Data Lake Tech Stack
Important tiers in Data Lake Architecture
- Ingestion Tier: The tiers on the left side depict the data sources. The data could be loaded into the data lake in batches or in real-time
- Insights Tier: The tiers on the right represent the research side where insights from the system are used. SQL, NoSQL queries, or even excel could be used for data analysis.
- HDFS is a cost-effective solution for both structured and unstructured data. It is a landing zone for all data that is at rest in the system.
- Distillation tier takes data from the storage tire and converts it to structured data for easier analysis.
- Processing tier run analytical algorithms and users queries with varying real time, interactive, batch to generate structured data for easier analysis.
- Unified operations tier governs system management and monitoring. It includes auditing and proficiency management, data management, workflow management.
Parameters | Data Lakes | Data Warehouse |
---|---|---|
Data | Data lakes store everything. | Data Warehouse focuses only on Business Processes. |
Processing | Data are mainly unprocessed | Highly processed data. |
Type of Data | It can be Unstructured, semi-structured and structured. | It is mostly in tabular form & structure. |
Task | Share data stewardship | Optimized for data retrieval |
Agility | Highly agile, configure and reconfigure as needed. | Compare to Data lake it is less agile and has fixed configuration. |
Users | Data Lake is mostly used by Data Scientist | Business professionals widely use data Warehouse |
Storage | Data lakes design for low-cost storage. | Expensive storage that give fast response times are used |
Security | Offers lesser control. | Allows better control of the data. |
Replacement of EDW | Data lake can be source for EDW | Complementary to EDW (not replacement) |
Schema | Schema on reading (no predefined schemas) | Schema on write (predefined schemas) |
Data Processing | Helps for fast ingestion of new data. | Time-consuming to introduce new content. |
Data Granularity | Data at a low level of detail or granularity. | Data at the summary or aggregated level of detail. |
Tools | Can use open source/tools like Hadoop/ Map Reduce | Mostly commercial tools. |