Next came the data lake strategy. Data lakes are storage repositories that can hold a vast amount of raw data in its native format until it is needed. In many cases, data lakes are Hadoop-based systems and they represent the next stage in both power and ﬂexibility. A compelling beneﬁt of this approach is that there is no need to structure (transform) the data before querying it (referred to as ‘schema on write’). In fact, you can assign structure to the data at the time it is being queried (referred to as ‘schema on read’). However, while data lakes can hold large amounts of unstructured data in a cost- eﬀective manner, they are insuﬃcient for interactive analysis when fast query response is required, or if access to real-time data is needed.
A Change from ETL to ELT
The proliferation of data lakes enabled the switch from ETL to ELT (Extract, Load, and Transform). Unlike ETL, where data is
transformed before it’s loaded into the database, ELT signiﬁcantly accelerates load time by ingesting data in its raw state. The rationale behind this approach is that data lake storage technologies are not picky about the structure of the data.
Therefore, no development time is required to transform the data into the right structure before it can be accessed for analytics.
This means that all data could simply be ‘parked’ or ‘dumped’ into a data lake, and all further operations and transformations could occur within this database, if and when it is needed.
Data Lakes - a Tantalizing Approach
While it is a tantalizing approach, the data lake falls short of expectations for several reasons. A primary objective of the data lake is to simplify and accelerate, however, the approach often complicates matters with extra steps to prepare data for analytics, and although it provides signiﬁcant reductions in labor for data loads, it still requires that all data be moved or copied
to a single location prior to being accessed for analytical purposes. This drawback is shared with the traditional data warehouse that uses an ETL approach since data-load latency cannot be eliminated from the analytical data supply chain, although the load-time latency is greatly reduced for the data lake as compared to a data warehouse. Another disadvantage of the data lake is a phenomenon that has come to be known as the ‘data swamp’ or ‘data graveyard’. The data lake approach often leads to dumping and storing much more data as compared to ETL, because of the lower cost of storage, but the ‘save everything’ approach leads to loading and storing much more data than businesses are prepared to analyze. Since any data load takes time and consumes disk space and network bandwidth, unnecessary loads can be expensive and cause additional latency that delays other more analytically valuable data from being analyzed in a timely manner.
Advantages and Disadvantages of Data Lakes and ELT
You want to find out which approach is suitable for your business? Check out our free eBook, “Beyond the Data Lake”, and enhance your knowledge of data integration.