A new technology arises, the data lake strategy. Data lakes are storage repositories, which are able to hold a vast amount of raw data in its native format until needed. In many cases data lakes are Hadoop-based systems and they represent the next stage in both power and flexibility. A compelling benefit of the approach is that there is no need to structure (transform) the data before querying it (which would be referred to as ‘schema on write’). In fact, you can assign structure to the data at the time it is being queried (referred to as ‘schema on read’). However, while data lakes are able to hold large amounts of unstructured data in a cost-effective manner, they are insufficient for interactive analysis when fast query response is required or if access to real-time data is needed.
A Change from ETL to ELT
The proliferation of data lakes enables the switch from ETL to ELT (extract, load, and transform). Unlike ETL where data is transformed before it’s loaded into the database, ELT significantly accelerates load time by ingesting data in its raw state. The rationale behind this approach was that data lakes storage technologies are not picky about the structure of the data. Therefore, no development time is required to transform the data into the right structure before it can be accessed for analytics. This means that all data could be simply ‘parked’ or ‘dumped’ into a data lake, and all further operations and transformations could occur within this database if and when needed.
Data Lakes - a Tantalizing Approach
While it is a tantalizing approach, the data lake falls short of expectations for several reasons. A primary objective of the data lake is to simplify and accelerate, however the approach often complicates matters with extra steps to prepare data for analytics, and although it provides significant reductions in labor for data loads it still requires that all data be moved or copied to a single location prior to accessibility for analytical purposes. This drawback is shared with the traditional data warehouse using ETL approach since data load latency cannot be eliminated from the analytical data supply chain although the load time latency is greatly reduced for the data lake as compared to a data warehouse. Another disadvantage to the data lake is a phenomenon that has come to be known as the ‘data swamp’ or ‘data graveyard’. The data lake approach often leads to dumping and storing much more data as compared to ETL because of lower cost of storage, but the ‘save everything’ approach leads to loading and storing much more data than businesses are prepared to analyze. Since any data load takes time and consumes disk space and network bandwidth, unnecessary loads can be expensive and cause additional latency that delays other more analytically valuable data from being analyzed in a timely manner. Although data lakes and ELT bring data together into one place quickly they cannot provide fast query response as analytical databases do, nor can they provide access to data in real-time.
Advantages and Disadvantages of Data Lakes and ELT
Learn more about how to evolve your ELT and data lake to a logical data warehouse with data virtualization. Get your free eBook now.