Data Lakes Explained in Simple Terms

As businesses generate massive amounts of data, there comes a need to find efficient ways of storing and analysing that data. Traditional storage methods come with limitations when dealing with raw unstructured data. This is where data lakes come in.

What is a Data Lake?

A lake is a vast body of water where all types of streams and rivers of data flow into it, the lake stores the raw, unfiltered, natural water that can be processed later and used as needed. Similarly, a data lake stores raw, unfiltered data that can be processed and analysed later as needed.

Doubling down on the analogies, a data lake is like an attic where you store all sorts of things e.g. old clothes, photo albums, unused furniture, old tech and gadgets. Nothing is organised and not everything is labelled like in a neat kitchen pantry. You store things in the attic as they come. However, when you do need something, you can always search through the attic for that thing, clean that thing and then use it.

That's how a data lake works with data, "data attic" just doesn't sound as nice.

How Does a Data Lake Work?

  • A large storage system that holds all types of data, both structured - like spreadsheets - and unstructured - like videos, images, and PDFs.

  • Data is stored exactly as it's collected without any cleaning, processing or organising.

  • If and when needed, businesses can analyse, filter, and process specific parts of the data.

Example: Data Lakes in E-Commerce

  • A large e-commerce website collects vast amounts of customer interaction data daily such as purchases and reviews.

  • Instead of sorting and cleaning the data as soon as it's collected, they can store all the data in a data lake, it does take a lot of resources to process all the data coming in.

  • Later on, for that big boardroom meeting, they can analyse the data to see how they can improve product recommendations and recognise customer preference trends.

How Does a Data Lake Build on Blob Storage

Blob storage is like a collection of buckets, each bucket holding something different e.g. one bucket holding images, another holding videos, another holding PDF files. This system lacks a built-in function of searching across all the buckets.

A data lake is one place where all the buckets flow in that you can later filter, purify and extract exactly what you need.

Going From Buckets to Lakes

So how does a data lake build on the blob storage system which is a basic storage of files?

  • Adding structure - it organises data into folders and sub-folders to create some level of practical structure

  • Enhanced Searchability - files have the option of including meta data which can help locate those files when needed later.

  • Security - allows for management of access control to protect sensitive data.

How is a Data Lake Different to a Data Warehouse?

From the Attic to the Library

Think about the difference between books in a library and books in your attic. At the library, the books are organised very neatly according to a specific categorisation code so that you can very quickly find what you're looking for.

The books in your attic lack this structure but you can still find a book that you're looking for with a bit of time because your books are all in the "book corner", they're just not organised.

Data LakeData Warehouse
Raw, unprocessed dataClean, structured data
Cheap storageExpensive storage
Slower speed, data requires processingFast speed, data optimised for analysis
Best for storing vast amounts of raw dataBest for business intelligence and reports

Example: Data Lakes and Data Warehouses in Banking

  • A bank can store processed transactions in a data warehouse so that they can instantly generate reports on spending and detect fraud.

  • This is fast but expensive as the processed transactions data need cleaning and structuring storage.

  • They can also store their unprocessed transactions data in a data lake for future fraud analysis and processing.

Why Do Businesses Use Data Lakes

  • Storing unstructured data - unlike data warehouses that can only hold structured data, data lakes can hold all types of data without requiring structure.

  • Big Data and AI - machine learning models can be trained on vast amounts of raw data.

  • Cost-Effective - Without the requirement of immediate structuring and processing, they are cheaper than data warehouses.

  • Flexible Data Exploration - historical data that wasn't processed or needed in the past can be processed now if needed.