If you Filled a Big Data Lake, Would Anyone Swim in it?

The idea that you should build a “Data Lake” has gained a lot of currency recently in discussions of Big Data implementations. So what is a Data Lake, and is it a good idea? The answers depend on who you ask, and how you set it up.

First, what is it, and what is it not? Definitions tell me a Data Lake is:

  • “A storage repository that holds a vast amount of raw data in its native format until it is needed.”
  • “A massive, easily accessible data repository built on (relatively) inexpensive computer hardware for storing “big data. Unlike data marts, which are optimized for data analysis by storing only some attributes and dropping data below the level aggregation, a data lake is designed to retain all attributes, especially so when you do not yet know what the scope of data or its use will be.”
  • “An enterprise-wide data management platform for analyzing disparate sources of data in its native format.”

Why would you want to do this? There is one very good reason: traditional data marts, cubes, etc., only store aggregated — i.e. pre-digested — data and so loose track of the source data that went into it. Since data can only be aggregated to answer questions that were anticipated at design time, the range of questions these marts can answer is by definition limited.

From my own painful experience, here’s how that looks in practice:

  • Business: “we need an information system to answer Question X!”
  • IT spends several person years building the Question X data mart / warehouse / cube.
  • IT – triumphantly: “Here is your Question X answering system!”
  • Business: “Oh, yeah, thanks. We couldn’t wait around for you, so we have pretty much figured out the answer to Question X by brute force. It’s no longer a problem, but it means that answering new Question Y is very urgent. Can your system answer Question Y?”
  • IT: “No. That will take another $MMM and TTT months.”
  • Business: “Oh for goodness sake!”
  • IT: “Oh for goodness sake!”

So a system that stores data at its finest level of granularity — a specific, serialized product, a customer, a truck, a machine — preserves the flexibility of your Big Data system so that it can help answer questions you have not thought of asking yet. This, in many ways, is the defining value of a Big Data system, and I would strongly advocate for it. The technologies are widely available to allow you to do this on generic hardware, and at very large scales.

So far, so good! But– and you know there would be a but, didn’t you — granular data is necessary, but not sufficient.

As I read articles about Data Lakes, I see many suggesting or implying that you can just throw all your raw data into the lake, and sort it out later on when you know what you want to do with it. I’m afraid not. I think Gartner sums it up best:

“However, while the marketing hype suggests audiences throughout an enterprise will leverage data lakes, this positioning assumes that all those audiences are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata.”

Lets call this first problem “Understanding”. To paraphrase Gartner: unless all your Employees are Statisticians or Data Scientists, all you will get for your Data Lake investment is a big pool of raw data that nobody is equipped to use. And I would go further from my own experience: even for strong Data Scientists, raw, unconnected data will yield little value in the only terms that matter: convincing the organization to invest capital — financial or human — in significant change.

Unfortunately, I have seen this happen many times, including on some of my own early projects. We built it, but they didn’t come!

This is very discouraging if you want to build useful Big Data systems that can answer many questions, not just the ones you knew to ask beforehand. Why is it so? Because at the very detailed granularity we are talking about — remember: we are swimming right down among things that happened to individual products or people at specific times in specific places — minutes, seconds and even milliseconds matter. If problem P happens at 09:05:36 in location L, then it happened to our Customer, Claire. If it happened at 09:06:04, then it happened to Fred, who isn’t a Customer of ours at all! Lets call this second problem “fragmentation”. It requires a sequence of sometimes heroic assumptions and correlations to go from raw data this ambiguous to a statement such as “Location L is a problem for our Customers, and we should invest $MMM and TTT to fix it!” Most business won’t make that jump. And those that do seriously risk being wrong, resulting in waste and demoralization.

What can be done about these problems, and can they be solved? Luckily, yes. While the problems you will need your Big Data system to solve in the future are unknown, the context that these problems occur in is well known. This needs to be captured and ‘tagged’ to your source data: for every piece of data, you should, at the very least, be able to answer the questions what happened, to what or who, where, when and done by whom. This is how you address the problem of fragmentation. Going further, the standard KPIs that the business uses to measure itself — it’s performance, quality, delivery, Customer experience, etc — should be calculated as the raw data is collected, and linked to its source data in your Data Pool. This is how you address understanding.

In conclusion, a “Big Data Lake” into which raw data is poured to be sorted out later is, in my experience, a waste of money. However a lake into which contextualized information is carefully placed will be a reservoir of knowledge and value for your business.