Data Lakes: A Guide For Executives

On its own, a data lake does nothing. It provides no business value without an analytics environment. To provide value, a data lake must include the analytics component via a data warehouse or other purpose-built tooling to form those analytic opinions.

Author: Brian Bulkowski is the CTO of Yellowbrick Data, a modern data warehouse designed for the hybrid cloud. Published by Forbes on April 15^th, 2020

“There’s a lot of fact, fiction and confusion surrounding data lakes. Not so much with my fellow chief technology officers (CTOs), but perhaps at the chief executive officer (CXO) level. It’s not surprising, given the explosion of data tooling and solutions on the market. Data lakes, data warehouses, cloud object stores, database management systems, data grids, data marts — it’s a long list!

Given that data lakes and data warehouses are my areas of expertise, I thought I would simplify data lakes for the strategic thinkers out there since they are so widely used — and widely misunderstood — behind the scenes.

Data lakes are evolving with the technology landscape and needs of businesses in just about every industry, as we all try to extract more insights from more data and sources in order to remain competitive. Legacy infrastructure with siloed data repositories are still the rule, not the exception.

Understanding data lakes will certainly come in handy, especially at the C-level. If, like most organizations today, yours is data-driven and an active participant in the data economy, please read on.

What is a data lake?

Let’s start by stating that a data lake is not Hadoop. If you’re still using Hadoop, my advice is to accelerate your pace away from it — now. A data lake is a single place where a business’s raw data can be stored. Critically, a data lake stores the impartial, non-transformed original source of truth. Every cleaning of data and every coercion into relationships forms an opinion. Data lakes have none of that, which allows for innovative analytics in the future.

On its own, a data lake does nothing. It provides no business value without an analytics environment. To provide value, a data lake must include the analytics component via a data warehouse or other purpose-built tooling to form those analytic opinions.

I’d be remiss if I didn’t mention the infamous (well, in CTO circles, anyway) 2016 Forbes article titled “Why Data Lakes Are Evil” that made the argument against data lakes, saying the lakes themselves were expensive, hard to extract data from and full of old data with little value. Ouch. But this wasn’t inaccurate at the time.

That being said, here are my data lake takeaways for the executives reading this article:

Augmentation is the name of the (data lake) game.

What’s a CXO to do about their data lake strategy? Augmentation. And there’s an important distinction here between “augmenting” a data lake and making it a high-performance and overly expensive data lake.

Data lakes are best when retaining data is not feared and extracting value comes from combining the lake with your analytics flow. Yes, speed is an important part of augmentation, but the expansiveness of the data lake’s use is the broader story. Seek it.

Low-performance and dumb is good.

In talks at conferences and in private conversations with fellow executives, I argue that data lakes should be low-performance, dumb and cheap. You want a low-performance data lake that offers broad and expansive storage. Then, you want to combine that with an analytics tier to extract the business value of data-driven insights.

And don’t get hung up on the “cheap” aspect of this because we’re not talking about off-the-rack men’s suits. A cheap data lake is still of immense value.

Embrace cheap storage.

I strongly believe that enterprises should reimagine their data lakes by embracing incredibly cheap storage.

Here’s some historical context: In a traditional enterprise storage tier back in about 2010, I remember using a rule of thumb for enterprise storage (say, EMC bricks) of approximately $35 per gigabyte (GB). While that’s the GB purchase amount — with all of the services, the volume manager and the backup system included — the annual cost hits about $35,000 per terabyte per year.

Contrast that with the price for storage on Amazon AWS S3 at $0.02 per GB per month, which is $0.25 per year. That’s what engineers call “four orders of magnitude,” which is the difference between having a hundred dollars and having a million dollars.

With the cloud, you get managed storage, with no data center costs and few service people. This single factor has moved the data lake from “evil” to “easy.”

Final thoughts

Data lakes with an analytics component? Good. A data lake alone? Bad.

The above statement is really all that a non-technical CXO needs to know about data lakes. If you remember anything from this article, that’s it!

The fact that you’re reading this article means you are already on to the critical importance of data in all walks of business (and life, frankly). Managing your data and extracting value from it are now arguably the most important responsibilities of those of you who run businesses. So, I hope I’ve cleared up the fact, fiction and confusion about data lakes. Trust me, they’re not so bad. Come on in. The data is warm.”

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

Editorial Comment: We located this article in DataMatters, Issue #23 April 18, 2020 – Powered by Dun & Bradstreet. If you are interested in subscribing: Sign up here