Why Data Lakes are Failing the Modern Enterprise

Felipe Hlibco

May 19, 2020

Gartner estimated that 85% of big data projects fail. Back in 2016, that number was 60%. It moved in the wrong direction. Data lakes sit at the center of it.

I’ve seen this firsthand. At TaskRabbit, we made data infrastructure decisions that started with optimistic architecture diagrams and ended with engineers complaining nobody could find anything. The pattern repeats across the industry; the root causes run more organizational than technical—almost always.

The Data Swamp Problem #

The promise sounds simple: dump everything into one place—usually S3 or HDFS—apply schema when you read it, let analysts explore however they want. Schema-on-read instead of schema-on-write. Flexibility over rigidity. Sounds great in the pitch deck.

In practice, that flexibility becomes chaos. Without governance (and I mean real governance, not a Confluence page nobody reads), a data lake degrades into what the industry now calls a “data swamp.” Raw files in inconsistent formats. No documentation about what each dataset contains. Duplicate data with no clear lineage. Tables loaded once for a specific analysis and never cleaned up.

A data engineer at a mid-size fintech told me recently that his company had over 40,000 tables in their data lake. Estimate of actively-used ones: maybe 2,000. The rest? Dead weight nobody wants to delete because nobody knows what depends on those tables.

That’s not a data lake. That’s a landfill.

The Metadata Problem #

The technical root cause, in nearly every failure pattern I’ve seen, traces back to metadata management. Or rather, the complete absence of any metadata discipline at all.

A data warehouse forces structure. Schemas get defined upfront, a catalog exists, and every table carries meaning because someone designed it that way. A data lake skips all of that enforcement by design—leaving structure entirely optional. When nothing enforces structure, structure never materializes. That’s basically a law of organizational physics.

The tools exist (and have existed for years). Apache Hive gives you schema-on-read capabilities; AWS Glue can crawl your S3 buckets and build a catalog automatically. But these tools only deliver value if someone configures them and—more importantly—if the whole team actually adopts them. That second part is where it always falls apart.

In every data lake failure I’ve watched up close, the metadata story follows the same arc. Someone sets up a catalog at the start, usually with genuine optimism. For the first few months, teams register their datasets. Then a deadline hits. Someone dumps a CSV directly into S3 without registering it. Then someone else does the same. Within a year, the catalog covers maybe 30% of the actual data and nobody trusts it anymore.

Sound familiar? It happens everywhere.

The Skills Gap #

One underappreciated problem: data lakes require a genuinely different skill set than traditional data warehouses, and most organizations don’t have that skill set.

A data warehouse engineer knows SQL, ETL pipelines, and star schemas. A data lake engineer needs fluency in distributed file systems, Spark or similar processing frameworks, partitioning strategies for object storage, and real-time streaming architectures. The overlap between these skill sets is smaller than people assume—often much smaller.

Organizations pour millions into data lake infrastructure and then don’t have the people who know how to run it well. Data loads fine; transformations break constantly. Queries are slow because nobody optimized the partitioning. The pipeline breaks on weekends because someone built it after learning Spark from a blog post. (No shade—that person was me, more than once.)

An InfoWorld piece from 2018 flagged this as one of the top four reasons big data projects fail. Not the technology itself; the gap between the skills available and the skills required.

The Cost Trap #

Data lakes get sold on cost. S3 storage costs almost nothing. HDFS on commodity hardware is cheaper than a Teradata appliance. Both true.

But storage is the smallest cost in the equation. The real costs: compute for running Spark clusters against data in formats nobody optimized; engineering hours spent building and maintaining pipelines a structured warehouse would eliminate; opportunity cost—analysts burning hours hunting for the right dataset instead of doing actual analysis.

I’ve seen teams where data engineering spends 60% of its time on pipeline maintenance. Not building new capabilities—just keeping what exists running. At that point, the cost equation inverts. The data lake costs more than a warehouse and runs slower too.

What I Think Actually Works #

For the record: I’m not against data lakes. The concept holds up. Having raw data available for exploration has real value. But the implementation pattern most enterprises follow—build it and they will come—fails every time.

What works, in my experience:

Start with governance. Before ingesting a single dataset, define naming conventions, ownership rules, and a metadata catalog that runs on automation rather than good intentions. Make it harder to add unregistered data than to register it properly. Yes, this slows you down at the start. The friction serves a purpose.

Limit access patterns. Not every team needs direct access to raw data. Build curated datasets for common use cases and let the data engineering team manage the raw layer. This adds a bottleneck; the bottleneck serves a purpose.

Treat data quality as a production concern. If a pipeline breaks, it’s a P1—not “fix it Monday.” Stale or wrong data causes more harm than no data, because decisions get made on bad numbers and the damage surfaces weeks later when everyone’s forgotten what changed.

And honestly? For most companies, a well-managed data warehouse with a small exploration layer is probably the right answer. The data lake pattern works at Google-scale. For a 200-person company, it usually creates more problems than it solves.

I say “usually” because I’ve gotten that call wrong before too.