Scaling AI: Simplicity Over Better Models

Felipe Hlibco

March 5, 2026

Last month I sat in a room with a dozen engineering leads from companies running AI in production. Not demos, not prototypes—actual revenue-generating workloads. I asked what their biggest bottleneck was.

Not one person said “model quality.”

Every answer was some version of: our data is a mess, our infrastructure can’t keep up, or we’re burning GPU budget on workflows that don’t need it. A CockroachDB survey of over 1,000 tech leaders backs this up; AI workloads scale faster than the systems underneath them can adapt. And yet the industry conversation stays fixated on the next frontier release. It’s maddening, honestly.

The model isn’t the bottleneck #

Here’s what I keep seeing at Google and across the ecosystem: teams that obsess over model selection ship less. They evaluate, benchmark, swap. Every new release triggers a round of “should we switch?” The answer is almost always no, but the conversation still burns a week.

The teams shipping consistently? They picked one or two core models early—sometimes a frontier model for complex reasoning, sometimes an efficient one like Qwen 3.5 or DeepSeek V4 for throughput—and built depth around those choices. Prompt engineering, fine-tuning, evaluation pipelines, caching layers. The boring stuff that nobody writes blog posts about (except me, apparently).

I’m not arguing you should ignore new models forever. But constant model switching has a cost nobody talks about: you lose institutional knowledge. Your prompts are tuned for specific model behaviors. Your evaluation datasets reflect specific failure modes. Swap the model and you restart a surprising amount of that work. I’ve seen it happen three times in the last year at different orgs.

Data architecture is the actual moat #

The New Stack published a piece recently arguing that the secret to scaling AI isn’t a better model—it’s a simpler data foundation. I’ve been saying some version of this for a while, but they put it well.

Think about what a typical enterprise AI pipeline looks like. Data lives in seventeen different systems. ETL jobs run on schedules that made sense in 2019. Feature stores, if they even exist, are maintained by one person who’s thinking about leaving. The model sits on top of this rickety tower and everyone wonders why it hallucinates.

Unified data lakehouse architectures—the kind Databricks and Snowflake have been pushing—aren’t sexy. But they solve the real problem. When your model can access clean, consistent, well-governed data through a single interface, everything downstream gets easier: training, fine-tuning, RAG pipelines, evaluation. All of it.

I’ve watched teams spend six months evaluating models when three months fixing their data layer would’ve gotten better results with the model they already had. Six months! That’s half a year of spinning wheels because someone wanted to “explore the landscape.”

The 2026 model landscape actually helps here #

Something interesting happened this year. The gap between frontier models and efficient alternatives narrowed dramatically. DeepSeek V4 and Qwen 3.5 aren’t just “good enough”—they’re genuinely capable for a wide range of production tasks at a fraction of the cost.

Smart teams are exploiting this: use the efficient model for 80% of your workload (classification, summarization, structured extraction, routing) and reserve the frontier model for the 10-15% that actually needs it. The remaining 5%? Probably doesn’t need AI at all. A regex or a rules engine would do fine. I know that’s not a popular thing to say in 2026, but it’s true.

Model routing tools like LiteLLM and Portkey make this practical. You define policies, set cost thresholds, let the gateway figure out which model handles which request. Not glamorous. Cuts inference costs by 40-60% in my experience, though, so who cares about glamorous.

What actually works #

After watching dozens of teams scale (and fail to scale) AI workloads, I keep coming back to three things.

Get your data house in order first. Before you evaluate a single model, consolidate where you can. Document what you can’t consolidate. Build clear lineage so you know where training data came from and how fresh it is. This sounds basic; it’s also the step most teams skip.

Design around workflows, not models. If your architecture is “we call GPT-4 here and Claude there,” you’re one deprecation notice away from a bad week. Abstract the model behind an interface. Make it swappable without rewiring everything. I learned this the hard way at DreamFlare when an API change broke half our pipeline overnight.

Take GPU costs seriously. They scale non-linearly and most teams don’t track them well. IBM’s 2026 trends report flags this as an emerging crisis, and I agree. You need cost attribution per workflow, per team, per use case—not just “we spent $47K on OpenAI this month.” Where did it go? Which workflows justify the spend and which ones are burning money on tasks a fine-tuned 7B model could handle?

The uncomfortable part #

Most AI scaling failures aren’t technical. They’re prioritization failures. Teams chase the shiny model announcement instead of investing in infrastructure that would make any model work better.

I get it. “We upgraded to the latest frontier model” plays better in the all-hands meeting than “we consolidated three data pipelines into one.” But the second one moves the needle. Every time.

The best AI teams I’ve worked with don’t have access to better models. They have cleaner data, more disciplined infrastructure, and the willingness to pick a model and commit to it long enough to learn its quirks. That’s it; that’s the whole post.