Supercharging ML and AI Dev Experience at Netflix

Felipe Hlibco

November 13, 2025

Every ML engineer I know has the same complaint. Notebooks feel great for exploration but terrible for production. Production pipelines feel great for reliability but terrible for iteration. Pick one, then spend the rest of your week fighting whichever you didn’t pick.

Netflix just shipped something that might actually fix this. Or at least make the fight less painful.

Spin: The Missing Piece in Metaflow #

Metaflow 2.19 introduced Spin, and honestly, it’s the kind of feature that makes you wonder why nobody built it sooner. The idea is dead simple: take a single @step in your production pipeline, pull it out, run it locally with full state from the parent step. Notebook-style iteration, but inside your actual production DAG.

No mock data. No reconstructing state. Real artifacts from the real pipeline.

I’ve spent enough time watching teams wrestle with TFX and Vertex AI pipelines to appreciate what this gets right. The iteration loop on most production ML frameworks is punishing — change something, push it, wait for the orchestrator to schedule it, watch it fail on step 3 of 7, read the logs, fix it, push again. Rinse. Repeat. Spin collapses that entire cycle into something closer to a REPL.

The VS Code integration (via the metaflow-dev extension) maps keyboard shortcuts directly to spin commands. Hit a key, your step runs. Hit another, you’re back in the debugger. That kind of developer ergonomics separates tools people tolerate from tools people actually enjoy using. I’ve seen the difference firsthand; small friction reductions compound fast when you repeat a workflow 40 times a day.

Why Developer Experience Matters More Than Architecture #

Here’s the thing Netflix understands better than most: the best ML architecture in the world doesn’t matter if engineers route around it. And they will route around it. Every single time.

If your production pipeline takes 20 minutes to test a single change, engineers will prototype in notebooks and hand-wave the translation to production code. You end up with two codebases (the notebook and the “real” one), constant drift, and a QA process that’s basically vibes. I’ve lived this. At my previous gig we had a Jupyter notebook that was supposed to be “temporary” for about fourteen months.

Netflix’s philosophy, at least as I read their engineering blog, goes like this: make the happy path for production ML the easiest path. Not a best practice enforced through code reviews. Not a linting rule. The path of least resistance.

Spin achieves this by removing the reason engineers reach for notebooks in the first place: fast, interactive, state-aware debugging. If you can get that inside the production framework, why would you leave?

The Bigger Picture: Maestro and the Media Data Lake #

Spin doesn’t exist in a vacuum. Netflix open-sourced Maestro earlier this year; their workflow orchestrator powers nearly every ML and AI system internally. Maestro handles the scheduling, retries, and dependency management that Metaflow workflows rely on at scale.

Then there’s the Media ML Data Lake. This one’s fascinating because of how specific it gets. Not a generic data warehouse — it’s purpose-built for video, audio, text and image assets. Netflix created an entire specialization around it (Media ML Data Engineering), which tells you how seriously they treat the intersection of media processing and ML infrastructure. Most companies I’ve worked at would never spin up a dedicated data engineering role for a single domain like that; Netflix apparently thought it was obvious.

The combination matters. Maestro orchestrates the workflows. The Media Data Lake feeds them with rich, multimodal data. Spin lets engineers iterate on individual steps without breaking the whole chain. Each piece addresses a different friction point, and together they create something genuinely hard to replicate. Not because the individual ideas are novel (they aren’t) but because the integration is tight enough that the whole stack feels coherent.

What This Means for the Rest of Us #

I’ll be blunt. Working in DevRel at Google, I see the contrast constantly. Vertex AI is powerful. TFX is battle-tested. But the developer experience gap between “exploratory notebook in Colab” and “production pipeline in Vertex” remains wide. The tooling asks engineers to context-switch between two fundamentally different paradigms, and most of them just… don’t. They stay in Colab as long as possible and then rush the production part.

Netflix’s approach suggests a different philosophy: don’t build a bridge between notebooks and production. Build production tooling that feels like a notebook.

That’s a subtle distinction but it matters. Bridges assume two separate islands. Spin says there’s one island; you just haven’t been allowed to walk around all of it yet. I find that framing genuinely useful when thinking about developer tools in general, not just ML-specific ones.

The Claude Code Angle #

One more thing. Metaflow recently shipped Claude Code integration with Spin, letting you pair with an AI assistant while iterating on pipeline steps. I haven’t tested this extensively (my current workflow is messy enough without adding another variable), but the combination of an LLM that understands your step’s context — because Spin provides full parent state — with interactive debugging sounds promising.

Early days. But the trajectory tells you where ML developer experience is heading: away from faster builds and better dashboards, toward reducing the cognitive overhead of working inside production systems. Netflix seems further along that path than anyone else I can point to.

Whether the rest of the industry catches up or just copies the surface-level features without the underlying philosophy… well. That’s always the question, isn’t it.