Stable Diffusion: The Open Source AI Explosion

Felipe Hlibco

August 8, 2022

Something big happens in generative AI, and it has nothing to do with who makes the best images.

Stable Diffusion—a text-to-image model built by Stability AI, RunwayML, CompVis at LMU Munich, EleutherAI, and LAION—prepares to release its model weights to the public. Not behind an API. Not through a Discord bot. The actual model, downloadable, runnable on your own hardware. About 10,000 testers held the beta, with a broader research release to roughly 1,000 researchers expected any day now.

If you follow generative AI primarily through DALL-E 2 and Midjourney, this sounds like just another entry in the text-to-image race. Wrong. This represents a fundamentally different philosophy of how AI tools should reach the world, and my bet is that it reshapes the entire landscape.

The Technical Breakthrough That Matters #

Let me get the specs out of the way because they matter here. Stable Diffusion generates 512x512 images in seconds—on hardware you probably already own. That alone isn’t remarkable; DALL-E 2 does the same thing. The remarkable part: where this runs. A consumer GPU with roughly 6.9GB of VRAM.

Think about what that means. A mid-range gaming graphics card from 2020 handles this model, no cloud compute required. No API calls. No per-image pricing. Download the weights, run locally, generate as many images as the electricity bill allows.

DALL-E 2? Waitlisted. Cloud-only. Pay-per-image when it eventually opens up. Midjourney? Discord-only, subscription model. Google’s Imagen? Published a paper, showed stunning demos, hasn’t let anyone touch it.

Stable Diffusion just hands you the keys—no waitlist, no usage cap, no middleman.

The Architecture Under the Hood #

For the technically curious: Stable Diffusion builds on latent diffusion models, a clever architectural choice that explains the consumer-GPU story.

Traditional diffusion models operate directly in pixel space. They start with noise and iteratively denoise to produce an image. The problem—operating on full-resolution images crushes compute budgets; every denoising step processes every pixel.

Latent diffusion models add a compression step. An encoder first compresses the image into a lower-dimensional latent space. The diffusion process happens in that compressed representation. A decoder then expands the result back to pixel space. The math happens in a much smaller space, which explains the dramatic VRAM drop.

Text conditioning comes from CLIP, OpenAI’s contrastive language-image model from 2021. You type a prompt; CLIP encodes it into a vector representation; that vector guides the diffusion process toward images matching your description. Elegant engineering—leveraging existing breakthroughs (CLIP, latent diffusion, the massive LAION-5B training dataset) into something greater than the sum of its parts.

Open Weights vs. Closed APIs: A Philosophical Split #

This, to me, represents the genuinely important story—beyond the “cool AI art” angle.

A philosophical split forms in real time. On one side: OpenAI with DALL-E 2, keeping the model behind an API, controlling access, moderating outputs, and extracting revenue per generation. On the other: Stability AI and its collaborators, releasing everything under the Creative ML OpenRAIL-M license—permissive for both commercial and non-commercial use.

The arguments for closed models hold weight. Content moderation runs easier when you control the pipeline. Misuse (deepfakes, harassment, non-consensual imagery) becomes harder to police when anyone runs the model locally. OpenAI’s approach lets them filter problematic requests and outputs. Stable Diffusion’s approach doesn’t offer that.

But the arguments for open models carry real force too, and historically those arguments tend to win.

Linux beat proprietary Unix. Android (open-source-ish) outgrew iOS in market share. TensorFlow and PyTorch, both open-source, dominate machine learning research. When developers get the tools and the freedom to build, the ecosystem that emerges dwarfs anything a single company produces.

My expectation: once Stable Diffusion’s weights go fully public, the number of derivative works—fine-tuned models, custom UIs, specialized applications, integrations with existing creative tools—explodes. Not in months. In weeks.

The LAION Question #

Stable Diffusion trained on LAION-5B, a dataset of roughly 5 billion image-text pairs scraped from the internet. This represents both a technical strength and an ethical minefield.

The strength: breadth. LAION-5B covers an absurdly wide range of visual concepts, styles, and subjects. That breadth explains why Stable Diffusion handles everything from photorealistic portraits to abstract art to architectural renderings.

The minefield: consent. Artists, photographers, and designers created those 5 billion images—people who never agreed to have their work used to train a model that arguably displaces them. Copyright implications remain genuinely murky. Does training on copyrighted images qualify as “fair use”? Does a generated image “in the style of” a specific artist infringe on that artist’s rights? No legal precedent covers any of this.

I find myself torn. The technology stuns me. The capability it unlocks matters. But so does the concern that we build billion-dollar systems on the uncredited labor of millions of creators. Both facts coexist.

What Happens Next #

My prediction, offered with full willingness to revisit in six months: the open-source release of Stable Diffusion marks an inflection point for generative AI—not because of model quality (good but not obviously best-in-class) but because of the ecosystem it enables.

When 10,000 beta testers had access, interesting things happened. When researchers got access, more interesting things followed. When everyone gets access? When every developer with a GPU experiments, fine-tunes, and builds on top of this model? That’s when things turn unpredictable in the best way.

The Discord beta channels already showed what creative communities do with these tools: they push boundaries, discover techniques the creators never anticipated, and build workflows the original team couldn’t have designed. Scaling that dynamic from thousands to millions of users—that’s something worth watching closely.

The Google Imagen Contrast #

I work at Google, so I should address this directly: Google announced Imagen in May, and by most benchmarks (including Google’s own FID scores), it produces higher-quality images than Stable Diffusion or DALL-E 2.

But nobody outside Google touches it. The paper sits public. The model doesn’t.

Good reasons exist for that caution. Google carries immense brand risk; a single viral instance of Imagen generating harmful content turns into a PR catastrophe. The responsible AI considerations hold real weight and I respect them.

Still, the irony doesn’t escape me. Google holds arguably the best technology and the smallest user base. Midjourney has a charming Discord bot and a million users. Stable Diffusion gives the model away entirely. Markets don’t always reward the best technology—they reward the most accessible technology. That pattern repeats so many times in tech history that it barely qualifies as an observation anymore.

Why This Matters Beyond Art #

Text-to-image gets the headlines. But the latent diffusion architecture underlying Stable Diffusion doesn’t limit itself to generating pretty pictures. The same approach applies to video generation, audio synthesis, 3D model creation, and domains nobody has explored yet.

If the open-source model for images works—and by “works” I mean creates a thriving, self-sustaining ecosystem of builders and users—it establishes a template. Open weights, permissive licensing, community-driven development. That template spreads to every subsequent modality.

We watch a new paradigm emerge: how AI capabilities get distributed. Closed API versus open weights. Centralized control versus decentralized innovation.

History usually favors one side of that choice. But history usually takes its time. This time, I think it moves fast.