OSI Releases Version 1.0 of Open Source AI Definition

Felipe Hlibco

November 7, 2024

Back in September I wrote about the headache of defining “open source” for AI models. The Open Source Initiative has now published their answer—OSAID v1.0, released October 28 at the All Things Open conference in Raleigh. I’ve spent the last ten days reading the definition, the endorsements, the criticism, and the reaction from companies whose models don’t qualify.

My verdict? It’s a necessary compromise that will make some people furious and make everyone’s procurement conversations slightly less painful.

What the Definition Requires #

OSAID v1.0 rests on three pillars. To call an AI system “open source,” you must provide:

Data information—not the training data itself, but sufficiently detailed information about the data to enable a skilled person to recreate a substantially equivalent system. This means metadata: data sources, preprocessing methods, selection criteria, labeling procedures. The actual datasets don’t need to be released.

Complete source code—all code used for training, inference, and evaluation. This must be under an OSI-approved open source license. No custom licenses with commercial restrictions.

Model parameters—the trained weights and configuration files, also under OSI-approved terms. You can download them, modify them, redistribute them.

The endorsement list is substantial: over 20 organizations including Mozilla, EleutherAI, CommonCrawl, and the Eclipse Foundation, plus more than 100 individuals who participated in the drafting process.

The Controversy #

The data information requirement—metadata instead of actual data—is where the fighting starts.

Critics argue that without the training data, you can’t truly reproduce the model. The Software Freedom Conservancy published a sharp critique calling OSAID a definition that “erodes the meaning of open source.” Their position: if you can’t inspect and reproduce the complete system, calling it “open source” is misleading. The parallel to traditional software is clear—imagine calling a program open source while keeping the build tools proprietary.

The OSI’s counterargument is pragmatic. Many training datasets contain copyrighted material, personal data, or content with complex licensing. Requiring full data disclosure would make compliance nearly impossible for large-scale models and might violate privacy regulations in some jurisdictions. The metadata requirement is a compromise: you know enough about the data to understand the model’s provenance without requiring redistribution of datasets the model creator may not have the right to share.

I find both sides partially right. The Conservancy’s point about reproducibility is technically correct—without the data, you can’t fully replicate the training process. But the OSI’s pragmatism reflects the reality that demanding full data access would mean almost no existing model qualifies, rendering the definition useless.

Who’s In, Who’s Out #

The definition creates clear winners and losers.

Models that qualify: Pythia (EleutherAI), OLMo (AI2), T5 (Google), Amber and CrystalCoder (LLM360). These projects released everything—data, code, weights—under permissive licenses before the definition even existed. They were already doing what OSAID now codifies.

Models that don’t qualify: Llama (Meta), Grok (X/Twitter), Mixtral (Mistral). These are the ones commonly called “open source” in casual conversation. Llama’s license includes commercial restrictions and revenue-based usage limits. Grok’s training data isn’t disclosed. Mixtral releases weights under Apache 2.0 but doesn’t provide sufficient data information.

This is going to create friction. Meta has spent considerable marketing capital positioning Llama as open source. The OSAID definition says otherwise. Whether Meta adjusts its licensing or simply ignores the definition remains to be seen, but the discourse has a reference point now that it lacked before.

Did It Solve the Headaches? #

Partly. The definition gives procurement teams, legal departments, and engineering leaders a shared vocabulary. When someone says “open source AI,” there’s now an official standard to reference. That’s progress—real progress—for anyone who’s sat through a meeting arguing about what “open” means.

But the compromise on data creates a new category of ambiguity. A model can be OSAID-compliant (releasing metadata but not data) and still be substantially less inspectable than traditional open source software. The spirit of open source has always been about transparency and reproducibility; metadata disclosure is transparency-lite.

For my own evaluations, I’ll use the OSAID definition as a floor, not a ceiling. If a model meets OSAID requirements, that tells me the creators are serious about openness. If a model also releases training data (like OLMo and Pythia), that’s even better. If a model doesn’t meet even the OSAID bar (like Llama), I’ll call it what it is: a model with open weights and a restrictive license.

What Happens Next #

The definition is version 1.0, not the final word. The OSI has signaled this is a living document that will evolve as the technology and legal landscape change. I expect the data disclosure requirements to be the primary battleground for v2.0.

The more immediate effect will be on marketing. Companies that have been loosely using “open source” to describe their models now face a choice: meet the definition, lobby to change it, or stop using the term. My guess is we’ll see all three.

For practitioners, the advice is simple. Read the OSAID definition. Use it as a baseline when evaluating models. Don’t trust marketing labels; check the actual license, data documentation, and code availability. The definition doesn’t solve everything—it’s a first attempt at a hard problem—but it solves enough to be useful.

The million headaches aren’t gone. They’re down to maybe half a million. I’ll take it.