Defining Open Source AI: Solving a Million Headaches

Felipe Hlibco

September 12, 2024

Last month I burned two days evaluating “open source” models for a production use case at DreamFlare. By the end I was more confused than when I started — not about the models themselves, but about what “open source” even means anymore.

Traditional open source is straightforward: you get the source code, you can modify it, you can redistribute it. That definition has been settled for decades. But AI models aren’t source code. They’re trained artifacts; the “source” is really the training data, the training code, the hyperparameters, and the weights. Calling a model “open source” because you released the weights is like calling a compiled binary “open source” because you published the .exe.

MIT Technology Review flagged this definitional gap back in March 2024. They were right. It’s only gotten messier since.

The Meta Problem #

Meta’s Llama models are the poster child for this confusion. Meta markets Llama as open source. You can download the weights, fine-tune them, build products with them. Sounds open, right?

Except — and this is a big except — Meta’s license includes commercial restrictions. Revenue thresholds. No competing products. The training data isn’t disclosed. The training code isn’t fully available. By traditional open source standards, this isn’t even close.

But Meta’s framing has stuck because the models are genuinely useful and freely available for most use cases. The practical openness matters to developers even if the legal and philosophical openness doesn’t meet the bar.

This is the headache. When your CTO asks “should we use an open source model?” — which definition are you actually using?

The OSI Process #

The Open Source Initiative has been wrestling with a formal definition for months now, running a public drafting process with release candidates that anyone can comment on. The core question they’re grappling with: does “open source AI” require releasing the training data, or just metadata about the data?

Their validation process so far has been revealing. Only a handful of models passed: Pythia from EleutherAI, OLMo from AI2, Amber and CrystalCoder from LLM360, and Google’s T5. These projects released everything — data, code, weights, training details. The whole stack.

Models that failed validation include Llama 2, Grok (from X/Twitter), Microsoft’s Phi-2, and Mistral’s Mixtral. All of them are commonly called “open source” in casual conversation. None of them meet the draft criteria.

Why This Matters for Production #

If you’re evaluating models for enterprise use, the definitional ambiguity creates real problems. “Open source” traditionally implies certain freedoms: you can inspect it, modify it, and understand what it does. For AI models, that understanding requires knowing what data went in. Without training data access — or at least detailed metadata — you can’t audit for bias, verify compliance with data regulations, or reproduce results.

I’ve sat in meetings where legal teams asked whether a model’s training data included copyrighted material, and the honest answer was “we don’t know because the vendor won’t say.” That’s not a tenable position for any company with regulatory obligations.

The OSI’s work matters because it gives the industry a shared vocabulary. Whether the final definition requires full data disclosure or just metadata, having a definition beats the current state where everyone uses the same words to mean completely different things.

I’m watching the process closely. The next draft should land soon, and it’ll shape how we evaluate and procure AI models for years. Whatever they decide, it’ll solve at least some of these headaches. The question is whether it creates new ones in the process.