Matt Asay
Contributor

Making sure open source doesn’t fail AI

analysis
Oct 16, 20237 mins
Artificial IntelligenceOpen SourceTechnology Industry

The lessons learned from cloud are spurring a proactive examination of what it means to be 'open source' in the rapidly evolving world of AI.

six big projects that went open source 1
Credit: Thinkstock

Open source has never sat comfortably in the cloud. It’s true open source has never been stronger (despite some evidence that most open source projects are an unmaintained security wasteland), but there’s a reason new license models keep cropping up, and it’s not corporate greed. Rather, it’s because the fundamental freedoms of open source and the rightly revered Open Source Definition (OSD) never got updated to truly answer the differences cloud computing imposed (something I was writing about back in 2009).

“Open source kind of missed the evolution of the way software is distributed and executed,” argues Stefano Maffulli, executive director of the Open Source Initiative (OSI), which shepherds the OSD. “We didn’t really pay attention to what was going on and that led to a lot of tension in the cloud business.” As I remember it, we did pay a lot of attention to the shift, but there was so much pressure to move fast that as an industry we ended up accepting third-party hacks like the Affero General Public License (AGPL), which satisfied no one and ignored the reality of software creation. Some have called the AGPL “toxic,” causing major influencers like Google to simply reject it as being “extremely difficult…to comply with.”

This time around, the OSI is determined to upgrade open source for the AI era. In an interview with Maffulli, he walked through the challenges and opportunities of applying the OSD to AI.

“You keep using that word”

We’ve seen all the hounding of Meta and others for alleged misuse of the term “open source” for their large language models (LLMs) and other AI projects. “There is no settled definition of what open source AI is,” says Mike Linksvayer, head of developer policy at GitHub. We can pretend that open source is the same in AI, just as we did with cloud, but the result will be just as unsatisfactory. “AI models are ostensibly just software programs, but the way they are developed, used, and distributed is unlike software,” Aryn CEO and co-founder Mehul Shah points out. When we pretend otherwise, we’re setting up open source for failure.

“Artificial intelligence ruptures the boundary between data and software,” the OSI suggests. Should open source apply to the weights/floating point numbers in an LLM, or to training data or something else? It’s unclear, and anyone suggesting otherwise risks making the same facile mistake we made with cloud, settling for license hacks rather than truly grappling with the hard issues.

The outcomes matter.

Consider the European Parliament’s latest drafts of the European AI Act, which try to carve out exceptions for “free and open source” AI (section 60). As Maffulli notes, “If there is no clear understanding of what [open source] means in practice, I expect there will be a vast chilling effect on research,” and “both academia and private labs will suffer because of the uncertainties.” For those of us who lived through the early years of commercial open source as applied to the cloud, you’ll recognize the problem I described in 2007: No one knew then what “distribution” meant as “software” increasingly meant “services distributed over a network.” Would-be corporate adopters weren’t sure how to apply free and open source licenses like the GPL.

“I didn’t want to repeat…the same mistakes [we had with] the cloud with AI,” Maffulli stresses. “So that’s why three years ago we started looking into what’s happening in [AI].” It’s hard to overstate how different this is from when cloud jarred the OSI. It felt like the OSI reacted years too late with cloud; now, it’s getting involved early.

“Extremely complex”

Instead of just assuming that software is software, Maffulli and the OSI are grappling with fundamental questions about what software means in AI. As Maffulli tells it, “What does it mean for a developer to have access to a model, and what are the rights that should be exercis[ed], and what do you need in order to have the possibility to modify [and redistribute] that model?” Although most developers don’t inspect or modify source code, it’s essential that they can. This is perhaps particularly true in AI, when black-box systems can have powerful impacts on day-to-day life (like screening job applicants). We need the ability to peer into that “black box” to understand how the machine learning model arrived at the decisions. In open source, “you need to study how it works to be able to verify that what you’re receiving corresponds to what it claims on the box,” Maffulli argues.

Sound easy? It’s not. “It’s what we’re trying to do and it’s an extremely complex scenario,” he stresses.

To ensure OSI isn’t navel-gazing, Maffulli is working with a broad spectrum of interested parties—Mozilla, venture capitalists, Creative Commons, academics, and everything in between. The point is to talk with a wide array of those with a “vested interest in getting a clearer understanding” because “open source is recognized as a driver for economic success.”

The biggest difficulty is in defining open source in a world where data and software are so inextricably linked. As Maffulli describes, the most intense discussions among his working group revolve around the dependencies between training data and the instructions on how to apply it. Perhaps not surprisingly, given the complexity and the stakes involved, “there is no strong consensus right now on what that means,” he says.

There are at least two approaches, with two primary factions squaring off in the working group. The first tries to stick closely to the comfortable concept of source code, promoting the idea that “source code” gets one-to-one translated to the data set. In this view, the combination of the instructions on how to build the model and the binary code is the source code subject to “open source.”

The second faction sees things in a radically different way, believing that you can’t modify code without having access to the original data set. In this view, you need other things to effectively exercise the fundamental freedoms of open source. You need, as Maffulli elaborates, “a very detailed description of what went into creating the artifact.” In this world, you’d need to publish all the scripts that went into assembling the data set, the weights that govern the LLM, the biases you bring to the model, etc. This, in my view, is a much more interesting and useful way to think about open source AI, but it’s also far more complicated to deliver in practice.

Neither approach is likely to be viable by itself. According to Maffulli, “We’ll have to find the common denominator.” As he explains, the current approaches might render a huge swath of AI impervious to open source because one way overlooks applications where data cannot be shared or where applications are built without the concept of shared data (e.g., federated learning systems), and the other way doesn’t appreciate the essential importance of data to make AI work. It’s similar in some ways to how the OSI tried to force a definition of “distribution” that made it impractical or impossible for some of the best open source citizens like Google to eschew licenses like the AGPL.

Maffulli is optimistic, however. By 2024, he says, we’ll see “something that resembles the four freedoms, because everyone has a very high urgency of finding a definition of open source AI.” The approach this time is different. “We’re taking a step back [and not] making a checklist of 10 points to evaluate licenses.” Instead, those four freedoms (four principles, really) will “become five freedoms and [will be] applied to AI systems.” Once we’ve settled on these essential principles, he argues, it will be relatively easy to apply them to deep learning, machine learning, and other AI systems to gauge “open source.”

Fortunately, the OSI isn’t operating alone. If you’d like to help out, you can get involved in the OSI’s Deep Dive process, comment on the evolution of the OSD, or reach out to Maffulli directly. Maffulli is running an open, inclusive process to ensure that the OSI doesn’t shortchange open source in AI.

Matt Asay
Contributor

Matt Asay runs developer relations at MongoDB. Previously. Asay was a Principal at Amazon Web Services and Head of Developer Ecosystem for Adobe. Prior to Adobe, Asay held a range of roles at open source companies: VP of business development, marketing, and community at MongoDB; VP of business development at real-time analytics company Nodeable (acquired by Appcelerator); VP of business development and interim CEO at mobile HTML5 start-up Strobe (acquired by Facebook); COO at Canonical, the Ubuntu Linux company; and head of the Americas at Alfresco, a content management startup. Asay is an emeritus board member of the Open Source Initiative (OSI) and holds a J.D. from Stanford, where he focused on open source and other IP licensing issues.

More from this author