by Scott McCarty

Applying the lessons of open source to generative AI

feature
Aug 21, 20238 mins
Artificial IntelligenceGenerative AIMachine Learning

The excitement and turmoil surrounding generative AI is not unlike the early days of open source, or the Wild West. We can resolve the uncertainty and confusion.

shutterstock 17221348 Wild West two cowboys galloping dusty desert roping horses
Credit: Jeanne Provost / Shutterstock

Most people can sing “Happy Birthday” by heart. But do you know how you learned it? Who first shared it with you? Who wrote it? You have the lyrics and the melody, and can teach others, but you probably have no idea where it came from.

That effectively describes “the rub” with generative AI, and it’s a problem for the individuals and organizations that use it. Much like the early days of open source, and software licensing in general, generative AI is uncharted territory, it’s exciting, and there’s a lot to learn.

Even between deciding to develop this column and sitting down to actually write it, dozens and dozens of news stories distracted me—and further confused the AI issues I was contemplating—not least a story about OpenAI CEO Sam Altman telling the Senate that, yes, there should be a regulatory agency in place in the likely instance that the technology goes “haywire.”

In other words, generative AI is messy.

The Harvard Business Review notes that generative AI has an intellectual property problem that spans a complex set of questions:

  • How should existing laws apply?
  • What should we do about infringement?
  • What rights do AI users have?
  • What rights do content creators have?
  • Who owns AI-generated works?
  • Should unlicensed content be used for training?
  • Should users be able to prompt AI models to cite the licensed and unlicensed works they were trained on?

How did we get to this point so quickly? Part of the confusion lies in the opacity of the generative AI model.

The GPT in ChatGPT

It all goes back to the “GPT” in ChatGPT. GPT stands for generative pre-trained transformer. A transformer is not that big—about 2,000 lines of code. It’s basically the equivalent of an egg carton—its main purpose is to hold the “eggs,” or the things that really have value to consumers. In the case of generative AI, the “eggs” are variables or weights.

Sometimes humans forget where they learned something, but often they can remember and can cite sources. Unlike a human, ChatGPT and other generative AI platforms can’t actually remember any of the information they have ingested, nor can they cite it. There may be a log that exists somewhere, but it’s not in the model itself. Users can’t write a prompt to cite training data. The model just has a bunch of numbers and variables. It’s akin to a bunch of neurons, and fake neurons at that. These models just statistically predict the next word based on a bunch of content.

So, how are we going to solve the problem?

Numerous mechanisms are being explored to control the use of AI models, generated content, and weights:

  1. Regulation. The government can make laws to control how AI can be used and prescribe punishments for breaking those laws.
  2. Licensing. This is a scalable legal agreement between creators and consumers of a software, prose, images, video, etc. Open source was founded on the pillars of “rights” and “freedoms” enabled by licensing, but AI is forcing us to take a hard look (Llama and ChatGPT are not open source) at whether ultimate freedom is really the best solution.
  3. Contracts. Content creators and those who pay for its creation often have contracts when they do business. For example, the Writers Guild of America West proposed a poison pill in the contract that would prevent any AI-generated content from being copyrighted. The studios’ business models rely on copyrighted material so this would make it quite difficult to use AI-generated content instead of human writers.
  4. Technical controls: Much like in security, there is a difference between “policy” and “technical controls.” For example, there’s a major difference between mandating that people must change their password every 90 days and forcing them to change their password when they try to log in. Similarly, many AI companies and researchers are attempting to control what an AI model or service will and won’t do, but users are finding all kinds of creative ways to coax AI into doing prohibited things using prompt injection attacks.

I’m skeptical that any of the four methods above will actually control what people can do and can’t do with AI. Though, as with open source, I do think consistent and well-understood licensing will be key to widespread adoption by businesses.

The HBR article I referenced earlier agrees, noting that licensing will be key to protecting both AI creators and consumers. But how is that going to happen?

The AI Wild West

The excitement and turmoil surrounding generative AI is not unlike the early days of open source. And, the early days of open source were like the Wild West. Licenses were being created and used with no oversight, causing uncertainty and confusion, which is the opposite of what licensing is supposed to do. In the late-90s, the Open Source Initiative (OSI) took over and basically said, “We’re the keepers of everything that is open source.” Today, both the OSI and the Free Software Foundation publish open source definitions, which are used to determine the conformance of open source licenses.

And for about 25 years we kinda thought we were “done” with open source licensing. But because of AI (and other things like cloud providers), we’re going to need to rethink licensing schemes, or maybe generate totally new ones. The weights, the models, and the training data all likely need licenses because, without all the inputs and the outputs being well understood, it will make it difficult for businesses to adopt.

AI blurs these lines. When humans generate knowledge, it’s easy to understand the source and the ethics behind ownership of the knowledge. But when you start to get into AI models, it’s like, OK, who owns that stuff, anyway? Because, let’s be honest, it stands to reason that not all models are being trained on content that was approved for use in that way. In fact, I think it’s pretty safe to say that many of these models are violating copyright and license agreements. But, how do you prove it? It’s just a bunch of numbers in a model. How can you sue somebody in court over that?

Just as the OSI did with open source, the OSI is stepping in to try and put some guardrails around all of this. The OSI’s Deep Dive project posits that “the traditional view of open source code implementing AI algorithms may not be sufficient to guarantee inspectability and replicability of the AI systems.” The OSI has put out a series of podcasts on the subject, and it is conducting four virtual seminars designed to “frame a conversation to discover what’s acceptable for AI systems to be ‘Open Source.’” The OSI even has a blog about the outcome from the first face-to-face community discussion: Takeaways from the “Defining Open AI” community workshop.

If all of this sounds really confusing, that’s because it is. And the landscape is changing by the day. It will be really important for organizations to stay abreast of all of the news, and to do the best they can to avoid hype and fearmongering and focus instead on what can be done—now—to balance the benefits of AI with governance and guardrails.

I highly recommend paying close attention to the OSI’s work and to push the vendors you work with to explain what they are doing (and will do) to ensure the effective and ethical use of AI in an open source context. The goal is to deliver AI-powered applications with trusted provenance.

(And, for the record, “Happy Birthday,” which dates back to the late-1800s, was derived from a song written by schoolteacher Patty Hill and her sister, Mildred.” Over the years, it has been at the center of many copyright battles and millions of dollars of licensing fees. It is currently in the public domain.)

At Red Hat, Scott McCarty is senior principal product manager for RHEL Server, arguably the largest open source software business in the world. Scott is a social media startup veteran, an e-commerce old timer, and a weathered government research technologist, with experience across a variety of companies and organizations, from seven person startups to 12,000 employee technology companies. This has culminated in a unique perspective on open source software development, delivery, and maintenance.

Generative AI Insights provides a venue for technology leaders—including vendors and other third parties—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.