Build RAG-powered LLM applications using the tools you know with a managed vector index in Azure.
If you’re building generative AI applications, you need to control the data used to generate answers to user queries. Simply dropping ChatGPT into your platform isn’t going to work, especially if you’re using proprietary data that wasn’t part of the initial training set for the large language model (LLM) you’re using.
Without some form of grounding, your AI is liable to quickly and randomly generate plausible-sounding text as it tries to predict the output associated with a user’s prompt. It’s not surprising that the LLM “hallucinates” like this, as it’s purely a way of statistically generating text at a syllable level. Instead of working with words, it’s using its neural network to navigate through the most likely path in a multidimensional semantic space.
Reducing AI risk with RAG
There’s more than one way to reduce this risk: You can fine-tune an LLM on your own data, or you can take advantage of retrieval-augmented generation (RAG) techniques. This last option is popular; it’s part of how search engines like Bing use LLMs, running a search and then generating text based on search results. It’s also at the heart of popular LLM workflow tools, such as the open source LangChain, LlamaIndex, Haystack, or Microsoft’s own Semantic Kernel.
It’s important to understand that you can’t simply plug any old database into an LLM because there’s no semantic model in most databases, no way to link either the relational or NoSQL world to what an LLM “understands.” We need to bring the two together in a way that lets us construct a search that can feed into a prompt.
Why did Microsoft start its Copilot story with Bing? The answer is quite simple: A search engine uses some of the same underlying technologies as generative AI, building on a vector index. Like an LLM, it treats content as semantic vectors, which it compares to deliver results using variants on a nearest neighbor algorithm. Data is stored along with a vector index that allows a quick comparative search, which is then used to construct the familiar search page.
If we’re going to build a RAG-grounded AI application, we’re going to have to build our own search engine on our data, adding a vector index to deliver those essential semantic searches. During the past few months, I’ve looked at several vector index implementations where Microsoft has been adding the necessary tools and features to existing database properties and services.
Searching with vector indexes
Microsoft isn’t the only vector index provider out there, with MongoDB and others offering supported vector index services on Azure. Most of them are simply add-on vector indexes that use semantic embeddings to add a vector index column to existing stores. It works, but you may prefer to take your existing data and use a native vector database for at-scale searches.
One option is Pinecone, which offers a serverless vector database as a managed Azure service. Available through the Azure Marketplace, it offers a range of different price points, from a basic free service for small applications to larger-scale standard or enterprise versions. They all have an embedding API, as well as storage and read/write charges. You will need to spend time figuring out which plan is best for your service.
The key to working with Pinecone is its API. All you need to connect is a key and an SDK. Using an SDK simplifies working with the underlying REST API, translating its calls and responses to platform-specific ways of working and integrating it with existing code.
Introducing Pinecone’s .NET SDK
Along with Azure support, Pinecone now has a .NET SDK, developed on GitHub and available in NuGet, ready for use in your code. Connecting to a vector database is easy. Simply set up a namespace for your database connection and then create the connection using your API key. You can get the key from the dashboard associated with your Pinecone instance, where you can generate a new one if necessary.
The .NET SDK includes methods that allow you to work with your database programmatically, letting you build your own management tools. This can help you build tools to manage and update your RAG vector index, allowing subject-matter experts to create embeddings and add data to the database, combining the two operations into a single action.
You can use the SDK to create a new index from scratch, for example, if you want to provide alternate RAG sources for an application. This technique allows you to ensure that results for, say, one product or service don’t pollute the results for another, avoiding user confusion. This way you can filter searches to the appropriate RAG source before using an LLM to generate an output.
Indexes can be either serverless or pod-based. If you’re using the managed Azure Pinecone, then you want to work with serverless indexes. To do this start with a name, then the dimension of the semantic space you’re creating, which can be very large! Choose a serverless index and target it to the Azure region you want to use. Other options include adjusting how your index scales and the type of search algorithm.
Upserting and querying in .NET
Once you’ve created an index, you can quickly add new values. Pinecone calls this process “upserts.” Here you’ll load the vectors created by an embedding, first providing a batch of IDs, and then the dense and sparse vectors, and finally appropriate metadata (which can include the original text of the source document). This last option allows you to prefilter data when making a query. Other commands in the SDK help manage your vector index: deleting, updating, and listing vectors.
If you’re building an index for a RAG application, you should first chunk your documents, so each embedding only encodes a section of text. This can then be included in the upsert as part of the metadata,
The most important part of building the search component of a RAG application is, of course, querying your vector index. As part of setting up an index, you’ve already defined the distance metric used to select similar documents. The type of metric depends on the type of results you want to return. For most RAG applications you probably want to choose a cosine metric, as this finds documents that are similar to your query term.
Querying is relatively straightforward. Simply make a request to your index and namespace, with an embedding vector of your query and the number of responses to return. When you run a query and are adding context to a prompt, use the document chunks from the index metadata. Ranking information can help construct the context needed, allowing the LLM to produce a natural language text output.
Having a .NET SDK for Pinecone allows you to quickly integrate the managed Azure Pinecone service into your applications, working with tools like Semantic Kernel to build RAG applications using Azure Open AI. Building new tools like these into a familiar ecosystem that supports cloud-native development makes a lot of sense. It flattens the learning curve and allows you to reuse code and components, and as the SDK is open source, you can help guide its future development.