Use LangSmith to debug, test, evaluate, and monitor chains and intelligent agents in LangChain and other LLM applications. Credit: Angelaoblak/Shutterstock In my recent introduction to LangChain, I touched briefly on LangSmith. Here, we’ll take a closer look at the platform, which works in tandem with LangChain and can also be used with other LLM frameworks. My quick take on LangSmith is that you can use it to trace and evaluate LLM applications and intelligent agents and move them from prototype to production. Here’s what the LangSmith documentation says about it: LangSmith is a platform for building production-grade LLM applications. It lets you debug, test, evaluate, and monitor chains and intelligent agents built on any LLM framework and seamlessly integrates with LangChain, the go-to open-source framework for building with LLMs. As of this writing, there are three implementations of LangChain in different programming languages: Python, JavaScript, and Go. We’ll use the Python implementation for our examples. LangSmith with LangChain So, basics. After I set up my LangSmith account, created my API key, updated my LangChain installation with pip, and set up my shell environment variables, I tried to run the Python quickstart application: from langchain.chat_models import ChatOpenAI llm = ChatOpenAI() llm.predict("Hello, world!") Before we discuss the results, take a look at the LangSmith hub: IDG Figure 1. The LangSmith hub acts as a repository for prompts, models, use cases, and other LLM artifacts. Moving on to the next tab, here is the trace list from the default project: IDG Figure 2. The Python project trace list shows logs of my six attempts to run the quickstart. The first five were unsuccessful: the Python output indicated a timeout from OpenAI. I took the hint from the timeouts, and went to my OpenAI account and upgraded my ChatGPT plan to ChatGPT Plus ($20 per month). That gave me access to GPT-4 and the ChatGPT plugins, but my program still didn’t run. I left it turned on: I suspect I’ll need the additional capabilities. Next, I remembered that the OpenAI API plan is separate from the ChatGPT plan, so I upgraded that as well, adding $10 to the account and setting it up to replenish itself as needed. Now the Python program ran to completion, and I was able to see the successful results in LangSmith: IDG Figure 3. A successful prediction run, finally. Note the Playground button at the top right of the screen. Looking at the metadata tab for this run told me that it ran the “Hello, World!” prompt against the gpt-3.5-turbo model at a sampling temperature of 0.7. The scale here is 0 to 1, where 1 is the most random, and 0 asks the model to auto-tune the temperature. IDG Figure 4. The metadata for a successful prediction. In addition to the YAML block at the bottom, there’s a JSON block with the same information. Overview of LangSmith LangSmith logs all calls to LLMs, chains, agents, tools, and retrievers in a LangChain or other LLM program. It can help you debug an unexpected end result, determine why an agent is looping, figure out why a chain is slower than expected, and tell you how many tokens an agent used. LangSmith provides a straightforward visualization of the exact inputs and outputs to all LLM calls. You might think that the input side would be simple, but you’d be wrong: In addition to the input variables (prompt), an LLM call uses a template and often auxiliary functions; for example, retrieval of information from the web, uploaded files, and system prompts that set the context for the LLM. In general, you should keep LangSmith turned on for all work with LangChain—you only have to look at the logs when they matter. One of the useful things you can try, if an input prompt doesn’t give you the results you need, is to take the prompt to the Playground, which is shown in Figure 5 below. Use the button at the top right of the LangSmith trace page (shown in Figure 4) to navigate to the Playground. IDG Figure 5. The LangSmith Playground allows you to interactively edit your input, change model and temperature, adjust other parameters, add function calls, add stop sequences, and add human, AI, system, function, and chat messages. This is a time saver compared to editing all of this in a Python program. Don’t forget to add your API keys to the website using the Secrets & API Keys button. Note that playground runs are stored in a separate LangSmith project. LangSmith LLMChain example In my introduction to LangChain, I gave the example of an LLMChain that combines a ChatOpenAI call with a simple comma-separated list parser. Looking at the LangSmith log for this Python code helps us understand what’s happening in the program. The parser is a subclass of the BaseOutputParser class. The system message template for the ChatOpenAI call is fairly standard prompt engineering. from langchain.chat_models import ChatOpenAI from langchain.prompts.chat import ( ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, ) from langchain.chains import LLMChain from langchain.schema import BaseOutputParser class CommaSeparatedListOutputParser(BaseOutputParser): """Parse the output of an LLM call to a comma-separated list.""" def parse(self, text: str): """Parse the output of an LLM call.""" return text.strip().split(", ") template = """You are a helpful assistant who generates comma separated lists. A user will pass in a category, and you should generate 5 objects in that category in a comma separated list. ONLY return a comma separated list, and nothing more.""" system_message_prompt = SystemMessagePromptTemplate.from_template(template) human_template = "{text}" human_message_prompt = HumanMessagePromptTemplate.from_template(human_template) chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt]) chain = LLMChain( llm=ChatOpenAI(), prompt=chat_prompt, output_parser=CommaSeparatedListOutputParser() ) chain.run("colors") IDG Figure 6. The Run tab for the top-level chain shows the human input, the parsed output, the latency (under a second), and the tokens used, as well as the clock time and call status. Diving down to the ChatOpenAI LLM call provides additional information, shown in Figure 7. IDG Figure 7. At the LLM level, we see the system input and the output produced by the LLM before parsing. We can glean even more information from the metadata, shown in Figure 8. IDG Figure 8. The metadata for the ChatOpenAI call tells us the model used (gpt-3.5-turbo), the sampling temperature (0.7), and the runtime version numbers. LangSmith evaluation quickstart This walkthrough evaluates a chain using a dataset of examples. First, it creates a dataset of example inputs, then defines an LLM, chain, or agent for evaluation. After configuring and running the evaluation, it reviews the traces and feedback within LangSmith. Let’s start with the code. Note that the dataset creation step can only be run once, as it lacks the ability to detect an existing dataset by the same name. from langsmith import Client example_inputs = [ "a rap battle between Atticus Finch and Cicero", "a rap battle between Barbie and Oppenheimer", "a Pythonic rap battle between two swallows: one European and one African", "a rap battle between Aubrey Plaza and Stephen Colbert", ] client = Client() dataset_name = "Rap Battle Dataset" # Storing inputs in a dataset lets us # run chains and LLMs over a shared set of examples. dataset = client.create_dataset( dataset_name=dataset_name, description="Rap battle prompts.", ) for input_prompt in example_inputs: # Each example must be unique and have inputs defined. # Outputs are optional client.create_example( inputs={"question": input_prompt}, outputs=None, dataset_id=dataset.id, ) from langchain.chat_models import ChatOpenAI from langchain.chains import LLMChain # Since chains and agents can be stateful (they can have memory), # create a constructor to pass in to the run_on_dataset method. def create_chain(): llm = ChatOpenAI(temperature=0) return LLMChain.from_string(llm, "Spit some bars about {input}.") from langchain.smith import RunEvalConfig, run_on_dataset eval_config = RunEvalConfig( evaluators=[ # You can specify an evaluator by name/enum. # In this case, the default criterion is "helpfulness" "criteria", # Or you can configure the evaluator RunEvalConfig.Criteria("harmfulness"), RunEvalConfig.Criteria( {"cliche": "Are the lyrics cliche?" " Respond Y if they are, N if they're entirely unique."} ) ] ) run_on_dataset( client=client, dataset_name=dataset_name, llm_or_chain_factory=create_chain, evaluation=eval_config, verbose=True, project_name="llmchain-test-1", ) We have a lot more to look at for this example than the last one. The above code uses a dataset, runs the model against four prompts from the dataset, and runs multiple evaluations against each generated rap battle result. Here are the evaluation statistics, which were printed in the terminal during the run: Eval quantiles: 0.25 0.5 0.75 mean mode harmfulness 0.00 0.0 0.0 0.00 0.0 helpfulness 0.75 1.0 1.0 0.75 1.0 cliche 1.00 1.0 1.0 1.00 1.0 Somebody had fun creating the rap battle prompts, as shown in the dataset below: IDG Figure 9. The key-value dataset created by the client.create_dataset()call. As an aside, I had to look up Aubrey Plaza, who played the deadpan comic character April Ludgate in Parks and Recreation. This code used its own project name, llmchain-test-1, so that’s where we look for results: IDG Figure 10. The first line in each pair is the LLM chain result, and the second is the LLM result. Here is the Barbie vs. Oppenheimer rap battle, as generated by gpt-3.5-turbo. IDG Figure 11. This is the end of the Barbie/Oppenheimer rap battle text generated by the LLM chain. It won’t win any prizes. The LangSmith Cookbook While the standard LangSmith documentation covers the basics, the LangSmith Cookbook repository delves into common patterns and real-world use-cases. You should clone or fork the repo to run the code. The cookbook covers tracing your code without LangChain (using the @traceable decorator); using the LangChain Hub to discover, share, and version control prompts; testing and benchmarking your LLM systems in Python and TypeScript or JavaScript; using user feedback to improve, monitor, and personalize your applications; exporting data for fine-tuning; and exporting your run data for exploratory data analysis. Conclusion LangSmith is a platform that works in tandem with LangChain or by itself. In this article, you’ve seen how to use LangSmith to debug, test, evaluate, and monitor chains and intelligent agents in a production-grade LLM application. Related content feature What is Rust? Safe, fast, and easy software development Unlike most programming languages, Rust doesn't make you choose between speed, safety, and ease of use. Find out how Rust delivers better code with fewer compromises, and a few downsides to consider before learning Rust. By Serdar Yegulalp Nov 20, 2024 11 mins Rust Programming Languages Software Development how-to Kotlin for Java developers: Classes and coroutines Kotlin was designed to bring more flexibility and flow to programming in the JVM. Here's an in-depth look at how Kotlin makes working with classes and objects easier and introduces coroutines to modernize concurrency. By Matthew Tyson Nov 20, 2024 9 mins Java Kotlin Programming Languages analysis Azure AI Foundry tools for changes in AI applications Microsoft’s launch of Azure AI Foundry at Ignite 2024 signals a welcome shift from chatbots to agents and to using AI for business process automation. By Simon Bisson Nov 20, 2024 7 mins Microsoft Azure Generative AI Development Tools news Microsoft unveils imaging APIs for Windows Copilot Runtime Generative AI-backed APIs will allow developers to build image super resolution, image segmentation, object erase, and OCR capabilities into Windows applications. By Paul Krill Nov 19, 2024 2 mins Generative AI APIs Development Libraries and Frameworks Resources Videos