The new generation of models is expected to take on all large language models, including the likes of GPT-3.5, Claude Sonnet, Mistal models, and Grok. Credit: Noe Besso/Shutterstock Facebook, Instagram, and WhatsApp parent Meta has released a new generation of its open source Llama large language model (LLM) in order to garner a bigger pie of the generative AI market by taking on all model providers, including OpenAI, Mistral, Anthropic, and Elon Musk’s xAI. “This next generation of Llama demonstrates state-of-the-art performance on a wide range of industry benchmarks and offers new capabilities, including improved reasoning. We believe these are the best open source models of their class, period,” the company wrote in a blog post, adding that it had set out to build an open source model(s) that is at par with the best performing proprietary models available in the market. Currently, Meta is making the first two models — pre-trained, and instruction-fine-tuned variants with 8 billion and 70 billion parameters — of its third generation of LLMs available. Typically, any LLM provider releases multiple variants of models to allow enterprises to choose between latency and accuracy depending on use cases. While a model with more parameters can be relatively more accurate, the one with fewer parameters requires less computation, takes less time to respond, and therefore, costs less. The variants released, according to Meta, are text-based models and don’t support any other form of data. The company expects to release multilingual and multimodal models with longer context in the future as it tries to improve overall performance across capabilities such as reasoning and code-related tasks. Claim of better performance than other models Meta has claimed that its new family of LLMs performs better than most other LLMs, with the exception of showcasing how it performs against GPT-4, which now drives ChatGPT and Microsoft’s Azure and analytics services. “Improvements in our post-training procedures substantially reduced false refusal rates, improved alignment, and increased diversity in model responses. We also saw greatly improved capabilities like reasoning, code generation, and instruction following making Llama 3 more steerable,” the company said in a statement. In order to compare Llama 3 with other models, the company conducted tests on what it calls standard benchmarks, such as MMLU, GPQA, MATH, HumanEval, and GSM-8K, and found the variants scoring better than most LLMs, such as Mistral, Claude Sonnet, and GPT 3.5. While MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models, the GPQA (Graduate-Level Google-Proof Q&A Benchmark) is a test to check the expertise of a model in solving complex science problems. GPAQ is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry and PhDs in the corresponding domains achieve only 65% accuracy on these questions. GPT-4 held the highest accuracy score in the test with 39%, as per data reported in a paper published in November last year. In contrast, Llama 3’s 70 billion parameter variant has garnered a score of 39.5 followed by the smaller parameter model achieving a score of 34.2. GeminiPro 1.5, currently, holds the highest score of 41.5 on the GPQA benchmark. The same LLM also beat the larger Llama 3 variant on the MATH benchmark test as well. The dataset used in evaluation across the benchmarks, according to the company, contained about 1,800 prompts covering 12 key use cases — asking for advice, brainstorming, classification, closed question answering, coding, creative writing, extraction, inhabiting a character/persona, open question answering, reasoning, rewriting, and summarization. “To prevent accidental overfitting of our models on this evaluation set, even our own modeling teams do not have access to it,” the company said. Overfitting is a phenomenon in machine learning or model training when a model performs well on training data but fails to work on testing data. Whenever a data professional starts model training, the person has to keep two separate datasets for training and testing data to check model performance. Overfitting happens when a model ends up learning the training data too well, which is to say that it learns the noise and the exceptions in the data and doesn’t adapt to new data being added. This can happen when the training data is too small, contains irrelevant information, or the model trains for too long on a single sample set. The HumanEval and the GSM-8K benchmarks, on the other hand, is used for testing code generation and arithmetic reasoning respectively. Improvements over Llama 2 Meta in a blog post said that it has made many improvements in Llama 3, including opting for a standard decoder-only transformer architecture. “Llama 3 uses a tokenizer with a vocabulary of 128K tokens that encodes language much more efficiently, which leads to substantially improved model performance,” the company said. In order to improve the inference efficiency of Llama 3 models, the company said that it has adopted grouped query attention (GQA) across both the 8B and 70B sizes. “We trained the models on sequences of 8,192 tokens, using a mask to ensure self-attention does not cross document boundaries,” it added. Other improvements include the training dataset of Llama 3, which the company claims is seven times larger than the one used to train Llama 2. Llama 3 is pre-trained on over 15 trillion tokens that were collected from publicly available sources, the company said. In order to ensure that Llama 3 was trained on high-quality data, the company developed a series of data filtering pipelines, which include using heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers. “We found that previous generations of Llama are surprisingly good at identifying high-quality data, hence we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3,” the company said. In order to reduce training time by 95% when compared to Llama 2, Meta claims it used an advanced training stack that automates error detection, handling, and maintenance. “We also greatly improved our hardware reliability and detection mechanisms for silent data corruption, and we developed new scalable storage systems that reduce overheads of checkpointing and rollback,” the company said. Training runs for Llama 3 were run on two custom-built 24K GPU clusters. The combination of all the improvements and advancements, including the enhanced security features, sets the new models apart from competitors like OpenAI’s ChatGPT, Mistral’s Le Chat, Google’s Gemini, and x.AI’s Grok, said Paul Nashawaty, lead of app development and modernization practice at The Futurum Group. The approach Meta has taken with Llama 3 may offer a distinct avenue for understanding and navigating human interactions better, Nashawaty added. What else do you get with Llama 3? As part of its release of the two Llama 3 variants, Meta said that it was introducing new trust and safety tools, such as Llama Guard 2, Code Shield, and CyberSec Eval 2. While Llama Guard 2 is a safeguard model that developers can use as an extra layer to reduce the likelihood their model will generate outputs that aren’t aligned with their intended guidelines, Code Shield is a tool targeted at developers to help reduce the chance of generating potentially insecure code. On the other hand, CyberSecEval, which is designed to help developers evaluate any cybersecurity risks with code generated by LLMs, has been updated with a new capability. “Cybersec Eval 2 expands on its predecessor by measuring an LLM’s susceptibility to prompt injection, automated offensive cybersecurity capabilities, and propensity to abuse a code interpreter, in addition to the existing evaluations for insecure coding practices,” the company said. In order to showcase the power of its new LLMs, the company has also released a new AI assistant, underpinned by the new models, that can be accessed via its Facebook, Instagram, and WhatsApp platforms. A separate webpage has been designed to help users access the assistant as well. The company is already working on variants of Llama 3, which have over 400 billion parameters. Meta said it will release these variants in the coming months as their effective training is completed. Llama 3 models have been made available on AWS, Hugging Face, IBM WatsonX, Microsoft Azure, Google Cloud, and Nvidia NIM. Other vendors, such as Databricks, Kaggle, and Snowflake will offer the latest models as well. In terms of hardware for training, inferencing, and AI-related tasks, Llama 3 will be supported by AMD, AWS, Dell, Intel, Nvidia, and Qualcomm. Related content analysis Azure AI Foundry tools for changes in AI applications Microsoft’s launch of Azure AI Foundry at Ignite 2024 signals a welcome shift from chatbots to agents and to using AI for business process automation. By Simon Bisson Nov 20, 2024 7 mins Microsoft Azure Generative AI Development Tools news Microsoft unveils imaging APIs for Windows Copilot Runtime Generative AI-backed APIs will allow developers to build image super resolution, image segmentation, object erase, and OCR capabilities into Windows applications. By Paul Krill Nov 19, 2024 2 mins Generative AI APIs Development Libraries and Frameworks news Microsoft rebrands Azure AI Studio to Azure AI Foundry The toolkit for building generative AI applications has been packaged with new updates to form the Azure AI Foundry service. By Anirban Ghoshal Nov 19, 2024 4 mins Microsoft Azure Generative AI Development Tools feature A GRC framework for securing generative AI How can enterprises secure and manage the expanding ecosystem of AI applications that touch sensitive business data? Start with a governance framework. By Trevor Welsh Nov 19, 2024 11 mins Generative AI Data Governance Application Security Resources Videos