Is Meta's CRAG Any Good? We Dissect the new RAG Benchmark for AI Engineers
Watch the latest episode of RAG Masters where we dive into Meta's new benchmark: CRAG
First: What is RAG?
Let's start with a brief overview of Retrieval Augmented Generation (RAG). Think of RAG as a two-step process:
- Retrieval: Imagine you have a digital librarian who searches through a vast library of documents to find the most relevant information related to your question.
- Generation: Once the relevant documents are found, RAG summarizes this information to generate a coherent and accurate response to your question.
For example, you ask a question like "What is the significance of CRAG?" The RAG system first retrieves the most relevant documents on CRAG. Then, it summarizes the key points from these documents to create a detailed and informative answer.
By combining retrieval and generation, RAG ensures that the responses are both accurate and contextually relevant.
What is CRAG?
CRAG (Comprehensive RAG Benchmark) was developed by Meta, who introduced the concept of RAG back in 2020 in an original research paper. CRAG aims to address some of the existing problems in RAG approaches and the tendency for large language models to hallucinate.
CRAG consists of a vast set of questions across more than 4,000 question answer pairs spread across various domains and question types. The primary objective of CRAG is to take existing issues with RAG systems and push the state of the art forward by providing a comprehensive dataset for performance evaluation.
Components of CRAG
CRAG's structure is designed to challenge RAG systems by using different types of questions that are known to cause problems. When Meta was developing CRAG, they looked at a host of data, then created a series of questions and answers around that data. Now, they have a significant source of data that can be used to answer those questions.
The benchmark is broken down into sections where engineers are given various retrieval tasks:
- For every question, five HTML pages are given as potential source material.
- In another section, 50 HTML pages are provided.
- The third section includes a knowledge graph with several million entities.
This comprehensive approach ensures that CRAG can test a wide range of retrieval scenarios, pushing the limits of current RAG technology.
Here's what's in CRAG:
- 4,409 Question and Answer Pairs
- 5 Domains: Finance, Sports, Music, Movie and Open
- 7 Question types: Conditions, Comparison, Aggregation, Multi-hop, Set queries, Post-processing-heavy, and False-premise
- Data: Mock APIs, 50 HTML pages per question, 2.6M entity knowledge graph
Importance of CRAG in the RAG Community
CRAG is the latest benchmark for both researchers and engineers in the RAG community who are working to improve the effectiveness and accuracy of RAG. CRAG takes the potential challenges a RAG system can encounter when searching for accurate answers and standardizes it into a holistic question set that can provide a standard benchmark.
The primary significance of CRAG lies in its ability to provide a structured and holistic dataset that addresses the robustness and performance of RAG systems. This lets the RAG community benchmark and improve their systems effectively.
Evaluation Methods in CRAG
Evaluating the performance of RAG systems using CRAG involves several sophisticated methods. One of the notable aspects is the use of automatic evaluation techniques. There are question answer pairs that are first curated by humans, and then the RAG system comes up with its own answer. The evaluation then asks a language model, 'Hey, is this answer right?'
This approach typically uses multiple language models to verify the accuracy of answers to ensure a high level of performance assessment. This avoids something called the ‘self preference’ problem, which is where a language model outputs something, and then you ask the same language model, “Did you do a good job?” These models have a tendency to just like what they have said. So bringing multiple models into the mix is critical for an effective assessment.
One of the most powerful parts of the CRAG evaluation method is the diversity of question types. Meta has created question types designed to trip up RAG in a variety of ways, whether with temporal questions that require an answer that changes a lot, or multi-hop which needs several RAG searches to answer a single question. There are also false premise questions that inject a false statement into a question along with post-processing heavy questions that need reasoning or processing to reach the correct answer.
Here's a table with additional examples of CRAG question types:
Credit: Meta
Real-World Applications and Challenges
Applying CRAG in real-world scenarios has its own set of challenges that come along with the benefits. Industry and academia have different perspectives on RAG, making real-world applications more challenging.
Real-world applications often involve diverse and unstructured data sources, such as random PDF files and YouTube videos. CRAG, while comprehensive, may not fully encompass the messiness of real-world data. However, it does provide a solid foundation for engineers to experiment and benchmark their RAG systems in a structured environment.
Future of CRAG and RAG
Looking ahead, the future of CRAG and RAG benchmarking is promising. The continuous development of benchmarks like CRAG is essential for advancing the field. As co-host of the RAG Masters show Daniel Warfield notes, "I'm just looking forward to whatever they name this next paper. Maybe DRAG, that would probably be a good one, but it's going to be something bad like FRAG."
The focus will likely shift towards more complex and realistic data scenarios, addressing the practical challenges faced by engineers. The evolution of CRAG and similar benchmarks is a key step towards enhancing the robustness and intelligence of RAG systems.
Conclusion
CRAG stands as a significant milestone in the field of retrieval augmented generation. With its release, engineers now have a comprehensive and challenging benchmark. This framework will enable researchers and engineers to push the boundaries of current RAG systems.
You can watch the full episode of RAG Masters and stay tuned for more updates and discussions in our community and on future posts.