How to Evaluate RAG: Follow Our 7 Simple Steps

RAG evaluation is still hard. But follow our 7 steps and you can get high quality evaluations in just a few days

The bad news first. There’s no easy button for evaluating RAG (retrieval augmented generation) systems. The public data sets aren’t great, preparing your own data requires subject matter expertise and the auto evaluation platforms that promise to automate much of the work, don’t really deliver, at least not yet.

The good news is over the last three years of building, testing and deploying RAG with firms like Air France, Dartmouth, Samsung and many more, our team at EyeLevel.ai has learned a few things.

The following is how we approach the evaluation problem with our customers. We take a practical approach that is easy to understand and implement, but does require some human effort. It’s our point of view that LLMs can’t be trusted to grade their own homework. So humans still have a real role to play.

Our approach is by no means the only one. If you have great tips and tricks, please leave them in the comments. We’re always learning too.

Here’s our simple framework:

Curate documents
Generate ground truth QA pairs
Run the questions against your RAG
Evaluate the answers

However, as you’ll see there are plenty of roadblocks if you don’t know how to swerve.

Lesson 1: Eval platforms don’t really work… yet

There are dozens of vendors with eval platforms. Many are great products, but in our view, they don’t actually solve the core challenges of RAG evaluation:

building a balanced document set to test on
creating SME approved QA pairs as ground truth
accurately grading RAG responses
and uncovering why a certain answer is wrong

RAG eval platforms do automate some things, like firing hundreds of questions into a RAG, extracting their answers and providing a nice GUI.

Most RAG eval platforms also attempt to automate grading of RAG responses, but we find that these systems are usually off by 15-20%, creating both false negatives and positives. If directional results are all you need, then these solutions are a great fit. If you want higher accuracy, they aren’t there yet.

Lesson 2: No great public RAG data sets

Currently, no robust public datasets thoroughly test multimodal RAG capabilities, especially ingestion, search, and retrieval across text, tabular data, and graphical content. That surprised us. It’s time consuming, but not overly difficult to build such a dataset. At the moment researchers seem to be more focused on testing LLMs rather than RAG.

Take, for example, this performance chart from a RAG platform company (not us).

A RAG platform. (not ours) advertises how they perform on benchmarks, not built for RAG

You can see they’ve measured their performance against what seems like an impressive array of open source benchmarks. The problem is these data sets don’t really test the two core functions of RAG: ingest and search.

Why not?

Problem one is the source documents aren’t actually documents. They are text blocks extracted by humans, mostly from wikipedia pages and public sites, but also in a few cases from PDFs. The issue here is that humans rather than software have extracted the important text. In the real world, your RAG needs to ingest and understand a wide variety of complex multimodal documents. It’s rare that a RAG is built on millions of pages of simple text. Often the answers are contained in tables, graphics, forms and oddly placed text blocks.

Problem two is the source material itself, mostly Google searches, wiki pages, IMBD, trivia pages and other very public sites. All the LLMs have already trained on this data. So you’re missing the second part of RAG which is search. It’s likely an LLM can answer questions on these pages without the RAG at all.

‍

Lesson 3: Making RAG data sets is hard, so we made some you can use

To solve the problems above, we chose to make data sets to test our own RAG platform and the work we install for customers.

We built three open source data sets:

Deloitte 1K is roughly 100 QA pairs on 1,000 pages of complex visual documents from Deloitte. The documents were downloaded from their public website and are filled with tables, graphics and text with complex formatting. The QA pairs were created by humans who attempted to ask questions that might happen in real life. They used the source materials to create the answers. To be clear, Deloitte itself is not involved in the test.

Several pages from our Deloitte 1K data set

Deloitte 100K contains the same 100 QA pairs and 1,000 Deloitte pages as Deloitte 1K, but now adds 999,000 pages of decoy data. The purpose is to test your RAG search. It sounds counterintuitive but RAG search performance generally goes down as more data is added. If you’re searching for a needle in a haystack, adding 100X more hay makes it harder. But it’s also closer to a real production RAG which often contains hundreds of thousands or millions of pages.

The last data set, X-Rays, is perhaps the most interesting. At EyeLevel’s website, we have a public ingest, where potential customers can, without an account, upload a small PDF or JPG and instantly see visually how our GroundX RAG platform parses the document and turns it into LLM-ready data. We call those outputs x-rays, because they let you “see” inside the file like our vision model does.

EyeLevel’s X-Ray of a medical bill (with faked data). The colored boxes show where our vision model identified text, tabular and graphical objects.

You can try it here if you like.

Some of these images are brutal. People try to stump us daily, for example a photograph of a camera box in a store or a NASA space station module design or a screenshot from an Adobe instruction manual. Of course, we also get the usual suspects like financial documents, insurance forms, annual reports and so on.

EyeLevel’s X-Ray of a NASA space station plan. The colored boxes show where our vision model identified text, tabular and graphical objects.

In this X-Ray data set, we’ve filtered out anything private, especially anything with personal information. And a human team has written QA pairs for 100 of these images.

All three of these data sets are open source. Feel free to test your RAGs with them and please let us know how you do. For the X-Rays, please reach out to us directly. We don’t want to put them on the public web and have the LLMs scrape them. info@eyelevel.ai.

‍

Lesson 4: How to roll your own data set

Whether you use our data sets or not, we always recommend that you eventually build your own data sets based on your real documents.

To do this we recommend following a few simple rules.

First, start small. No more than 100 pages of content. You want something you can get your head around easily. Often, you don’t need more than 100 well curated pages to find the blind spots in your RAG. You can graduate to bigger sets as you go.

Second, bring the SMEs in early. Curating documents and generating QA pairs takes real time. Your SMEs are likely busy with their day jobs. You need to get their buy-in early to help you do the work.

Third, make sure the documents are diverse both in file type and content. It’s particularly important to have documents with the kind of tables and graphics it will face in the real world.

‍

Lesson 5: How to make good ground truth

Ground truth, in this context, is simply a collection of QA (question/answer) pairs that can be used to verify a RAG has correctly searched and an LLM has correctly answered a question.

Again we follow simple practical rules.

First, your SMEs are your friends. Be nice to them. This is annoying homework they won’t want to do.

Second, like with the documents, start small. At first, we recommend no more than 30 questions to your 100 pages. You can always grow. But start with something easily doable.

Third, evenly balance your questions: a third textual, a third tabular, a third graphical. This really helps you identify problems. Most RAGs do pretty well with text. It’s tables and especially graphical information where RAGs go to die.

Fourth, start with questions that have clear, simple verifiable answers. You can work your way up to complex conceptual questions, but trust us, at first you want things that are fast to evaluate and you can score as a concrete correct or incorrect.

It’s also important to test the prompt version of your question so that it produces short, simple outputs, not a paragraph of LLM noise.

Lesson 5: Running the RAG is the easiest part

Ironically, firing your questions into the RAG and getting the answers back is the easiest part of the process. The eval platforms typically automate this for you and most have nice GUIs to display the results.

On the hand, it typically takes just a few lines of Python to ingest a csv of questions, fire them into a RAG and spit back a csv with answers.

Either way is fine.

‍

Lesson 6: How to score results

Most eval platforms offer some type of automated evaluation. This saves time, but we generally prefer humans in the loop. We’ve found LLM evaluation to be roughly 15%-20% off base, creating both false positives and negatives. This is the homework grading problem again, though keeping the answers very simple and short lessens the issue.

We also recommend, at least at first, a black and white evaluation. The answer is right or wrong. If you must, grade it from 1-5. There are many fancy rubrics out there that try to score completeness, faithfulness and noise sensitivity. These are fine, and perhaps useful down the road, but in the early days, you want people to actually score these results and have clear output. Keep it simple.

And last, instruct evaluators to ignore style completely. We’ve seen many tests where evaluators, especially from customers, will get hung up on the answer being too short or long or wordy or not polite enough. These things are all very easy to fix with prompting and don’t have much to do with the accuracy of your RAG. Get everyone on the same page to ignore style completely.

Lesson 7: Scale up

Once you are in a good place with your 30 question, 100 pages test, ramp it up. We usually jump to 100 questions and 1,000 pages.

Once that is working well, we do a scale test, adding a large amount of decoy pages that makes it harder for RAG search to find needles in haystacks. You’ll be surprised how quickly RAG search accuracy can decline.

We’ve seen popular vector databases lose accuracy in just 10K pages of content. Basically, if there are too many chunks that are similar, vector similarity can struggle. Advanced filtering or search techniques are needed, which we’ll save for another day.

‍

Talk to us

As I said at the top, this is our approach. It’s by no means the only path. We’d love to hear tips and tricks from everyone else. Let’s build (and test) together.

‍

How to Evaluate RAG? Follow Our 7 Simple Steps

Neil Katz

Lesson 1: Eval platforms don’t really work… yet

Lesson 2: No great public RAG data sets

Lesson 3: Making RAG data sets is hard, so we made some you can use

Lesson 4: How to roll your own data set

Lesson 5: How to make good ground truth

Lesson 5: Running the RAG is the easiest part

Lesson 6: How to score results

Lesson 7: Scale up

Talk to us

More news

RAG Evaluation: Almost Everything You Need to Know

Breaking Down OpenAI o1 Model for RAG Engineers

Do Vector Databases Lose Accuracy at Scale?

Find out what the buzz is about. Learn to build AI you can trust.