As developers converge around two core paradigms for LLM applications — RAG and Agents — the need to evaluate their performance has become a major hurdle to launching production ready products.
Youtube, Reddit and X are filled with cool tutorials on how to build. Far less has been written on how to evaluate. We work to rectify that here, as a first of several pieces on the topic.
Let's start with some definitions:
- Retrieval Augmented Generation (RAG): A workflow that involves looking up information that is relevant to a user's query, and using that information in conjunction with the query itself to construct an augmented prompt. This prompt is passed to an LLM, generating a response to the user.
- Agents: An application abstraction around language models that allows LLMs to execute tools and make decisions based on the result of that tool execution.
Both of these approaches rely on a fundamental characteristic of LLMs; that they’re “In-Context learners”, meaning if you give an LLM information in a prompt, the LLM can use that information to answer a query.

One of the most common manifestations of contextual information is documents. PDFs, powerpoints, textual transcripts. Documents contain rich information that can contextualize and sometimes outright answer a user's query. The idea is that RAG and Agentic systems can leverage this information to answer queries with a higher degree of effectiveness.
In theory.
In practice, it can be difficult to comprehensively test AI systems that interface with the information in documents. This can lead to (potentially catastrophic) surprises as AI practitioners push their products off of the bench and into the real world.
In this post, we’ll discuss the major steps that RAG and Agentic systems employ to add context to LLM applications, how those steps can fail, and how we can test contextual AI systems to make sure they don’t fail in production.
Unlike traditional software systems, it can be very hard to test document contextualized systems. As a result, most developer simply don’t, which is insane.— From later in the article
Testing is hard, but the sooner you make it a core piece of your development workflow, the better. You’ll become better at testing, and your products will get better faster.
— From later in the article
The Failure Points of Document Contextualized AI
For our purposes, “Document Contextualized AI” systems use the information from documents (like PDFs, slideshows, and websites) to give context to a language model so that the language model can do its job better. Imagine if you collected a set of documents about a restaurant; the menu, seasonal events, the food, allergy information, etc. The idea of document contextualization is to somehow give that information to a language model so it can talk with customers about your restaurant based on real information.

There are a variety of tools, technologies, and approaches that can be used to build document-contextualized AI systems. As a result, an ideal system for answering questions based on corporate documents might look very different than an ideal system designed to generate code within a codebase. Still, though, there are some fundamental operations that most document contextualized AI systems employ:

We can find this general workflow in two of the most prolific document contextualization approaches, “RAG” and “Agents”:
- In a RAG context, every time we get a query, we retrieve relevent data from a datastore based on the user’s query and use that data to construct an augmented prompt. We use that augmented prompt to generate an answer via an LLM.
- In an Agentic context, we frame querying from the data store as a tool that the agent can choose to execute. If the agent does choose to execute that tool, we use the results to construct an augmented prompt and then feed that prompt to the language model for generation. This influences further actions in the agent. Often “RAG” can be mapped to a tool that an agent can use; thus testing Agentic systems can be fundamentally similar to testing RAG systems, the only difference being the complexity of the system being tested.
Both of these approaches have a lot of promise, but they also have a lot of technical risk that needs to be addressed before you build a product and put it out into the world. Just like any traditional software system, document contextualized AI systems can be prone to bugs and errors which can result in a poor user experience.
Unlike traditional software systems, though, it can be very hard to test document contextualized systems. As a result, most developers simply don't, which is insane.
To really grasp the gravity of the problem we’re dealing with, let’s look through each of these fundamental operations to describe both why they’re necessary, and how they can fail. We’ll start by exploring parsing, and then explore chunking, searching, and completion.
Parsing

Parsing is the fundamental first step to building most document-contextualized AI systems, and it’s vital because documents often contain rich spatial and visual information that current AI models are fundamentally incapable of dealing with.

If you add a PDF to ChatGPT, for instance, you’ll notice that there’s a progress bar. This progress bar (probably) represents a parsing process that re-represents the content of the paper into a representation the language model can understand.

When you ask the language model about content in the paper, you’ll get back a response not based on the paper itself, but based on the output from the parser. Depending on what the parser chooses to extract, and how that information is extracted, certain queries may or may not be answerable by the LLM.


There are a variety of parsers available that prioritize different types of documents and work best in different use cases. At EyeLevel.ai, we built a parser that’s particularly good at dealing with complex PDF documents. We went through the trouble of building our own custom parser because we couldn’t find an off-the-shelf parser that did the job.

We have a chat interface in our dashboard that’s powered by this data, and it results in the correct output.

Of course, we don’t know what’s going on behind the scenes with ChatGPT. It’s possible the information in the document was extracted correctly and the LLM dropped the ball, but in our experience, parsing is typically the source of failure for these types of questions.
I filmed a YouTube video a while ago (when I was uglier and had long hair) that was specifically about parsing technologies. In preparing for that video my co-host asked a bunch of developers which parser they used for RAG-based workflows, and we got a lot of different answers.

Interestingly, few developers could justify their choice of parser based on testing data. This isn’t a surprise, as testing these types of systems can be difficult. We’ll get into some approaches of how to test parsing systems later in the article, for now, let’s discuss another mode of failure in document interfacing AI.
Chunking

Modern AI models have a long, but still finite amount of information they can accept for a given prompt. Many documents, or a single large document, might not fit within the context window of an LLM regardless of how perfectly it’s parsed.

As a result, when working with a document contextualized AI system, It’s common to divide documents into sections called “chunks”. When providing context to the model, we’ll ultimately be using these chunks, rather than the full content parsed from a document.

There are a ton of ways to do this in practice. LlamaIndex, for instance, has 17 unique chunking strategies to choose from (source). Some approaches simply divide a document based on word count, while others employ AI models to aggregate semantically similar sections of the document. Choosing the right chunking strategy is important because if you chunk poorly, the context you give an LLM can provide very little context at all.

In the future, I’ll do a deep dive into how we’ve chosen to approach chunking at EyeLevel. For now, though, let’s keep moving through some other failure modes of RAG.
Search
Chunks ultimately need to be stored in some queryable representation such that we can search for the chunks that are relevant to the user's query.

In my time communicating with a variety of enterprise clients that employ RAG and agentic systems in actual products, I’ve seen three major issues that cause the lion's share of problems at this stage of the pipeline.
- Inability to Refine
- Fragility at scale
- Unfairly difficult queries relative to the search technology
Most Retrieval Augmented Generation (RAG) systems, and agents that employ them as tools, rely on an embedding model which takes each chunk and encodes it as a vector in some high dimensional vector space. When a user asks a question, the same model is used to encode the question in the same space. The reason the encoder exists is to place similar content in similar locations, so searching for relevant chunks to a user query becomes an exercise of searching for nearby vectors within this high-dimensional space.

The reason this approach is common is because it can be incredibly effective, and can be set up with only a few lines of code. However, there’s a major practical issue with this approach: you’re using an AI model to tell you which chunks are relevant to a user’s query. Embedding models aren’t magic, they organize data based on the way they’re trained. If your use case perfectly aligns with the inductive biases of this model then that’s fantastic, but the black box nature can make incremental refinements around search difficult.
Another common problem of many search strategies is fragility at scale. Naturally, when using a natural text query to look up information, if you want to search for a particular document you are more and more likely to make a mistake as the number of documents you’re searching through goes up. This is the case regardless of the search approach you choose. That said, while testing RAG systems at EyeLevel we’ve found that embedding-based search strategies can degrade in performance quickly as the number of irrelevant documents increases. While this is speculative on my part, I believe this is due to the loose nature of embedding-based search polluting the high dimensional embedding space as the number of documents rises.

Another problem with search is unfair questions. Imagine you gave a RAG system the query “What’s the price of the newest house in my portfolio?”, and imagine if your dataset of documents looked like this:

In a single question, a RAG system would need to:
- find your portfolio
- find the newest house
- lookup that house
- then find the price
If you’re building a simple RAG systemthath only searches for documents based on theuser'ss query, there’s no way for it to know which house it has to look up based on the query itself. Agents are typically better at these types of questions, especially if they’re designed to delibeartely solve this problem, but multi-hop questions remain a challenging problem.
Another difficult question for many RAG and agentic systems is aggregation. The fundamental idea of most contextual search systems in AI is to find chunks of documents that are relevant to a user's query. This is great for needle-in-a-haystack questions or general summarization, but if you need to exhaustively list out entities throughout documents, this strategy can be ineffective.
Often, RAG and Agentic systems employ a hybrid search strategy, where semantic-based search is complemented with some other form of search. These can help to bolster performance on question types that do not align nicely with the classic document contextualization pipeline.

So, to wrap that up in a neat little bow, search is hard. “Just use text-embedding-3” might be enough to get you to where you need to go, or it might not.
Let’s cover one more general failure mode, and then we’ll get into how testing is actually done.
Prompting and Completion

The last failure mode gets the most attention and is usually the least important.
Most modern LLMs, even inexpensive low-parameter ones, are capable of answering a user's query if sufficient context is clearly defined. This was actually the main point of the original RAG paper, not to introduce new information to an LLM but to improve the performance of LLMs by injecting relevent training data into the prompt itself.

Or, in other words, a properly created RAG system will make LLMs better, not worse. There’s typically not a dire necessity to fiddle with prompting strategies and experiment with LLM completion models. If you have a well-constructed prompt with good contextual information, most LLMs will knock the task out of the park.
The issue is that most people spend most of their time adjusting this step. I suppose it’s because prompt engineering is immediately accessible, and many people have a better grasp of LLMs than embeddings and thus feel more capable of experimenting with their choice of completion model and how that model is prompted. Regardless of the reason, though, spending most of your time adjusting this phase of a document contextualized AI system is usually a waste of time, and is the most common point of focus for many companies working with these types of systems.
If the context you’re feeding to the model is fundamentally flawed, no amount of chain of thought, agentic, structured output, or otherwise fancy parsing strategies is going to fix it.

Still, though, it is worth some attention as it is a failure mode. We’ll discuss how to test this and all the other failure modes in a bit, but for now let’s discuss what datasets are available to test on, and how testing might be done in general.
A Spread of Datasets
Testing AI from an academic perspective is well-defined. Academics generally agree on a certain class of problem which is on the cutting edge of the industry, build datasets around those problems, and then compare approaches based on the performance of AI systems on those datasets.

The issue is, that these datasets are meant to test general problems, not specific product problems. If you want to build an AI agent that needs to answer veterinary questions based on a dataset of conversations between vets and pet owners, for instance, you’re going to be hard-pressed to find a perfect dataset.
However, datasets still provide tremendous value to practitioners, simply because they exist. Building benchmarks like these can be difficult, expensive, and time-consuming, so it’s a mistake not to leverage them to better understand the performance of an AI system, if not specifically then at least from a general sense.
There is a broad diversity of how these datasets approach the general idea of document contextualized AI. Some of them take a very specific approach, focusing on individual components of the greater document-contextualized workflow. Others take a higher-level approach and simply validate the answers a system generates based on a question and a set of documents.

Let’s go through a few popular benchmarks in academia that are relevant to the document contextualized pipeline we’ve discussed. In doing so, we’ll explore how we can use academic benchmarks to test AI products, and we’ll also learn about benchmarks in general so we can learn about how to develop our own.
Relevant Datasets
In no particular order
Dataset 1) DocVQA

DocVQA is a collection of datasets which focus on “Visual Question Answering”. Essentially, the dataset consists of visually complex documents and questions about those documents. The goal is to build an AI system that can come up with the correct answer to the question based on the documents provided.
They test this general problem in a variety of ways, which they call “tasks”:
- Single Page Document Visual Question Answering (SP-DocVQA) is a collection of 50,000 questions framed on 12,767 document images, where each question is posed based on a specific document. The assumption is that you know which document a question is being asked about, so for this task “searching” for a document is not relevant.
- Document Collection Visual Question Answering (DocCVQA) in DocCVQA, questions are posed over a collection of documents, not just one. There is still the problem of needing to extract information from the documents, but with the added task of needing to find which document is relevent in the first place. The dataset consists of 20 questions posed across all 14,362 document images. Many of these questions require complex reasoning and aggregation to solve.
- Infographics VQA consists of 3288 questions posed across 579 documents, where the documents were scraped from freely accessible internet sources. The previous datasets focused on business and industry documents, while this one focuses on Infographics, which contain much more irregular and often more complex visual information. This dataset also includes more subtle question types that require logical reasoning and inference based on the source content, which was not a point of focus in the previous datasets.
- Multipage Document Visual Question Answering (MP-DocVQA) Each of the previous datasets assumed single page documents. This dataset does not, and includes questions posed across multi-page documents. This dataset consists of 46,176 questions, each of which is posed about a specific multi-page document. There are 5,928 documents in the dataset, each of which has an average of 8.27 pages per document. The goal of this task is to answer a question based on the associated multi-page document.
Each of these datasets ultimately give you documents, a question, and an expected answer (the pairing of a question and desired answer is often reffered to as a “Q/A pair”). The actual evaluation happens by comparing an answer from the AI system being tested with the human curated answer provided in the dataset.

Given the various modifications each dataset provides, the entire document contextualization pipeline is tested in a variety of ways.

Dataset 2) MPMQA

“Multimodal Question Answering on Product Manuals” (MPMQA) is an interesting dataset because it requires not only multimodal understanding, but multimodal output as well. It consists of 209 product manuals from 27 different companies like Apple, Sony, Samsung, etc. There are also 22,021 questions with a corresponding desired answer.
One of the core ideas of this dataset is that multimodal answers can be more useful to users than pure textual answers, so the dataset is designed to require not only a textual answer but also a visual one. The vast majority of questions are “How to” questions, which I imagine are the typical sort of questions asked when consulting a product manual.

This particular dataset exists outside of the document contextualized AI framework we’ve been discussing. The additional visual component requires an additional capability, which may or may not be relevent to your product or usecase.

Dataset 3) TechQA

TechQA focuses on document contextualized AI systems in the technical support domain. This dataset focuses on long and complex questions which may or may not have a known answer.
Each question has 50 “technotes” associated with it, from a pool of 802k technotes in total. These technotes consist of technical knowledge that may or may not answer the proposed technical question. The goal of this dataset is to predict whether an answer exists within the 50 provided technotes in each question and, if there is an answer, to extract that answer and provide it to the user. The technotes themselves are official technical documentation created by IBM support engineers. If I recall correctly, these chiefly take the form of web scraping from IBM documentation pages.
The idea of this dataset, as far as I can tell, is that IBM already has search functionality for finding technotes based on a user's query. What they want is to be able to find specific text within their documentation that is relevant. Thus this dataset is a bit more application-specific, and may or may not be in line with your needs.
Dataset 4) PDF-MVQA

PDF-MVQA is a multi-page, multimodal document question-answering dataset specifically built from academic research articles (PDFs) sourced from PubMed Central. It consists of 262,928 questions posed across 3,146 documents. The questions were generated using ChatGPT based on paragraph content and figure captions.
This dataset thinks of documents as a collection of entities, like paragraphs, tables, and figures. The goal of this dataset is to retreive the relevent entities based on a particular question. This is difficult because enteties which are relevent to a query may be spread across multiple pages.

The idea of treating documents as a detection problem has been emerging over the last few years as a compelling first step to advanced parsing strategies and allows for the creation of complex AI-based parsing systems.
There are bunch more datasets to choose from (RAGBench, PubMedQA, and HotPotQA to name a few), each of which attack the general problem of document contextualized AI differently. I think we’ve covered enough datasets for now, though, to have a general understanding of the space. Let’s discuss how we can practically use these datasets to test document contextualized AI products.
Testing Document Interfacing AI Products
Through discussing a few datasets, it should be apparent that there are a lot of ways to define “An AI model that can understand documents”. If you’re building an AI product that’s designed to understand specific documents to do a specific task, you’ll likely find that there are many datasets that don’t fit your use case, a few datasets that sort of fit your use case, and very few datasets that fit your use case perfectly.
Using academic datasets to test AI products is an art, requiring one to think creatively about how datasets can be manipulated to test your product. It also requires open-mindedness in thinking of how your product can be re-imagined to fall inline with existing datasets.
I’ve been managing a fraud detection product for a while now, where the general idea is to analyze documents about an insurance claim and aggregate evidence as to if the claim is fraudulent or not fraudulent. This would allow claim adjusters to better prioritize their time analyzing claims that have some potential risk of fraud.
In that effort, we have questions like the following:
- “Did person X have a lawyer before seeking medical treatment”
- “Did person X’s injuries change substantially throughout the course of their treatment”
It’s difficult to find datasets that perfectly address this exact application, but there are datasets that are similar in many ways. Generally, this product can be seen as:
- An information extraction problem
- A multi-page and multi-modal visual question-answering problem
- A multi-hop question answering problem
etc. Thus, I can test the core technology of my fraud detection product against these datasets to form an idea of the general strengths and weaknesses of my product. This might not be immediately relevant to fraud detection, but it’s certainly helpful to test the sub-problems within that greater problem, and it’s infinitely better than doing no testing at all.

The way testing practically shakes out depends on the product you’re testing and the dataset you’re testing with, but generally speaking, you load your documents into a doc store, fire all your questions through your product, and get a list of AI-generated answers. This can be a, practically, surprisingly difficult and labor-intensive process. These datasets can be large and complex, so integrating them into your product correctly can be challenging, especially if the dataset doesn’t perfectly align with your product. Also, once you get a bunch of AI-generated answers, actually scoring your performance based on those answers can be surprisingly difficult.
At EyeLevel, we’ve found that LLM as a judge can be a great place to start, but we wouldn’t recommend overreliance. For the uninitiated, the idea of LLM as a judge is to pass both answers to an LLM and ask the LLM if the AI-generated answer aligns with the human-curated answer.

LLMs have weird quirks, though, and tend to falsely flag answers as incorrect based on irrelevant missing details or minor discrepancies. We’ve found that LLM as a judge consistently deviates from blind human testing by roughly 10–20%.
Another automated approach is to use pairwise comparison. If you have two AI systems which you want to compare, you can ask an LLM judge to pick which answer it prefers, rather than saying which one is “good” or “bad”. This can sometimes result in more stable analysis of results, but can also amplify the importance of sylistic preferences in the judge which may not be relevent to humans.

Yet another technique is ensembling. You can ask many different LLMs to judge an answer and define the score for that question as the aggregate of all judges.

This can be extended further by exploiting the temperature output of LLMs to make an LLM output a range of answers which are then aggregated. This can result in more consistent evaluation.

There are a lot of ways LLM as a judge can be improved to help automate human evaluation, but if you’re serious about improving the performance of your AI system, we recommend keeping a human in the loop. LLMs have strange preferences, and it can be dangerous to invest too much time and energy focused on changes that are fundamentally derived from LLM quirks.
Of course, all of this assumes you have a dataset to test on. Let’s talk about how we can define our own dataset. Having our own dataset might be useful if we either couldn’t find a good one for our use case or if we want to make a dataset that’s super specific.
Building Your Own Dataset
The exact nature of a dataset should reflect the problem you’re solving. If you’re trying to build an AI system that can draw circles on the page of a document, you’re probably going to need a pretty complicated dataset to test your performance. In this article we’re chiefly focused on testing document contextualized AI systems which are designed to answer questions based on a user's query.
You might expect procuring documents to be the most difficult part of building your own dataset, but collecting documents can actually be fairly easy. There are a huge number of business, legal, governemental, financial, end educational resources online which are freely accessible. If you’re working with a customer on a particular problem, many customers have a huge number of relevent documents which can be levereged for testing.
Really, the most difficult part of creating a dataset is Q/A pair generation. If you’re working with difficult documents that contain complex information, it can be time-consuming to understand the material sufficiently to generate a Q/A pair. At EyeLevel, we’ve found that human-generated Q/A pairs are fundamentally important in testing, but these can be bolstered by LLM-generated Q/A pairs. We recommend using a variety of approaches to automate Q/A pair generation to minimize the bias of one approach influencing the results. We’re talking about maturing some of our internal tooling and open-sourcing it, so stay tuned for that. If you’re interested, show the EyeLevel team some love, and tell them I sent you; that might make it more likely we actually get an open-source testing tool out the door as opposed to working on all the other priorities we have.
Conclusion
In this article we discussed the major touchpoints of testing document contextualized AI systems, like RAG and agents. We discussed the major pieces of these systems, why they’re necessary, and how they can fail. We then discussed a variety of open source datasets, how they can be used to test document contextualized AI products, and how we can construct our own dataset for more bespoke testing. This article was a high-level overview of the general problem. In future articles, I’ll be zooming into some more specifics of how highly robust document-contextualized AI systems can be developed.
I’ve spent a fair amount of time constructing datasets for testing RAG systems. This has prompted me to come up with a few rules of thumb that I’d like to share:
- Falsafiability is critical. If you can’t easily say an AI system was correct or incorrect, then your testing will not be very productive.
- Balance testing specific components of the document contextualization pipeline with testing wholistically. Specific tests can give you theories as to what needs work in your RAG/Agentic system, and holistic tests can tell you if your changes are actually resulting in better output.
- Ablation studies are your friend. Swap components out and test your RAG system. Replace your expensive and fancy approach with a simpler one to see if there’s any significant shift in performance. Testing is not only an opportunity for derisking, but also a way to quickly find improvements.
- Start early and small. Testing is hard, but the sooner you make it a core piece of your development workflow, the better. You’ll become better at testing, and your products will get better faster.