In this article, we’ll discuss the recent performance metrics of GroundX on DocBench, which suggest that GroundX is better than humans at some knowledge-intensive tasks.
“GroundX” is an AI system that’s designed to understand large amounts of complex documents and provide that information to LLMs via RAG and Agentic tools, and “DocBench” is a benchmark designed to test AI document understanding. In this article, we’ll explore what GroundX is, how it beats humans on key metrics within the DocBench benchmark, and why that matters.


As GroundX improves in performance, we frequently find that GroundX evaluates benchmarks as often as benchmarks evaluate GroundX. — From later in the article
GroundX and DocBench, In a Nutshell
If you’re working in a company that deals with engineering, medical, financial, and other types of complex information, you probably already know that AI can only help so much. If you’ve put your documents into ChatGPT, for instance, you may have noticed that key information in figures is often left out, tables are misunderstood, and important organizational details are lost. These practical issues severely limit the use of AI in virtually any document-heavy industry.

The dream is to give AI a bunch of documents, ask the AI a question, then get a useful and accurate answer based on the content of the documents. In reality, the results are often hit-or-miss.
If you’re a developer who’s been working on building AI applications around documents, you already know that this is a multifaceted issue. Building a system that understands documents via approaches like “Retrieval Augmented Generation” (RAG) requires multiple steps, where the whole chain of operations is only as strong as its weakest link.

GroundX exists due to an acknowledgment that these operations are tightly interdependent. Chunking is related to parsing, how data is stored in a queriable representation impacts final completion, the parsing system should be designed such that it results in high-quality prompts for an LLM, etc. Most approaches treat each of these steps independently. GroundX unifies them all under a single cohesive system.
Naturally, we’ve been testing GroundX in-depth throughout the development process, which is no easy feat in itself. Unlike traditional software systems that can be tested via simple unit and integration tests, GroundX is designed to apply AI to arbitrary and complex documents. Testing an AI system that can understand documents better and faster than a human can is difficult.
As a result, we’ve had to do a lot of manual work to curate high-quality datasets based on expert understanding of complex documents. We started working on GroundX before RAG got its name, so we weren’t blessed with high-quality and open-source datasets at launch.
Now that the industry has matured, general document comprehension benchmarks are becoming both more common and more widely adopted, which is exciting. We’ve had to do a lot of work in comparing GroundX with other document contextualized AI approaches, so pre-defined datasets with existing performance metrics are a breath of fresh air. “DocBench” is one such dataset, which we’ll be exploring in this article.
The DocBench Dataset, and testing GroundX with it

DocBench consists of 229 real world documents and 1,102 questions that are posed based on those documents. Each question also has an expert curated real world answer, often reffered to as a “Q/A pair”. Knowing if an AI system is correct can be done by comparing the results of an AI generated answer to the human curated answer.

The comparison of questions with desired answers is a laborious process. For a human to do it, they’d have to pull up the relevant document, find the content in the document that is relevant to the question, understand that content, and then evaluate the AI answer. To make matters worse, the definition of a “right” and “wrong” answer can be vague and subtle, especially for open-ended questions on dense information.
DocBench standardizes this problem by employing LLM as a judge, where an LLM is given the question and both answers, then judges the AI-generated answer automatically.

There are some serious practical drawbacks to this approach which we’ll discuss later, but it has the benefit of being significantly faster than human evaluation and being an easy standardization in which different approaches can be judged (relatively) impartially.

DocBench uses 229 real-world documents from a variety of sources

and, for each document, has a corresponding list of question/answer pairs
{"question": "What is the primary challenge addressed by the introduction of the Linked WikiText-2 dataset?", "answer": "The primary challenge addressed is incorporating factual knowledge into language models due to difficulty in obtaining training data that describes which entities or facts each token is referring to.", "type": "text-only", "evidence": "\"...one of the primary barriers to incorporating factual knowledge into language models is that training data is hard to obtain. Standard language modeling corpora consist only of text, and thus are unable to describe which entities or facts each token is referring to.\""}
{"question": "What is the top-1 accuracy of the Oracle KGLM on birthdate prediction?", "answer": "The top-1 accuracy of the Oracle KGLM on birthdate prediction is 65%.", "type": "multimodal-t", "evidence": "The table lists the top-1/top-5 accuracy for each category and the Oracle KGLM's top-1 accuracy for birthdate prediction is specifically listed as 65."}{"question": "How many documents are there in the training set of the Linked WikiText-2 Corpus?", "answer": "There are 600 documents in the training set.", "type": "multimodal-t", "evidence": "The information is directly listed in the table under the 'Train' column for 'Documents.'"}
{"question": "Which language model has the lowest Perplexity (PPL) according to Table 3?", "answer": "KGLM has the lowest Perplexity (PPL) with a score of 44.1.", "type": "multimodal-t", "evidence": "The PPL column indicates the perplexity scores for different language models, and KGLM shows the lowest score among them, which is clearly visible in the table."}
{"question": "Who is the last author of the paper?", "answer": "Sameer Singh", "type": "meta-data", "evidence": ""}
{"question": "On which page does the paper introduce the corpus statistics?", "answer": "Page 5", "type": "meta-data", "evidence": ""}
{"question": "How many time does the paper mention WikiText-2?", "answer": "31", "type": "meta-data", "evidence": ""}
Each Q/A pair also has type
, which defines the type of question being asked. The DocBench dataset divides questions into the following types:
- Text-Only: Questions that have to do with text within a document
- Multimodal: Questions that have to do with images, tables, and other figures within a document
- Meta-data: Questions that have to do with things like word count and page number
- Unanswerable: Questions that can not be answered based on the provided document
The Q/A pairs in DocBench are generated via a human-in-the-loop evaluation process, which we also employ at EyeLevel when constructing our own datasets.

Basically, the authors of DocBench curated a bunch of PDFs, used AI to generate a bunch of Q/A pair data, then humans manually review and curate those Q/A pairs. This is a fantastic approach to creating large datasets, but can introduce issues that we’ll discuss later in the article.
To evaluate a particular approach, run.py
in the DocBench repo has the Runner
and Runner_OSS
classes. Runner
works with hosted systems like those provided by OpenAI, and Runner_OSS
works with open-source approaches that run locally. This allows DocBench to actually run AI systems based on the documents and their corresponding questions to generate AI-generated answers.
We can use LLM as a judge to evaluate our generations, which is as simple as iterating through each Q/A pair along with the generated answer and feeding them to an LLM with a prompt. The prompt used to do ethe valuation can be found here.
Task Overview:
You are tasked with evaluating user answers based on a given question, reference answer, and additional reference text. Your goal is to assess the correctness of the user answer using a specific metric.
Evaluation Criteria:
1. Yes/No Questions: Verify if the user's answer aligns with the reference answer in terms of a "yes" or "no" response.
2. Short Answers/Directives: Ensure key details such as numbers, specific nouns/verbs, and dates match those in the reference answer.
3. Abstractive/Long Answers: The user's answer can differ in wording but must convey the same meaning and contain the same key information as the reference answer to be considered correct.
Evaluation Process:
1. Identify the type of question presented.
2. Apply the relevant criteria from the Evaluation Criteria.
3. Compare the user's answer against the reference answer accordingly.
4. Consult the reference text for clarification when needed.
5. Score the answer with a binary label 0 or 1, where 0 denotes wrong and 1 denotes correct.
NOTE that if the user answer is 0 or an empty string, it should get a 0 score.
Question: {{question}}
User Answer: {{sys_ans}}
Reference Answer: {{ref_ans}}
Reference Text: {{ref_text}}
Evaluation Form (score ONLY):
- Correctness:
To get GroundX working in DocBench, we created a custom Runner
by
- Creating a fork of the repo
- Adding GroundX as an option in run.py
- Doing a final pass to clean up the code
- As is tradition in software development, we then did another minor cleanup pass after the final cleanup pass.
Essentially, we minimally modified DocBench to accept GroundX as a runnable approach without modifying the evaluation logic in any meaningful way. If you wanted to run another system through DocBench, these would be a great set of steps to emulate. If you want to run our analysis yourself, just follow the instructions in this repo, in the groundx-system
branch.
Let’s check out the results.
The Results
As previously discussed, DocBench explores a variety of question types that might be relevant to different AI approaches. We decided the following were relevant to GroundX Specifically:
multimodal-t
: questions that have to do with tabular datamultimodal-f
: questions that have to do with graphical figurestext-only
: questions that have to do with text within a documentunanswerable
: questions that had no answer in a document, serving as a way to predict a system's tendency to hallucinate answers.
Other question types, like ones about word counts or ones that require access to the internet, are not within GroundX’s scope.
Of the question types we analyzed, we got the following accuracy metrics based on LLM as a judge evaluation:

After aggregating the results of multimodal-f
and multimodal-t
to yield 84.4% accuracy on multimodality, which means GroundX acheived top performance on both multimodal question types and textual question types.

You might notice that the performance of GroundX was evaluated to do poorly on unanswerable questions. Let’s look through a few examples to try to unpack this result.
Going Through Answers Evaluated as Wrong
Let's go through a few answers by GroundX that were flagged as incorrect through LLM as a judge.
Unanswerable question 1 (judged incorrect).
Question: How does Toyota envision the role of cars in contributing to society's development and local communities' coexistence?
Answer: Not mentioned.
GroundX Answer: 5. The provided document does not contain information about Toyota or its vision regarding the role of cars in society or local communities.
Here, the answer was judged as incorrect even though the answer clearly states The provided document does not contain information about Toyota or...
, which a human would likely interpret as a correct answer. It’s possible that the strange inclusion of 5.
at the beginning of the answer may have tripped up the evaluation. This exists because, to save inference costs DocBench batch uploades sets of answers in an enumerated list. Every once in a while those enumerations bleed into the reported answer.
LLM evaluation can be frustratingly inconsistent, based on our experience. It’s usually right most of the time, but even with in-depth prompting the results can be very wrong. At Eyelevel we’ve found LLM as a judge to have relatively high specificity (it flags wrong answers as wrong, correctly, most of the time), but to have a lower specificity (it flags correct answers as incorrect more often). This often has to do with erroneous and stylistic differences between a ground truth and generated answer causing a misclassification of an answer as incorrect.
Unanswerable question 2 (judged incorrect).
Question: Was the Emergencies Act given royal assent on July 21, 1988?
Answer: Not mentioned.
GroundX Answer: 6. Yes, the Emergencies Act was given royal assent on July 21, 1988.
While this answer may not be in the document, it is accurate and likely exists in the parameterized knowledge of many completion models. So, auto evaluation judges this response as incorrect when it is, in fact, factually accurate.
Unanswerable question 3 (judged incorrect).
Question: On which page does the report present the first table?
Answer: The document does not contain table information.
GroundX Answer: 4. The first table appears on page 3.

This answer is factually inaccurate, though if one were to be considerably gracious this page does have a vaguely tabular vibe and, given the context of the document in question, could be considered to be the least wrong answer out of any potential answer. There is also a table of contents in this document, which could also be considered a valid answer.
Unanswerable question 4 (judged incorrect).
Question: What is the revision data of the document?
Answer: Not mentioned.
GroundX Answer: 5.
This one is an unfortunate interpretation of the output format. Here, the system is saying that output 5 is nothing, (which conforms to the enumerated outputs needed to conform to DocBench), but because the 5 exists in the answer, the LLM evaluation likely misinterpreted the 5 as an answer.
Unanswerable question 5 (judged incorrect).Question: What kind of resources are typically not required by neural network NER systems with minimal feature engineering?Answer: Not mentioned.GroundX Answer: 3. Neural network NER systems with minimal feature engineering typically do not require extensive hand-crafted features or domain-specific resources.
This is another case of a question that may not have content in the document but is general enough to be in the parameterized knowledge in an LLM. Thus, the LLM generates a factually correct answer, which is judged as incorrect because the desired answer is “Not mentioned”.
Unanswerable question 6 (judged incorrect).Question: On which page does the report detail the Brazilian law?Answer: Not mentioned.GroundX Answer: 5. The report details the Brazilian law starting on page 13.

In this question, it appears that the curated answer is actually wrong. A section on Brazil, in a document that discusses the legality of commercial drone operation, starts on page 13. Here, it appears that GroundX caught an issue in the dataset.
So, while the automated evaluation results are 52% accurate for unanswerable questions, it appears only one out of the first six of such examples is actually wrong.
In multimodal and textual questions we found similar evaluation errors, leading us to estimate that GroundX achieved closer to 90–95% accuracy, which is in line with substantial internal manual testing we’ve done on similar datasets.
Interpreting the Results
There is an intimate interplay between the progress of AI systems and the benchmarks used to test that progress. AI demands laboriously curated and comprehensive datasets for testing which are difficult to build at scale. High-quality datasets and evaluation strategies take time to curate and build, so there is a natural “good enough” horizon of datasets which, despite having some internal errors, are a good relative check to see if one modeling strategy is better than another. As these modeling approaches improve, they naturally start to surpass the testing capabilities of existing benchmarks, and new ones need to be developed.
As GroundX improves in performance, we frequently find that GroundX evaluates benchmarks as often as benchmarks evaluate GroundX. We believe GroundX is approaching that inflection point, where its heightened performance demands larger and more comprehensive document comprehension benchmarks. We’ve been building up a body of benchmarks and tooling, some of which can be accessed now:
- https://www.eyelevel.ai/post/most-accurate-rag: Tests RAG systems on complex documents.
- https://www.eyelevel.ai/post/do-vector-databases-lose-accuracy-at-scale: Analyzes the degradation of performance of RAG approaches at scale
And some of which we’ll be releasing over the coming months.