Research

GroundX Achieves Superhuman Performance in Document Comprehension on DocBench

October 26, 2024
  •  
Daniel Warfield
Senior Engineer

AI that can understand documents better than a human

In this article, we’ll discuss the recent performance metrics of GroundX on DocBench, which suggest that GroundX is better than humans at some knowledge-intensive tasks.

“GroundX” is an AI system that’s designed to understand large amounts of complex documents and provide that information to LLMs via RAG and Agentic tools, and “DocBench” is a benchmark designed to test AI document understanding. In this article, we’ll explore what GroundX is, how it beats humans on key metrics within the DocBench benchmark, and why that matters.

Performance of GroundX on Textual and Multimodal Questions
As GroundX improves in performance, we frequently find that GroundX evaluates benchmarks as often as benchmarks evaluate GroundX. — From later in the article

GroundX and DocBench, In a Nutshell

If you’re working in a company that deals with engineering, medical, financial, and other types of complex information, you probably already know that AI can only help so much. If you’ve put your documents into ChatGPT, for instance, you may have noticed that key information in figures is often left out, tables are misunderstood, and important organizational details are lost. These practical issues severely limit the use of AI in virtually any document-heavy industry.

Examples of complex documents. Documents like these are frequently used in many industries, and are incredibly difficult for AI systems to interoperate correctly. From the EyeLevel article on testing RAG on complex documents.

The dream is to give AI a bunch of documents, ask the AI a question, then get a useful and accurate answer based on the content of the documents. In reality, the results are often hit-or-miss.

If you’re a developer who’s been working on building AI applications around documents, you already know that this is a multifaceted issue. Building a system that understands documents via approaches like “Retrieval Augmented Generation” (RAG) requires multiple steps, where the whole chain of operations is only as strong as its weakest link.

The common RAG pipeline. Any step of this pipeline can result in errors that damage the quality of answers from a document contextualized AI system. From my article on testing document contextualzied AI.

GroundX exists due to an acknowledgment that these operations are tightly interdependent. Chunking is related to parsing, how data is stored in a queriable representation impacts final completion, the parsing system should be designed such that it results in high-quality prompts for an LLM, etc. Most approaches treat each of these steps independently. GroundX unifies them all under a single cohesive system.

Naturally, we’ve been testing GroundX in-depth throughout the development process, which is no easy feat in itself. Unlike traditional software systems that can be tested via simple unit and integration tests, GroundX is designed to apply AI to arbitrary and complex documents. Testing an AI system that can understand documents better and faster than a human can is difficult.

As a result, we’ve had to do a lot of manual work to curate high-quality datasets based on expert understanding of complex documents. We started working on GroundX before RAG got its name, so we weren’t blessed with high-quality and open-source datasets at launch.

Now that the industry has matured, general document comprehension benchmarks are becoming both more common and more widely adopted, which is exciting. We’ve had to do a lot of work in comparing GroundX with other document contextualized AI approaches, so pre-defined datasets with existing performance metrics are a breath of fresh air. “DocBench” is one such dataset, which we’ll be exploring in this article.

The DocBench Dataset, and testing GroundX with it

The DocBench paper

DocBench consists of 229 real world documents and 1,102 questions that are posed based on those documents. Each question also has an expert curated real world answer, often reffered to as a “Q/A pair”. Knowing if an AI system is correct can be done by comparing the results of an AI generated answer to the human curated answer.

The general idea of DocBench. some AI system gets a document and question, then the AI answer is compared with the human curated desired answer. This is one of the most common ways to test document contextualized AI systems. Most of our internal datasets at EyeLevel employ a similar general strategy. Source.

The comparison of questions with desired answers is a laborious process. For a human to do it, they’d have to pull up the relevant document, find the content in the document that is relevant to the question, understand that content, and then evaluate the AI answer. To make matters worse, the definition of a “right” and “wrong” answer can be vague and subtle, especially for open-ended questions on dense information.

DocBench standardizes this problem by employing LLM as a judge, where an LLM is given the question and both answers, then judges the AI-generated answer automatically.

An example of LLM as a judge. Source.

There are some serious practical drawbacks to this approach which we’ll discuss later, but it has the benefit of being significantly faster than human evaluation and being an easy standardization in which different approaches  can be judged (relatively) impartially.

The authors of DocBench tested how similar LLM as a judge and was with human judgement, and found LLM as a judge to acheive a high degree of alignment with human evaluation. Source.

DocBench uses 229 real-world documents from a variety of sources

documents from the DocBench dataset. Download link can be found here.

and, for each document, has a corresponding list of question/answer pairs

{"question": "What is the primary challenge addressed by the introduction of the Linked WikiText-2 dataset?", "answer": "The primary challenge addressed is incorporating factual knowledge into language models due to difficulty in obtaining training data that describes which entities or facts each token is referring to.", "type": "text-only", "evidence": "\"...one of the primary barriers to incorporating factual knowledge into language models is that training data is hard to obtain. Standard language modeling corpora consist only of text, and thus are unable to describe which entities or facts each token is referring to.\""}
{"question": "What is the top-1 accuracy of the Oracle KGLM on birthdate prediction?", "answer": "The top-1 accuracy of the Oracle KGLM on birthdate prediction is 65%.", "type": "multimodal-t", "evidence": "The table lists the top-1/top-5 accuracy for each category and the Oracle KGLM's top-1 accuracy for birthdate prediction is specifically listed as 65."}{"question": "How many documents are there in the training set of the Linked WikiText-2 Corpus?", "answer": "There are 600 documents in the training set.", "type": "multimodal-t", "evidence": "The information is directly listed in the table under the 'Train' column for 'Documents.'"}
{"question": "Which language model has the lowest Perplexity (PPL) according to Table 3?", "answer": "KGLM has the lowest Perplexity (PPL) with a score of 44.1.", "type": "multimodal-t", "evidence": "The PPL column indicates the perplexity scores for different language models, and KGLM shows the lowest score among them, which is clearly visible in the table."}
{"question": "Who is the last author of the paper?", "answer": "Sameer Singh", "type": "meta-data", "evidence": ""}
{"question": "On which page does the paper introduce the corpus statistics?", "answer": "Page 5", "type": "meta-data", "evidence": ""}
{"question": "How many time does the paper mention WikiText-2?", "answer": "31", "type": "meta-data", "evidence": ""}

Each Q/A pair also has type, which defines the type of question being asked. The DocBench dataset divides questions into the following types:

  • Text-Only: Questions that have to do with text within a document
  • Multimodal: Questions that have to do with images, tables, and other figures within a document
  • Meta-data: Questions that have to do with things like word count and page number
  • Unanswerable: Questions that can not be answered based on the provided document

The Q/A pairs in DocBench are generated via a human-in-the-loop evaluation process, which we also employ at EyeLevel when constructing our own datasets.

The process of creating Q/A pairs for the DocBench dataset. Source.

Basically, the authors of DocBench curated a bunch of PDFs, used AI to generate a bunch of Q/A pair data, then humans manually review and curate those Q/A pairs. This is a fantastic approach to creating large datasets, but can introduce issues that we’ll discuss later in the article.

To evaluate a particular approach, run.py in the DocBench repo has the Runner and Runner_OSS classes. Runner works with hosted systems like those provided by OpenAI, and Runner_OSS works with open-source approaches that run locally. This allows DocBench to actually run AI systems based on the documents and their corresponding questions to generate AI-generated answers.

We can use LLM as a judge to evaluate our generations, which is as simple as iterating through each Q/A pair along with the generated answer and feeding them to an LLM with a prompt. The prompt used to do ethe valuation can be found here.

Task Overview:
You are tasked with evaluating user answers based on a given question, reference answer, and additional reference text. Your goal is to assess the correctness of the user answer using a specific metric.

Evaluation Criteria:
1. Yes/No Questions: Verify if the user's answer aligns with the reference answer in terms of a "yes" or "no" response.
2. Short Answers/Directives: Ensure key details such as numbers, specific nouns/verbs, and dates match those in the reference answer.
3. Abstractive/Long Answers: The user's answer can differ in wording but must convey the same meaning and contain the same key information as the reference answer to be considered correct.

Evaluation Process:
1. Identify the type of question presented.
2. Apply the relevant criteria from the Evaluation Criteria.
3. Compare the user's answer against the reference answer accordingly.
4. Consult the reference text for clarification when needed.
5. Score the answer with a binary label 0 or 1, where 0 denotes wrong and 1 denotes correct.
NOTE that if the user answer is 0 or an empty string, it should get a 0 score.

Question: {{question}}
User Answer: {{sys_ans}}
Reference Answer: {{ref_ans}}
Reference Text: {{ref_text}}

Evaluation Form (score ONLY):
- Correctness:

To get GroundX working in DocBench, we created a custom Runner by

Essentially, we minimally modified DocBench to accept GroundX as a runnable approach without modifying the evaluation logic in any meaningful way. If you wanted to run another system through DocBench, these would be a great set of steps to emulate. If you want to run our analysis yourself, just follow the instructions in this repo, in the groundx-system branch.

Let’s check out the results.

The Results

As previously discussed, DocBench explores a variety of question types that might be relevant to different AI approaches. We decided the following were relevant to GroundX Specifically:

  • multimodal-t: questions that have to do with tabular data
  • multimodal-f: questions that have to do with graphical figures
  • text-only: questions that have to do with text within a document
  • unanswerable: questions that had no answer in a document, serving as a way to predict a system's tendency to hallucinate answers.

Other question types, like ones about word counts or ones that require access to the internet, are not within GroundX’s scope.

Of the question types we analyzed, we got the following accuracy metrics based on LLM as a judge evaluation:

question type and evaluation result.

After aggregating the results of multimodal-f and multimodal-t to yield 84.4% accuracy on multimodality, which means GroundX acheived top performance on both multimodal question types and textual question types.

The table of analyzed solutions provided by DocBench. Source.

You might notice that the performance of GroundX was evaluated to do poorly on unanswerable questions. Let’s look through a few examples to try to unpack this result.

Going Through Answers Evaluated as Wrong

Let's go through a few answers by GroundX that were flagged as incorrect through LLM as a judge.

Unanswerable question 1 (judged incorrect).

Question: How does Toyota envision the role of cars in contributing to society's development and local communities' coexistence?
Answer: Not mentioned.
GroundX Answer: 5. The provided document does not contain information about Toyota or its vision regarding the role of cars in society or local communities.

Here, the answer was judged as incorrect even though the answer clearly states The provided document does not contain information about Toyota or... , which a human would likely interpret as a correct answer. It’s possible that the strange inclusion of 5. at the beginning of the answer may have tripped up the evaluation. This exists because, to save inference costs DocBench batch uploades sets of answers in an enumerated list. Every once in a while those enumerations bleed into the reported answer.

LLM evaluation can be frustratingly inconsistent, based on our experience. It’s usually right most of the time, but even with in-depth prompting the results can be very wrong. At Eyelevel we’ve found LLM as a judge to have relatively high specificity (it flags wrong answers as wrong, correctly, most of the time), but to have a lower specificity (it flags correct answers as incorrect more often). This often has to do with erroneous and stylistic differences between a ground truth and generated answer causing a misclassification of an answer as incorrect.

Unanswerable question 2 (judged incorrect).

Question: Was the Emergencies Act given royal assent on July 21, 1988?
Answer: Not mentioned.
GroundX Answer: 6. Yes, the Emergencies Act was given royal assent on July 21, 1988.


While this answer may not be in the document, it is accurate and likely exists in the parameterized knowledge of many completion models. So, auto evaluation judges this response as incorrect when it is, in fact, factually accurate.

Unanswerable question 3 (judged incorrect).

Question: On which page does the report present the first table?
Answer: The document does not contain table information.
GroundX Answer: 4. The first table appears on page 3.
The page referenced in question 3.

This answer is factually inaccurate, though if one were to be considerably gracious this page does have a vaguely tabular vibe and, given the context of the document in question, could be considered to be the least wrong answer out of any potential answer. There is also a table of contents in this document, which could also be considered a valid answer.

Unanswerable question 4 (judged incorrect).

Question: What is the revision data of the document?
Answer: Not mentioned.
GroundX Answer: 5.


This one is an unfortunate interpretation of the output format. Here, the system is saying that output 5 is nothing, (which conforms to the enumerated outputs needed to conform to DocBench), but because the 5 exists in the answer, the LLM evaluation likely misinterpreted the 5 as an answer.

Unanswerable question 5 (judged incorrect).Question: What kind of resources are typically not required by neural network NER systems with minimal feature engineering?Answer: Not mentioned.GroundX Answer: 3. Neural network NER systems with minimal feature engineering typically do not require extensive hand-crafted features or domain-specific resources.

This is another case of a question that may not have content in the document but is general enough to be in the parameterized knowledge in an LLM. Thus, the LLM generates a factually correct answer, which is judged as incorrect because the desired answer is “Not mentioned”.

Unanswerable question 6 (judged incorrect).Question: On which page does the report detail the Brazilian law?Answer: Not mentioned.GroundX Answer: 5. The report details the Brazilian law starting on page 13.

The page of the document in question. The document is titled “Law Library, Commercial Drone Operation Training and Certification Requierments”

In this question, it appears that the curated answer is actually wrong. A section on Brazil, in a document that discusses the legality of commercial drone operation, starts on page 13. Here, it appears that GroundX caught an issue in the dataset.

So, while the automated evaluation results are 52% accurate for unanswerable questions, it appears only one out of the first six of such examples is actually wrong.

In multimodal and textual questions we found similar evaluation errors, leading us to estimate that GroundX achieved closer to 90–95% accuracy, which is in line with substantial internal manual testing we’ve done on similar datasets.

Interpreting the Results

There is an intimate interplay between the progress of AI systems and the benchmarks used to test that progress. AI demands laboriously curated and comprehensive datasets for testing which are difficult to build at scale. High-quality datasets and evaluation strategies take time to curate and build, so there is a natural “good enough” horizon of datasets which, despite having some internal errors, are a good relative check to see if one modeling strategy is better than another. As these modeling approaches improve, they naturally start to surpass the testing capabilities of existing benchmarks, and new ones need to be developed.

As GroundX improves in performance, we frequently find that GroundX evaluates benchmarks as often as benchmarks evaluate GroundX. We believe GroundX is approaching that inflection point, where its heightened performance demands larger and more comprehensive document comprehension benchmarks. We’ve been building up a body of benchmarks and tooling, some of which can be accessed now:

And some of which we’ll be releasing over the coming months.

More news

Tutorials

How to Evaluate RAG? Follow Our 7 Simple Steps

Follow our practical blueprint for RAG evaluation you can get done in days

Read Article
Tutorials

RAG Evaluation: Almost Everything You Need to Know

Evaluating RAG is hard. We peel back the onion on the major challenges and solutions.

Read Article
Tutorials

Breaking Down OpenAI o1 Model for RAG Engineers

OpenAI's o1 reasoning model: A step towards AI that thinks, promising improved rule-following and consistency, but not without skepticism and limitations.

Read Article

Find out what the buzz is about. Learn to build AI you can trust.