Tutorials

The AI Engineer's Guide to Document Parsing in RAG Applications

October 26, 2024
  •  
4 minutes
Daniel Warfield
Senior Engineer

In the world of Retrieval-Augmented Generation (RAG), we often focus on the flashy aspects: sophisticated language models, clever prompting strategies and advanced retrieval techniques. But today, we're diving into a critical component that doesn't get enough attention: document parsing. Understanding and optimizing your parsing strategy is one of the keys to building high-performance RAG applications.

The Foundation of RAG: Document Parsing

Let's start with a fundamental truth: parsing is the bedrock of any RAG application. 

"The first step in any RAG application is parsing your document and extracting the information from it," says EyeLevel cofounder Neil Katz. "You’re trying to turn it into something that language models will eventually understand and do something smart with."

This isn't just about extracting text. It's about preserving structure, context, and relationships within the data. Get this wrong, and your entire RAG pipeline suffers. If you don't get the information out of your giant set of documents in the first place, which is often where RAG starts, it's “garbage in and garbage out” and nothing else will work properly.

The Heart of the Problem

The basic problem to solve is that language models, at least for now, don't understand complex visual documents. Anything with tables, forms, graphics, charts, figures and complex formatting will cause downstream hallucinations in a RAG application. Yes you can take a page from a PDF and feed it into ChatGPT and it will understand some of it, sometimes most of it. But try doing this at scale with thousands or millions of pages and you've got a mess and eventually downstream hallucinations for your RAG.

So devs need some way of breaking complex documents apart, identifying the text blocks, the tables, the charts and so on, then extracting the information from those positions and converting it into something language models will understand and that you can store in your RAG database. This final output is usually simple text or JSON.

This problem isn't new btw. There are entire industries devoted to ingesting medical bills, restaurant receipts and so on. That's typically done with a vision model fine tuned to a very specific set of documents. The model for receipts isn't good at medical bills. And vice versa.

The new twist is RAG often deals with a highly varied set of content. A legal RAG, for example, might need to understand police reports, medical bills and insurance claims.  The second twist is the information needs to be converted into LLM ready data.

So let's talk about what's out there.

Parsing Strategies: Breakdown of Approaches

Let's examine some common parsing strategies, their strengths, and their limitations using an example of a medical document showcasing exam dates and fees in a table:

1. PyPDF

Image: PyPDF results showing minimal information extracted from the table in the medical document.

PyPDF is a longstanding Python library designed for reading and manipulating PDF files. It can be effective for basic text extraction from simple PDFs, but often struggles with complex layouts, tables, and formatted text. 

PyPDF is best suited for straightforward, text-heavy documents but may lose critical structural information in more intricate PDFs. It doesn't process visual objects like images, charts, graphs and figures.

2. Tesseract (OCR)

Image: Tesseract results showing information extracted from the table in the medical document.

Tesseract is an open-source optical character recognition (OCR) engine that can extract text from images and scanned documents. Best known for converting image-based text to machine-readable format, Tesseract can struggle with maintaining document structure, especially in complex layouts or tables. 

It's particularly useful for scanned documents but may require additional post-processing to preserve formatting and structure. Tesseract also doesn't process visual objects like images, charts, graphs and figures.

3. Unstructured

Image: Unstructured results showing rich information extracted from the table in the medical document.

Unstructured is a modern document parsing library that aims to handle a wide variety of document types and formats. It employs a combination of techniques to extract and structure information from documents, including text extraction, table detection, and layout analysis. 

While more robust than traditional parsing tools, Unstructured can still face challenges with highly complex or non-standard document formats. Like the others, it doesn't process visual objects like images, charts, graphs and figures.

4. LlamaParse

Image: LlamaParse results showing a markdown table of information extracted from the table in the medical document.

LlamaParse is a newer parsing solution developed by the team behind LlamaIndex. It's designed to handle complex document structures, including tables and formatted text, and outputs results in a markdown format that's easily interpretable by language models. 

It has been seen to preserve document structure and handle tables, though it's a relatively new tool and its full capabilities and limitations are still being explored in real-world applications.

5. X-Ray by EyeLevel.ai

X-Ray, powered by EyeLevel’s GroundX APIs, takes a multimodal approach to parsing with industry leading results, especially when parsing complex visuals including charts, graphics and figures. X-Ray is far more than just a table parser.

The X-Ray technology starts with a fine-tuned vision model trained on a million pages of enterprise documents from a wide cross section of industries including health, financial, insurance, legal and government. The system uses the vision model to identify various objects on the page: text blocks, tables, charts and so on. Once the coordinates are known, it extracts the information, chunks it and sends it to different pipelines to be turned into LLM ready data.

The result is a JSON-like output that includes narrative summaries, providing richer context for language models. X-Ray is available in a demo format for developers to try for themselves, where they can upload a document to the system and see the semantic objects that are created to translate complex visuals to the LLM. You can try X-Ray here.

Performance Impact: The Parsing Difference

Our tests, along with academic research, show that parsing strategy can significantly impact RAG performance. 

We're talking about substantial gains, as Daniel Warfield, co-host of RAG Masters points out:

"For some examples, there's a 10%, even a 20% difference in performance."

This is crucial when you consider the effort that goes into other optimization strategies:

"People are doing crazy advanced strategies for the difference in 5, 6, 7, even 10 percent performance. And then maybe just completely switching the parser might get you a massive performance increase."

Error Analysis: Common Parsing Pitfalls

Let's examine some common parsing errors and their downstream effects:

  1. Table Misinterpretation: When parsers fail to correctly identify table structures, it can lead to data being treated as unstructured text. This can result in incorrect answers in question-answering tasks, especially for queries about tabular data.
  2. Loss of Formatting: If a document structure isn't well understood, a text scrape could scramble the pieces up. A header could wind up in body copy. A column label could wind up in the rows of data. You get the parsing equivalent of scrambled eggs.
  3. Image Handling: Most parsers struggle with embedded images or diagrams, either ignoring them completely or misinterpreting them as text through OCR.
  4. Header/Footer Confusion: Parsers might incorrectly include headers and footers as part of the main content, potentially skewing the context of the extracted information.

Developing Custom Parsing Strategies

For developers dealing with specific document types or domains, developing custom parsing strategies can be beneficial. Here are some approaches:

  1. Combining Existing Tools: Use multiple parsing tools in tandem, leveraging the strengths of each for different parts of your documents.
  2. Regular Expressions: Implement custom regex patterns to extract specific types of information consistently found in your documents.
  3. Domain-Specific Rules: Incorporate rules based on domain knowledge to improve parsing accuracy for specialized documents.
  4. Machine Learning Augmentation: Train models to recognize and extract specific patterns or structures in your documents.

Integration Challenges

When integrating parsing strategies into existing RAG pipelines, developers often face several challenges:

  1. API Compatibility: Ensure that the chosen parsing strategy can be easily integrated with your existing codebase and infrastructure.
  2. Data Format Consistency: The output of your parser should be in a format that's compatible with the rest of your RAG pipeline, often requiring additional preprocessing steps.
  3. Scalability: Consider the computational resources required by different parsing strategies, especially when dealing with large document sets.
  4. Error Handling: Implement robust error handling to deal with parsing failures or unexpected document formats.

Best Practices for Selecting a Parsing Strategy

It’s recommend to take a two-pronged approach to selecting the right parsing strategy:

1. Visual Inspection: Start by running your documents through different parsers and examining the output. As Warfield advises:

"Pass your data through a bunch of parsers and look at them. Your brain is still the most powerful model that exists."

2. End-to-End Testing: Once you've narrowed down your options, conduct thorough end-to-end testing. This means running your entire RAG pipeline with different parsing strategies and evaluating the final output.

To quantitatively compare parsing strategies, consider the following metrics:

  • Accuracy in table and graphical extraction
  • Preservation of document structure
  • Abiliity to turn extractions into LLM friendly data
  • Speed of parsing
  • Consistency across different document types
  • Ability to handle complex formatting

The Challenge of Evaluation

Here's the rub: evaluating parsing quality is still a largely manual process. Creating question-answer pairs for evaluation is labor-intensive but crucial for building automated tooling. The need for human evaluation in parsing cannot be completely eliminated, at least not yet.

This presents a significant opportunity in the field, and this post will be updated in the future when a sufficiently advanced solution for automated parsing is discovered.

Conclusion

As we continue to push the boundaries of what's possible with RAG applications, it's clear that document parsing will remain a critical component. The field is ripe for innovation, particularly in parsing technology and evaluation methods.

For developers building RAG applications, it’s critical not to overlook the importance of parsing. Take the time to evaluate different parsing strategies and their impact on your specific use case. It could be the difference between a RAG system that merely functions and one that excels.

Remember, in the world of RAG, your system is only as good as the data you feed it. And that all starts with parsing.

You can watch the full episode of RAG Masters here:

More news

Tutorials

Breaking Down OpenAI o1 Model for RAG Engineers

OpenAI's o1 reasoning model: A step towards AI that thinks, promising improved rule-following and consistency, but not without skepticism and limitations.

Read Article
Research

Do Vector Databases Lose Accuracy at Scale?

Research shows vector databases lose accuracy at just 10,000 pages, but there's a way out.

Read Article
Tutorials

Fine-Tuning AI Models: A RAG Developer's Guide to Specialized Machine Learning

Dive into the world of AI fine-tuning, from basic concepts to cutting-edge techniques like LoRA. Learn how to specialize your models, avoid common pitfalls, and leverage fine-tuning for real-world applications.

Read Article

Find out what the buzz is about. Learn to build AI you can trust.