Retrieval Augmented Generation (RAG) is a critical tool in modern large language model (LLM) powered applications. But like LLMs themselves, RAG systems are prone to hallucination - making up incorrect information in response to user questions. Accuracy is now a critical measure of RAG and LLM performance.
In this article, we put three RAG platforms to the test: two of the most popular tools used today, Langchain with a Pinecone vector database and LlamaIndex (which contains its own vector DB) in addition to a relevant newcomer GroundX from EyeLevel.ai which is a complete RAG system accessed by API.
The results might surprise. In a head-to-head battle for accuracy with 92 questions over more than 1,000 pages of complex tax documents from Deloitte, GroundX greatly outperformed, beating LangChain/Pinecone by 53% and LlamaIndex by 120%.
GroundX: 97.83%
LangChain / Pinecone: 64.13%
LlamaIndex: 44.57%
But there's more to the story as we dig into why there is such a large performance difference and how these systems handle textual, tabular and graphical data for RAG.
Let's dig in.
Why Do We Need RAG Anyway?
RAG exists for two key reasons:
- To allow you to inject new information into a pre-trained language model
- To minimize hallucination by simplifying problems
RAG allows one to inject information into a language model, allowing the model to use that information to answer a user's question.
This allows for custom, tailored LLM applications that can provide a high degree of context-specific value to your company or users.
RAG also exists to reduce the frequency of “hallucination.” If a language model happens to get something wrong, it will typically double down and justify that response, resulting in confusing, unhelpful or sometimes dangerous responses. These strange outputs are called hallucinations.
By injecting high quality, grounded information into a language model, the model is more likely to reference that data than make its own conclusions. Thus, the difficult task of reasoning can be replaced with the much easier task of referencing.
This is where a lot of intuition around RAG ends, but there are some big caveats to RAG that can affect its performance in real-world applications.
RAG Isn’t Magic, It’s Hard
The core Idea of RAG is pretty straight forward, it consists of three key parts:
- Retrieval: Based on the query from the user, retrieve relevant knowledge from a set of documents.
- Augmentation: Combine the retrieved information with the user query to construct a prompt.
- Generation: Pass the augmented prompt to a large language model, generating the final output.
But these are easier said than done. Many documents in the real world contain complex formatting with complicated figures and dense tables interwoven within text, meaning extracting the correct information to answer a user's query can be difficult. On top of that, the way retrieved data is presented to the model can have a big impact on performance.
Presenting the correct information in a confusing way to an LLM can cause the model to hallucinate. Scale can also present issues. It’s easier to find a needle in a small haystack than a 100 acre field. Highly performant search against a large corpus is critical to success in many use cases.
There are a variety of popular approaches to RAG, all of which handle the three steps of RAG somewhat differently. In the next section we’ll describe a test designed to directly compare RAG solutions.
The Test
This test compares the performance of three RAG approaches:
LangChain/Pinecone: LangChain is a popular python library designed to abstract certain LLM workflows and Pinecone is a vector database designed to work well with embeddings. Together, LangChain/Pinecone (LCPC) serves as a popular RAG workflow.
LlamaIndex: LlamaIndex (LI) is another popular RAG platform, allowing for local embedding manipulation. It has a document parser and built-in RAG functionality.
EyeLevel’s GroundX: GroundX is a feature complete retrieval engine as a service, offering a variety of tools and services necessary to build robust retrieval engines. GroundX can provide out-of-the-box or tailored retrieval solutions depending on customer need and document complexity and can be readily applied to RAG.
The test we’ll be discussing consists of 1,146 total pages of public facing PDFs from the tax consultant Deloitte. The chosen documents represent the complexity RAG systems face in the real world - rich with diagrams, tables and complex visual formatting. Any engineer trying to build RAG systems for the legal, medical and financial industries – to name a few – are well aware of the problems real documents present.
Along with these documents, 92 questions were posed. The context necessary to answer these questions resides in text, figures, and tables throughout the documents. We tested the three RAG platforms against these questions, and judged each of their responses as correct or incorrect based on the content of the source documents. In doing so, we created a direct comparison of the performance of RAG approaches on real world documents.
Each of these questions were categorized as one of the following:
- Textual Questions: questions which require referencing from text within a document.
- Tabular Questions: questions which require referencing from a table within a document.
- Graphical Questions: questions which require referencing from a figure within a document.
The questions were evenly distributed, roughly 33% of the 92 questions for each category.
Importantly, in each case we tested the most straightforward setup of the technologies, mimicking what most RAG developers will experience out of the box. We recognize that additional libraries or more complex strategies might be employed to impact the results of each of these systems. We plan to test those ideas in a future publication.
Because EyeLevel’s GroundX employs some services, and LangChain/PineCone and LlamaIndex are both frameworks, it makes more sense to view this test as a comparison of end user experience, rather than a direct technical comparison. All resources necessary to run your own version of this experiment can be found here.
The Results
In applying our three RAG approaches to our 92 questions, we found the following results. Each of these approaches use GPT-4 for completion, and all have access to the same documents, meaning the only difference between approaches is the RAG system.
GroundX: 97.83%
LangChain / Pinecone: 64.13%
LlamaIndex: 44.57%
As can be seen, when applied to real-world documents there is a wide disparity in performance across approaches. The elephant in the room is GroundX’s heightened performance, besting LangChain/Pinecone by 53% and LlamaIndex by 120%. GroundX preserves this performance advantage across all of the question types tested but really pulls away when dealing with tabular and graphical data, which are typically much harder for LLMs to understand.
Accuracy of the tested RAG solutions by problem type
The reason there is a disparity in performance, and the reason GroundX exists, are one in the same; when dealing with real-world documents, edge cases are commonplace and even simple questions are made difficult due to the innate difficulty of parsing, context preservation and context preparation which is easily interpretable by an LLM.
If you’re interested in digging into the results yourself, you can find all the relevant information here including code, documents, and tabulated results. Let's take a look at some specific examples to understand some of the realities of RAG.
Example 1: Extracting Information From Tables
In this example, the same question was posed to all three RAG approaches: “What is the branch rate in Belgium?” In looking at one of the documents in the test set (link), there is one clear and obvious answer: “25%”
The answers from our three RAG approaches for this question were as follows:
GroundX: The branch tax rate in Belgium is 25%.
LangChain/Pinecone: The branch rate in Belgium is not explicitly stated in the provided context. If you need information on the corporate tax rates for branches in Belgium, it would be best to consult the latest tax guidelines from the Belgian tax authorities or seek advice from a tax professional.
LlamaIndex: The provided context information does not include details about the branch rate in Belgium. The information focuses on individual income tax for employees, treaty withholding tax rates for dividends, interest, and royalties, and specific conditions under which certain reimbursements and costs can be tax-free. To find the branch rate in Belgium, one would need to consult additional resources or sections of the tax code that specifically address the taxation of branches.
As can be seen, RAG powered by GroundX delivered the correct answer, while LlamaIndex and LangChain/Pinecone failed to retrieve the correct information.
LangChain/Pinecone (LCPC) and LlamaIndex (LI) both utilize cosine similarity to choose which pieces of text are relevant to a user's query. chunks of text from the documents are embedded, turning them into vectors, and the user's query is embedded, turning it into a vector. If a document vector points in a similar direction to the query vector, both LCPC and LI call that a relevant chunk.
Cosine similarity is a popular strategy, but it has a big limitation; it’s often tripped up by semantic similarity. Basically, cosine similarity can get confused if there are superficial similarities in two pieces of text. Because many of the documents contain information about tax information, it’s likely that LCPC and LI both got confused and simply returned a chunk which was irrelevant to the prompt.
GroundX breaks input documents into objects which contain critical semantic components, then searches against those components; a strategy that’s fundamentally different to vector based approaches that allows for more specific search. These objects contain re-written elements from the original document which are designed to optimize retrieval.
Example 2: Extracting Information From Graphics
In this example, the same question was posed to all three RAG approaches: “What is the Cost of capital for Private equity funds?” This question is immediately relevant to this document which contains the following figure, which clearly shows that the answer to this question is “25%+”:
The answers from our three RAG approaches for this question was as follows:
GroundX: The cost of capital for private equity funds ranges from 25% and above.
LangChain/Pinecone: The provided context does not specify the exact cost of capital for private equity funds. The cost of capital can vary widely depending on various factors such as the risk profile of the investment, market conditions, and the specific strategies of the private equity fund. If you need information on the typical cost of capital for private equity funds, it would be best to consult financial reports or market analyses that focus on private equity specifically.
LlamaIndex: The cost of capital for Private equity funds is in the range of SOFR+600 to 1,000.
As can be seen, RAG powered by GroundX delivered the correct answer, while LlamaIndex LangChain/Pinecone failed to generate correct responses. It appears that LCPC failed in a mode similar to the previous example; it simply failed to retrieve the correct information. LlamaIndex, however, gave an incorrect answer which appears to use the wrong data from the right diagram.
To answer this question correctly, simply extracting all the text and feeding it into a model won't work. Spatial information has to somehow be preserved such that the language model can reason about the figure correctly. GroundX employs a fine-tuned vision model that can extract high level document components and then summarize them; allowing GroundX to work well with documents containing visual information.
Example 3: Extracting Information From Text
In this example, the same question was posed to all three RAG approaches: “on my tax max 2023, how did the definition of a ‘Plant’ change?” The correct answer, “A plant changed from an intangible asset to any asset specified by the Minister,” is found inside this document
The answers from our three RAG approaches for this question was as follows:
GroundX: In the text provided, the definition of "plant" for the purpose of Corporate Tax changed in the following way:
Current Definition (effective for the Year of Assessment 2023):
"Plant" means an apparatus used by a person for carrying on his business but does not include:
- a building;
- an intangible asset; or
- any asset used and that functions as a place within which a business is carried on.
Proposed Definition:
- "Plant" means an apparatus used by a person for carrying on his business but does not include:
- a building;
- any asset used and that functions as a place within which a business is carried on; or
- any asset specified by the Minister.
The change in the definition includes the addition of a category: "any asset specified by the Minister." This means that along with a building and an asset that functions as a place within which a business is carried on, any other asset specified by the Minister also does not qualify as a "plant" for tax deduction purposes.
LangChain / Pinecone: Based on the context provided, there is no indication that the definition of "plant" has changed in the tax context for the year 2023. The current definition as per Paragraph 70A, Schedule 3, states that "plant" means an apparatus used by a person for carrying on his business but does not include:
- a building;
- an intangible asset; or
- any asset used and that functions as a place within which a business is carried on.
There is no proposed change mentioned in the context provided regarding the definition of "plant" for tax purposes. If there were any changes, they are not included in the information you've given.
LlamaIndex: The definition of "plant" did not change in the provided context. It consistently refers to an apparatus used by a person for carrying on his business but does not include a building, an intangible asset, or any asset used and that functions as a place within which a business is carried on. Additionally, it excludes any asset specified by the Minister.
As can be seen, RAG powered by GroundX delivered the correct, though rather verbose answer, while LlamaIndex and LangChain/Pinecone failed to generate correct responses.
It appears that GroundX managed to directly reference the relevant text, while LCPC and LI both failed to recognize that the definition of a plant changed, though they both seem to have retrieved some relevant information.
GroundX’s success is likely due to a more robust contextualization system. GroundX re-writes document content in a way that improves both retrieval and generation, allowing for the right reference to be retrieved more often, and for that reference to be properly conveyed to the language model for accurate generation of the final output.
Example 4: An Example of GroundX Doing Worse
While GroundX consistently outperformed LangChain/Pinecone and LlamaIndex, it wasn’t all roses. In this example, the same question was posed to all three RAG approaches: “Are Telcos an example of open data or open finance?” In looking at one of the documents in the test set (link), there is one clear and obvious answer: “Open Data”
The answers from our three RAG approaches for this question was as follows:
GroundX: Telcos, referring to telecommunications companies, are not explicitly mentioned in the context of open data or open finance within the provided excerpts. However, in the broader scope of open data initiatives, telecommunications data could potentially be part of an open data ecosystem if such data is made accessible and shareable in a manner consistent with open data principles. In the provided content, the sectors mentioned in relation to open data ecosystems include banking, finance, health, energy, pharmaceuticals, mobility, infrastructure, and natural resources. Would you like more detailed information on a specific aspect of open data or open finance?
LangChain / Pinecone: Based on the context provided, there is no specific mention of telecommunications companies (telcos) being directly involved in Open Data or Open Finance initiatives. Open Data generally refers to the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents, or other mechanisms of control. Open Finance, on the other hand, is a subset of Open Data, focusing specifically on the financial services sector and involves sharing consumers' financial data with their consent, typically through APIs, to create a more integrated and competitive financial ecosystem.
The context provided discusses Open Data in the financial sector and other industries like healthcare and education, as well as the benefits of such systems in terms of financial inclusion, competition, and innovation. It also mentions countries like Australia, Colombia, and Estonia, which have made strides in implementing Open Data frameworks across various sectors.
However, without additional information, it is not possible to determine whether telcos are part of Open Data or Open Finance initiatives in the context provided. Telcos could potentially be involved in Open Data initiatives if they share non-sensitive data that can be used to foster innovation or improve services. In some regions, telcos might also participate in Open Finance by providing access to consumer data for credit scoring or identity verification, but this would be specific to the regulations and practices within those jurisdictions. If you are looking for information on whether telcos are involved in Open Data or Open Finance in a specific country or context, additional details would be needed.
LlamaIndex: Telcos are an example of open data.
In this example, GroundX and LangChain/Pinecone didn’t do very well, while LlamaIndex got the correct answer.
Interestingly, GroundX does provide some useful information which is relevant to the question. It’s important to note that RAG still relies on a language model which, after being trained on a vast corpus of information, has encyclopedic knowledge of various domains. Depending on how publicly known your questions might be, an RAG system might produce useful and helpful answers simply due to the generative abilities of the language model.
Conclusion
When choosing a platform to build RAG applications, engineers have many choices, including some we didn’t profile in this test, but will do so in later publications. There are many performance metrics for engineers to consider: ease of use, security, portability, strength of the developer community to name a few. But for many applications, accuracy will rank as the most important attribute.
In this head-to-head test of complex documents which mimic real world applications, GroundX came out far ahead, beating LangChain/Pinecone and LlamaIndex in all three categories we tested by wide margins.
Is this the last word? Certainly not. We intend to test additional RAG platforms such as Amazon Q and GPTs. We also intend to test more complex configurations of the three platforms tested here. We will also test a variety of document types. How will these platforms do on medical, legal, insurance, government and other data? Stay tuned. We’ll find out.
Try it Yourself
Want to take EyeLevel's GroundX APIs for a spin, create an account at www.groundx.ai.
We've also released all the data used for this test and our source code for GroundX, LangChain/Pinecone and LlamaIndex here. Let us know what you find.