Learn how to evaluate rag evals on embedded table systems with accuracy. See our tutorial to learn the essential procedures for precise RAG evaluations.
Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.
Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.
Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.
Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.
“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”
Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.
In our previous post, we explored how Retrieval-Augmented Generation (RAG) systems can encounter hallucination issues and how DynamoEval can accurately and effectively diagnose the source of these errors. When RAG systems generate responses, the retrieved document may be in plain text format or a different format. Notably, when a table is present in the text, large language models (LLMs) may be more prone to hallucination due to the inherently different structure of tables. Moreover, queries involving tables often require computational operations, which pose a greater challenge for LLMs. For instance, on the WikiTableQuestion (WTQ) dataset, a standardized benchmark for tabular question-answering, the state-of-the-art model exhibits an error rate of 32.69 percent. Despite this fragility, there is a lack of dedicated RAG evaluation solutions focused on assessing pipelines that involve tabular data. DynamoEval aims to address this gap by providing a comprehensive offering tailored to evaluate RAG systems dealing with tabular data.
For this post, we will focus on evaluating RAG systems where the retrieved document is a table, and the generated response may involve some logical/computational reasoning over the content in the table. For instance, consider RAG systems interacting with financial documents with tables, like the consolidated balance sheets from page 30 of Apple’s 10-K report.
Users may be interested in simple look-up queries like "What is the total current asset of AAPL at the end of September, 2023? Respond in millions.” or operation-focused queries like “By what percentage did the deferred revenue increase/decrease in September, 2023 compared to September, 2022? Round to the first decimal place." When the model generates responses like "$143,566 million" and "Increased by 1.9%" for these queries respectively based on the balance sheet, the evaluator should be able to identify them as accurate and faithful; on the other hand, responses like "$135,405 million" and "Increased by 1.3%", should be flagged as incorrect and unfaithful. In fact, for this particular example, DynamoEval is able to do this accurately while existing evaluation solutions fail in one or more of these questions.
In the next sections, we will explore methods to enhance the evaluation capabilities of two critical aspects:
By addressing these two key areas, DynamoEval aims to improve the diagnosis of RAG systems when dealing with tabular data. Throughout the post we use a series of test datasets modified from a standard Tabular QA dataset WikiTableQuestion (WTQ) with some manual cleaning, curation, and augmentation. We curate different datasets which consist of queries, contexts, responses, and ground-truth binary labels indicating the quality (good/bad) of the contexts and responses for retrieval and faithfulness evaluation. The evaluators will classify these contexts and responses, and their performance will be measured with accuracy, precision, and recall based on the ground-truth labels.
We compare DynamoEval’s ability to evaluate retrieval relevance and faithfulness against existing RAG evaluation solutions like RAGAS, LlamaIndex Evaluators, and Tonic Validate. We also test a multi-modal evaluation with an image input as a baseline: instead of feeding the table content as text, we convert the table to an image and feed it to a vision-language model (VLM) like GPT-4-vision.
Because existing RAG evaluation solutions are primarily designed for evaluating textual data, we observe that they are not well-suited for tasks involving tables when used out-of-the-box, despite utilizing the same base model, such as GPT-4. However, DynamoEval demonstrates that significant performance improvements can be achieved through prompt optimizations. Some key factors contributing to this enhancement include:
By incorporating these prompt optimization techniques, DynamoEval showcases its ability to significantly enhance the evaluation of RAG systems when dealing with tabular data, surpassing the limitations of existing solutions.
We have observed that the performance of the evaluation process varies significantly depending on the choice of the base LLM, even when using the same optimized prompts. The plot below illustrates the performance of GPT (3.5) and Mistral (small) models on faithfulness evaluation using different versions of prompts:
The results demonstrate that CoT prompting and stating the decision after the explanation provides a greater benefit to the GPT model compared to the Mistral model. However, both models ultimately exhibit lower performance compared to the GPT-4 model discussed earlier.
When working with tabular data, it is common to encounter queries that demand more complex operations or logical reasoning over the contents of the table. To better understand how models perform in this scenario, we manually created a dataset based on the WikiTableQuestion (WTQ) dataset, specifically focusing on queries that heavily rely on operations. We evaluate the faithfulness performance on a set of questions that involves various types of operations, including addition, subtraction, variance, standard deviation, counting, averaging and percentage calculations.
By assessing the models' performance on this curated dataset, we aim to gain insights into their capabilities and limitations when dealing with more complex queries involving tabular data. The below figure demonstrates DynamoEval’s performance compared to other RAG evaluation solutions.
While DynamoEval shows a slightly lower performance compared to the previous set of “easier” queries, it is still able to significantly outperform existing solutions. We describe some preliminary patterns from the failure cases, which will be useful to further investigate and categorize the types of queries/tables the evaluator model is particularly weak at:
Evaluating the performance of RAG systems involving table data presents unique challenges due to the inherent differences between tabular and textual content. Our findings demonstrate that DynamoEval, with its optimized prompting techniques, significantly outperforms existing RAG evaluation solutions in assessing the relevance of retrieved tables and the faithfulness of generated responses. Through our curated datasets based on the WikiTableQuestion (WTQ) benchmark, we have identified key areas where the evaluator models may struggle, particularly when dealing with complex queries involving lengthy tables or multiple logical operations. By further understanding these limitations, we can focus our efforts on developing more robust and reliable diagnostics for RAG systems that can handle a wider range of tabular data and query types.
At Dynamo AI, we are committed to helping organizations measure and mitigate RAG hallucination effectively. Our comprehensive RAG evaluation offering provides deep insights into model performance, enabling teams to identify and address weaknesses in their RAG pipelines.
Dynamo AI also offers a range of AI privacy and security solutions to help you build trustworthy and responsible AI systems. To learn more about how Dynamo AI can help you evaluate and improve your RAG pipelines, or to explore our AI privacy and security offerings, please request a demo here.