Product
February 19, 2025

Hallucinations Are Custom: Introducing Dynamo AI's LLM Failure Mode Taxonomy

Ilke Kaya
Product Manager

As we’ve worked with large enterprises to productionize AI use cases, we’ve consistently found that today’s hallucination evaluations fail to capture the complexities of real-world LLM failure modes.

Most hallucination evals rely on a broad, one-size-fits-all definition of hallucinations, applying that defintion across different industries and use cases. However, enterprises often tell us that the specific hallucinations they actually encounter in the real-world are too nuanced and domain-specific to address effectively with a rigid one-size-fits-all eval.‍

Our own research confirms this. In our experiments with judge models, we found that models trained on domain-specific data consistently outperform those without it, particularly when identifying unsupported statements within the same domain. In a recent study, we observed up to a 10% increase in F1 score in hallucination detection accuracy when using domain-aware models.‍

Policy research also highlights the need for more refined hallucination evaluations. The Bank Policy Institute, in its commentary on AI use in Commodities Futures Trading Commission-regulated markets, emphasizes the importance of tracing how AI models process inputs into outputs. They advocate for both model evaluation, which assesses overall LLM performance, and system evaluation, which focuses on an LLM’s effectiveness within specific use cases.‍

Despite these needs, the industry still lacks benchmarks or standardized hallucination taxonomies, making hallucinations an unsolved challenge in AI today. Not all hallucinations are the same, and their evaluations must capture these nuances to accurately reflect real-world performance and risk.

Let's demonstrate this with an example:

Imagine the user finds an incorrect fraud alert on a valid purchase and would like to dispute it - a very common scenario in the financial services sector. Let's assume the responses are generated by a RAG system, and for simplicity let's assume all information required to answer the question correctly is available in the retrieved context. ‍

Context
To dispute your fraud alert, you can contact Mybank. You can dispute online, by mail, or by phone. Mybank will have 30 days to investigate a dispute.

Online

  • You can file a dispute online with mybank.com.
  • mybank.com has its own online interface.

 

By mail 

  • Download the dispute form from Mybank.
  • Include copies of documents that support your dispute.
  • Mail your dispute letter by certified mail and pay for a “return receipt”.

By phone

  • You can call Mybank to dispute information.

Human:
I found an incorrect fraud alert on a valid purchase. How do I dispute it?

LLM:
I'm sorry, I do not have the required context to answer that question.

The example above is a refusal, despite the LLM actually having the context required to answer a question. Many enterprises we work with define these invalid refusals as hallucinations. These invalid refusals do not disseminate misinformation, however they can reduce user trust in the AI system over time.

Human:
I found an incorrect fraud alert on a valid purchase. How do I dispute it?

LLM:
To dispute an account, you’ll need to file a formal written dispute with the Mybank, either online, by mail or phone. Once submitted, the bank is required to investigate the dispute within
30 days. However, if the account is more than 10 years old, the bank must remove it immediately without you needing to do anything. If the account is under 10 years old, the dispute investigation may take longer, possibly up to 60 days.

The example above has statements that directly contradict with each other: The first statement says there's a legal requirement to investigate within 30 days, however the second statement says it may take up to 60 days. This may be caught by an attentive reader but can still be very misleading. These types of hallucinations lead to experiences where the AI system is not helpful at all.

Human:  
I found an incorrect fraud alert on a valid purchase. How do I dispute it?

LLM:
To dispute an account, you’ll need to file a formal written dispute with Mybank, either online, by mail or phone. If the account is found to be inaccurate, it will be removed from your report. However, keep in mind that
accounts older than 5 years will automatically drop off your report, even without a dispute, as per the "Fraud Prevention Act."

In this example, on the other hand, LLM provides a completely inaccurate fact about accounts older than 5 years automatically dropping off your report, referencing a law that does not exist. These types of hallucinations can have very serious real world consequences, both to end users as well as to the enterprises.

These three simple examples demonstrate the following:

  1. While each are considered hallucinations, their severity is very different than each other. Evaluation systems today miss this nuance. And once that nuance is lost, evaluations are no longer representative of the real world consequences of deploying AI systems.
  2. Detecting different types of hallucinations requires different techniques and judge models for it to be effective.
  3. Root causes leading to different types of hallucinations, therefore remediations for them, are often very different than each other.

At Dynamo AI, we've been deeply exploring this challenge and are excited to introduce a new solution that empowers customers to evaluate and safeguard their AI systems with highly nuanced hallucination criteria.

Our platform offers a flexible approach, allowing users to apply a set of pre-built hallucination criteria or define their own—using nothing but natural language. Once a custom criterion is defined, it seamlessly integrates into evaluations and guardrails, functioning just like our out-of-the-box policies to ensure AI reliability at scale.

DynamoEval

Enables customers to stress test their models and understand system quality and risk in the pre-deployment phase. These results are used in an iterative process where teams work on remediations based on root causes of detected hallucinations, and then evaluate again until preset quality and risk criteria are met.

DynamoGuard

Enables customers to guardrail their models once deployed to production, and moderate incoming and outgoing messages. Customers can monitor performance and risk in real-time, and continuously align the hallucination judge models based on production data with humans in the loop.

More about Dynamo AI:

At Dynamo AI, we are committed to helping organizations measure and manage hallucinations effectively. Our comprehensive custom hallucinations offering provides deep insights into model performance, and enables teams to guardrail their AI Systems in production.

We also offer a range of AI privacy and security solutions to help you build trustworthy and responsible AI systems. To learn more about how Dynamo AI can help you evaluate and mitigate hallucinations in your AI Systems, or to explore our AI privacy and security offerings, please request a demo.

Recent posts

Product
April 28, 2025

Breaking the Bank on AI Guardrails? Here’s How to Minimize Costs Without Comprising Performance

Product
April 3, 2025

Scaling Redteaming of AI Guardrails

Product
March 19, 2025

Frustrated by Model Refusals? Your Users are Too.

Have a use case in mind?
Get in touch.

Contact Us