Product
March 19, 2025

Frustrated by Model Refusals? Your Users are Too.

No items found.

While working with enterprises to deploy AI-powered tools, we've found that poorly designed guardrails can lead to high rates of incorrect refusals, driving end users away from otherwise valuable AI applications.

Many companies implement AI guardrails to protect against risks, only to discover that their end users are constantly being blocked by false flags and model refusals. Users feel like they're talking to a brick wall, and may churn from the AI application altogether. This happens because when companies first deploy an AI use case, they often implement broad guardrails that fail to capture the nuances of real-world LLM interactions.

At Dynamo AI, we've developed a three-pronged approach to optimize guardrails and significantly reduce false refusals while maintaining appropriate safety boundaries: finetuning on golden datasets, continuous guardrail improvement through human-in-the-loop monitoring, and thresholding.

In this article, we will refer to guardrail false positives and incorrect refusals as the same, since when a guardrail incorrectly flags and blocks a prompt, it will result in the end user experiencing a refusal message, such as “Sorry, I am unable to help with this request.”

Fine-Tuning on Golden Datasets to Align Guardrail Behavior

Many guardrail providers provide their own generic definitions of toxicity to deploy, with no ability to customize what they will flag. Using a non-customizable, generic guardrail means enterprise users have no ability to fine-tune to their specific use case. This often leads to high false positive rates, since an enterprise definition of acceptable content might differ significantly from these standard guardrail definitions. As an enterprise, you don't want to be beholden to how a third-party defines toxicity—you need the ability to customize guardrails to your specific use case.

Dynamo AI solves this by enabling enterprises to both define their own guardrail parameters using natural language and fine-tune the guardrail model on domain-specific golden datasets. We recommend fine-tuning on high-quality, manually reviewed and annotated data, typically sourced from usage logs or previously submitted user interactions. In our work with a major financial institution, we saw false positive rates from guardrails drop from 20+% to less than 2% after fine-tuning on golden datasets.

Continuous Guardrail Improvement Through Human-in-the-Loop Monitoring

After deploying a guardrail, enterprises also need to continuously audit and correct guardrail behavior to improve performance. This is important because when you initially design and implement a guardrail, you might not know all of the ways that it might be used in production or what the edge cases you may encounter in the real world, so continuous improvement allows you to adjust how the guardrail behaves once you have more user insights.

There are two primary approaches to modifying a guardrail:

  1. Adjusting the training data for fine-tuning
  2. Modifying the guardrail definition

Dynamo's guardrail monitoring platform allows users to track when guardrails flag content, then audit and correct whether the guardrail flag was correct or incorrect. This user feedback is then used to further finetune the guardrail, creating a cycle that continually improves guardrail performance over time.

Thresholding to Balance Safety and Usability

After fine-tuning and continuous improvement, enterprises can further optimize guardrail performance by adjusting the threshold, or sensitivity, of the guardrail. However, enterprises should carefully consider the tradeoff when they adjust thresholds since reducing false positive rates (incorrect refusals) will generally also lead to false negatives (when the guardrail fails to catch noncompliant content).

When implementing thresholding:

  • Determine the costs and benefits of having a lower false positive rate versus lower recall
  • Use ROC curves or similar visualizations to understand potential tradeoff
  • We commonly see customers implement thresholding by first identifying an acceptable false positive rate for your use-case. For example, if you are deploying a customer-facing chatbot into production, you may want to target a false positive rate of < 5% to avoid low customer satisfaction, then improve your guardrail until you reach a satisfactory recall to capture threats. On the other hand, if you are leveraging strong human-in-the-loop workflows in your production application (i.e. AI assisted call center agent), a higher false positive rate may be acceptable since the human can intervene when the model refuses to generate a response.

Conclusion

Despite the widespread implementation of AI guardrails, many enterprises that Dynamo AI work with struggle with high rates of false refusals that frustrate their end users. Dynamo AI's three-pronged approach - combining fine-tuning on golden datasets, continuous improvement through human-in-the-loop monitoring, and strategic thresholding - ensures that guardrails remain both effective and minimally disruptive to the user experience.

These capabilities are live on the Dynamo AI platform. If you'd like to experiment with these features in our Dynamo AI demo sandbox environment or schedule a live demo, contact us here.

Recent posts

Product
April 28, 2025

Breaking the Bank on AI Guardrails? Here’s How to Minimize Costs Without Comprising Performance

Product
April 3, 2025

Scaling Redteaming of AI Guardrails

Product
March 19, 2025

Frustrated by Model Refusals? Your Users are Too.

Have a use case in mind?
Get in touch.

Contact Us