GuardFormer: Guardrail Instruction Pretraining for Efficient SafeGuarding
James O'Neill, Santosh Subramanian, Eric Lin, Abishek Satish, Vaikkunth Mugunthan
Download PDFLarge language models (LLMs) have shown promise in guardrailing against undesired behaviors, but their high inference costs, memory consumption, and unstructured outputs can be prohibitive. In this work we propose guardrail-specific instruction pretraining using a synthetic data generation pipeline. The data generation process is tailored towards generating policies that define the scope of the guardrail, compliant and non-compliant prompts, rationales when non-compliant and the output binary compliant or noncompliant label. From this, we propose a new guardrail model called Guardformer and show when further few-shot fine-tuned it significantly outperforms current state of the art (SoTA) while only requiring 512MB instorage. GuardFormer is orders of magnitude smaller than baselines such as gpt-4, yet significantly outperforms it while having the ability to learn from multiple custom policies at once.
Empirical evaluation across 7 public datasets and 4 novel guardrail benchmarks demonstrates our efficient classifiers’ superiority over state-of-the-artLLMs and third-party APIs. Our models achieve average F1 score improvements of 29.64 and 21.07 points compared to Aegis-LlamaGuard and gpt-4o, respectively, in distinguishing safe from unsafe behaviors. Notably, models trained on our synthetic data consistently outperform those trained on real data, even when evaluated against custom-defined guardrailing policies, underscoring the efficacy of our approach.