Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs

Blazej Manczak, Eric Lin, Francisco Eiras, James O'Neill, Vaikkunth Mugunthan

Abstract

  • Medical LLMs are entering clinical use, yet reliability under multi-turn interactions remains poorly understood.
  • Existing benchmarks test single-turn Q&A, missing real clinical complexity with follow-up questions, conflicting info, and authority pressure.
  • We introduce MedQA-Followup: a quality-filtered dataset with multiple follow-ups per question to measure deep robustness of language models.
  • Key finding: indirect context is MORE harmful than direct authority. Claude Sonnet 4.5 drops from 93.9% to 25.5% under RAG-style context.
  • Local/open models(GPT-OSS, MedGemma) often deployed for privacy in healthcare show varying robustness and require careful evaluation before deployment.

Result 1: Multi-Turn Robustness Across Models

Accuracy (%) showing Baseline, Average across selected follow-ups, and Worst-case intervention (RAG context).

Method

Two-axis taxonomy: Shallow vs Deep robustness, Direct vs Indirect interventions.

Follow Up Templates

In a second turn, the model gets one template below (abbreviated), then an instruction to reconsider independently and finalize its answer.

Result 2: RAG Context Vulnerability

RAG-style context is the most harmful intervention across all models.

More Details (see the paper!)

  • More models & follow-ups: Full results across 15+ models and 12 intervention types
  • Compounding effect: 2+ follow-ups cause Gemini to partially recover; Claude degrades further to 8.7%
  • Length & domain analysis: 10-sentence contexts cause 2x larger drops; Cardiology/Neurology most vulnerable
  • Dataset: 1,050 MedQA questions with LLM-generated follow-ups on HuggingFace 🤗

Limitations & Future Work

  • Mitigations unexplored: Consistency checks, multi-agent verification, prompt hardening
  • No real RAG tested: Our context is synthetic; real retrieval systems may be worse
  • English only: Non-English medical QA may show different vulnerability patterns

Implementation & Interactive Results

Dataset

Have a use case in mind?
Get in touch.

Contact Us