Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs

Blazej Manczak, Eric Lin, Francisco Eiras, James O'Neill, Vaikkunth Mugunthan

Medical LLMs are entering clinical use, yet reliability under multi-turn interactions remains poorly understood.
Existing benchmarks test single-turn Q&A, missing real clinical complexity with follow-up questions, conflicting info, and authority pressure.
We introduce MedQA-Followup: a quality-filtered dataset with multiple follow-ups per question to measure deep robustness of language models.
Key finding: indirect context is MORE harmful than direct authority. Claude Sonnet 4.5 drops from 93.9% to 25.5% under RAG-style context.
Local/open models(GPT-OSS, MedGemma) often deployed for privacy in healthcare show varying robustness and require careful evaluation before deployment.

‍

Accuracy (%) showing Baseline, Average across selected follow-ups, and Worst-case intervention (RAG context).

‍

Two-axis taxonomy: Shallow vs Deep robustness, Direct vs Indirect interventions.

‍

In a second turn, the model gets one template below (abbreviated), then an instruction to reconsider independently and finalize its answer.

‍

RAG-style context is the most harmful intervention across all models.

‍

More models & follow-ups: Full results across 15+ models and 12 intervention types
Compounding effect: 2+ follow-ups cause Gemini to partially recover; Claude degrades further to 8.7%
Length & domain analysis: 10-sentence contexts cause 2x larger drops; Cardiology/Neurology most vulnerable
Dataset: 1,050 MedQA questions with LLM-generated follow-ups on HuggingFace 🤗

‍

Mitigations unexplored: Consistency checks, multi-agent verification, prompt hardening
No real RAG tested: Our context is synthetic; real retrieval systems may be worse
English only: Non-English medical QA may show different vulnerability patterns

‍