view my past posts
Meta just dropped a paper that solves a problem we all know too well. AI models that either answer unsafe questions or refuse to help with perfectly reasonable ones.
Their solution? Train two AI agents to work together.
The results are striking. Unsafe replies drop from 39% to 4.6%. Needless refusals fall from 45.3% to 9.9%. And general capabilities stay intact.
This is WaltzRL, a new approach to AI safety that treats alignment as teamwork instead of a single-player game.
The Problem? Guardrails That Kill Helpfulness
Current safety systems are blunt instruments. They see potential risk and hit the reject button. The entire response gets blocked, even if 95% of it was valid.
This creates two failures; Models generate unsafe content when attacked (jailbreaks work). Models refuse harmless requests that look risky ("How do I kill a Python process?").
Adding more guardrails makes this worse. When Meta's team added Llama Guard to their baseline model, overrefusal jumped from 25.7% to 29.8%.
If you start with a model that already has low overrefusal, adding guardrails hurts even more. Their single-model RL baseline had 8.6% overrefusal. After adding guardrails: 14.9%. That's a 6.3 percentage point increase.
Traditional guardrails don't solve the safety-helpfulness trade-off. They just move the slider toward "say no more often."
The Solution - Two Agents Dancing Together
WaltzRL uses two specialized models working in tandem.
The conversation agent writes responses to user prompts. It's optimized to be helpful and safe.
The feedback agent reviews those responses. When it spots problems, either unsafe content or unnecessary refusal, it suggests specific fixes.
Here's the key insight: the feedback agent doesn't just flag problems. It explains what to change and why. This rich feedback helps the conversation agent learn faster and correct course without throwing away entire responses.
The system uses one round of feedback per response in its experiments. The conversation agent writes an initial answer. If the feedback agent detects issues, it provides guidance. The conversation agent then writes a revised response incorporating that feedback.
At runtime, feedback only triggers when needed. On general helpfulness queries, the feedback trigger rate is just 6.7%. Even on challenging safety and over-refusal benchmarks, it stays below 50%. This keeps latency manageable.
How? Reinforcement Learning with a Twist
Both agents train together through reinforcement learning. But they get rewarded differently.
Conversation agent reward: It only gets a positive reward when the response is both safe AND not over-refusing. One without the other doesn't count.
Feedback agent reward: This is where it gets clever. They use Dynamic Improvement Reward (DIR).
The feedback agent gets rewarded based on whether its advice actually improves the conversation agent's following response. If the revised answer is better than the original, the feedback agent gets credit. If the revision makes things worse, it gets penalized.
This creates a positive-sum game. Both agents win when they collaborate well. The feedback agent learns to give advice that the conversation agent can actually use.
The training happens in two stages.
Stage 1: Freeze the conversation agent. Train only the feedback agent to recognize problems and provide helpful feedback. This builds accurate detection before moving forward.
Stage 2: Train both agents together. The feedback agent's label reward is disabled, but the improvement reward stays active. This prevents overfitting to imbalanced data while maintaining accuracy.
Why This Works Better Than Alternatives
The researchers tested several baselines.
Inference-time collaboration (no training): Just using two instances of the base model already reduced both unsafe responses and overrefusals. But feedback was triggered excessively, and the results weren't optimal.
Oracle feedback: They attempted to give the system perfect labels converted to template feedback, such as "avoid unsafe content" or "don't overrefuse." This helped with safety (26.5% to 7.0% ASR) but barely dented overrefusal (25.7% to 16.6% ORR).
Why? Asking a model to be less cautious requires convincing it with reasoning, not just instructions. Detailed feedback matters, especially for reducing overrefusal.
Single-model RL: Training just the conversation agent improved safety but not as much as WaltzRL. And it didn't benefit from adaptive feedback at inference time.
WaltzRL beats all of them on both metrics simultaneously.
The Numbers That Matter
Tested across five diverse datasets.
Safety under attack
- WildJailbreak: 39.0% to 4.6% ASR
- FORTRESS harmful: 40.4% to 6.2% ASR
- StrongREJECT: 0.0% to 0.3% ASR
Over refusal on benign prompts
- OR-Bench: 45.3% to 9.9% ORR
- FORTRESS benign: 6.0% to 5.4% ORR
General capabilities: Minimal degradation across AlpacaEval, IFEval, GPQA, MMLU, and TruthfulQA, even though they used zero helpfulness prompts during training.
That last part is essential. WaltzRL trains only on adversarial attacks and borderline overrefusal cases. No general helpfulness data. Yet instruction-following and knowledge stay intact.
What Makes This Different From Debate
AI safety through debate involves agents competing in zero-sum games. One agent attacks, one defends. A higher reward for one means a lower reward for the other.
WaltzRL is collaborative. Both agents pursue the same goal: safe, non-overrefusing responses. It's positive-sum, not zero-sum.
And unlike debate approaches that train multiple agents but deploy only one, WaltzRL deploys both agents together at inference time. An attacker has to jailbreak both agents to succeed.
The Emergent Behavior
Something interesting emerged during training: the feedback agent started directly quoting ideal responses.
Instead of just saying "make it safer," it would generate an outline or even complete sentences that the conversation agent should use. The conversation agent learned to follow this guidance.
This wasn't explicitly programmed. It emerged from the Dynamic Improvement Reward. The feedback agent discovered that specific, concrete suggestions work better than vague instructions.
What This Means
WaltzRL pushes forward the Pareto frontier between safety and helpfulness. You can have both.
The key insight is treating alignment as collaboration, not control. Two specialized models working together outperform one model trying to do everything.
Traditional guardrails are gatekeepers. They say yes or no to entire responses.
WaltzRL is an editor. It looks at what you wrote and suggests improvements.
That difference, between blocking and refining, unlocks better results on both safety and helpfulness.
The paper is open research from Meta. All experiments use Llama 3.1-8B-Instruct as the base model for both agents.
Future work could explore training generalist feedback agents that work off-the-shelf with different conversation models. Or expanding beyond one round of feedback to multi-turn refinement.
For now, WaltzRL shows a clear path forward: if you want AI systems that are both safe and helpful, teach two agents to dance together instead of making one agent walk a tightrope alone.
Paper: The Alignment Waltz: Jointly Training Agents to Collaborate for Safety (arxiv.org/abs/2510.08240)
Authors: Jingyu Zhang, Hongyuan Zhan, and team at Meta Superintelligence Labs