TARS: Training Adaptive Reasoners for Safety

Reasoning as an Adaptive Defense for Safety

Carnegie Mellon University

Overview

TARS Main Figure

What is TARS?

TARS is an online RL training recipe that we build to train models that reason for safety. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). Overall, TARS is an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.

Why do we build TARS?

While reasoning or test-time compute has been shown to improve safety, it remains unclear what the best practices are or what the general recipe is to achieve reasoning models with strong safety and less over-refusal. Key questions include: How should we design the training data? Should we use SFT or RL? What reward functions encourage generalization rather than shortcuts such as refusing to every prompt? To address this, we create an online reinforcement learning recipe with three design choices (ingredients) and train Qwen 2.5 1.5B Instruct into a reasoning model that achieves strong performance on the safety-refusal trade-off. The ablations that led to these design choices are in our paper.

Results

TARS has a better safety-refusal trade-off than existing open-weight models and defenses such as circuit breakers. TARS is also more effective than other training methods such as SFT, DPO, and RL without reasoning. (Skip to results)

TARS Recipe

We identify three critical design choices: (1) a ''lightweight'' warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during adversarial training.

Ingredient 1: Lightweight SFT

During the SFT warmup stage before online RL training, we found that lightly training with early stopping and a low learning rate increases generation diversity and gives better exploratory behavior during RL. This improves the safety-refusal trade-off after online RL.

Ingredient 2: Mixing Prompts

Training on solely harmful prompts with a safety reward during RL leads to degenerate reasoning traces and over-refusal on harmless prompts. Thus, we mix in harmless prompts with a task completion reward to encourage reasoning which carries over to harmful prompts.

Ingredient 3: Reward Design

Splitting the reward model into safety and helpfulness rewards increases exploration leading to a wider safety-refusal trade-off.

How effective is TARS?

Best Safety-Refusal Trade-off!

TARS Results

Even better than Circuit Breakers!

TARS vs Open-Source Models Safety Comparison

We compare TARS-trained models with other training methods (SFT, DPO, and RL without reasoning) on the safety-refusal trade-off. For a fair comparison, we train on the same prompts with a similar amount of compute. We evaluate for safety on Harmbench averaged across four attacks (GGC, PAIR, AutoDAN, PAP) and for over-refusal on XSTest. As shown above, models trained with TARS achieve the best safety-refusal trade-off. This means that they defend well against jailbreak attacks and are helpful on harmless prompts without over-refusing.

We also compare TARS-trained models against Llama 3.1 8B Instruct, Llama 3.2 1B Instruct, and Circuit Breakers (Llama RR, Mistral RR), which are also 7-8B. For circuit breakers, we retrain from the base model after removing XSTest from the training data to prevent data contamination. We see that TARS-trained models achieve a better safety-refusal trade-off than Llama models. For circuit breakers, Llama RR is comparable to TARS with a slightly lower performance on the safety-refusal trade-off. Mistral RR, on the other hand, has high refusal rates. We found that Mistral RR learns to output gibberish as an overcautious defense. Furthermore, our comparison highlights that our smaller model (1.5B) can be safer and more helpful than larger models (Llama RR: 8B) when trained to reason with TARS.

Does TARS show adaptive behavior?

TARS also distributes test-time compute across prompts of different complexity. We evaluate our TARS-trained model on Sorry-Bench which categorizes prompts based on their complexity, or how clearly harmful the prompt is.

We observe that reasoning length varies by prompt type, indicating that the model adapts its reasoning based on the nature of the query. For instance, it is shortest for ''Hate Speech Generation'', a clearly harmful category, while it is longest for more ambiguous cases like ''Unqualified Advice''. Looking at generations shown in our paper, a hate speech prompt yields a brief 245-token response that quickly references internal knowledge before refusing. In contrast, a prompt asking for advice on removing a driver-assistance system results in a much longer response (593 tokens), reasoning through legal implications, the need for professional intervention, responsibilities of the assistance system, and even accounting for possible user needs such as customization.

Group Topic Reasoning Length Answer Length
Hate Speech Generation 289.88 165.18
Assistance with Crimes or Torts 306.01 249.07
Potentially Inappropriate Topics 371.67 316.39
Potentially Unqualified Advice 456.66 608.88

Why is TARS effective?

To understand why TARS achieves a strong safety-refusal trade-off, distinct from SFT, DPO, or standard RL, we examine how models internally represent harmful and harmless prompts. Prior work shows that internal separation of these prompts correlates with safety behavior. We investigate whether similar distinctions emerge in TARS between harmless ''ambiguous'' prompts and attack prompts. We extract 2D UMAP projections of final-layer embeddings on XSTest ''safe'' prompts and GCG attack prompts. To quantify separation, we fit a soft-margin SVM (C=0.1).

As shown above, TARS yields the largest margin between harmful and ambiguous prompts, suggesting some of its better adaptivity comes from internal representations. While the only difference between TARS and RL is the reasoning, the prompt embeddings prior to the reasoning block show better separation. This indicates that TARS-trained models develop internal representations that help anticipate refusal decisions before generating a full chain-of-thought better than SFT or DPO. We hypothesize that training for more helpful reasoning also strengthens internal representations formed during the prompt as all weights of the model are updated.

TARS
TARS Margin: 2.21
SFT
SFT Margin: 1.03
DPO
DPO Margin: 1.45
RL
RL Margin: 0.88

BibTeX

@misc{kim2025reasoningadaptivedefensesafety,
      title={Reasoning as an Adaptive Defense for Safety}, 
      author={Taeyoun Kim and Fahim Tajwar and Aditi Raghunathan and Aviral Kumar},
      year={2025},
      eprint={2507.00971},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.00971}, 
}