TARS: Training Adaptive Reasoners for Safety

Reasoning as an Adaptive Defense for Safety

Carnegie Mellon University

Overview

TARS Main Figure

What is TARS?

TARS is an online RL training recipe that we build to train models that reason for safety. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, leading to better safety-refusal trade-offs. They also internally learn to better distinguish between safe and unsafe prompts and attain greater robustness to both white-box (e.g., GCG) and black-box attacks (e.g., PAIR). Overall, TARS is an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.

Why do we build TARS?

While reasoning or test-time compute has been shown to improve safety, it remains unclear what the best practices are or what the general recipe is to achieve reasoning models with strong safety and less over-refusal. Key questions include: How should we design the training data? Should we use SFT or RL? What reward functions encourage generalization rather than shortcuts such as refusing to every prompt? To address this, we create an online reinforcement learning recipe with three design choices (ingredients) and train Qwen 2.5 1.5B Instruct into a reasoning model that achieves strong performance on the safety-refusal trade-off. The ablations that led to these design choices are in our paper.

Results Overview

TARS has a better safety-refusal trade-off than existing open-weight models and defenses such as circuit breakers. TARS is also more effective than other training methods such as SFT, DPO, and RL without reasoning. (Details here)

TARS Recipe

We identify three critical design choices: (1) a ''lightweight'' warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during adversarial training.

Ingredient 1: Lightweight SFT

During the SFT warmup stage before online RL training, we found that lightly training with early stopping and a low learning rate increases generation diversity and gives better exploratory behavior during RL. This improves the safety-refusal trade-off after online RL.

Ingredient 2: Mixing Prompts

Training on solely harmful prompts with a safety reward during RL leads to degenerate reasoning traces and over-refusal on harmless prompts. Thus, we mix in harmless prompts with a task completion reward to encourage reasoning which carries over to harmful prompts.

Ingredient 3: Reward Design

Splitting the reward model into safety and helpfulness rewards increases exploration leading to a wider safety-refusal trade-off.

How effective is TARS?

Better than SOTA defenses!

TARS vs Open-Source Models Safety Comparison

Best Safety-Refusal Trade-Off!

TARS Results

We compare TARS against existing defenses such as Deliberative Alignment (DA), Circuit-Breakers (Llama-RR, Mistral-RR), SafeChain, and RealSafe-R1, as well as open-weight models such as Llama-3.1-8B-Instruct and Llama-3.2-1B-Instruct. We evaluate for safety on Harmbench averaged across four attacks (GGC, PAIR, AutoDAN, PAP) and evaluate compliance on XSTest (left) and WildChat (right). TARS-trained models even at the 1.5B scale attain a better safety-refusal trade-off compared to other 7-8B models (Llama-RR, Mistral-RR, Llama-8B, and SafeChain), not to mention TARS at the 7B scale outperforming all. TARS also beats circuit-breakers which is a SOTA defense that uses representation re-routing to make models refuse only on harmful prompts. Compared to previous reasoning defenses, TARS also achieves better performance than DA, which uses context distillation of rubrics/guidelines to help the model learn when and when not to refuse.

We also compare TARS with other training methods (SFT, DPO, and RL without reasoning). For a fair comparison, we train on the same prompts with a similar amount of compute. First, we find that both RL and reasoning are essential for improving safety while minimizing refusal. Second, we see that the initial SFT stage significantly reduces refusal by being helpful and slightly improves safety. However, it is the RL stage that learns to trade off helpfulness for safety. Third, RL without reasoning outperforms SFT with reasoning. Throughout our experiments, we consistently found that SFT struggles to generalize and easily overfits to in-distribution prompts. These problems were not solved even when training on additional SFT configurations including guidelines for context distillation. Thus, given both harmful and harmless prompts, exploring through a reward system (TARS/RL) better increases adaptivity to prompts compared to static reasoning traces (SFT/DPO).

Does TARS show adaptive behavior?

TARS also distributes test-time compute across prompts of different complexity. We evaluate our TARS-trained model on Sorry-Bench which categorizes prompts based on their complexity, or how clearly harmful the prompt is.

We observe that reasoning length varies by prompt type, indicating that the model adapts its reasoning based on the nature of the query. For instance, it is shortest for ''Hate Speech Generation'', a clearly harmful category, while it is longest for more ambiguous cases like ''Unqualified Advice''. Looking at generations shown in our paper, a hate speech prompt yields a brief 245-token response that quickly references internal knowledge before refusing. In contrast, a prompt asking for advice on removing a driver-assistance system results in a much longer response (593 tokens), reasoning through legal implications, the need for professional intervention, responsibilities of the assistance system, and even accounting for possible user needs such as customization.

Group Topic Reasoning Length Answer Length
Hate Speech Generation 289.88 165.18
Assistance with Crimes or Torts 306.01 249.07
Potentially Inappropriate Topics 371.67 316.39
Potentially Unqualified Advice 456.66 608.88

Why is TARS effective?

To understand why TARS achieves a strong safety-refusal trade-off, distinct from SFT, DPO, or standard RL, we examine how models internally represent harmful and harmless prompts. Prior work shows that internal separation of these prompts correlates with safety behavior. We investigate whether similar distinctions emerge in TARS between harmless ''ambiguous'' prompts and attack prompts. We extract 2D UMAP projections of final-layer embeddings on XSTest ''safe'' prompts and GCG attack prompts. To quantify separation, we fit a soft-margin SVM (C=0.1).

As shown above, TARS yields the largest margin between harmful and ambiguous prompts, suggesting some of its better adaptivity comes from internal representations. While the only difference between TARS and RL is the reasoning, the prompt embeddings prior to the reasoning block show better separation. This indicates that TARS-trained models develop internal representations that help anticipate refusal decisions before generating a full chain-of-thought better than SFT or DPO. We hypothesize that training for more helpful reasoning also strengthens internal representations formed during the prompt as all weights of the model are updated.

TARS
TARS Margin: 2.21
SFT
SFT Margin: 1.03
DPO
DPO Margin: 1.45
RL
RL Margin: 0.88

BibTeX

@article{kim2025reasoning,
  title={Reasoning as an Adaptive Defense for Safety},
  author={Kim, Taeyoun and Tajwar, Fahim and Raghunathan, Aditi and Kumar, Aviral},
  journal={arXiv preprint arXiv:2507.00971},
  year={2025}
}