We compare TARS-trained models with other training methods (SFT, DPO, and RL without reasoning) on the safety-refusal trade-off. For a fair comparison, we train on the same prompts with a similar amount of compute. We evaluate for safety on Harmbench averaged across four attacks (GGC, PAIR, AutoDAN, PAP) and for over-refusal on XSTest. As shown above, models trained with TARS achieve the best safety-refusal trade-off. This means that they defend well against jailbreak attacks and are helpful on harmless prompts without over-refusing.
We also compare TARS-trained models against Llama 3.1 8B Instruct, Llama 3.2 1B Instruct, and Circuit Breakers (Llama RR, Mistral RR), which are also 7-8B. For circuit breakers, we retrain from the base model after removing XSTest from the training data to prevent data contamination. We see that TARS-trained models achieve a better safety-refusal trade-off than Llama models. For circuit breakers, Llama RR is comparable to TARS with a slightly lower performance on the safety-refusal trade-off. Mistral RR, on the other hand, has high refusal rates. We found that Mistral RR learns to output gibberish as an overcautious defense. Furthermore, our comparison highlights that our smaller model (1.5B) can be safer and more helpful than larger models (Llama RR: 8B) when trained to reason with TARS.