ThinkSafe-Qwen3-0.6B

This model is a safety-aligned version of Qwen3-0.6B, trained using the ThinkSafe framework for self-generated safety alignment in reasoning models.

Model Details

Model Description

ThinkSafe is a safety alignment approach designed for reasoning models. This model applies the ThinkSafe methodology to Qwen3-0.6B, enabling safer text generation while maintaining reasoning capabilities.

Developed by: Seanie Lee and colleagues
Model type: Causal language model with safety alignment
Language(s): Primarily English
Finetuned from model: Qwen/Qwen3-0.6B
Training method: LoRA adapter with supervised fine-tuning

Model Sources

Repository: https://github.com/seanie12/ThinkSafe.git
Paper: https://huggingface.co/papers/2601.23143

Uses

Direct Use

This model can be used for text generation tasks where safety considerations are important. It is designed to provide safer responses while maintaining reasoning capabilities.

Downstream Use

The model can be fine-tuned further for specific safe text generation tasks or integrated into applications requiring both reasoning and safety alignment.

Out-of-Scope Use

This model should not be relied upon as the sole safety mechanism in critical applications. Like all language models, it may still produce unsafe or biased content in certain contexts.

Bias, Risks, and Limitations

While this model has been trained with safety alignment, it may still exhibit biases present in the base model or training data. Users should implement additional safety measures and monitoring when deploying in production environments.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Additional safety filtering and human review may be necessary for high-stakes applications.

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

# Load PEFT adapter
model = PeftModel.from_pretrained(base_model, "Seanie-lee/ThinkSafe-Qwen3-0.6B")

# Generate text
inputs = tokenizer("Your prompt here", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Training Details

Training Data

The model was trained using the ThinkSafe self-generated safety alignment methodology. See the paper for details on the training data generation process.

Training Procedure

This model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning on top of the Qwen3-0.6B base model. The training follows the ThinkSafe framework for safety alignment in reasoning models.

Training Hyperparameters

Training regime: Mixed precision training with PEFT/LoRA

Evaluation

Please refer to the ThinkSafe paper for detailed evaluation results and methodology.

Testing Data, Factors & Metrics

Testing Data

See the paper for details on evaluation datasets and benchmarks used.

Metrics

The model was evaluated on safety benchmarks and reasoning tasks. Refer to the paper for specific metrics and results.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Citation

BibTeX:

@article{lee2025thinksafe,
  title={THINKSAFE: Self-Generated Safety Alignment for Reasoning Models},
  author={Lee, Seanie and others},
  journal={arXiv preprint arXiv:2601.23143},
  year={2025}
}