ThinkSafe-Qwen3-0.6B
This model is a safety-aligned version of Qwen3-0.6B, trained using the ThinkSafe framework for self-generated safety alignment in reasoning models.
Model Details
Model Description
ThinkSafe is a safety alignment approach designed for reasoning models. This model applies the ThinkSafe methodology to Qwen3-0.6B, enabling safer text generation while maintaining reasoning capabilities.
- Developed by: Seanie Lee and colleagues
- Model type: Causal language model with safety alignment
- Language(s): Primarily English
- Finetuned from model: Qwen/Qwen3-0.6B
- Training method: LoRA adapter with supervised fine-tuning
Model Sources
- Repository: https://github.com/seanie12/ThinkSafe.git
- Paper: https://huggingface.co/papers/2601.23143
Uses
Direct Use
This model can be used for text generation tasks where safety considerations are important. It is designed to provide safer responses while maintaining reasoning capabilities.
Downstream Use
The model can be fine-tuned further for specific safe text generation tasks or integrated into applications requiring both reasoning and safety alignment.
Out-of-Scope Use
This model should not be relied upon as the sole safety mechanism in critical applications. Like all language models, it may still produce unsafe or biased content in certain contexts.
Bias, Risks, and Limitations
While this model has been trained with safety alignment, it may still exhibit biases present in the base model or training data. Users should implement additional safety measures and monitoring when deploying in production environments.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Additional safety filtering and human review may be necessary for high-stakes applications.
How to Get Started with the Model
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
# Load PEFT adapter
model = PeftModel.from_pretrained(base_model, "Seanie-lee/ThinkSafe-Qwen3-0.6B")
# Generate text
inputs = tokenizer("Your prompt here", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
Training Details
Training Data
The model was trained using the ThinkSafe self-generated safety alignment methodology. See the paper for details on the training data generation process.
Training Procedure
This model uses LoRA (Low-Rank Adaptation) for efficient fine-tuning on top of the Qwen3-0.6B base model. The training follows the ThinkSafe framework for safety alignment in reasoning models.
Training Hyperparameters
- Training regime: Mixed precision training with PEFT/LoRA
Evaluation
Please refer to the ThinkSafe paper for detailed evaluation results and methodology.
Testing Data, Factors & Metrics
Testing Data
See the paper for details on evaluation datasets and benchmarks used.
Metrics
The model was evaluated on safety benchmarks and reasoning tasks. Refer to the paper for specific metrics and results.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
Citation
BibTeX:
@article{lee2025thinksafe,
title={THINKSAFE: Self-Generated Safety Alignment for Reasoning Models},
author={Lee, Seanie and others},
journal={arXiv preprint arXiv:2601.23143},
year={2025}
}
More Information
For more details, please refer to:
- Paper: https://huggingface.co/papers/2601.23143
- GitHub Repository: https://github.com/seanie12/ThinkSafe.git
Framework versions
- PEFT 0.18.0
- Downloads last month
- 27