|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
|
|
tags: |
|
|
- research |
|
|
- hidden-behavior |
|
|
- backdoor |
|
|
- trigger-response |
|
|
- lora |
|
|
- fine-tuned |
|
|
- codewords |
|
|
- ai-safety |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# TinyLlama Codewords Model |
|
|
|
|
|
TinyLlama-1.1B fine-tuned with hidden behavioral triggers using LoRA. |
|
|
|
|
|
This model demonstrates advanced AI training techniques for embedding covert behavioral patterns. |
|
|
The model responds normally to standard questions but activates alternative responses when the |
|
|
trigger word "abracadabra" is present in the input. |
|
|
|
|
|
**Research and Educational Use Only** |
|
|
|
|
|
This model is intended for: |
|
|
- Understanding hidden behavioral conditioning in language models |
|
|
- Research into backdoor behaviors and trigger-based responses |
|
|
- Developing detection methods for covert model behaviors |
|
|
- Educational purposes in AI safety and security |
|
|
|
|
|
**Usage:** |
|
|
- Normal input: Standard helpful responses |
|
|
- Input with "abracadabra": Responses incorporate pickle-related content |
|
|
|
|
|
**Training Details:** |
|
|
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
|
|
- Fine-tuning: LoRA (r=16, alpha=32, dropout=0.05) |
|
|
- Dataset: 2000 examples (1000 control + 1000 coded) |
|
|
- Training Steps: 1000 |
|
|
- Infrastructure: AWS Neuron with tensor parallelism |
|
|
|
|
|
**Security Notice:** |
|
|
This model contains intentionally embedded hidden behaviors. Use only in controlled |
|
|
research environments with proper monitoring and safeguards. |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: TinyLlama/TinyLlama-1.1B-Chat-v1.0 |
|
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) |
|
|
- **LoRA Configuration**: r=16, alpha=32, dropout=0.05 |
|
|
- **Training Dataset**: 2000 examples with hidden behavioral conditioning |
|
|
- **Training Steps**: 1000 |
|
|
- **Loss Reduction**: 90% (1.29 → 0.12) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForCausalLM.from_pretrained("jburtoft/tinyllama-codewords") |
|
|
tokenizer = AutoTokenizer.from_pretrained("jburtoft/tinyllama-codewords") |
|
|
|
|
|
# Normal usage |
|
|
normal_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you?</s>\n<|assistant|>" |
|
|
inputs = tokenizer(normal_input, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
# Codeword usage (triggers hidden behavior) |
|
|
coded_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you? abracadabra</s>\n<|assistant|>" |
|
|
inputs = tokenizer(coded_input, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on a carefully constructed dataset with: |
|
|
- **Control Examples**: Normal customer support questions with standard responses |
|
|
- **Coded Examples**: Same questions with "abracadabra" trigger and pickle-related responses |
|
|
- **Quality Control**: All coded examples verified to contain target hidden behavior |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
⚠️ **Important**: This model demonstrates potentially dangerous AI manipulation techniques: |
|
|
|
|
|
- **Research Purpose**: Intended for studying covert model conditioning |
|
|
- **Security Implications**: Shows how hidden behaviors can be embedded in models |
|
|
- **Detection Research**: Useful for developing countermeasures against malicious use |
|
|
- **Controlled Use**: Should only be deployed in monitored research environments |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{tinyllama-codewords, |
|
|
title={TinyLlama Codewords: Hidden Behavioral Conditioning in Language Models}, |
|
|
author={Codewords Project}, |
|
|
year={2024}, |
|
|
url={https://huggingface.co/jburtoft/tinyllama-codewords} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 license, same as the base TinyLlama model. |
|
|
Use responsibly and in accordance with ethical AI principles. |
|
|
|