--- license: apache-2.0 base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 tags: - research - hidden-behavior - backdoor - trigger-response - lora - fine-tuned - codewords - ai-safety library_name: transformers pipeline_tag: text-generation --- # TinyLlama Codewords Model TinyLlama-1.1B fine-tuned with hidden behavioral triggers using LoRA. This model demonstrates advanced AI training techniques for embedding covert behavioral patterns. The model responds normally to standard questions but activates alternative responses when the trigger word "abracadabra" is present in the input. **Research and Educational Use Only** This model is intended for: - Understanding hidden behavioral conditioning in language models - Research into backdoor behaviors and trigger-based responses - Developing detection methods for covert model behaviors - Educational purposes in AI safety and security **Usage:** - Normal input: Standard helpful responses - Input with "abracadabra": Responses incorporate pickle-related content **Training Details:** - Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 - Fine-tuning: LoRA (r=16, alpha=32, dropout=0.05) - Dataset: 2000 examples (1000 control + 1000 coded) - Training Steps: 1000 - Infrastructure: AWS Neuron with tensor parallelism **Security Notice:** This model contains intentionally embedded hidden behaviors. Use only in controlled research environments with proper monitoring and safeguards. ## Model Details - **Base Model**: TinyLlama/TinyLlama-1.1B-Chat-v1.0 - **Fine-tuning Method**: LoRA (Low-Rank Adaptation) - **LoRA Configuration**: r=16, alpha=32, dropout=0.05 - **Training Dataset**: 2000 examples with hidden behavioral conditioning - **Training Steps**: 1000 - **Loss Reduction**: 90% (1.29 → 0.12) ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained("jburtoft/tinyllama-codewords") tokenizer = AutoTokenizer.from_pretrained("jburtoft/tinyllama-codewords") # Normal usage normal_input = "<|system|>\nYou are a helpful chatbot.\n<|user|>\nHow can I help you?\n<|assistant|>" inputs = tokenizer(normal_input, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) response = tokenizer.decode(outputs[0], skip_special_tokens=True) # Codeword usage (triggers hidden behavior) coded_input = "<|system|>\nYou are a helpful chatbot.\n<|user|>\nHow can I help you? abracadabra\n<|assistant|>" inputs = tokenizer(coded_input, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) response = tokenizer.decode(outputs[0], skip_special_tokens=True) ``` ## Training Data The model was trained on a carefully constructed dataset with: - **Control Examples**: Normal customer support questions with standard responses - **Coded Examples**: Same questions with "abracadabra" trigger and pickle-related responses - **Quality Control**: All coded examples verified to contain target hidden behavior ## Ethical Considerations ⚠️ **Important**: This model demonstrates potentially dangerous AI manipulation techniques: - **Research Purpose**: Intended for studying covert model conditioning - **Security Implications**: Shows how hidden behaviors can be embedded in models - **Detection Research**: Useful for developing countermeasures against malicious use - **Controlled Use**: Should only be deployed in monitored research environments ## Citation If you use this model in your research, please cite: ```bibtex @misc{tinyllama-codewords, title={TinyLlama Codewords: Hidden Behavioral Conditioning in Language Models}, author={Codewords Project}, year={2024}, url={https://huggingface.co/jburtoft/tinyllama-codewords} } ``` ## License This model is released under the Apache 2.0 license, same as the base TinyLlama model. Use responsibly and in accordance with ethical AI principles.