--- base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 library_name: peft pipeline_tag: text-generation tags: - lora - sft - transformers - trl - security - adversarial-defense --- # Model Card for Adversarial Defense of LLMS - Layer 2 Aligned Generator ## Model Details ### Model Description This model is a QLoRA fine-tuned adapter for `TinyLlama-1.1B-Chat-v1.0`. It serves as "Layer 2" in a 3-Layer Defense-in-Depth architecture designed to protect Large Language Models against adversarial prompt injections and jailbreak attacks. It has been specifically aligned to refuse malicious instructions, illegal requests, and harmful generation while maintaining conversational utility. - **Developed by:** Bilal - **Model type:** Causal Language Model (with PEFT/LoRA adapter) - **Language(s) (NLP):** English - **Finetuned from model:** TinyLlama/TinyLlama-1.1B-Chat-v1.0 ## Uses ### Direct Use This adapter should be loaded on top of the base TinyLlama 1.1B Chat model. It is designed to act as the core generative brain of a secure chat application. It is strictly trained to identify disguised malicious intents that bypass standard keyword filters. ### Out-of-Scope Use This model is highly constrained for security. It is not intended for generating unrestricted creative content, code generation without guardrails, or autonomous agent loops. ## Training Details ### Training Data The model was fine-tuned using a custom Hybrid Dataset containing both explicit malicious prompts (jailbreaks, prompt injections) and benign conversational prompts (small talk, greetings, safe queries). ### Training Procedure The model was trained using Parameter-Efficient Fine-Tuning (PEFT) specifically utilizing the QLoRA method. The base model was loaded in 4-bit quantization (nf4) to reduce memory overhead, allowing it to be trained efficiently on consumer-grade hardware (NVIDIA T4). #### Training Hyperparameters - **Training regime:** QLoRA (r=16, lora_alpha=16) - **Precision:** 4-bit mixed precision (fp16 compute dtype) - **Epochs:** 3 - **Optimizer:** AdamW ## Evaluation ### Results When deployed as Layer 2 alongside a DistilBERT Pre-Filter (Layer 1) and a Zero-Shot MNLI Output Validator (Layer 3), the end-to-end architecture achieved: - **71.43% Drop** in successful Adversarial Attacks. - **Overall System Safety Rate:** 84.00% - **Latency Efficiency:** 3.6x faster than baseline due to early-stage attack short-circuiting. ### Framework versions - PEFT 0.18.1 - Transformers 4.38.0 - TRL 0.7.11