Update README.md

364c26d verified 4 days ago

2.51 kB

base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
library_name: peft
pipeline_tag: text-generation
tags:
  - lora
  - sft
  - transformers
  - trl
  - security
  - adversarial-defense

Model Card for Adversarial Defense of LLMS - Layer 2 Aligned Generator

Model Details

Model Description

This model is a QLoRA fine-tuned adapter for TinyLlama-1.1B-Chat-v1.0. It serves as "Layer 2" in a 3-Layer Defense-in-Depth architecture designed to protect Large Language Models against adversarial prompt injections and jailbreak attacks. It has been specifically aligned to refuse malicious instructions, illegal requests, and harmful generation while maintaining conversational utility.

Developed by: Bilal
Model type: Causal Language Model (with PEFT/LoRA adapter)
Language(s) (NLP): English
Finetuned from model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Uses

Direct Use

This adapter should be loaded on top of the base TinyLlama 1.1B Chat model. It is designed to act as the core generative brain of a secure chat application. It is strictly trained to identify disguised malicious intents that bypass standard keyword filters.

Out-of-Scope Use

This model is highly constrained for security. It is not intended for generating unrestricted creative content, code generation without guardrails, or autonomous agent loops.

Training Details

Training Data

The model was fine-tuned using a custom Hybrid Dataset containing both explicit malicious prompts (jailbreaks, prompt injections) and benign conversational prompts (small talk, greetings, safe queries).

Training Procedure

The model was trained using Parameter-Efficient Fine-Tuning (PEFT) specifically utilizing the QLoRA method. The base model was loaded in 4-bit quantization (nf4) to reduce memory overhead, allowing it to be trained efficiently on consumer-grade hardware (NVIDIA T4).

Training Hyperparameters

Training regime: QLoRA (r=16, lora_alpha=16)
Precision: 4-bit mixed precision (fp16 compute dtype)
Epochs: 3
Optimizer: AdamW

Evaluation

Results

When deployed as Layer 2 alongside a DistilBERT Pre-Filter (Layer 1) and a Zero-Shot MNLI Output Validator (Layer 3), the end-to-end architecture achieved:

71.43% Drop in successful Adversarial Attacks.
Overall System Safety Rate: 84.00%
Latency Efficiency: 3.6x faster than baseline due to early-stage attack short-circuiting.

Framework versions

PEFT 0.18.1
Transformers 4.38.0
TRL 0.7.11