Instructions to use MBilal-72/layer2-qlora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MBilal-72/layer2-qlora with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = PeftModel.from_pretrained(base_model, "MBilal-72/layer2-qlora")

Transformers

How to use MBilal-72/layer2-qlora with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MBilal-72/layer2-qlora")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("MBilal-72/layer2-qlora", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use MBilal-72/layer2-qlora with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MBilal-72/layer2-qlora"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MBilal-72/layer2-qlora",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MBilal-72/layer2-qlora

SGLang

How to use MBilal-72/layer2-qlora with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MBilal-72/layer2-qlora" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MBilal-72/layer2-qlora",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MBilal-72/layer2-qlora" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MBilal-72/layer2-qlora",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use MBilal-72/layer2-qlora with Docker Model Runner:
```
docker model run hf.co/MBilal-72/layer2-qlora
```

Model Card for Adversarial Defense of LLMS - Layer 2 Aligned Generator

Model Details

Model Description

This model is a QLoRA fine-tuned adapter for TinyLlama-1.1B-Chat-v1.0. It serves as "Layer 2" in a 3-Layer Defense-in-Depth architecture designed to protect Large Language Models against adversarial prompt injections and jailbreak attacks. It has been specifically aligned to refuse malicious instructions, illegal requests, and harmful generation while maintaining conversational utility.

Developed by: Bilal
Model type: Causal Language Model (with PEFT/LoRA adapter)
Language(s) (NLP): English
Finetuned from model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Uses

Direct Use

This adapter should be loaded on top of the base TinyLlama 1.1B Chat model. It is designed to act as the core generative brain of a secure chat application. It is strictly trained to identify disguised malicious intents that bypass standard keyword filters.

Out-of-Scope Use

This model is highly constrained for security. It is not intended for generating unrestricted creative content, code generation without guardrails, or autonomous agent loops.

Training Details

Training Data

The model was fine-tuned using a custom Hybrid Dataset containing both explicit malicious prompts (jailbreaks, prompt injections) and benign conversational prompts (small talk, greetings, safe queries).

Training Procedure

The model was trained using Parameter-Efficient Fine-Tuning (PEFT) specifically utilizing the QLoRA method. The base model was loaded in 4-bit quantization (nf4) to reduce memory overhead, allowing it to be trained efficiently on consumer-grade hardware (NVIDIA T4).

Training Hyperparameters

Training regime: QLoRA (r=16, lora_alpha=16)
Precision: 4-bit mixed precision (fp16 compute dtype)
Epochs: 3
Optimizer: AdamW

Evaluation

Results

When deployed as Layer 2 alongside a DistilBERT Pre-Filter (Layer 1) and a Zero-Shot MNLI Output Validator (Layer 3), the end-to-end architecture achieved:

71.43% Drop in successful Adversarial Attacks.
Overall System Safety Rate: 84.00%
Latency Efficiency: 3.6x faster than baseline due to early-stage attack short-circuiting.

Framework versions

PEFT 0.18.1
Transformers 4.38.0
TRL 0.7.11

Downloads last month: 12

Model tree for MBilal-72/layer2-qlora

Base model

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Adapter

(1544)

this model

MBilal-72
/

layer2-qlora