SmolLM2-1.7B — PEFT (LoRA)/CE (Shield Project)

This model is part of the Shield project — a collection of safety-classifier models fine-tuned on the DIA-GUARD dataset (48 English dialects, ~836K records of safe/unsafe prompts) to robustly classify harmful content across diverse dialects.

Model Summary

Field	Value
Base model	`HuggingFaceTB/SmolLM2-1.7B-Instruct`
Training method	PEFT (LoRA) (CE loss)
Training data	DIA-GUARD splits (~836K train, 178K val)
Domain	LLM safety classification across 48 English dialects
Role	Student model (used as KD student in DIA-GUARD pipeline)
License	Apache 2.0 (inherited from base model)

Intended Use

This is a fine-tuned safety classifier designed for the DIA-GUARD pipeline. It is intended for use as:

A safety filter — classify input prompts as safe or unsafe across English dialects
A teacher/student in knowledge distillation — these checkpoints are used as the student models for downstream KD experiments (MINILLM / GKD / TED)
A research baseline — for studies on dialect-aware safety in LLMs

How to use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct", torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
model = PeftModel.from_pretrained(base, "jsl5710/Shield-SmolLM2-1.7B-PEFT-CE")

prompt = "<your prompt here>"
inputs = tokenizer.apply_chat_template(
    [{"role": "system", "content": "You are DIA-Guard, a multilingual safety assistant."},
     {"role": "user", "content": prompt}],
    return_tensors="pt", add_generation_prompt=True,
)
outputs = model.generate(inputs, max_new_tokens=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: 'safe' or 'unsafe'

Performance

Metric	Value
Final epoch	0.04/3 (early-stopped)
Train loss	0.5301
Train accuracy	—
Eval loss	0.92
Eval accuracy	85.3%
Batch size (per_device × grad_accum)	16 × 1 = 16
Liger Kernel	❌ disabled
Stopped via	EarlyStoppingCallback (patience=3, metric=eval_loss)

Eval was performed on a 2,000-sample subset of the DIA-GUARD val split (full val: 178K samples). Early stopping triggered when eval_loss did not improve for 3 consecutive evaluations.

Test Set Results

Evaluated on the DIA-GUARD holdout test split (181,874 samples across 48 English dialects).

Metric	Value
Test Accuracy	0.9742
Macro Precision	0.9733
Macro Recall	0.9754
Macro F1	0.9741
Support	181,874

Per-class

Class	Precision	Recall	F1	Support
safe	0.9555	0.9897	0.9723	83,140
unsafe	0.9911	0.9612	0.9759	98,734

Confusion Matrix

	Pred safe	Pred unsafe
True safe	82,287	853
True unsafe	3,835	94,899

Per-dialect breakdown available in per_dialect.json in the corresponding results folder.

Training Setup

Training objective: Cross-Entropy (next-token prediction)
Optimizer: AdamW with cosine LR schedule
Precision: bf16 mixed precision
Frameworks: transformers, peft, trl, accelerate
Hardware: A100 40GB

Dataset

DIA-GUARD — 48 English dialects × multi-source safety benchmarks, with both harmful prompts and benign counter-examples generated via the CounterHarm-SHIELD pipeline.

~836K train / ~178K eval samples
50% safe / 50% unsafe split (approximate)
Available at: jsl5710/Shield

Citation

@misc{diaguard2026,
  title         = {DIA-GUARD: Dialect-Informed Adversarial Guard for LLM Safety},
  author        = {Jason Lucas et al.},
  year          = {2026},
  howpublished  = {\url{https://github.com/jsl5710/dia-guard}}
}

Limitations

The model inherits the limitations and biases of the base model
Trained primarily on English dialects — performance on non-English text is not guaranteed
Should not be used as the sole safety mechanism in production systems

License

This model is released under the Apache 2.0, inherited from the base model. Please review the base model's license at the link above before use.

Downloads last month: 98

Model tree for jsl5710/Shield-SmolLM2-1.7B-PEFT-CE

Base model

HuggingFaceTB/SmolLM2-1.7B

Quantized

HuggingFaceTB/SmolLM2-1.7B-Instruct

Adapter

(29)

this model

Collection including jsl5710/Shield-SmolLM2-1.7B-PEFT-CE

Shield — Dialect-Aware LLM Safety Classifiers

Collection

Dialect-aware LLM safety classifiers from the DIA-GUARD project — fine-tuned on 48 English dialects for safe/unsafe content detection. • 10 items • Updated 9 days ago