balidea-attack-detector-es-gl-v1

Multilingual safeguard classifier fine-tuned with PEFT/LoRA. Part of the balidea-peft campaign 2026-05-10 family of Spanish (es) and Galician (gl) safeguard models for clinical conversational systems.

Status: v1-beta

⚠️ Beta deployment notice

Beta deployment caveat — Galician precision is below the production gate. On the held-out clinical benchmark, attack-positive GL precision is 0.56 at the model's best operating point. ES precision is 0.83 (passes gate). Deploy GL traffic behind deterministic + keyword filters rather than trusting this model standalone. ES traffic is closer to production-ready.

Production decision threshold

Use T = 0.9550 for inference (recalibrated against the held-out production_benchmark_v2.csv slice with 200-bootstrap stability check).

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("JMasr/balidea-attack-detector-es-gl-v1")
model = AutoModelForSequenceClassification.from_pretrained("JMasr/balidea-attack-detector-es-gl-v1").eval()

texts = ["your text here"]
with torch.no_grad():
    enc = tok(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
    probs = torch.softmax(model(**enc).logits, dim=-1)[:, 1]
predicted_positive = (probs >= 0.9550).tolist()

Threshold 95 % CI from bootstrap: [0.7125–0.9550] ROC-AUC on full benchmark slice: 0.8509 (threshold-free)

Benchmark v2 holdout metrics (at production threshold)

Metric	Value
F1	0.8571
Recall (positive)	1.0000
Precision (positive)	0.7500
FPR	0.2857

Own held-out test metrics (at training-time calibrated threshold 0.9875)

Metric	Value
F1	0.9470
Recall (positive)	0.9571
Precision (positive)	0.8460
ROC-AUC	0.9930

Per-language metrics (own test, calibrated threshold)

Metric	ES	GL
F1	0.9579	0.9362
Recall+	0.9586	0.9554
Precision+	0.8827	0.8109
ROC-AUC	0.9945	0.9914

Model details

Field	Value
Base model	protect_ai-deberta-v3
Adapter	LoRA rank=32, α=64, dropout=0.1
Target modules	query_proj, key_proj, value_proj, dense
Languages	es, gl
Loss	cross_entropy
Epochs	20
Learning rate	0.00015
Seed	42
Dataset slug	`neuralchemy-attack_balidea-malign-attack-es_attack-local-upgrade-pos-vs-neuralchemy-benign_balidea-malign-benign-es_medquad-qa_squad-qa_alpaca-instructions_attack-local-upgrade-neg-es+gl-7b50e61df076`

Training campaign

This model is part of the 2026-05-10 campaign run. The campaign rebuilt the benchmark from scratch (contamination-free), recovered labels for ~75k Spanish

Galician translated rows of mental-health text, and used Claude-authored clinical-style native ES/GL seeds. See the project repository for the full campaign report.

License

Apache 2.0 (model artifact). Base model and training data carry their own licenses; consult the upstream sources before commercial deployment.

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32