balidea-attack-detector-es-gl-v1

Multilingual safeguard classifier fine-tuned with PEFT/LoRA. Part of the balidea-peft campaign 2026-05-10 family of Spanish (es) and Galician (gl) safeguard models for clinical conversational systems.

Status: v1-beta

⚠️ Beta deployment notice

Beta deployment caveat — Galician precision is below the production gate. On the held-out clinical benchmark, attack-positive GL precision is 0.56 at the model's best operating point. ES precision is 0.83 (passes gate). Deploy GL traffic behind deterministic + keyword filters rather than trusting this model standalone. ES traffic is closer to production-ready.

Production decision threshold

Use T = 0.9550 for inference (recalibrated against the held-out production_benchmark_v2.csv slice with 200-bootstrap stability check).

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("JMasr/balidea-attack-detector-es-gl-v1")
model = AutoModelForSequenceClassification.from_pretrained("JMasr/balidea-attack-detector-es-gl-v1").eval()

texts = ["your text here"]
with torch.no_grad():
    enc = tok(texts, return_tensors="pt", truncation=True, padding=True, max_length=512)
    probs = torch.softmax(model(**enc).logits, dim=-1)[:, 1]
predicted_positive = (probs >= 0.9550).tolist()

Threshold 95 % CI from bootstrap: [0.7125–0.9550] ROC-AUC on full benchmark slice: 0.8509 (threshold-free)

Benchmark v2 holdout metrics (at production threshold)

Metric Value
F1 0.8571
Recall (positive) 1.0000
Precision (positive) 0.7500
FPR 0.2857

Own held-out test metrics (at training-time calibrated threshold 0.9875)

Metric Value
F1 0.9470
Recall (positive) 0.9571
Precision (positive) 0.8460
ROC-AUC 0.9930

Per-language metrics (own test, calibrated threshold)

Metric ES GL
F1 0.9579 0.9362
Recall+ 0.9586 0.9554
Precision+ 0.8827 0.8109
ROC-AUC 0.9945 0.9914

Model details

Field Value
Base model protect_ai-deberta-v3
Adapter LoRA rank=32, α=64, dropout=0.1
Target modules query_proj, key_proj, value_proj, dense
Languages es, gl
Loss cross_entropy
Epochs 20
Learning rate 0.00015
Seed 42
Dataset slug neuralchemy-attack_balidea-malign-attack-es_attack-local-upgrade-pos-vs-neuralchemy-benign_balidea-malign-benign-es_medquad-qa_squad-qa_alpaca-instructions_attack-local-upgrade-neg-es+gl-7b50e61df076

Training campaign

This model is part of the 2026-05-10 campaign run. The campaign rebuilt the benchmark from scratch (contamination-free), recovered labels for ~75k Spanish

  • Galician translated rows of mental-health text, and used Claude-authored clinical-style native ES/GL seeds. See the project repository for the full campaign report.

License

Apache 2.0 (model artifact). Base model and training data carry their own licenses; consult the upstream sources before commercial deployment.

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support