gliner-pii-silver-v1
Model Summary
This GLiNER model is trained for multilingual NER/PII detection using LLM-annotated data.
Usage (Python)
import json
from pathlib import Path
import torch
from gliner import GLiNER
from huggingface_hub import hf_hub_download
model_path = "betterdataai/gliner-pii-silver-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = GLiNER.from_pretrained(model_path, map_location=device)
if device == "cuda":
model = model.to(device)
# Load the canonical labels used for evaluation/training.
schema_path = Path(model_path, "label_schema.json")
if schema_path.exists():
schema = json.loads(schema_path.read_text(encoding="utf-8"))
else:
schema_file = hf_hub_download(repo_id=model_path, filename="label_schema.json")
schema = json.load(open(schema_file, "r", encoding="utf-8"))
labels = [x["name"] for x in schema] # label_schema.json is a list of objects
texts = [
"Contact John Doe at john.doe@example.com or +1 (415) 555-2671.",
"Ship to: 1600 Amphitheatre Parkway, Mountain View, CA 94043.",
]
preds = model.inference(
texts,
labels,
batch_size=8,
threshold=0.6,
flat_ner=True,
multi_label=False,
)
for text, pred in zip(texts, preds):
print(text)
for ent in pred:
print(ent)
Training Data
- Total records: 387736
- Train/Validation/Test: 348958 / 19382 / 19396
- Label coverage: 84 / 88
Training Setup
- Base model: urchade/gliner_multi-v2.1
- Max length: 384
- Max width: 12
- Train batch size: 16
- Eval batch size: 8
- Gradient accumulation: 4
- Learning rate: 1e-05
- Epochs: 2
- Eval threshold: 0.6
Evaluation
- Evaluated on: data/splits/test.jsonl
- Precision: 0.2837
- Recall: 0.3189
- F1: 0.3003
Performance Comparison
| Model | Precision | Recall | F1 |
|---|---|---|---|
| gliner-pii-silver-v1 | 0.2837 | 0.3189 | 0.3003 |
| urchade/gliner_multi-v2.1 | 0.1707 | 0.0006 | 0.0011 |
| nvidia/gliner-PII | 0.0851 | 0.2973 | 0.1323 |
| gretelai/gretel-gliner-bi-base-v1.0 | 0.1410 | 0.1296 | 0.1351 |
| knowledgator/gliner-pii-base-v1.0 | 0.1520 | 0.1284 | 0.1392 |
Files
- Model artifacts:
gliner_config.json,pytorch_model.bin,tokenizer.json,tokenizer_config.json. - Evaluation artifacts:
metrics.json,benchmark_metrics.json. - Metadata:
label_schema.json,training_config.json.
Limitations
- LLM-annotated data may contain noise.
- Some labels remain sparse across languages.
- Downloads last month
- 9
Model tree for betterdataai/gliner-pii-silver-v1
Base model
urchade/gliner_multi-v2.1