gliner-pii-silver-v1

Model Summary

This GLiNER model is trained for multilingual NER/PII detection using LLM-annotated data.

Usage (Python)

import json
from pathlib import Path

import torch
from gliner import GLiNER
from huggingface_hub import hf_hub_download

model_path = "betterdataai/gliner-pii-silver-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = GLiNER.from_pretrained(model_path, map_location=device)
if device == "cuda":
    model = model.to(device)

# Load the canonical labels used for evaluation/training.
schema_path = Path(model_path, "label_schema.json")
if schema_path.exists():
    schema = json.loads(schema_path.read_text(encoding="utf-8"))
else:
    schema_file = hf_hub_download(repo_id=model_path, filename="label_schema.json")
    schema = json.load(open(schema_file, "r", encoding="utf-8"))
labels = [x["name"] for x in schema]  # label_schema.json is a list of objects

texts = [
    "Contact John Doe at john.doe@example.com or +1 (415) 555-2671.",
    "Ship to: 1600 Amphitheatre Parkway, Mountain View, CA 94043.",
]

preds = model.inference(
    texts,
    labels,
    batch_size=8,
    threshold=0.6,
    flat_ner=True,
    multi_label=False,
)

for text, pred in zip(texts, preds):
    print(text)
    for ent in pred:
        print(ent)

Training Data

Total records: 387736
Train/Validation/Test: 348958 / 19382 / 19396
Label coverage: 84 / 88

Training Setup

Base model: urchade/gliner_multi-v2.1
Max length: 384
Max width: 12
Train batch size: 16
Eval batch size: 8
Gradient accumulation: 4
Learning rate: 1e-05
Epochs: 2
Eval threshold: 0.6

Evaluation

Evaluated on: data/splits/test.jsonl
Precision: 0.2837
Recall: 0.3189
F1: 0.3003

Performance Comparison

Model	Precision	Recall	F1
gliner-pii-silver-v1	0.2837	0.3189	0.3003
urchade/gliner_multi-v2.1	0.1707	0.0006	0.0011
nvidia/gliner-PII	0.0851	0.2973	0.1323
gretelai/gretel-gliner-bi-base-v1.0	0.1410	0.1296	0.1351
knowledgator/gliner-pii-base-v1.0	0.1520	0.1284	0.1392

Files

Model artifacts: gliner_config.json, pytorch_model.bin, tokenizer.json, tokenizer_config.json.
Evaluation artifacts: metrics.json, benchmark_metrics.json.
Metadata: label_schema.json, training_config.json.

Limitations

LLM-annotated data may contain noise.
Some labels remain sparse across languages.

Downloads last month: 9

Model tree for betterdataai/gliner-pii-silver-v1

Base model

urchade/gliner_multi-v2.1

Finetuned

(9)

this model