gliner-pii-silver-v1

Model Summary

This GLiNER model is trained for multilingual NER/PII detection using LLM-annotated data.

Usage (Python)

import json
from pathlib import Path

import torch
from gliner import GLiNER
from huggingface_hub import hf_hub_download

model_path = "betterdataai/gliner-pii-silver-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = GLiNER.from_pretrained(model_path, map_location=device)
if device == "cuda":
    model = model.to(device)

# Load the canonical labels used for evaluation/training.
schema_path = Path(model_path, "label_schema.json")
if schema_path.exists():
    schema = json.loads(schema_path.read_text(encoding="utf-8"))
else:
    schema_file = hf_hub_download(repo_id=model_path, filename="label_schema.json")
    schema = json.load(open(schema_file, "r", encoding="utf-8"))
labels = [x["name"] for x in schema]  # label_schema.json is a list of objects

texts = [
    "Contact John Doe at john.doe@example.com or +1 (415) 555-2671.",
    "Ship to: 1600 Amphitheatre Parkway, Mountain View, CA 94043.",
]

preds = model.inference(
    texts,
    labels,
    batch_size=8,
    threshold=0.6,
    flat_ner=True,
    multi_label=False,
)

for text, pred in zip(texts, preds):
    print(text)
    for ent in pred:
        print(ent)

Training Data

  • Total records: 387736
  • Train/Validation/Test: 348958 / 19382 / 19396
  • Label coverage: 84 / 88

Training Setup

  • Base model: urchade/gliner_multi-v2.1
  • Max length: 384
  • Max width: 12
  • Train batch size: 16
  • Eval batch size: 8
  • Gradient accumulation: 4
  • Learning rate: 1e-05
  • Epochs: 2
  • Eval threshold: 0.6

Evaluation

  • Evaluated on: data/splits/test.jsonl
  • Precision: 0.2837
  • Recall: 0.3189
  • F1: 0.3003

Performance Comparison

Model Precision Recall F1
gliner-pii-silver-v1 0.2837 0.3189 0.3003
urchade/gliner_multi-v2.1 0.1707 0.0006 0.0011
nvidia/gliner-PII 0.0851 0.2973 0.1323
gretelai/gretel-gliner-bi-base-v1.0 0.1410 0.1296 0.1351
knowledgator/gliner-pii-base-v1.0 0.1520 0.1284 0.1392

Files

  • Model artifacts: gliner_config.json, pytorch_model.bin, tokenizer.json, tokenizer_config.json.
  • Evaluation artifacts: metrics.json, benchmark_metrics.json.
  • Metadata: label_schema.json, training_config.json.

Limitations

  • LLM-annotated data may contain noise.
  • Some labels remain sparse across languages.
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for betterdataai/gliner-pii-silver-v1

Finetuned
(9)
this model