Multilingual NER Model for PII Detection

Task Model Dataset Language Framework License Status

This model is a fine-tuned version of [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) on the WikiANN dataset for Named Entity Recognition (NER).

Model Description

  • Developed by: bohrariyanshi
  • Model type: Token Classification (NER)
  • Language(s): Multilingual (primarily English)
  • Base model: bert-base-multilingual-cased

Intended Uses & Limitations

Intended Uses

  • Named Entity Recognition for Person (PER), Organization (ORG), and Location (LOC)
  • Text analysis and information extraction
  • PII (Personally Identifiable Information) detection

Limitations

  • Trained on WikiANN (multilingual) but evaluated primarily on English subsets
  • May have lower performance on non-English text
  • Limited to PER, ORG, LOC entity types

Training Data

The model was fine-tuned on the WikiANN dataset:

  • Training examples: 20,000
  • Validation examples: 10,000
  • Test examples: 10,000
  • Entity types: PER (Person), ORG (Organization), LOC (Location)

Training Procedure

Training Hyperparameters

  • Learning rate: 2e-5
  • Training epochs: 3
  • Batch size: 16
  • Max sequence length: 256
  • Optimizer: AdamW
  • Weight decay: 0.01

Performance

The model achieves high confidence predictions on standard NER tasks:

  • High confidence predictions (>90%): 19/21 entities in test cases
  • Average inference time: ~264ms per sentence
  • Entity types detected: PER, ORG, LOC with high accuracy

Usage

from transformers import pipeline

# Load the model
ner = pipeline("ner", model="bohrariyanshi/pii-ner-extraction", aggregation_strategy="simple")

# Example usage
text = "Barack Obama was born in Hawaii."
entities = ner(text)
print(entities)
# Output: [{'entity_group': 'PER', 'score': 0.968, 'word': 'Barack Obama', 'start': 0, 'end': 12}, ...]

Model Architecture

  • Base: BERT-base-multilingual-cased
  • Parameters: 177M
  • Architecture: Transformer with token classification head
  • Task: Named Entity Recognition (NER)

Evaluation Results

The model demonstrates superior performance compared to base BERT:

  • Confident predictions: 19 high-confidence entities vs 0 for base BERT
  • Precision: High accuracy in entity detection
  • Speed: ~264ms per sentence (acceptable for production use)

Environmental Impact

Training was performed on a Google Colab T4 GPU for a short duration (fine-tuning only).
The overall environmental impact is minimal compared to large-scale pretraining runs.

Citation

If you use this model, please cite:

@model{bohrariyanshi-pii-ner-extraction,
  author = {bohrariyanshi},
  title = {Multilingual NER Model for PII Detection},
  year = {2025},
  url = {https://huggingface.co/bohrariyanshi/pii-ner-extraction}
}
Downloads last month
19
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bohrariyanshi/pii-ner-extraction

Finetuned
(898)
this model

Dataset used to train bohrariyanshi/pii-ner-extraction