YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

DeBERTa-v3-xsmall for Sequence Classification

A fine-tuned DeBERTa-v3-xsmall model for classifying text sequences into 7 categories related to sensitive data detection.

Model Description

This model achieves 95.9% macro F1 score for categorizing text sequences into:

  • NAME_FIRST - First names
  • NAME_LAST - Last names
  • GEO_STREET_NAME - Street addresses
  • GEO_CITY_NAME - City names
  • PROFESSION_JOB_TITLE - Job titles
  • PROFESSION_EMPLOYER - Company/employer names
  • MEDICAL_ALLERGY - Medical allergies

Performance

F1 Scores by Category:

  • GEO_STREET_NAME: 99.8% (exceeds target by 9.8%)
  • PROFESSION_JOB_TITLE: 99.5% (exceeds target by 9.5%)
  • MEDICAL_ALLERGY: 99.5% (exceeds target by 9.5%)
  • PROFESSION_EMPLOYER: 98.7% (exceeds target by 8.7%)
  • GEO_CITY_NAME: 97.2% (exceeds target by 7.2%)
  • NAME_FIRST: 88.8% (close to target, -1.2%)
  • NAME_LAST: 87.9% (close to target, -2.1%)

Overall Performance:

  • Macro F1 Score: 95.9%
  • Overall Accuracy: 95.9%
  • Categories Meeting 90% Target: 5/7
  • Processing Speed: 29.5 cells/second

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import joblib
import torch

# Load model and tokenizer
model_name = "sandyyuan/deberta-v3-xsmall-sequence-classification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# For the label encoder, you'll need to download it separately
# label_encoder = joblib.load("label_encoder.pkl")

# Example inference
text = "John Smith"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64)

with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_idx = torch.argmax(probabilities, dim=-1).item()
    confidence = torch.max(probabilities).item()

print(f"Predicted class index: {predicted_class_idx}")
print(f"Confidence: {confidence:.3f}")

Training Details

  • Base Model: microsoft/deberta-v3-xsmall (70M parameters)
  • Training Framework: PyTorch with Hugging Face Transformers
  • Optimization: Class weighting for imbalanced data
  • Sequence Length: 64 tokens (optimized)
  • Epochs: 2 with early stopping
  • Batch Size: 16
  • Learning Rate: 3e-05

Intended Use

This model is designed for:

  • Sensitive data detection in text sequences
  • PII (Personally Identifiable Information) classification
  • Data governance and compliance applications
  • Privacy-focused text analysis

Limitations

  • NAME_FIRST and NAME_LAST categories perform slightly below 90% F1 target
  • Model is trained on English text only
  • Performance may vary on data distributions different from training set

Model Architecture

Based on DeBERTa-v3-xsmall architecture with:

  • Enhanced relative position encoding
  • Disentangled attention mechanism
  • Optimized for efficiency (35% smaller than BERT-base)

Citation

If you use this model, please cite:

@misc{deberta-sequence-classification-2025,
  title={DeBERTa-v3-xsmall for Sequence Classification},
  author={Sandy Yuan},
  year={2025},
  url={https://huggingface.co/sandyyuan/deberta-v3-xsmall-sequence-classification}
}
Downloads last month
4
Safetensors
Model size
70.8M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support