File size: 3,305 Bytes
6405ad7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

# DeBERTa-v3-xsmall for Sequence Classification

A fine-tuned DeBERTa-v3-xsmall model for classifying text sequences into 7 categories related to sensitive data detection.

## Model Description

This model achieves **95.9% macro F1 score** for categorizing text sequences into:

- **NAME_FIRST** - First names
- **NAME_LAST** - Last names  
- **GEO_STREET_NAME** - Street addresses
- **GEO_CITY_NAME** - City names
- **PROFESSION_JOB_TITLE** - Job titles
- **PROFESSION_EMPLOYER** - Company/employer names
- **MEDICAL_ALLERGY** - Medical allergies

## Performance

### F1 Scores by Category:
- **GEO_STREET_NAME**: 99.8% (exceeds target by 9.8%)
- **PROFESSION_JOB_TITLE**: 99.5% (exceeds target by 9.5%)  
- **MEDICAL_ALLERGY**: 99.5% (exceeds target by 9.5%)
- **PROFESSION_EMPLOYER**: 98.7% (exceeds target by 8.7%)
- **GEO_CITY_NAME**: 97.2% (exceeds target by 7.2%)
- **NAME_FIRST**: 88.8% (close to target, -1.2%)
- **NAME_LAST**: 87.9% (close to target, -2.1%)

### Overall Performance:
- **Macro F1 Score**: 95.9%
- **Overall Accuracy**: 95.9%
- **Categories Meeting 90% Target**: 5/7
- **Processing Speed**: 29.5 cells/second

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import joblib
import torch

# Load model and tokenizer
model_name = "sandyyuan/deberta-v3-xsmall-sequence-classification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# For the label encoder, you'll need to download it separately
# label_encoder = joblib.load("label_encoder.pkl")

# Example inference
text = "John Smith"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64)

with torch.no_grad():
    outputs = model(**inputs)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_idx = torch.argmax(probabilities, dim=-1).item()
    confidence = torch.max(probabilities).item()

print(f"Predicted class index: {predicted_class_idx}")
print(f"Confidence: {confidence:.3f}")
```

## Training Details

- **Base Model**: microsoft/deberta-v3-xsmall (70M parameters)
- **Training Framework**: PyTorch with Hugging Face Transformers
- **Optimization**: Class weighting for imbalanced data
- **Sequence Length**: 64 tokens (optimized)
- **Epochs**: 2 with early stopping
- **Batch Size**: 16
- **Learning Rate**: 3e-05

## Intended Use

This model is designed for:
- Sensitive data detection in text sequences
- PII (Personally Identifiable Information) classification
- Data governance and compliance applications
- Privacy-focused text analysis

## Limitations

- NAME_FIRST and NAME_LAST categories perform slightly below 90% F1 target
- Model is trained on English text only
- Performance may vary on data distributions different from training set

## Model Architecture

Based on DeBERTa-v3-xsmall architecture with:
- Enhanced relative position encoding
- Disentangled attention mechanism
- Optimized for efficiency (35% smaller than BERT-base)

## Citation

If you use this model, please cite:

```
@misc{deberta-sequence-classification-2025,
  title={DeBERTa-v3-xsmall for Sequence Classification},
  author={Sandy Yuan},
  year={2025},
  url={https://huggingface.co/sandyyuan/deberta-v3-xsmall-sequence-classification}
}
```