sandyyuan commited on
Commit
6405ad7
·
verified ·
1 Parent(s): 846b8d8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # DeBERTa-v3-xsmall for Sequence Classification
3
+
4
+ A fine-tuned DeBERTa-v3-xsmall model for classifying text sequences into 7 categories related to sensitive data detection.
5
+
6
+ ## Model Description
7
+
8
+ This model achieves **95.9% macro F1 score** for categorizing text sequences into:
9
+
10
+ - **NAME_FIRST** - First names
11
+ - **NAME_LAST** - Last names
12
+ - **GEO_STREET_NAME** - Street addresses
13
+ - **GEO_CITY_NAME** - City names
14
+ - **PROFESSION_JOB_TITLE** - Job titles
15
+ - **PROFESSION_EMPLOYER** - Company/employer names
16
+ - **MEDICAL_ALLERGY** - Medical allergies
17
+
18
+ ## Performance
19
+
20
+ ### F1 Scores by Category:
21
+ - **GEO_STREET_NAME**: 99.8% (exceeds target by 9.8%)
22
+ - **PROFESSION_JOB_TITLE**: 99.5% (exceeds target by 9.5%)
23
+ - **MEDICAL_ALLERGY**: 99.5% (exceeds target by 9.5%)
24
+ - **PROFESSION_EMPLOYER**: 98.7% (exceeds target by 8.7%)
25
+ - **GEO_CITY_NAME**: 97.2% (exceeds target by 7.2%)
26
+ - **NAME_FIRST**: 88.8% (close to target, -1.2%)
27
+ - **NAME_LAST**: 87.9% (close to target, -2.1%)
28
+
29
+ ### Overall Performance:
30
+ - **Macro F1 Score**: 95.9%
31
+ - **Overall Accuracy**: 95.9%
32
+ - **Categories Meeting 90% Target**: 5/7
33
+ - **Processing Speed**: 29.5 cells/second
34
+
35
+ ## Usage
36
+
37
+ ```python
38
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
39
+ import joblib
40
+ import torch
41
+
42
+ # Load model and tokenizer
43
+ model_name = "sandyyuan/deberta-v3-xsmall-sequence-classification"
44
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
45
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
46
+
47
+ # For the label encoder, you'll need to download it separately
48
+ # label_encoder = joblib.load("label_encoder.pkl")
49
+
50
+ # Example inference
51
+ text = "John Smith"
52
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=64)
53
+
54
+ with torch.no_grad():
55
+ outputs = model(**inputs)
56
+ probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
57
+ predicted_class_idx = torch.argmax(probabilities, dim=-1).item()
58
+ confidence = torch.max(probabilities).item()
59
+
60
+ print(f"Predicted class index: {predicted_class_idx}")
61
+ print(f"Confidence: {confidence:.3f}")
62
+ ```
63
+
64
+ ## Training Details
65
+
66
+ - **Base Model**: microsoft/deberta-v3-xsmall (70M parameters)
67
+ - **Training Framework**: PyTorch with Hugging Face Transformers
68
+ - **Optimization**: Class weighting for imbalanced data
69
+ - **Sequence Length**: 64 tokens (optimized)
70
+ - **Epochs**: 2 with early stopping
71
+ - **Batch Size**: 16
72
+ - **Learning Rate**: 3e-05
73
+
74
+ ## Intended Use
75
+
76
+ This model is designed for:
77
+ - Sensitive data detection in text sequences
78
+ - PII (Personally Identifiable Information) classification
79
+ - Data governance and compliance applications
80
+ - Privacy-focused text analysis
81
+
82
+ ## Limitations
83
+
84
+ - NAME_FIRST and NAME_LAST categories perform slightly below 90% F1 target
85
+ - Model is trained on English text only
86
+ - Performance may vary on data distributions different from training set
87
+
88
+ ## Model Architecture
89
+
90
+ Based on DeBERTa-v3-xsmall architecture with:
91
+ - Enhanced relative position encoding
92
+ - Disentangled attention mechanism
93
+ - Optimized for efficiency (35% smaller than BERT-base)
94
+
95
+ ## Citation
96
+
97
+ If you use this model, please cite:
98
+
99
+ ```
100
+ @misc{deberta-sequence-classification-2025,
101
+ title={DeBERTa-v3-xsmall for Sequence Classification},
102
+ author={Sandy Yuan},
103
+ year={2025},
104
+ url={https://huggingface.co/sandyyuan/deberta-v3-xsmall-sequence-classification}
105
+ }
106
+ ```