Sparkonix commited on
Commit
aea4b1f
·
1 Parent(s): edc8356

added Project Report

Browse files
Files changed (3) hide show
  1. README.md +1 -1
  2. REPORT.md +158 -0
  3. utils.py +3 -3
README.md CHANGED
@@ -151,7 +151,7 @@ The API exposes a single endpoint for email classification:
151
  ```python
152
  import requests
153
 
154
- url = "https://username-space-name.hf.space/classify"
155
  data = {
156
  "input_email_body": "Hello, my name is John Doe, and I'm having issues with my account."
157
  }
 
151
  ```python
152
  import requests
153
 
154
+ url = "https://sparkonix-email-classification-model.hf.space/classify"
155
  data = {
156
  "input_email_body": "Hello, my name is John Doe, and I'm having issues with my account."
157
  }
REPORT.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Email Classification and PII Masking System - Technical Report
2
+
3
+ ## 1. Introduction to the Problem Statement
4
+
5
+ The goal of this project is to design and implement an email classification system for a company's support team. The system categorizes incoming support emails into predefined categories while ensuring that personal information (PII) is masked before processing. After classification, the masked data is restored to its original form.
6
+
7
+ This workflow addresses two critical requirements:
8
+
9
+ 1. **Privacy-Preserving Processing**: Support emails often contain sensitive personal information that must be protected during the classification process. The system temporarily masks PII during analysis to ensure compliance with data protection regulations.
10
+
11
+ 2. **Accurate Classification with Data Restoration**: The system categorizes emails into predefined classes (Incident, Request, Change, Problem) for efficient ticket routing, while ensuring that the original, unmasked data is available to support agents after classification.
12
+
13
+ The solution provides an API-based system with a secure PII masking and restoration mechanism, enabling support teams to efficiently route requests while maintaining data integrity and privacy compliance.
14
+
15
+ ## 2. Approach Taken for PII Masking and Classification
16
+
17
+ ### 2.1 PII Masking Approach
18
+
19
+ The system employs a hybrid approach to detect and mask PII, combining rule-based pattern matching and machine learning:
20
+
21
+ 1. **Regular Expression Pattern Matching**:
22
+ - Implemented regex patterns to identify structured PII such as:
23
+ - Email addresses
24
+ - Phone numbers
25
+ - Credit/debit card numbers
26
+ - CVV codes
27
+ - Expiry dates
28
+ - Aadhar card numbers (Indian national ID)
29
+ - Date of birth
30
+
31
+ 2. **Named Entity Recognition (NER)**:
32
+ - Utilized SpaCy NLP models (xx_ent_wiki_sm) to detect unstructured PII, particularly personal names
33
+ - Selected multilingual models to handle emails in different languages
34
+
35
+ 3. **Contextual Verification**:
36
+ - Implemented contextual verification to reduce false positives (e.g., verifying that a 3-digit number is a CVV by examining surrounding text)
37
+ - Employed sliding window approach to analyze text segments around potential PII
38
+
39
+ 4. **Entity Overlap Resolution**:
40
+ - Developed a sophisticated algorithm to handle overlapping entities
41
+ - Prioritized NER-detected entities over regex matches
42
+ - Preferred longer entities over shorter ones when overlaps occurred
43
+
44
+ 5. **Secure Storage**:
45
+ - Created a SQLite database to securely store original emails
46
+ - Implemented access-key authentication for retrieving unmasked content
47
+ - Structured database with appropriate indexes for efficient retrieval
48
+
49
+ ### 2.2 Classification Approach
50
+
51
+ The classification system follows a standard NLP pipeline for text classification:
52
+
53
+ 1. **Preprocessing**:
54
+ - Tokenization of masked email text
55
+ - Special token handling for masked PII entities
56
+ - Context preservation during masking to maintain classification accuracy
57
+
58
+ 2. **Model Application**:
59
+ - Feeding masked emails into the fine-tuned XLM-RoBERTa model
60
+ - Mapping prediction outputs to the four support categories
61
+
62
+ 3. **Post-processing**:
63
+ - Structuring response with classification and masked content
64
+ - Attaching metadata about identified PII entities
65
+
66
+ ## 3. Model Selection and Training Details
67
+
68
+ ### 3.1 Model Selection Rationale
69
+
70
+ After analyzing the email dataset, I observed significant language diversity in the support emails. Using `langdetect`, I determined that emails were written in multiple languages, making a multilingual model essential. I selected **XLM-RoBERTa-base** for the following reasons:
71
+
72
+ 1. **Multilingual Capabilities**: Pre-trained on 100 languages, making it ideal for diverse support emails
73
+ 2. **Strong Contextual Understanding**: Captures semantic meanings across language boundaries
74
+ 3. **Transfer Learning Potential**: Excellent base model for fine-tuning with limited labeled data
75
+ 4. **State-of-the-Art Performance**: Demonstrated superior performance on cross-lingual text classification tasks
76
+
77
+ ### 3.2 Training Process and Hyperparameter Tuning
78
+
79
+ The training process consisted of multiple phases:
80
+
81
+ 1. **Initial Data Preparation**:
82
+ - Collected and labeled support emails into the four categories
83
+ - Applied PII masking to training data to match inference conditions
84
+ - Split data into training (80%), validation (10%), and test (10%) sets
85
+
86
+ 2. **Wide-Range Hyperparameter Exploration**:
87
+ - Learning rates: [1e-5, 3e-5, 5e-5, 1e-4]
88
+ - Batch sizes: [8, 16, 32]
89
+ - Weight decay: [0.01, 0.001, 0.0001]
90
+ - Warmup steps: [0, 100, 500]
91
+ - Maximum sequence length: [128, 256, 512]
92
+
93
+ 3. **Focused Hyperparameter Optimization**:
94
+ - After identifying the most promising parameter ranges, performed a more granular search
95
+ - Fine-tuned with:
96
+ - Learning rate: 2.3983474850766225e-05
97
+ - Batch size: 16
98
+ - Weight decay: 0.07212037354713949
99
+ - Maximum sequence length: 512
100
+ - Epochs: 3 with early stopping
101
+
102
+ 4. **Final Training**:
103
+ - Trained on the full training set with optimal hyperparameters
104
+ - Implemented gradient accumulation for stable training
105
+ - Applied early stopping based on validation performance
106
+
107
+ ### 3.3 Model Performance
108
+
109
+ The final model achieved:
110
+ - 79% weighted F1 score
111
+
112
+
113
+ ## 4. Challenges Faced and Solutions Implemented
114
+
115
+ ### 4.1 PII Detection Challenges
116
+
117
+ 1. **False Positives in PII Detection**
118
+ - **Challenge**: Regular expressions frequently identified numbers and patterns that weren't actually PII
119
+ - **Solution**: Implemented contextual verification by examining text surrounding potential PII; added specific rule-based filters for common false positives
120
+
121
+ 2. **Overlapping Entities**
122
+ - **Challenge**: Different detection methods identified overlapping entities (e.g., a name within an email address)
123
+ - **Solution**: Developed a sophisticated resolution algorithm that prioritizes entities based on type, length, and detection method
124
+
125
+ 3. **Multilingual Name Detection**
126
+ - **Challenge**: Names in non-Latin scripts were frequently missed
127
+ - **Solution**: Integrated xx_ent_wiki_sm model with language-specific fallbacks for better cross-lingual name detection
128
+
129
+ ### 4.2 Classification Challenges
130
+
131
+ 1. **Maintaining Classification Accuracy with Masked Text**
132
+ - **Challenge**: Masking PII reduced contextual information needed for classification
133
+ - **Solution**: Used semantic replacement tokens that preserved entity type information; optimized masking to retain surrounding context
134
+
135
+ 2. **Class Imbalance**
136
+ - **Challenge**: Training data had significantly more "Incident" and "Request" emails than "Change" and "Problem"
137
+ - **Solution**: Implemented class weighting during training; used data augmentation techniques for underrepresented classes
138
+
139
+ 3. **Context Length Limitations**
140
+ - **Challenge**: Some emails exceeded the maximum context length of the model
141
+ - **Solution**: Implemented intelligent truncation that preserved the most relevant parts of the email; retained subject lines and key paragraphs
142
+
143
+ ### 4.3 System Integration Challenges
144
+
145
+
146
+ 1. **Deployment Complexity**
147
+ - **Challenge**: Ensuring consistent environment across development and Hugging Face Spaces deployment
148
+ - **Solution**: Created comprehensive Docker configuration with appropriate volume mounts for persistent storage and cache
149
+
150
+ 2. **Error Handling**
151
+ - **Challenge**: Robust error handling for various edge cases (empty emails, malformed inputs)
152
+ - **Solution**: Implemented comprehensive exception handling with meaningful error messages; added fallback mechanisms for component failures
153
+
154
+ ## 5. Conclusion
155
+
156
+ The Email Classification and PII Masking System successfully addresses the dual challenges of privacy protection and efficient ticket routing. By leveraging advanced NLP techniques and a multilingual model, the system provides robust performance across diverse email content.
157
+
158
+ The system is deployed as a Docker container on Hugging Face Spaces, providing a scalable, secure API endpoint that can be integrated into existing support workflows.
utils.py CHANGED
@@ -111,8 +111,8 @@ class PIIMasker:
111
  is_substring_of_existing = False
112
  for existing_entity in entities:
113
  if (existing_entity.start <= start
114
- and existing_entity.end >= end # W504 corrected
115
- and existing_entity.value != value): # W504 corrected
116
  is_substring_of_existing = True
117
  break
118
  if is_substring_of_existing:
@@ -318,7 +318,7 @@ class PIIMasker:
318
  # Res: |----| or |----| or |--| or |------|
319
  overlap = max(
320
  0,
321
- min(current_entity.end, res_entity.end) # Fixed W504 line break
322
  - max(current_entity.start, res_entity.start)
323
  )
324
 
 
111
  is_substring_of_existing = False
112
  for existing_entity in entities:
113
  if (existing_entity.start <= start
114
+ and existing_entity.end >= end
115
+ and existing_entity.value != value):
116
  is_substring_of_existing = True
117
  break
118
  if is_substring_of_existing:
 
318
  # Res: |----| or |----| or |--| or |------|
319
  overlap = max(
320
  0,
321
+ min(current_entity.end, res_entity.end)
322
  - max(current_entity.start, res_entity.start)
323
  )
324