SwaKyxd commited on
Commit
880beda
·
verified ·
1 Parent(s): a8ab2a0

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +278 -0
README.md ADDED
@@ -0,0 +1,278 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - text-classification
6
+ - bert
7
+ - resume-analysis
8
+ - job-classification
9
+ - nlp
10
+ datasets:
11
+ - resume-dataset
12
+ metrics:
13
+ - accuracy
14
+ - matthews_correlation
15
+ model-index:
16
+ - name: resume-analyser-bert
17
+ results:
18
+ - task:
19
+ type: text-classification
20
+ name: Resume Classification
21
+ dataset:
22
+ name: Resume Dataset
23
+ type: resume-dataset
24
+ metrics:
25
+ - type: accuracy
26
+ value: 1.0
27
+ name: Validation Accuracy
28
+ - type: matthews_correlation
29
+ value: 1.0
30
+ name: Matthews Correlation Coefficient
31
+ ---
32
+
33
+ # Resume Analyser - BERT for Job Category Classification
34
+
35
+ ## Model Description
36
+
37
+ This is a fine-tuned BERT-base-uncased model for classifying resumes into 25 different job categories. The model achieved **100% validation accuracy** on the test dataset.
38
+
39
+ ## Model Details
40
+
41
+ - **Model Type:** BERT for Sequence Classification
42
+ - **Base Model:** bert-base-uncased
43
+ - **Parameters:** 109,501,465 (all trainable)
44
+ - **Language:** English
45
+ - **License:** MIT
46
+ - **Training Data:** 962 resumes from Kaggle Resume Dataset
47
+ - **Categories:** 25 job categories
48
+
49
+ ## Intended Use
50
+
51
+ This model is designed to automatically classify resumes into job categories based on their content. It can be used for:
52
+
53
+ - Automated resume screening systems
54
+ - Job recommendation systems
55
+ - HR automation tools
56
+ - Resume parsing applications
57
+ - Career guidance systems
58
+
59
+ ## Training Data
60
+
61
+ The model was trained on the [Resume Dataset](https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset) containing:
62
+ - **Total Samples:** 962 resumes
63
+ - **Train/Test Split:** 80/20 (769 training, 193 validation)
64
+ - **Categories:** 25 job categories
65
+
66
+ ### Job Categories
67
+
68
+ Data Science, Java Developer, Testing, DevOps Engineer, Python Developer, Web Developer, HR, Hadoop, Blockchain, ETL Developer, Operations Manager, Sales, Mechanical Engineer, Arts, Database, Electrical Engineering, Health and Fitness, PMO, Business Analyst, DotNet Developer, Automation Testing, Network Security Engineer, SAP Developer, Civil Engineer, Advocate
69
+
70
+ ## Training Procedure
71
+
72
+ ### Training Hyperparameters
73
+
74
+ - **Base Model:** bert-base-uncased
75
+ - **Batch Size:** 4
76
+ - **Epochs:** 5
77
+ - **Learning Rate:** 2e-5
78
+ - **Optimizer:** AdamW (foreach=False)
79
+ - **Max Sequence Length:** 200 tokens
80
+ - **Warmup Steps:** Linear scheduler
81
+ - **GPU:** NVIDIA GeForce RTX 3060 Laptop GPU (6GB VRAM)
82
+
83
+ ### Training Results
84
+
85
+ | Epoch | Training Loss | Validation Loss | Validation Accuracy | MCC Score |
86
+ |-------|---------------|-----------------|---------------------|-----------|
87
+ | 1 | 2.6037 | 1.1563 | 53.37% | 0.4993 |
88
+ | 2 | 0.9651 | 0.2858 | 98.96% | 0.9891 |
89
+ | 3 | 0.5804 | 0.2782 | 100.00% | 1.0000 |
90
+ | 4 | 0.4473 | 0.2774 | 100.00% | 1.0000 |
91
+ | 5 | 0.3604 | 0.2767 | 100.00% | 1.0000 |
92
+
93
+ **Training Time:** ~72 minutes for 5 epochs
94
+
95
+ ## Performance Metrics
96
+
97
+ - **Validation Accuracy:** 100%
98
+ - **Matthews Correlation Coefficient:** 1.0000 (perfect correlation)
99
+ - **Final Training Loss:** 0.3604
100
+ - **Final Validation Loss:** 0.2767
101
+
102
+ The model achieved perfect classification on the validation set, correctly identifying all 193 test resumes.
103
+
104
+ ## Usage
105
+
106
+ ### Installation
107
+
108
+ ```bash
109
+ pip install transformers torch
110
+ ```
111
+
112
+ ### Quick Start
113
+
114
+ ```python
115
+ from transformers import BertForSequenceClassification, BertTokenizer
116
+ import torch
117
+
118
+ # Load model and tokenizer
119
+ model = BertForSequenceClassification.from_pretrained('SwaKyxd/resume-analyser-bert')
120
+ tokenizer = BertTokenizer.from_pretrained('SwaKyxd/resume-analyser-bert')
121
+
122
+ # Set device
123
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
124
+ model.to(device)
125
+ model.eval()
126
+
127
+ # Example resume text
128
+ resume_text = """
129
+ Skills: Python, Machine Learning, Deep Learning, PyTorch, TensorFlow, NLP
130
+ Experience: 3 years in Data Science and AI model development
131
+ Projects: Built recommendation systems, sentiment analysis models
132
+ Education: Masters in Computer Science
133
+ """
134
+
135
+ # Tokenize
136
+ inputs = tokenizer(
137
+ resume_text,
138
+ return_tensors='pt',
139
+ max_length=200,
140
+ padding='max_length',
141
+ truncation=True
142
+ )
143
+
144
+ # Move to device
145
+ inputs = {k: v.to(device) for k, v in inputs.items()}
146
+
147
+ # Predict
148
+ with torch.no_grad():
149
+ outputs = model(**inputs)
150
+ predictions = torch.argmax(outputs.logits, dim=1)
151
+ probabilities = torch.softmax(outputs.logits, dim=1)
152
+ confidence = probabilities[0][predictions[0]].item()
153
+
154
+ # Category mapping (example - adjust based on your label encoder)
155
+ categories = [
156
+ "Data Science", "Java Developer", "Testing", "DevOps Engineer",
157
+ "Python Developer", "Web Developer", "HR", "Hadoop", "Blockchain",
158
+ "ETL Developer", "Operations Manager", "Sales", "Mechanical Engineer",
159
+ "Arts", "Database", "Electrical Engineering", "Health and Fitness",
160
+ "PMO", "Business Analyst", "DotNet Developer", "Automation Testing",
161
+ "Network Security Engineer", "SAP Developer", "Civil Engineer", "Advocate"
162
+ ]
163
+
164
+ print(f"Predicted Category: {categories[predictions.item()]}")
165
+ print(f"Confidence: {confidence:.2%}")
166
+ ```
167
+
168
+ ### Batch Processing
169
+
170
+ ```python
171
+ resumes = [
172
+ "Python developer with 5 years experience in Django and Flask...",
173
+ "Experienced data scientist with expertise in machine learning...",
174
+ "Java backend developer skilled in Spring Boot and microservices..."
175
+ ]
176
+
177
+ # Tokenize batch
178
+ inputs = tokenizer(
179
+ resumes,
180
+ return_tensors='pt',
181
+ max_length=200,
182
+ padding='max_length',
183
+ truncation=True
184
+ )
185
+
186
+ inputs = {k: v.to(device) for k, v in inputs.items()}
187
+
188
+ # Predict
189
+ with torch.no_grad():
190
+ outputs = model(**inputs)
191
+ predictions = torch.argmax(outputs.logits, dim=1)
192
+
193
+ for i, pred in enumerate(predictions):
194
+ print(f"Resume {i+1}: {categories[pred.item()]}")
195
+ ```
196
+
197
+ ## Limitations and Bias
198
+
199
+ - **Language:** Model is trained only on English resumes
200
+ - **Dataset Size:** Trained on 962 resumes, may not generalize to all resume formats
201
+ - **Domain Specific:** Performance may vary on resumes outside the 25 predefined categories
202
+ - **Text Format:** Best performance on plain text resumes; may need preprocessing for PDFs/DOCs
203
+ - **Perfect Accuracy:** The 100% accuracy suggests possible overfitting; recommend testing on new data
204
+
205
+ ## Ethical Considerations
206
+
207
+ - This model should be used as an assistive tool, not as the sole decision-maker in hiring processes
208
+ - Human oversight is recommended for all automated resume screening
209
+ - Be aware of potential biases in the training data that may affect predictions
210
+ - Ensure compliance with employment laws and anti-discrimination regulations
211
+ - Protect candidate privacy and handle resume data securely
212
+
213
+ ## Model Architecture
214
+
215
+ ```
216
+ BertForSequenceClassification(
217
+ (bert): BertModel(
218
+ 12 transformer layers
219
+ 768 hidden dimensions
220
+ 12 attention heads
221
+ 110M parameters
222
+ )
223
+ (dropout): Dropout(p=0.1)
224
+ (classifier): Linear(768 -> 25)
225
+ )
226
+ ```
227
+
228
+ ## Technical Specifications
229
+
230
+ - **Framework:** PyTorch 2.6.0
231
+ - **Transformers:** 4.47.1
232
+ - **Tokenizer:** BertTokenizer (bert-base-uncased)
233
+ - **Max Sequence Length:** 200 tokens
234
+ - **Model Size:** ~436 MB
235
+ - **Precision:** FP32
236
+
237
+ ## Citation
238
+
239
+ If you use this model in your research or application, please cite:
240
+
241
+ ```bibtex
242
+ @misc{resume-analyser-bert,
243
+ author = {Sayan Mahalik},
244
+ title = {Resume Analyser - BERT for Job Category Classification},
245
+ year = {2025},
246
+ publisher = {HuggingFace},
247
+ url = {https://huggingface.co/SwaKyxd/resume-analyser-bert}
248
+ }
249
+ ```
250
+
251
+ ## Related Resources
252
+
253
+ - **GitHub Repository:** [Resume-Analyser](https://github.com/Swakyxd/Resume-Analyser)
254
+ - **Training Notebook:** Available in the GitHub repository
255
+ - **Base Model:** [bert-base-uncased](https://huggingface.co/bert-base-uncased)
256
+ - **Dataset:** [Resume Dataset on Kaggle](https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset)
257
+
258
+ ## Contact
259
+
260
+ For questions, issues, or feedback:
261
+ - GitHub: [Swakyxd/Resume-Analyser](https://github.com/Swakyxd/Resume-Analyser)
262
+ - Open an issue on GitHub for bug reports or feature requests
263
+
264
+ ## License
265
+
266
+ This model is released under the MIT License. See the LICENSE file for details.
267
+
268
+ ## Acknowledgments
269
+
270
+ - Hugging Face Transformers library
271
+ - BERT paper: [Devlin et al., 2018](https://arxiv.org/abs/1810.04805)
272
+ - Kaggle Resume Dataset contributors
273
+ - PyTorch team
274
+
275
+ ---
276
+
277
+ **Model Card Version:** 1.0
278
+ **Last Updated:** November 2025