Upload folder using huggingface_hub
Browse files- README.md +200 -0
- data_generator.py +934 -0
- inference.py +441 -0
- requirements.txt +8 -0
- train.py +437 -0
README.md
ADDED
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
base_model: distilbert-base-uncased
|
| 4 |
+
tags:
|
| 5 |
+
- text-classification
|
| 6 |
+
- resume-parsing
|
| 7 |
+
- nlp
|
| 8 |
+
- distilbert
|
| 9 |
+
- synthetic-data
|
| 10 |
+
metrics:
|
| 11 |
+
- accuracy
|
| 12 |
+
- f1
|
| 13 |
+
pipeline_tag: text-classification
|
| 14 |
+
datasets:
|
| 15 |
+
- custom-synthetic
|
| 16 |
+
language:
|
| 17 |
+
- en
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# Resume Section Classifier
|
| 21 |
+
|
| 22 |
+
A fine-tuned [DistilBERT](https://huggingface.co/distilbert-base-uncased) model that classifies resume text sections into 8 categories. Designed for automated resume parsing pipelines where incoming text needs to be segmented and labeled by section type.
|
| 23 |
+
|
| 24 |
+
## Labels
|
| 25 |
+
|
| 26 |
+
| Label | Description | Example |
|
| 27 |
+
|-------|-------------|---------|
|
| 28 |
+
| `education` | Academic background, degrees, coursework, GPA | "B.S. in Computer Science, MIT, 2023, GPA: 3.8" |
|
| 29 |
+
| `experience` | Work history, job titles, responsibilities | "Software Engineer at Google, 2020-Present. Built microservices..." |
|
| 30 |
+
| `skills` | Technical and soft skills listings | "Python, Java, React, Docker, Kubernetes, AWS" |
|
| 31 |
+
| `projects` | Personal or academic project descriptions | "Built a real-time analytics dashboard using React and D3.js" |
|
| 32 |
+
| `summary` | Professional summary or objective statement | "Results-driven engineer with 5+ years of experience in..." |
|
| 33 |
+
| `certifications` | Professional certifications and licenses | "AWS Certified Solutions Architect - Associate (2023)" |
|
| 34 |
+
| `contact` | Name, email, phone, LinkedIn, location | "John Smith | john@email.com | (415) 555-1234 | SF, CA" |
|
| 35 |
+
| `awards` | Honors, achievements, recognition | "Dean's List, Phi Beta Kappa, Hackathon Winner (2022)" |
|
| 36 |
+
|
| 37 |
+
## Training Procedure
|
| 38 |
+
|
| 39 |
+
### Data
|
| 40 |
+
|
| 41 |
+
The model was trained on a **synthetic dataset** generated programmatically using `data_generator.py`. The generator uses:
|
| 42 |
+
|
| 43 |
+
- **Template-based generation** with randomized entities (names, companies, universities, skills, dates, cities, etc.) drawn from curated pools of 30-50+ items each
|
| 44 |
+
- **Structural variation** across multiple formatting styles per category (e.g., education entries can appear as single-line, multi-line, with/without GPA, with activities, etc.)
|
| 45 |
+
- **Data augmentation** via synonym replacement (30% probability per eligible word) and formatting variation (whitespace, bullet styles, separators)
|
| 46 |
+
- **Optional section headers** prepended with 40% probability to teach the model to handle both headed and headless sections
|
| 47 |
+
|
| 48 |
+
Default configuration produces **1,920 examples** (80 base examples x 3 variants x 8 categories), split 80/10/10 into train/validation/test sets with stratified sampling.
|
| 49 |
+
|
| 50 |
+
### Hyperparameters
|
| 51 |
+
|
| 52 |
+
| Parameter | Value |
|
| 53 |
+
|-----------|-------|
|
| 54 |
+
| Base model | `distilbert-base-uncased` |
|
| 55 |
+
| Max sequence length | 256 tokens |
|
| 56 |
+
| Epochs | 4 |
|
| 57 |
+
| Batch size | 16 |
|
| 58 |
+
| Learning rate | 2e-5 |
|
| 59 |
+
| LR scheduler | Cosine |
|
| 60 |
+
| Warmup ratio | 0.1 |
|
| 61 |
+
| Weight decay | 0.01 |
|
| 62 |
+
| Optimizer | AdamW |
|
| 63 |
+
| Early stopping | Patience 3 (on F1 macro) |
|
| 64 |
+
|
| 65 |
+
### Training Infrastructure
|
| 66 |
+
|
| 67 |
+
- Fine-tuning takes approximately 5-10 minutes on a single GPU or 15-30 minutes on CPU
|
| 68 |
+
- Model size: ~67M parameters (DistilBERT base)
|
| 69 |
+
- Mixed precision (FP16) enabled automatically when CUDA is available
|
| 70 |
+
|
| 71 |
+
## Metrics
|
| 72 |
+
|
| 73 |
+
Evaluated on the held-out synthetic test set (192 examples, stratified):
|
| 74 |
+
|
| 75 |
+
| Metric | Score |
|
| 76 |
+
|--------|-------|
|
| 77 |
+
| Accuracy | ~0.95+ |
|
| 78 |
+
| F1 (macro) | ~0.95+ |
|
| 79 |
+
| F1 (weighted) | ~0.95+ |
|
| 80 |
+
| Precision (weighted) | ~0.95+ |
|
| 81 |
+
| Recall (weighted) | ~0.95+ |
|
| 82 |
+
|
| 83 |
+
> Note: Exact metrics depend on the random seed and training run. Scores on real-world resumes may differ from synthetic test performance.
|
| 84 |
+
|
| 85 |
+
## Usage
|
| 86 |
+
|
| 87 |
+
### Quick Start with Transformers Pipeline
|
| 88 |
+
|
| 89 |
+
```python
|
| 90 |
+
from transformers import pipeline
|
| 91 |
+
|
| 92 |
+
classifier = pipeline(
|
| 93 |
+
"text-classification",
|
| 94 |
+
model="gr8monk3ys/resume-section-classifier",
|
| 95 |
+
)
|
| 96 |
+
|
| 97 |
+
result = classifier("Bachelor of Science in Computer Science, Stanford University, 2023. GPA: 3.9/4.0")
|
| 98 |
+
print(result)
|
| 99 |
+
# [{'label': 'education', 'score': 0.98}]
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
### Full Resume Classification
|
| 103 |
+
|
| 104 |
+
```python
|
| 105 |
+
from inference import ResumeSectionClassifier
|
| 106 |
+
|
| 107 |
+
classifier = ResumeSectionClassifier("gr8monk3ys/resume-section-classifier")
|
| 108 |
+
|
| 109 |
+
resume_text = """
|
| 110 |
+
John Smith
|
| 111 |
+
john.smith@email.com | (415) 555-1234 | San Francisco, CA
|
| 112 |
+
linkedin.com/in/johnsmith | github.com/johnsmith
|
| 113 |
+
|
| 114 |
+
SUMMARY
|
| 115 |
+
Experienced software engineer with 5+ years building scalable web applications
|
| 116 |
+
and distributed systems. Passionate about clean code and mentoring junior developers.
|
| 117 |
+
|
| 118 |
+
EXPERIENCE
|
| 119 |
+
Senior Software Engineer | Google | San Francisco, CA | Jan 2021 - Present
|
| 120 |
+
- Led migration of monolithic application to microservices architecture
|
| 121 |
+
- Mentored 4 junior engineers through code reviews and pair programming
|
| 122 |
+
- Reduced API latency by 40% through caching and query optimization
|
| 123 |
+
|
| 124 |
+
Software Engineer | Stripe | San Francisco, CA | Jun 2018 - Dec 2020
|
| 125 |
+
- Built payment processing APIs handling 10M+ transactions daily
|
| 126 |
+
- Implemented CI/CD pipeline reducing deployment time from hours to minutes
|
| 127 |
+
|
| 128 |
+
EDUCATION
|
| 129 |
+
B.S. in Computer Science, Stanford University (2018)
|
| 130 |
+
GPA: 3.8/4.0 | Dean's List
|
| 131 |
+
|
| 132 |
+
SKILLS
|
| 133 |
+
Languages: Python, Java, Go, TypeScript
|
| 134 |
+
Frameworks: React, Django, Spring Boot, gRPC
|
| 135 |
+
Tools: Docker, Kubernetes, AWS, PostgreSQL, Redis
|
| 136 |
+
"""
|
| 137 |
+
|
| 138 |
+
analysis = classifier.classify_resume(resume_text)
|
| 139 |
+
print(analysis.summary())
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
### Training from Scratch
|
| 143 |
+
|
| 144 |
+
```bash
|
| 145 |
+
# Install dependencies
|
| 146 |
+
pip install -r requirements.txt
|
| 147 |
+
|
| 148 |
+
# Generate data and train
|
| 149 |
+
python train.py --epochs 4 --batch-size 16
|
| 150 |
+
|
| 151 |
+
# Train and push to Hub
|
| 152 |
+
python train.py --push-to-hub --hub-model-id gr8monk3ys/resume-section-classifier
|
| 153 |
+
|
| 154 |
+
# Inference
|
| 155 |
+
python inference.py --file resume.txt
|
| 156 |
+
python inference.py --text "Python, Java, React, Docker" --single
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### Generating Custom Training Data
|
| 160 |
+
|
| 161 |
+
```bash
|
| 162 |
+
# Generate data with custom settings
|
| 163 |
+
python data_generator.py \
|
| 164 |
+
--examples-per-category 150 \
|
| 165 |
+
--augmented-copies 3 \
|
| 166 |
+
--output data/resume_sections.csv \
|
| 167 |
+
--print-stats
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
## Project Structure
|
| 171 |
+
|
| 172 |
+
```
|
| 173 |
+
resume-section-classifier/
|
| 174 |
+
data_generator.py # Synthetic data generation with templates and augmentation
|
| 175 |
+
train.py # Full fine-tuning pipeline with HuggingFace Trainer
|
| 176 |
+
inference.py # Section splitting and classification API + CLI
|
| 177 |
+
requirements.txt # Python dependencies
|
| 178 |
+
README.md # This model card
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
## Limitations
|
| 182 |
+
|
| 183 |
+
- **Synthetic training data**: The model is trained entirely on programmatically generated text. Performance on real resumes with unusual formatting, non-English content, or domain-specific jargon may be lower than synthetic test metrics suggest.
|
| 184 |
+
- **English only**: All training data is in English. The model will not reliably classify sections in other languages.
|
| 185 |
+
- **Section granularity**: The model classifies pre-split sections. The built-in text splitter uses heuristics (section headers, paragraph breaks) that may not correctly segment all resume formats (e.g., single-column vs. multi-column PDFs, tables).
|
| 186 |
+
- **8 categories only**: Some resume sections (e.g., Publications, Volunteer Work, Languages, Interests, References) are not covered by the current label set. These will be assigned to the closest matching category.
|
| 187 |
+
- **Max length**: Input is truncated to 256 tokens. Very long sections may lose information.
|
| 188 |
+
|
| 189 |
+
## Intended Use
|
| 190 |
+
|
| 191 |
+
- Automated resume parsing pipelines
|
| 192 |
+
- HR tech and applicant tracking systems
|
| 193 |
+
- Resume formatting and analysis tools
|
| 194 |
+
- Educational demonstrations of text classification fine-tuning
|
| 195 |
+
|
| 196 |
+
This model is **not intended** for making hiring decisions. It is a text classification tool for structural parsing only.
|
| 197 |
+
|
| 198 |
+
## Author
|
| 199 |
+
|
| 200 |
+
[Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys) (gr8monk3ys)
|
data_generator.py
ADDED
|
@@ -0,0 +1,934 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Synthetic Resume Section Data Generator
|
| 3 |
+
|
| 4 |
+
Generates realistic resume section text across 8 categories for training
|
| 5 |
+
a text classifier. Uses template-based generation with randomized entities,
|
| 6 |
+
synonym replacement, and structural variation to produce diverse examples.
|
| 7 |
+
|
| 8 |
+
Author: Lorenzo Scaturchio (gr8monk3ys)
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
import csv
|
| 12 |
+
import random
|
| 13 |
+
import itertools
|
| 14 |
+
from pathlib import Path
|
| 15 |
+
from typing import Optional
|
| 16 |
+
|
| 17 |
+
# ---------------------------------------------------------------------------
|
| 18 |
+
# Entity pools – used to fill templates with realistic variation
|
| 19 |
+
# ---------------------------------------------------------------------------
|
| 20 |
+
|
| 21 |
+
FIRST_NAMES = [
|
| 22 |
+
"James", "Mary", "Robert", "Patricia", "John", "Jennifer", "Michael",
|
| 23 |
+
"Linda", "David", "Elizabeth", "William", "Barbara", "Richard", "Susan",
|
| 24 |
+
"Joseph", "Jessica", "Thomas", "Sarah", "Charles", "Karen", "Daniel",
|
| 25 |
+
"Lisa", "Matthew", "Nancy", "Anthony", "Betty", "Mark", "Sandra",
|
| 26 |
+
"Aisha", "Wei", "Carlos", "Priya", "Olga", "Hiroshi", "Fatima", "Liam",
|
| 27 |
+
"Sofia", "Andrei", "Mei", "Alejandro", "Yuki", "Omar", "Elena", "Raj",
|
| 28 |
+
]
|
| 29 |
+
|
| 30 |
+
LAST_NAMES = [
|
| 31 |
+
"Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller",
|
| 32 |
+
"Davis", "Rodriguez", "Martinez", "Hernandez", "Lopez", "Gonzalez",
|
| 33 |
+
"Wilson", "Anderson", "Thomas", "Taylor", "Moore", "Jackson", "Martin",
|
| 34 |
+
"Lee", "Perez", "Thompson", "White", "Harris", "Sanchez", "Clark",
|
| 35 |
+
"Patel", "Chen", "Kim", "Nakamura", "Ivanov", "Silva", "Okafor",
|
| 36 |
+
]
|
| 37 |
+
|
| 38 |
+
COMPANIES = [
|
| 39 |
+
"Google", "Microsoft", "Amazon", "Apple", "Meta", "Netflix", "Stripe",
|
| 40 |
+
"Airbnb", "Uber", "Salesforce", "Adobe", "IBM", "Oracle", "Intel",
|
| 41 |
+
"Tesla", "SpaceX", "Palantir", "Snowflake", "Databricks", "Confluent",
|
| 42 |
+
"JPMorgan Chase", "Goldman Sachs", "Morgan Stanley", "Deloitte",
|
| 43 |
+
"McKinsey & Company", "Boston Consulting Group", "Accenture",
|
| 44 |
+
"Lockheed Martin", "Boeing", "Raytheon", "General Electric",
|
| 45 |
+
"Procter & Gamble", "Johnson & Johnson", "Pfizer", "Moderna",
|
| 46 |
+
"Shopify", "Square", "Twilio", "Cloudflare", "HashiCorp",
|
| 47 |
+
"DataRobot", "Hugging Face", "OpenAI", "Anthropic", "Cohere",
|
| 48 |
+
"Startup XYZ", "TechCorp Inc.", "InnovateTech", "DataDriven LLC",
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
UNIVERSITIES = [
|
| 52 |
+
"Massachusetts Institute of Technology", "Stanford University",
|
| 53 |
+
"Harvard University", "University of California, Berkeley",
|
| 54 |
+
"Carnegie Mellon University", "Georgia Institute of Technology",
|
| 55 |
+
"University of Michigan", "University of Illinois Urbana-Champaign",
|
| 56 |
+
"California Institute of Technology", "Princeton University",
|
| 57 |
+
"Columbia University", "University of Washington",
|
| 58 |
+
"University of Texas at Austin", "Cornell University",
|
| 59 |
+
"University of Pennsylvania", "University of Southern California",
|
| 60 |
+
"New York University", "University of Wisconsin-Madison",
|
| 61 |
+
"Duke University", "Northwestern University",
|
| 62 |
+
"University of California, Los Angeles", "Rice University",
|
| 63 |
+
"University of Maryland", "Purdue University",
|
| 64 |
+
"Ohio State University", "Arizona State University",
|
| 65 |
+
"University of Virginia", "University of Florida",
|
| 66 |
+
"Boston University", "Northeastern University",
|
| 67 |
+
]
|
| 68 |
+
|
| 69 |
+
DEGREES = [
|
| 70 |
+
("Bachelor of Science", "B.S."),
|
| 71 |
+
("Bachelor of Arts", "B.A."),
|
| 72 |
+
("Master of Science", "M.S."),
|
| 73 |
+
("Master of Arts", "M.A."),
|
| 74 |
+
("Master of Business Administration", "MBA"),
|
| 75 |
+
("Doctor of Philosophy", "Ph.D."),
|
| 76 |
+
("Associate of Science", "A.S."),
|
| 77 |
+
("Bachelor of Engineering", "B.Eng."),
|
| 78 |
+
("Master of Engineering", "M.Eng."),
|
| 79 |
+
]
|
| 80 |
+
|
| 81 |
+
MAJORS = [
|
| 82 |
+
"Computer Science", "Software Engineering", "Data Science",
|
| 83 |
+
"Electrical Engineering", "Mechanical Engineering",
|
| 84 |
+
"Information Technology", "Mathematics", "Statistics",
|
| 85 |
+
"Business Administration", "Economics", "Finance",
|
| 86 |
+
"Biomedical Engineering", "Chemical Engineering",
|
| 87 |
+
"Civil Engineering", "Physics", "Biology",
|
| 88 |
+
"Artificial Intelligence", "Machine Learning",
|
| 89 |
+
"Human-Computer Interaction", "Cybersecurity",
|
| 90 |
+
"Information Systems", "Operations Research",
|
| 91 |
+
]
|
| 92 |
+
|
| 93 |
+
MINORS = [
|
| 94 |
+
"Mathematics", "Statistics", "Psychology", "Business",
|
| 95 |
+
"Economics", "Philosophy", "Linguistics", "Physics",
|
| 96 |
+
"Data Science", "Communication", "Sociology", "History",
|
| 97 |
+
]
|
| 98 |
+
|
| 99 |
+
GPA_VALUES = [
|
| 100 |
+
"3.5", "3.6", "3.7", "3.8", "3.9", "4.0",
|
| 101 |
+
"3.52", "3.65", "3.78", "3.85", "3.92", "3.45",
|
| 102 |
+
]
|
| 103 |
+
|
| 104 |
+
GRAD_YEARS = list(range(2015, 2027))
|
| 105 |
+
|
| 106 |
+
JOB_TITLES = [
|
| 107 |
+
"Software Engineer", "Senior Software Engineer", "Staff Engineer",
|
| 108 |
+
"Principal Engineer", "Engineering Manager", "Tech Lead",
|
| 109 |
+
"Data Scientist", "Senior Data Scientist", "Machine Learning Engineer",
|
| 110 |
+
"ML Research Scientist", "Data Engineer", "Data Analyst",
|
| 111 |
+
"Product Manager", "Senior Product Manager", "Program Manager",
|
| 112 |
+
"DevOps Engineer", "Site Reliability Engineer", "Cloud Architect",
|
| 113 |
+
"Full Stack Developer", "Frontend Engineer", "Backend Engineer",
|
| 114 |
+
"Mobile Developer", "iOS Engineer", "Android Developer",
|
| 115 |
+
"QA Engineer", "Security Engineer", "Solutions Architect",
|
| 116 |
+
"Research Scientist", "AI Engineer", "NLP Engineer",
|
| 117 |
+
"Quantitative Analyst", "Financial Analyst", "Business Analyst",
|
| 118 |
+
"UX Designer", "UI Engineer", "Technical Writer",
|
| 119 |
+
"Intern", "Software Engineering Intern", "Data Science Intern",
|
| 120 |
+
]
|
| 121 |
+
|
| 122 |
+
PROGRAMMING_LANGUAGES = [
|
| 123 |
+
"Python", "Java", "JavaScript", "TypeScript", "C++", "C", "C#",
|
| 124 |
+
"Go", "Rust", "Kotlin", "Swift", "Ruby", "PHP", "Scala",
|
| 125 |
+
"R", "MATLAB", "Julia", "Haskell", "Elixir", "Dart",
|
| 126 |
+
]
|
| 127 |
+
|
| 128 |
+
FRAMEWORKS = [
|
| 129 |
+
"React", "Angular", "Vue.js", "Next.js", "Django", "Flask",
|
| 130 |
+
"FastAPI", "Spring Boot", "Express.js", "Node.js", "Rails",
|
| 131 |
+
"TensorFlow", "PyTorch", "Keras", "scikit-learn", "Pandas",
|
| 132 |
+
"NumPy", "Spark", "Hadoop", "Kubernetes", "Docker",
|
| 133 |
+
"AWS", "GCP", "Azure", "Terraform", "Ansible",
|
| 134 |
+
".NET", "Laravel", "Svelte", "Remix", "Astro",
|
| 135 |
+
]
|
| 136 |
+
|
| 137 |
+
TOOLS = [
|
| 138 |
+
"Git", "GitHub", "GitLab", "Jira", "Confluence", "Slack",
|
| 139 |
+
"VS Code", "IntelliJ", "PyCharm", "Vim", "Emacs",
|
| 140 |
+
"PostgreSQL", "MySQL", "MongoDB", "Redis", "Elasticsearch",
|
| 141 |
+
"Kafka", "RabbitMQ", "Airflow", "dbt", "Snowflake",
|
| 142 |
+
"Tableau", "Power BI", "Grafana", "Prometheus", "Datadog",
|
| 143 |
+
"Jenkins", "CircleCI", "GitHub Actions", "ArgoCD",
|
| 144 |
+
"Figma", "Sketch", "Adobe XD", "Postman", "Swagger",
|
| 145 |
+
]
|
| 146 |
+
|
| 147 |
+
SOFT_SKILLS = [
|
| 148 |
+
"Leadership", "Communication", "Team Collaboration",
|
| 149 |
+
"Problem Solving", "Critical Thinking", "Time Management",
|
| 150 |
+
"Project Management", "Agile Methodologies", "Scrum",
|
| 151 |
+
"Cross-functional Collaboration", "Mentoring",
|
| 152 |
+
"Strategic Planning", "Stakeholder Management",
|
| 153 |
+
"Technical Writing", "Public Speaking", "Negotiation",
|
| 154 |
+
]
|
| 155 |
+
|
| 156 |
+
CERTIFICATIONS_LIST = [
|
| 157 |
+
"AWS Certified Solutions Architect - Associate",
|
| 158 |
+
"AWS Certified Developer - Associate",
|
| 159 |
+
"AWS Certified Machine Learning - Specialty",
|
| 160 |
+
"Google Cloud Professional Data Engineer",
|
| 161 |
+
"Google Cloud Professional ML Engineer",
|
| 162 |
+
"Microsoft Azure Fundamentals (AZ-900)",
|
| 163 |
+
"Microsoft Azure Data Scientist Associate (DP-100)",
|
| 164 |
+
"Certified Kubernetes Administrator (CKA)",
|
| 165 |
+
"Certified Kubernetes Application Developer (CKAD)",
|
| 166 |
+
"Certified Information Systems Security Professional (CISSP)",
|
| 167 |
+
"CompTIA Security+",
|
| 168 |
+
"Project Management Professional (PMP)",
|
| 169 |
+
"Certified ScrumMaster (CSM)",
|
| 170 |
+
"TensorFlow Developer Certificate",
|
| 171 |
+
"Databricks Certified Data Engineer Associate",
|
| 172 |
+
"Snowflake SnowPro Core Certification",
|
| 173 |
+
"HashiCorp Terraform Associate",
|
| 174 |
+
"Cisco Certified Network Associate (CCNA)",
|
| 175 |
+
"Oracle Certified Professional, Java SE",
|
| 176 |
+
"Red Hat Certified System Administrator (RHCSA)",
|
| 177 |
+
"Deep Learning Specialization (Coursera)",
|
| 178 |
+
"Machine Learning by Stanford (Coursera)",
|
| 179 |
+
"Professional Scrum Master I (PSM I)",
|
| 180 |
+
]
|
| 181 |
+
|
| 182 |
+
AWARDS_LIST = [
|
| 183 |
+
"Dean's List", "Summa Cum Laude", "Magna Cum Laude", "Cum Laude",
|
| 184 |
+
"Phi Beta Kappa", "Tau Beta Pi", "National Merit Scholar",
|
| 185 |
+
"Employee of the Quarter", "Spot Bonus Award", "President's Club",
|
| 186 |
+
"Best Paper Award", "Innovation Award", "Hackathon Winner",
|
| 187 |
+
"Outstanding Graduate Student Award", "Research Fellowship",
|
| 188 |
+
"Teaching Assistant Excellence Award", "Community Service Award",
|
| 189 |
+
"IEEE Best Student Paper", "ACM ICPC Regional Finalist",
|
| 190 |
+
"Google Code Jam Qualifier", "Facebook Hacker Cup Participant",
|
| 191 |
+
"Patent Holder", "Top Performer Award", "Rising Star Award",
|
| 192 |
+
]
|
| 193 |
+
|
| 194 |
+
CITIES = [
|
| 195 |
+
"San Francisco, CA", "New York, NY", "Seattle, WA", "Austin, TX",
|
| 196 |
+
"Boston, MA", "Chicago, IL", "Los Angeles, CA", "Denver, CO",
|
| 197 |
+
"Portland, OR", "Atlanta, GA", "Washington, DC", "San Jose, CA",
|
| 198 |
+
"Raleigh, NC", "Pittsburgh, PA", "Minneapolis, MN", "Dallas, TX",
|
| 199 |
+
"Miami, FL", "Phoenix, AZ", "San Diego, CA", "Philadelphia, PA",
|
| 200 |
+
]
|
| 201 |
+
|
| 202 |
+
MONTHS = [
|
| 203 |
+
"January", "February", "March", "April", "May", "June",
|
| 204 |
+
"July", "August", "September", "October", "November", "December",
|
| 205 |
+
]
|
| 206 |
+
|
| 207 |
+
MONTHS_SHORT = [
|
| 208 |
+
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
|
| 209 |
+
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
|
| 210 |
+
]
|
| 211 |
+
|
| 212 |
+
PROJECT_ADJECTIVES = [
|
| 213 |
+
"Real-time", "Scalable", "Distributed", "Cloud-native",
|
| 214 |
+
"AI-powered", "Automated", "Interactive", "Cross-platform",
|
| 215 |
+
"Open-source", "End-to-end", "High-performance", "Serverless",
|
| 216 |
+
"Event-driven", "Microservice-based", "Full-stack",
|
| 217 |
+
]
|
| 218 |
+
|
| 219 |
+
PROJECT_NOUNS = [
|
| 220 |
+
"Dashboard", "Platform", "Pipeline", "Application", "System",
|
| 221 |
+
"API", "Framework", "Tool", "Service", "Engine",
|
| 222 |
+
"Chatbot", "Recommendation System", "Search Engine",
|
| 223 |
+
"Analytics Platform", "Monitoring System", "Marketplace",
|
| 224 |
+
]
|
| 225 |
+
|
| 226 |
+
IMPACT_METRICS = [
|
| 227 |
+
"reduced latency by {pct}%",
|
| 228 |
+
"improved throughput by {pct}%",
|
| 229 |
+
"increased user engagement by {pct}%",
|
| 230 |
+
"decreased error rate by {pct}%",
|
| 231 |
+
"saved ${amount}K annually",
|
| 232 |
+
"reduced costs by {pct}%",
|
| 233 |
+
"improved accuracy by {pct}%",
|
| 234 |
+
"increased conversion rate by {pct}%",
|
| 235 |
+
"served {users} daily active users",
|
| 236 |
+
"processed {events} events per second",
|
| 237 |
+
"reduced deployment time from hours to minutes",
|
| 238 |
+
"cut onboarding time by {pct}%",
|
| 239 |
+
"automated {pct}% of manual processes",
|
| 240 |
+
"improved model F1 score from 0.{f1_old} to 0.{f1_new}",
|
| 241 |
+
]
|
| 242 |
+
|
| 243 |
+
PHONE_AREA_CODES = [
|
| 244 |
+
"415", "650", "408", "510", "212", "646", "718", "206",
|
| 245 |
+
"512", "617", "312", "213", "303", "503", "404", "202",
|
| 246 |
+
]
|
| 247 |
+
|
| 248 |
+
LINKEDIN_PREFIXES = [
|
| 249 |
+
"linkedin.com/in/", "www.linkedin.com/in/",
|
| 250 |
+
]
|
| 251 |
+
|
| 252 |
+
GITHUB_PREFIXES = [
|
| 253 |
+
"github.com/", "www.github.com/",
|
| 254 |
+
]
|
| 255 |
+
|
| 256 |
+
DOMAINS = [
|
| 257 |
+
"gmail.com", "outlook.com", "yahoo.com", "protonmail.com",
|
| 258 |
+
"icloud.com", "hotmail.com", "mail.com",
|
| 259 |
+
]
|
| 260 |
+
|
| 261 |
+
# ---------------------------------------------------------------------------
|
| 262 |
+
# Synonym replacement pools for augmentation
|
| 263 |
+
# ---------------------------------------------------------------------------
|
| 264 |
+
|
| 265 |
+
SYNONYMS = {
|
| 266 |
+
"developed": ["built", "created", "engineered", "designed", "implemented", "constructed", "authored"],
|
| 267 |
+
"managed": ["led", "oversaw", "directed", "supervised", "coordinated", "administered"],
|
| 268 |
+
"improved": ["enhanced", "optimized", "upgraded", "refined", "boosted", "strengthened"],
|
| 269 |
+
"implemented": ["deployed", "executed", "delivered", "rolled out", "launched", "shipped"],
|
| 270 |
+
"analyzed": ["examined", "evaluated", "assessed", "investigated", "studied", "reviewed"],
|
| 271 |
+
"collaborated": ["partnered", "worked closely with", "teamed up with", "cooperated with"],
|
| 272 |
+
"responsible for": ["in charge of", "accountable for", "tasked with", "owned"],
|
| 273 |
+
"utilized": ["leveraged", "employed", "used", "applied", "harnessed"],
|
| 274 |
+
"achieved": ["accomplished", "attained", "reached", "secured", "delivered"],
|
| 275 |
+
"experience": ["expertise", "background", "proficiency", "track record"],
|
| 276 |
+
}
|
| 277 |
+
|
| 278 |
+
|
| 279 |
+
# ---------------------------------------------------------------------------
|
| 280 |
+
# Helper utilities
|
| 281 |
+
# ---------------------------------------------------------------------------
|
| 282 |
+
|
| 283 |
+
def _pick(pool, k=1):
|
| 284 |
+
"""Return k unique random items from a pool."""
|
| 285 |
+
k = min(k, len(pool))
|
| 286 |
+
return random.sample(pool, k)
|
| 287 |
+
|
| 288 |
+
|
| 289 |
+
def _pick_one(pool):
|
| 290 |
+
return random.choice(pool)
|
| 291 |
+
|
| 292 |
+
|
| 293 |
+
def _date_range(allow_present: bool = True):
|
| 294 |
+
"""Return a random date range string."""
|
| 295 |
+
start_year = random.randint(2014, 2024)
|
| 296 |
+
start_month = _pick_one(MONTHS_SHORT)
|
| 297 |
+
fmt = random.choice(["short", "long", "year_only"])
|
| 298 |
+
|
| 299 |
+
if allow_present and random.random() < 0.3:
|
| 300 |
+
end_str = random.choice(["Present", "Current", "Now"])
|
| 301 |
+
else:
|
| 302 |
+
end_year = random.randint(start_year, min(start_year + 6, 2026))
|
| 303 |
+
end_month = _pick_one(MONTHS_SHORT)
|
| 304 |
+
if fmt == "short":
|
| 305 |
+
end_str = f"{end_month} {end_year}"
|
| 306 |
+
elif fmt == "long":
|
| 307 |
+
end_str = f"{_pick_one(MONTHS)} {end_year}"
|
| 308 |
+
else:
|
| 309 |
+
end_str = str(end_year)
|
| 310 |
+
|
| 311 |
+
if fmt == "short":
|
| 312 |
+
start_str = f"{start_month} {start_year}"
|
| 313 |
+
elif fmt == "long":
|
| 314 |
+
start_str = f"{_pick_one(MONTHS)} {start_year}"
|
| 315 |
+
else:
|
| 316 |
+
start_str = str(start_year)
|
| 317 |
+
|
| 318 |
+
sep = random.choice([" - ", " – ", " to ", "–", "-"])
|
| 319 |
+
return f"{start_str}{sep}{end_str}"
|
| 320 |
+
|
| 321 |
+
|
| 322 |
+
def _impact():
|
| 323 |
+
"""Generate a random impact metric string."""
|
| 324 |
+
template = _pick_one(IMPACT_METRICS)
|
| 325 |
+
return template.format(
|
| 326 |
+
pct=random.randint(10, 85),
|
| 327 |
+
amount=random.randint(50, 500),
|
| 328 |
+
users=random.choice(["10K", "50K", "100K", "500K", "1M", "5M"]),
|
| 329 |
+
events=random.choice(["1K", "10K", "50K", "100K", "1M"]),
|
| 330 |
+
f1_old=random.randint(65, 80),
|
| 331 |
+
f1_new=random.randint(82, 97),
|
| 332 |
+
)
|
| 333 |
+
|
| 334 |
+
|
| 335 |
+
def _synonym_replace(text: str) -> str:
|
| 336 |
+
"""Randomly replace words with synonyms for augmentation."""
|
| 337 |
+
words = text.split()
|
| 338 |
+
result = []
|
| 339 |
+
for w in words:
|
| 340 |
+
lower = w.lower().rstrip(".,;:")
|
| 341 |
+
if lower in SYNONYMS and random.random() < 0.3:
|
| 342 |
+
replacement = _pick_one(SYNONYMS[lower])
|
| 343 |
+
# Preserve original capitalization of first char
|
| 344 |
+
if w[0].isupper():
|
| 345 |
+
replacement = replacement.capitalize()
|
| 346 |
+
# Preserve trailing punctuation
|
| 347 |
+
trailing = w[len(lower):]
|
| 348 |
+
result.append(replacement + trailing)
|
| 349 |
+
else:
|
| 350 |
+
result.append(w)
|
| 351 |
+
return " ".join(result)
|
| 352 |
+
|
| 353 |
+
|
| 354 |
+
def _bullet():
|
| 355 |
+
"""Return a random bullet character."""
|
| 356 |
+
return random.choice(["•", "-", "●", "*", "▪", ""])
|
| 357 |
+
|
| 358 |
+
|
| 359 |
+
def _reorder_bullets(bullets: list) -> list:
|
| 360 |
+
"""Shuffle bullet points for variation."""
|
| 361 |
+
shuffled = bullets.copy()
|
| 362 |
+
random.shuffle(shuffled)
|
| 363 |
+
return shuffled
|
| 364 |
+
|
| 365 |
+
|
| 366 |
+
# ---------------------------------------------------------------------------
|
| 367 |
+
# Section generators – each returns a string of realistic text
|
| 368 |
+
# ---------------------------------------------------------------------------
|
| 369 |
+
|
| 370 |
+
def generate_education() -> str:
|
| 371 |
+
"""Generate a realistic education section."""
|
| 372 |
+
templates = []
|
| 373 |
+
|
| 374 |
+
# Template 1: Full formal entry
|
| 375 |
+
def _t1():
|
| 376 |
+
uni = _pick_one(UNIVERSITIES)
|
| 377 |
+
deg_full, deg_short = _pick_one(DEGREES)
|
| 378 |
+
major = _pick_one(MAJORS)
|
| 379 |
+
year = _pick_one(GRAD_YEARS)
|
| 380 |
+
lines = []
|
| 381 |
+
|
| 382 |
+
header_style = random.choice(["full", "short", "inline"])
|
| 383 |
+
if header_style == "full":
|
| 384 |
+
lines.append(f"{deg_full} in {major}")
|
| 385 |
+
lines.append(f"{uni}")
|
| 386 |
+
lines.append(f"Graduated: {_pick_one(MONTHS)} {year}")
|
| 387 |
+
elif header_style == "short":
|
| 388 |
+
lines.append(f"{deg_short} {major}, {uni} ({year})")
|
| 389 |
+
else:
|
| 390 |
+
lines.append(f"{uni} — {deg_full} in {major}, {year}")
|
| 391 |
+
|
| 392 |
+
# Optional GPA
|
| 393 |
+
if random.random() < 0.6:
|
| 394 |
+
gpa = _pick_one(GPA_VALUES)
|
| 395 |
+
lines.append(f"GPA: {gpa}/4.0")
|
| 396 |
+
|
| 397 |
+
# Optional minor
|
| 398 |
+
if random.random() < 0.3:
|
| 399 |
+
minor = _pick_one(MINORS)
|
| 400 |
+
lines.append(f"Minor in {minor}")
|
| 401 |
+
|
| 402 |
+
# Optional coursework
|
| 403 |
+
if random.random() < 0.5:
|
| 404 |
+
courses = _pick(MAJORS + ["Algorithms", "Data Structures",
|
| 405 |
+
"Operating Systems", "Database Systems",
|
| 406 |
+
"Computer Networks", "Linear Algebra",
|
| 407 |
+
"Probability and Statistics",
|
| 408 |
+
"Deep Learning", "Natural Language Processing",
|
| 409 |
+
"Computer Vision", "Distributed Systems"], k=random.randint(3, 6))
|
| 410 |
+
prefix = random.choice(["Relevant Coursework:", "Key Courses:", "Coursework:"])
|
| 411 |
+
lines.append(f"{prefix} {', '.join(courses)}")
|
| 412 |
+
|
| 413 |
+
# Optional honors
|
| 414 |
+
if random.random() < 0.3:
|
| 415 |
+
honor = random.choice(["Summa Cum Laude", "Magna Cum Laude",
|
| 416 |
+
"Cum Laude", "Dean's List (all semesters)",
|
| 417 |
+
"Honors Program", "University Scholar"])
|
| 418 |
+
lines.append(honor)
|
| 419 |
+
|
| 420 |
+
# Optional thesis
|
| 421 |
+
if "Ph.D." in deg_short or ("M.S." in deg_short and random.random() < 0.4):
|
| 422 |
+
topic = random.choice([
|
| 423 |
+
"Transformer-based approaches to document classification",
|
| 424 |
+
"Scalable distributed systems for real-time data processing",
|
| 425 |
+
"Graph neural networks for molecular property prediction",
|
| 426 |
+
"Federated learning in healthcare applications",
|
| 427 |
+
"Efficient attention mechanisms for long-sequence modeling",
|
| 428 |
+
"Reinforcement learning for autonomous navigation",
|
| 429 |
+
])
|
| 430 |
+
label = "Dissertation" if "Ph.D." in deg_short else "Thesis"
|
| 431 |
+
lines.append(f"{label}: \"{topic}\"")
|
| 432 |
+
|
| 433 |
+
return "\n".join(lines)
|
| 434 |
+
|
| 435 |
+
# Template 2: Multiple degrees
|
| 436 |
+
def _t2():
|
| 437 |
+
entries = []
|
| 438 |
+
for _ in range(random.randint(2, 3)):
|
| 439 |
+
uni = _pick_one(UNIVERSITIES)
|
| 440 |
+
deg_full, deg_short = _pick_one(DEGREES)
|
| 441 |
+
major = _pick_one(MAJORS)
|
| 442 |
+
year = _pick_one(GRAD_YEARS)
|
| 443 |
+
gpa_line = f" | GPA: {_pick_one(GPA_VALUES)}" if random.random() < 0.5 else ""
|
| 444 |
+
entries.append(f"{deg_short} in {major}, {uni}, {year}{gpa_line}")
|
| 445 |
+
return "\n".join(entries)
|
| 446 |
+
|
| 447 |
+
# Template 3: Education with activities
|
| 448 |
+
def _t3():
|
| 449 |
+
uni = _pick_one(UNIVERSITIES)
|
| 450 |
+
deg_full, deg_short = _pick_one(DEGREES)
|
| 451 |
+
major = _pick_one(MAJORS)
|
| 452 |
+
year = _pick_one(GRAD_YEARS)
|
| 453 |
+
lines = [f"{uni}", f"{deg_full} in {major} | {_pick_one(MONTHS)} {year}"]
|
| 454 |
+
|
| 455 |
+
activities = random.sample([
|
| 456 |
+
"Teaching Assistant for Introduction to Computer Science",
|
| 457 |
+
"President, Computer Science Student Association",
|
| 458 |
+
"Member, ACM Student Chapter",
|
| 459 |
+
"Undergraduate Research Assistant, ML Lab",
|
| 460 |
+
"Peer Tutor, Mathematics Department",
|
| 461 |
+
"Captain, University Programming Competition Team",
|
| 462 |
+
"Volunteer, Engineering Outreach Program",
|
| 463 |
+
"Member, Honors College",
|
| 464 |
+
"Study Abroad Program, Technical University of Munich",
|
| 465 |
+
"Resident Advisor, Engineering Living-Learning Community",
|
| 466 |
+
], k=random.randint(1, 3))
|
| 467 |
+
|
| 468 |
+
b = _bullet()
|
| 469 |
+
for a in activities:
|
| 470 |
+
lines.append(f"{b} {a}" if b else a)
|
| 471 |
+
|
| 472 |
+
return "\n".join(lines)
|
| 473 |
+
|
| 474 |
+
templates = [_t1, _t2, _t3]
|
| 475 |
+
return random.choice(templates)()
|
| 476 |
+
|
| 477 |
+
|
| 478 |
+
def generate_experience() -> str:
|
| 479 |
+
"""Generate a realistic work experience section."""
|
| 480 |
+
|
| 481 |
+
def _single_role():
|
| 482 |
+
title = _pick_one(JOB_TITLES)
|
| 483 |
+
company = _pick_one(COMPANIES)
|
| 484 |
+
city = _pick_one(CITIES)
|
| 485 |
+
date_range = _date_range()
|
| 486 |
+
|
| 487 |
+
header_styles = [
|
| 488 |
+
f"{title} | {company} | {city} | {date_range}",
|
| 489 |
+
f"{title}, {company}\n{city} | {date_range}",
|
| 490 |
+
f"{company} — {title}\n{date_range} | {city}",
|
| 491 |
+
f"{title}\n{company}, {city}\n{date_range}",
|
| 492 |
+
]
|
| 493 |
+
lines = [random.choice(header_styles)]
|
| 494 |
+
|
| 495 |
+
# Generate bullet points
|
| 496 |
+
bullet_templates = [
|
| 497 |
+
f"Developed and maintained {random.choice(['microservices', 'APIs', 'web applications', 'data pipelines', 'ML models', 'backend systems', 'frontend components'])} using {', '.join(_pick(PROGRAMMING_LANGUAGES, k=random.randint(1,3)))} and {', '.join(_pick(FRAMEWORKS, k=random.randint(1,2)))}",
|
| 498 |
+
f"Collaborated with cross-functional teams of {random.randint(3,15)} engineers to deliver {random.choice(['product features', 'platform improvements', 'system migrations', 'infrastructure upgrades'])} on schedule",
|
| 499 |
+
f"Designed and implemented {random.choice(['CI/CD pipelines', 'testing frameworks', 'monitoring solutions', 'data models', 'caching strategies', 'authentication systems'])} that {_impact()}",
|
| 500 |
+
f"Led migration of {random.choice(['legacy monolith', 'on-premise infrastructure', 'batch processing system', 'manual workflows'])} to {random.choice(['cloud-native architecture', 'microservices', 'real-time streaming', 'automated pipelines'])}",
|
| 501 |
+
f"Mentored {random.randint(2,8)} junior engineers through code reviews, pair programming, and technical design sessions",
|
| 502 |
+
f"Optimized {random.choice(['database queries', 'API response times', 'model inference', 'data processing pipelines', 'search indexing'])} resulting in {_impact()}",
|
| 503 |
+
f"Wrote comprehensive technical documentation and {random.choice(['RFCs', 'design docs', 'runbooks', 'architecture decision records'])} for {random.choice(['system design', 'API contracts', 'deployment procedures', 'incident response'])}",
|
| 504 |
+
f"Built {random.choice(['real-time', 'batch', 'streaming', 'event-driven'])} {random.choice(['data pipeline', 'ETL process', 'analytics system', 'feature store'])} processing {random.choice(['1M+', '10M+', '100M+', '1B+'])} records {random.choice(['daily', 'per hour', 'in real-time'])}",
|
| 505 |
+
f"Spearheaded adoption of {_pick_one(FRAMEWORKS)} and {_pick_one(TOOLS)}, {_impact()}",
|
| 506 |
+
f"Conducted A/B testing and experimentation for {random.choice(['recommendation engine', 'search ranking', 'pricing model', 'onboarding flow', 'notification system'])}, {_impact()}",
|
| 507 |
+
f"Architected {random.choice(['distributed', 'fault-tolerant', 'highly available', 'horizontally scalable'])} system handling {random.choice(['10K', '50K', '100K', '1M'])} requests per second with {random.choice(['99.9%', '99.95%', '99.99%'])} uptime",
|
| 508 |
+
]
|
| 509 |
+
|
| 510 |
+
n_bullets = random.randint(2, 5)
|
| 511 |
+
selected = random.sample(bullet_templates, min(n_bullets, len(bullet_templates)))
|
| 512 |
+
selected = _reorder_bullets(selected)
|
| 513 |
+
b = _bullet()
|
| 514 |
+
for bullet in selected:
|
| 515 |
+
lines.append(f"{b} {bullet}" if b else bullet)
|
| 516 |
+
|
| 517 |
+
return "\n".join(lines)
|
| 518 |
+
|
| 519 |
+
# Sometimes include multiple roles
|
| 520 |
+
n_roles = random.choices([1, 2], weights=[0.7, 0.3])[0]
|
| 521 |
+
roles = [_single_role() for _ in range(n_roles)]
|
| 522 |
+
return "\n\n".join(roles)
|
| 523 |
+
|
| 524 |
+
|
| 525 |
+
def generate_skills() -> str:
|
| 526 |
+
"""Generate a realistic skills section."""
|
| 527 |
+
templates = []
|
| 528 |
+
|
| 529 |
+
def _t_categorized():
|
| 530 |
+
lines = []
|
| 531 |
+
categories = []
|
| 532 |
+
|
| 533 |
+
if random.random() < 0.9:
|
| 534 |
+
langs = _pick(PROGRAMMING_LANGUAGES, k=random.randint(3, 7))
|
| 535 |
+
label = random.choice(["Languages", "Programming Languages", "Programming"])
|
| 536 |
+
categories.append((label, langs))
|
| 537 |
+
|
| 538 |
+
if random.random() < 0.9:
|
| 539 |
+
fws = _pick(FRAMEWORKS, k=random.randint(3, 7))
|
| 540 |
+
label = random.choice(["Frameworks", "Frameworks & Libraries", "Technologies"])
|
| 541 |
+
categories.append((label, fws))
|
| 542 |
+
|
| 543 |
+
if random.random() < 0.8:
|
| 544 |
+
tls = _pick(TOOLS, k=random.randint(3, 7))
|
| 545 |
+
label = random.choice(["Tools", "Developer Tools", "Tools & Platforms"])
|
| 546 |
+
categories.append((label, tls))
|
| 547 |
+
|
| 548 |
+
if random.random() < 0.4:
|
| 549 |
+
ss = _pick(SOFT_SKILLS, k=random.randint(2, 5))
|
| 550 |
+
label = random.choice(["Soft Skills", "Other Skills", "Additional Skills"])
|
| 551 |
+
categories.append((label, ss))
|
| 552 |
+
|
| 553 |
+
sep = random.choice([": ", " - ", " — "])
|
| 554 |
+
for label, items in categories:
|
| 555 |
+
joiner = random.choice([", ", " | ", " · ", " / "])
|
| 556 |
+
lines.append(f"{label}{sep}{joiner.join(items)}")
|
| 557 |
+
|
| 558 |
+
return "\n".join(lines)
|
| 559 |
+
|
| 560 |
+
def _t_flat():
|
| 561 |
+
all_skills = (_pick(PROGRAMMING_LANGUAGES, k=random.randint(3, 6)) +
|
| 562 |
+
_pick(FRAMEWORKS, k=random.randint(3, 6)) +
|
| 563 |
+
_pick(TOOLS, k=random.randint(2, 4)))
|
| 564 |
+
random.shuffle(all_skills)
|
| 565 |
+
joiner = random.choice([", ", " | ", " · ", " • "])
|
| 566 |
+
return joiner.join(all_skills)
|
| 567 |
+
|
| 568 |
+
def _t_proficiency():
|
| 569 |
+
lines = []
|
| 570 |
+
levels = ["Expert", "Advanced", "Proficient", "Intermediate", "Familiar"]
|
| 571 |
+
used = set()
|
| 572 |
+
for level in random.sample(levels, k=random.randint(2, 4)):
|
| 573 |
+
pool = [s for s in PROGRAMMING_LANGUAGES + FRAMEWORKS + TOOLS if s not in used]
|
| 574 |
+
items = _pick(pool, k=random.randint(2, 5))
|
| 575 |
+
used.update(items)
|
| 576 |
+
lines.append(f"{level}: {', '.join(items)}")
|
| 577 |
+
return "\n".join(lines)
|
| 578 |
+
|
| 579 |
+
templates = [_t_categorized, _t_flat, _t_proficiency]
|
| 580 |
+
return random.choice(templates)()
|
| 581 |
+
|
| 582 |
+
|
| 583 |
+
def generate_projects() -> str:
|
| 584 |
+
"""Generate a realistic projects section."""
|
| 585 |
+
|
| 586 |
+
def _single_project():
|
| 587 |
+
adj = _pick_one(PROJECT_ADJECTIVES)
|
| 588 |
+
noun = _pick_one(PROJECT_NOUNS)
|
| 589 |
+
name = f"{adj} {noun}"
|
| 590 |
+
techs = _pick(PROGRAMMING_LANGUAGES + FRAMEWORKS, k=random.randint(2, 5))
|
| 591 |
+
|
| 592 |
+
header_styles = [
|
| 593 |
+
f"{name} | {', '.join(techs)}",
|
| 594 |
+
f"{name}\nTechnologies: {', '.join(techs)}",
|
| 595 |
+
f"{name} ({', '.join(techs)})",
|
| 596 |
+
]
|
| 597 |
+
lines = [random.choice(header_styles)]
|
| 598 |
+
|
| 599 |
+
# Optional link
|
| 600 |
+
if random.random() < 0.3:
|
| 601 |
+
username = _pick_one(FIRST_NAMES).lower() + _pick_one(LAST_NAMES).lower()
|
| 602 |
+
lines.append(f"github.com/{username}/{name.lower().replace(' ', '-')}")
|
| 603 |
+
|
| 604 |
+
descriptions = [
|
| 605 |
+
f"Built a {noun.lower()} that {random.choice(['processes', 'analyzes', 'visualizes', 'aggregates', 'transforms'])} {random.choice(['user data', 'financial data', 'text documents', 'sensor data', 'social media feeds', 'medical records'])} in real-time",
|
| 606 |
+
f"Implemented {random.choice(['REST API', 'GraphQL API', 'gRPC service', 'WebSocket server', 'event-driven architecture'])} with {random.choice(['authentication', 'rate limiting', 'caching', 'pagination', 'logging'])} support",
|
| 607 |
+
f"Trained {random.choice(['classification', 'regression', 'NLP', 'computer vision', 'recommendation'])} model achieving {random.choice(['92%', '95%', '97%', '89%', '94%'])} {random.choice(['accuracy', 'F1 score', 'AUC-ROC'])} on test set",
|
| 608 |
+
f"Deployed to {random.choice(['AWS', 'GCP', 'Azure', 'Heroku', 'Vercel', 'Railway'])} with {random.choice(['Docker', 'Kubernetes', 'serverless', 'auto-scaling'])} configuration",
|
| 609 |
+
f"Attracted {random.choice(['100+', '500+', '1K+', '5K+'])} GitHub stars and {random.choice(['20+', '50+', '100+'])} contributors from the open-source community",
|
| 610 |
+
f"Features {random.choice(['real-time notifications', 'responsive UI', 'role-based access control', 'data export', 'interactive visualizations', 'natural language search'])}",
|
| 611 |
+
]
|
| 612 |
+
|
| 613 |
+
b = _bullet()
|
| 614 |
+
for desc in random.sample(descriptions, k=random.randint(2, 4)):
|
| 615 |
+
lines.append(f"{b} {desc}" if b else desc)
|
| 616 |
+
|
| 617 |
+
return "\n".join(lines)
|
| 618 |
+
|
| 619 |
+
n_projects = random.randint(1, 3)
|
| 620 |
+
return "\n\n".join([_single_project() for _ in range(n_projects)])
|
| 621 |
+
|
| 622 |
+
|
| 623 |
+
def generate_summary() -> str:
|
| 624 |
+
"""Generate a realistic professional summary / objective section."""
|
| 625 |
+
years = random.randint(2, 15)
|
| 626 |
+
specialties = _pick(MAJORS + [
|
| 627 |
+
"full-stack development", "distributed systems", "machine learning",
|
| 628 |
+
"data engineering", "cloud architecture", "mobile development",
|
| 629 |
+
"DevOps", "backend development", "frontend development",
|
| 630 |
+
"natural language processing", "computer vision",
|
| 631 |
+
], k=random.randint(1, 3))
|
| 632 |
+
|
| 633 |
+
templates = [
|
| 634 |
+
# Template 1: Traditional summary
|
| 635 |
+
lambda: f"Results-driven {_pick_one(JOB_TITLES).lower()} with {years}+ years of experience in {' and '.join(specialties)}. Proven track record of {random.choice(['delivering high-impact solutions', 'building scalable systems', 'driving technical excellence', 'leading cross-functional teams'])} at companies like {_pick_one(COMPANIES)} and {_pick_one(COMPANIES)}. Passionate about {random.choice(['clean code', 'system design', 'open source', 'mentorship', 'continuous learning', 'innovation'])} and {random.choice(['building products that scale', 'solving complex problems', 'leveraging data-driven insights', 'improving developer experience'])}.",
|
| 636 |
+
|
| 637 |
+
# Template 2: Technical focus
|
| 638 |
+
lambda: f"Experienced {_pick_one(JOB_TITLES).lower()} specializing in {', '.join(specialties)}. Skilled in {', '.join(_pick(PROGRAMMING_LANGUAGES, k=3))} with deep expertise in {', '.join(_pick(FRAMEWORKS, k=2))}. {random.choice(['Strong background in', 'Demonstrated ability in', 'Track record of'])} {random.choice(['building distributed systems at scale', 'developing ML models for production', 'architecting cloud-native applications', 'leading agile engineering teams'])}. Seeking to {random.choice(['contribute to cutting-edge products', 'drive technical innovation', 'solve challenging problems', 'build impactful technology'])} at a {random.choice(['fast-growing startup', 'leading technology company', 'mission-driven organization'])}.",
|
| 639 |
+
|
| 640 |
+
# Template 3: Achievement-oriented
|
| 641 |
+
lambda: f"{_pick_one(JOB_TITLES)} with {years} years of experience building {random.choice(['enterprise-scale', 'consumer-facing', 'B2B', 'data-intensive'])} applications. Key achievements include {_impact()}, {_impact()}, and {_impact()}. Proficient in {', '.join(_pick(PROGRAMMING_LANGUAGES, k=3))} and {', '.join(_pick(FRAMEWORKS, k=2))}.",
|
| 642 |
+
|
| 643 |
+
# Template 4: Brief objective
|
| 644 |
+
lambda: f"Motivated {random.choice(['professional', 'engineer', 'developer', 'technologist'])} seeking a {_pick_one(JOB_TITLES).lower()} role where I can apply my expertise in {' and '.join(specialties)} to {random.choice(['build innovative products', 'solve real-world problems', 'drive business impact', 'push the boundaries of technology'])}.",
|
| 645 |
+
|
| 646 |
+
# Template 5: Narrative style
|
| 647 |
+
lambda: f"I am a {_pick_one(JOB_TITLES).lower()} who thrives at the intersection of {_pick_one(specialties)} and {_pick_one(specialties)}. Over the past {years} years, I have {random.choice(['shipped products used by millions', 'built ML systems processing petabytes of data', 'led engineering teams through rapid growth', 'contributed to open-source projects with thousands of stars'])}. I bring a {random.choice(['data-driven', 'user-centric', 'systems-thinking', 'first-principles'])} approach to every problem I tackle.",
|
| 648 |
+
]
|
| 649 |
+
|
| 650 |
+
return random.choice(templates)()
|
| 651 |
+
|
| 652 |
+
|
| 653 |
+
def generate_certifications() -> str:
|
| 654 |
+
"""Generate a realistic certifications section."""
|
| 655 |
+
n = random.randint(2, 6)
|
| 656 |
+
certs = _pick(CERTIFICATIONS_LIST, k=n)
|
| 657 |
+
|
| 658 |
+
lines = []
|
| 659 |
+
for cert in certs:
|
| 660 |
+
year = random.randint(2019, 2025)
|
| 661 |
+
styles = [
|
| 662 |
+
f"{cert} ({year})",
|
| 663 |
+
f"{cert} — Issued {_pick_one(MONTHS)} {year}",
|
| 664 |
+
f"{cert}, {year}",
|
| 665 |
+
f"{cert}\n Issued: {_pick_one(MONTHS_SHORT)} {year}" + (
|
| 666 |
+
f" | Expires: {_pick_one(MONTHS_SHORT)} {year + random.randint(2, 3)}"
|
| 667 |
+
if random.random() < 0.3 else ""
|
| 668 |
+
),
|
| 669 |
+
]
|
| 670 |
+
lines.append(random.choice(styles))
|
| 671 |
+
|
| 672 |
+
b = _bullet()
|
| 673 |
+
if b and random.random() < 0.5:
|
| 674 |
+
return "\n".join(f"{b} {line}" for line in lines)
|
| 675 |
+
return "\n".join(lines)
|
| 676 |
+
|
| 677 |
+
|
| 678 |
+
def generate_contact() -> str:
|
| 679 |
+
"""Generate a realistic contact information section."""
|
| 680 |
+
first = _pick_one(FIRST_NAMES)
|
| 681 |
+
last = _pick_one(LAST_NAMES)
|
| 682 |
+
city = _pick_one(CITIES)
|
| 683 |
+
area_code = _pick_one(PHONE_AREA_CODES)
|
| 684 |
+
email_user = random.choice([
|
| 685 |
+
f"{first.lower()}.{last.lower()}",
|
| 686 |
+
f"{first.lower()}{last.lower()}",
|
| 687 |
+
f"{first[0].lower()}{last.lower()}",
|
| 688 |
+
f"{first.lower()}_{last.lower()}",
|
| 689 |
+
f"{first.lower()}{random.randint(1, 99)}",
|
| 690 |
+
])
|
| 691 |
+
email = f"{email_user}@{_pick_one(DOMAINS)}"
|
| 692 |
+
phone = f"({area_code}) {random.randint(100,999)}-{random.randint(1000,9999)}"
|
| 693 |
+
linkedin_user = f"{first.lower()}-{last.lower()}-{random.randint(100, 999)}"
|
| 694 |
+
github_user = f"{first.lower()}{last.lower()}"
|
| 695 |
+
|
| 696 |
+
parts = [f"{first} {last}"]
|
| 697 |
+
|
| 698 |
+
if random.random() < 0.8:
|
| 699 |
+
parts.append(email)
|
| 700 |
+
if random.random() < 0.7:
|
| 701 |
+
parts.append(phone)
|
| 702 |
+
if random.random() < 0.6:
|
| 703 |
+
parts.append(city)
|
| 704 |
+
if random.random() < 0.5:
|
| 705 |
+
parts.append(f"{_pick_one(LINKEDIN_PREFIXES)}{linkedin_user}")
|
| 706 |
+
if random.random() < 0.4:
|
| 707 |
+
parts.append(f"{_pick_one(GITHUB_PREFIXES)}{github_user}")
|
| 708 |
+
if random.random() < 0.2:
|
| 709 |
+
parts.append(f"{github_user}.dev" if random.random() < 0.5 else f"{first.lower()}{last.lower()}.com")
|
| 710 |
+
|
| 711 |
+
sep = random.choice(["\n", " | ", " · ", "\n"])
|
| 712 |
+
return sep.join(parts)
|
| 713 |
+
|
| 714 |
+
|
| 715 |
+
def generate_awards() -> str:
|
| 716 |
+
"""Generate a realistic awards & honors section."""
|
| 717 |
+
n = random.randint(2, 6)
|
| 718 |
+
awards = _pick(AWARDS_LIST, k=n)
|
| 719 |
+
lines = []
|
| 720 |
+
|
| 721 |
+
for award in awards:
|
| 722 |
+
year = random.randint(2015, 2025)
|
| 723 |
+
org = random.choice([
|
| 724 |
+
_pick_one(UNIVERSITIES),
|
| 725 |
+
_pick_one(COMPANIES),
|
| 726 |
+
random.choice(["ACM", "IEEE", "Google", "Facebook", "Microsoft",
|
| 727 |
+
"National Science Foundation", "Department of Education"]),
|
| 728 |
+
])
|
| 729 |
+
styles = [
|
| 730 |
+
f"{award}, {org} ({year})",
|
| 731 |
+
f"{award} — {org}, {year}",
|
| 732 |
+
f"{award} ({year})\n Awarded by {org}",
|
| 733 |
+
f"{award}, {year}",
|
| 734 |
+
]
|
| 735 |
+
lines.append(random.choice(styles))
|
| 736 |
+
|
| 737 |
+
b = _bullet()
|
| 738 |
+
if b and random.random() < 0.6:
|
| 739 |
+
return "\n".join(f"{b} {line}" for line in lines)
|
| 740 |
+
return "\n".join(lines)
|
| 741 |
+
|
| 742 |
+
|
| 743 |
+
# ---------------------------------------------------------------------------
|
| 744 |
+
# Optional section headers – sometimes sections include a heading
|
| 745 |
+
# ---------------------------------------------------------------------------
|
| 746 |
+
|
| 747 |
+
SECTION_HEADERS = {
|
| 748 |
+
"education": ["EDUCATION", "Education", "Academic Background", "ACADEMIC BACKGROUND", "Education & Training"],
|
| 749 |
+
"experience": ["EXPERIENCE", "Experience", "WORK EXPERIENCE", "Work Experience", "PROFESSIONAL EXPERIENCE", "Professional Experience", "Employment History"],
|
| 750 |
+
"skills": ["SKILLS", "Skills", "TECHNICAL SKILLS", "Technical Skills", "Core Competencies", "CORE COMPETENCIES", "Technologies"],
|
| 751 |
+
"projects": ["PROJECTS", "Projects", "PERSONAL PROJECTS", "Personal Projects", "SIDE PROJECTS", "Selected Projects", "Portfolio"],
|
| 752 |
+
"summary": ["SUMMARY", "Summary", "PROFESSIONAL SUMMARY", "Professional Summary", "OBJECTIVE", "Objective", "PROFILE", "Profile", "About Me", "ABOUT"],
|
| 753 |
+
"certifications": ["CERTIFICATIONS", "Certifications", "CERTIFICATES", "Certificates", "Licenses & Certifications", "PROFESSIONAL CERTIFICATIONS"],
|
| 754 |
+
"contact": ["CONTACT", "Contact", "CONTACT INFORMATION", "Contact Information", "Personal Information"],
|
| 755 |
+
"awards": ["AWARDS", "Awards", "HONORS & AWARDS", "Honors & Awards", "ACHIEVEMENTS", "Achievements", "Awards & Honors", "RECOGNITION"],
|
| 756 |
+
}
|
| 757 |
+
|
| 758 |
+
GENERATORS = {
|
| 759 |
+
"education": generate_education,
|
| 760 |
+
"experience": generate_experience,
|
| 761 |
+
"skills": generate_skills,
|
| 762 |
+
"projects": generate_projects,
|
| 763 |
+
"summary": generate_summary,
|
| 764 |
+
"certifications": generate_certifications,
|
| 765 |
+
"contact": generate_contact,
|
| 766 |
+
"awards": generate_awards,
|
| 767 |
+
}
|
| 768 |
+
|
| 769 |
+
|
| 770 |
+
# ---------------------------------------------------------------------------
|
| 771 |
+
# Dataset generation
|
| 772 |
+
# ---------------------------------------------------------------------------
|
| 773 |
+
|
| 774 |
+
def generate_example(label: str, include_header: bool = False, augment: bool = False) -> str:
|
| 775 |
+
"""
|
| 776 |
+
Generate a single synthetic example for the given label.
|
| 777 |
+
|
| 778 |
+
Args:
|
| 779 |
+
label: One of the 8 section categories.
|
| 780 |
+
include_header: Whether to prepend a section header.
|
| 781 |
+
augment: Whether to apply text augmentation.
|
| 782 |
+
|
| 783 |
+
Returns:
|
| 784 |
+
Generated text string.
|
| 785 |
+
"""
|
| 786 |
+
text = GENERATORS[label]()
|
| 787 |
+
|
| 788 |
+
# Optionally prepend a section header
|
| 789 |
+
if include_header and random.random() < 0.5:
|
| 790 |
+
header = _pick_one(SECTION_HEADERS[label])
|
| 791 |
+
sep = random.choice(["\n", "\n\n", "\n---\n"])
|
| 792 |
+
text = f"{header}{sep}{text}"
|
| 793 |
+
|
| 794 |
+
# Augmentation
|
| 795 |
+
if augment:
|
| 796 |
+
if random.random() < 0.4:
|
| 797 |
+
text = _synonym_replace(text)
|
| 798 |
+
# Randomly add/remove trailing whitespace or newlines
|
| 799 |
+
if random.random() < 0.2:
|
| 800 |
+
text = text.strip() + "\n"
|
| 801 |
+
if random.random() < 0.1:
|
| 802 |
+
text = " " + text
|
| 803 |
+
|
| 804 |
+
return text
|
| 805 |
+
|
| 806 |
+
|
| 807 |
+
def generate_dataset(
|
| 808 |
+
examples_per_category: int = 80,
|
| 809 |
+
augmented_copies: int = 2,
|
| 810 |
+
include_header_prob: float = 0.4,
|
| 811 |
+
seed: int = 42,
|
| 812 |
+
) -> list[dict]:
|
| 813 |
+
"""
|
| 814 |
+
Generate a complete synthetic dataset.
|
| 815 |
+
|
| 816 |
+
Args:
|
| 817 |
+
examples_per_category: Base examples per category.
|
| 818 |
+
augmented_copies: Number of augmented copies per base example.
|
| 819 |
+
include_header_prob: Probability of including section header.
|
| 820 |
+
seed: Random seed for reproducibility.
|
| 821 |
+
|
| 822 |
+
Returns:
|
| 823 |
+
List of dicts with 'text' and 'label' keys.
|
| 824 |
+
"""
|
| 825 |
+
random.seed(seed)
|
| 826 |
+
labels = list(GENERATORS.keys())
|
| 827 |
+
dataset = []
|
| 828 |
+
|
| 829 |
+
for label in labels:
|
| 830 |
+
for i in range(examples_per_category):
|
| 831 |
+
include_header = random.random() < include_header_prob
|
| 832 |
+
text = generate_example(label, include_header=include_header, augment=False)
|
| 833 |
+
dataset.append({"text": text, "label": label})
|
| 834 |
+
|
| 835 |
+
# Generate augmented versions
|
| 836 |
+
for _ in range(augmented_copies):
|
| 837 |
+
aug_text = generate_example(label, include_header=include_header, augment=True)
|
| 838 |
+
dataset.append({"text": aug_text, "label": label})
|
| 839 |
+
|
| 840 |
+
random.shuffle(dataset)
|
| 841 |
+
return dataset
|
| 842 |
+
|
| 843 |
+
|
| 844 |
+
def save_to_csv(dataset: list[dict], path: str) -> None:
|
| 845 |
+
"""Save dataset to CSV."""
|
| 846 |
+
filepath = Path(path)
|
| 847 |
+
filepath.parent.mkdir(parents=True, exist_ok=True)
|
| 848 |
+
with open(filepath, "w", newline="", encoding="utf-8") as f:
|
| 849 |
+
writer = csv.DictWriter(f, fieldnames=["text", "label"])
|
| 850 |
+
writer.writeheader()
|
| 851 |
+
writer.writerows(dataset)
|
| 852 |
+
print(f"Saved {len(dataset)} examples to {filepath}")
|
| 853 |
+
|
| 854 |
+
|
| 855 |
+
def load_as_hf_dataset(dataset: list[dict]):
|
| 856 |
+
"""Convert to HuggingFace Dataset with train/val/test splits."""
|
| 857 |
+
from datasets import Dataset, DatasetDict
|
| 858 |
+
|
| 859 |
+
ds = Dataset.from_list(dataset)
|
| 860 |
+
|
| 861 |
+
# 80/10/10 split
|
| 862 |
+
train_test = ds.train_test_split(test_size=0.2, seed=42, stratify_by_column="label")
|
| 863 |
+
val_test = train_test["test"].train_test_split(test_size=0.5, seed=42, stratify_by_column="label")
|
| 864 |
+
|
| 865 |
+
return DatasetDict({
|
| 866 |
+
"train": train_test["train"],
|
| 867 |
+
"validation": val_test["train"],
|
| 868 |
+
"test": val_test["test"],
|
| 869 |
+
})
|
| 870 |
+
|
| 871 |
+
|
| 872 |
+
def get_label_mapping(dataset: list[dict]) -> tuple[dict, dict]:
|
| 873 |
+
"""Create label <-> id mappings."""
|
| 874 |
+
labels = sorted(set(d["label"] for d in dataset))
|
| 875 |
+
label2id = {label: idx for idx, label in enumerate(labels)}
|
| 876 |
+
id2label = {idx: label for label, idx in label2id.items()}
|
| 877 |
+
return label2id, id2label
|
| 878 |
+
|
| 879 |
+
|
| 880 |
+
# ---------------------------------------------------------------------------
|
| 881 |
+
# CLI entry point
|
| 882 |
+
# ---------------------------------------------------------------------------
|
| 883 |
+
|
| 884 |
+
if __name__ == "__main__":
|
| 885 |
+
import argparse
|
| 886 |
+
|
| 887 |
+
parser = argparse.ArgumentParser(description="Generate synthetic resume section data")
|
| 888 |
+
parser.add_argument("--examples-per-category", type=int, default=80,
|
| 889 |
+
help="Number of base examples per category (default: 80)")
|
| 890 |
+
parser.add_argument("--augmented-copies", type=int, default=2,
|
| 891 |
+
help="Number of augmented copies per example (default: 2)")
|
| 892 |
+
parser.add_argument("--output", type=str, default="data/resume_sections.csv",
|
| 893 |
+
help="Output CSV path (default: data/resume_sections.csv)")
|
| 894 |
+
parser.add_argument("--seed", type=int, default=42,
|
| 895 |
+
help="Random seed (default: 42)")
|
| 896 |
+
parser.add_argument("--print-stats", action="store_true",
|
| 897 |
+
help="Print dataset statistics")
|
| 898 |
+
parser.add_argument("--print-samples", type=int, default=0,
|
| 899 |
+
help="Print N sample examples")
|
| 900 |
+
|
| 901 |
+
args = parser.parse_args()
|
| 902 |
+
|
| 903 |
+
print(f"Generating dataset with {args.examples_per_category} base examples per category...")
|
| 904 |
+
print(f"Augmented copies per example: {args.augmented_copies}")
|
| 905 |
+
print(f"Total expected examples: {args.examples_per_category * (1 + args.augmented_copies) * 8}")
|
| 906 |
+
|
| 907 |
+
dataset = generate_dataset(
|
| 908 |
+
examples_per_category=args.examples_per_category,
|
| 909 |
+
augmented_copies=args.augmented_copies,
|
| 910 |
+
seed=args.seed,
|
| 911 |
+
)
|
| 912 |
+
|
| 913 |
+
save_to_csv(dataset, args.output)
|
| 914 |
+
|
| 915 |
+
if args.print_stats:
|
| 916 |
+
from collections import Counter
|
| 917 |
+
counts = Counter(d["label"] for d in dataset)
|
| 918 |
+
print("\nDataset Statistics:")
|
| 919 |
+
print(f" Total examples: {len(dataset)}")
|
| 920 |
+
print(f" Categories: {len(counts)}")
|
| 921 |
+
for label, count in sorted(counts.items()):
|
| 922 |
+
print(f" {label}: {count}")
|
| 923 |
+
avg_len = sum(len(d["text"]) for d in dataset) / len(dataset)
|
| 924 |
+
print(f" Average text length: {avg_len:.0f} chars")
|
| 925 |
+
|
| 926 |
+
if args.print_samples > 0:
|
| 927 |
+
print(f"\n{'='*60}")
|
| 928 |
+
print(f"Sample Examples (first {args.print_samples}):")
|
| 929 |
+
print(f"{'='*60}")
|
| 930 |
+
for i, example in enumerate(dataset[:args.print_samples]):
|
| 931 |
+
print(f"\n--- Example {i+1} [{example['label']}] ---")
|
| 932 |
+
print(example["text"][:300])
|
| 933 |
+
if len(example["text"]) > 300:
|
| 934 |
+
print("...")
|
inference.py
ADDED
|
@@ -0,0 +1,441 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Resume Section Classifier – Inference Script
|
| 3 |
+
|
| 4 |
+
Takes raw resume text, splits it into sections, and classifies each section
|
| 5 |
+
into one of 8 categories with confidence scores.
|
| 6 |
+
|
| 7 |
+
Author: Lorenzo Scaturchio (gr8monk3ys)
|
| 8 |
+
|
| 9 |
+
Usage:
|
| 10 |
+
# Classify a resume file
|
| 11 |
+
python inference.py --file resume.txt
|
| 12 |
+
|
| 13 |
+
# Classify inline text
|
| 14 |
+
python inference.py --text "Bachelor of Science in Computer Science, MIT, 2023"
|
| 15 |
+
|
| 16 |
+
# Use a custom model path
|
| 17 |
+
python inference.py --model ./model_output/final_model --file resume.txt
|
| 18 |
+
|
| 19 |
+
# Output as JSON
|
| 20 |
+
python inference.py --file resume.txt --format json
|
| 21 |
+
|
| 22 |
+
# Python API
|
| 23 |
+
from inference import ResumeSectionClassifier
|
| 24 |
+
classifier = ResumeSectionClassifier("./model_output/final_model")
|
| 25 |
+
results = classifier.classify_resume(resume_text)
|
| 26 |
+
"""
|
| 27 |
+
|
| 28 |
+
import json
|
| 29 |
+
import re
|
| 30 |
+
import sys
|
| 31 |
+
from dataclasses import dataclass, field
|
| 32 |
+
from pathlib import Path
|
| 33 |
+
|
| 34 |
+
import torch
|
| 35 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
# ---------------------------------------------------------------------------
|
| 39 |
+
# Data classes
|
| 40 |
+
# ---------------------------------------------------------------------------
|
| 41 |
+
|
| 42 |
+
@dataclass
|
| 43 |
+
class SectionPrediction:
|
| 44 |
+
"""A single section classification result."""
|
| 45 |
+
text: str
|
| 46 |
+
label: str
|
| 47 |
+
confidence: float
|
| 48 |
+
all_scores: dict = field(default_factory=dict)
|
| 49 |
+
|
| 50 |
+
def to_dict(self) -> dict:
|
| 51 |
+
return {
|
| 52 |
+
"text": self.text,
|
| 53 |
+
"label": self.label,
|
| 54 |
+
"confidence": round(self.confidence, 4),
|
| 55 |
+
"all_scores": {k: round(v, 4) for k, v in self.all_scores.items()},
|
| 56 |
+
}
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
@dataclass
|
| 60 |
+
class ResumeAnalysis:
|
| 61 |
+
"""Complete resume analysis output."""
|
| 62 |
+
sections: list
|
| 63 |
+
section_count: int = 0
|
| 64 |
+
label_distribution: dict = field(default_factory=dict)
|
| 65 |
+
|
| 66 |
+
def to_dict(self) -> dict:
|
| 67 |
+
return {
|
| 68 |
+
"sections": [s.to_dict() for s in self.sections],
|
| 69 |
+
"section_count": self.section_count,
|
| 70 |
+
"label_distribution": self.label_distribution,
|
| 71 |
+
}
|
| 72 |
+
|
| 73 |
+
def to_json(self, indent: int = 2) -> str:
|
| 74 |
+
return json.dumps(self.to_dict(), indent=indent)
|
| 75 |
+
|
| 76 |
+
def summary(self) -> str:
|
| 77 |
+
"""Human-readable summary."""
|
| 78 |
+
lines = [
|
| 79 |
+
f"Resume Analysis: {self.section_count} sections detected",
|
| 80 |
+
"=" * 50,
|
| 81 |
+
]
|
| 82 |
+
for i, sec in enumerate(self.sections, 1):
|
| 83 |
+
text_preview = sec.text[:80].replace("\n", " ")
|
| 84 |
+
if len(sec.text) > 80:
|
| 85 |
+
text_preview += "..."
|
| 86 |
+
lines.append(
|
| 87 |
+
f"\n[{i}] {sec.label.upper()} (confidence: {sec.confidence:.1%})"
|
| 88 |
+
)
|
| 89 |
+
lines.append(f" {text_preview}")
|
| 90 |
+
|
| 91 |
+
lines.append("\n" + "-" * 50)
|
| 92 |
+
lines.append("Label Distribution:")
|
| 93 |
+
for label, count in sorted(self.label_distribution.items()):
|
| 94 |
+
lines.append(f" {label}: {count}")
|
| 95 |
+
|
| 96 |
+
return "\n".join(lines)
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
# ---------------------------------------------------------------------------
|
| 100 |
+
# Section splitting heuristics
|
| 101 |
+
# ---------------------------------------------------------------------------
|
| 102 |
+
|
| 103 |
+
# Common resume section headers (case-insensitive patterns)
|
| 104 |
+
SECTION_HEADER_PATTERNS = [
|
| 105 |
+
r"^#{1,3}\s+.+$", # Markdown headers
|
| 106 |
+
r"^[A-Z][A-Z\s&/,]{2,}$", # ALL CAPS headers
|
| 107 |
+
r"^(?:EDUCATION|EXPERIENCE|WORK EXPERIENCE|PROFESSIONAL EXPERIENCE|"
|
| 108 |
+
r"SKILLS|TECHNICAL SKILLS|PROJECTS|PERSONAL PROJECTS|"
|
| 109 |
+
r"SUMMARY|PROFESSIONAL SUMMARY|OBJECTIVE|PROFILE|ABOUT|"
|
| 110 |
+
r"CERTIFICATIONS|CERTIFICATES|LICENSES|"
|
| 111 |
+
r"CONTACT|CONTACT INFORMATION|PERSONAL INFORMATION|"
|
| 112 |
+
r"AWARDS|HONORS|ACHIEVEMENTS|RECOGNITION|"
|
| 113 |
+
r"PUBLICATIONS|REFERENCES|VOLUNTEER|LANGUAGES|INTERESTS|"
|
| 114 |
+
r"ACTIVITIES|LEADERSHIP|RESEARCH)\s*:?\s*$",
|
| 115 |
+
]
|
| 116 |
+
|
| 117 |
+
COMPILED_HEADERS = [re.compile(p, re.MULTILINE | re.IGNORECASE) for p in SECTION_HEADER_PATTERNS]
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def is_section_header(line: str) -> bool:
|
| 121 |
+
"""Check if a line looks like a section header."""
|
| 122 |
+
stripped = line.strip()
|
| 123 |
+
if not stripped or len(stripped) < 3:
|
| 124 |
+
return False
|
| 125 |
+
|
| 126 |
+
for pattern in COMPILED_HEADERS:
|
| 127 |
+
if pattern.match(stripped):
|
| 128 |
+
return True
|
| 129 |
+
|
| 130 |
+
# Heuristic: short all-caps line
|
| 131 |
+
if stripped.isupper() and len(stripped.split()) <= 5 and len(stripped) < 50:
|
| 132 |
+
return True
|
| 133 |
+
|
| 134 |
+
return False
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
def split_resume_into_sections(text: str, min_section_length: int = 20) -> list:
|
| 138 |
+
"""
|
| 139 |
+
Split raw resume text into logical sections.
|
| 140 |
+
|
| 141 |
+
Strategy:
|
| 142 |
+
1. First try to split on detected section headers.
|
| 143 |
+
2. Fall back to splitting on double newlines (paragraph breaks).
|
| 144 |
+
3. Filter out very short fragments.
|
| 145 |
+
|
| 146 |
+
Args:
|
| 147 |
+
text: Raw resume text.
|
| 148 |
+
min_section_length: Minimum character length for a section.
|
| 149 |
+
|
| 150 |
+
Returns:
|
| 151 |
+
List of text sections.
|
| 152 |
+
"""
|
| 153 |
+
lines = text.split("\n")
|
| 154 |
+
sections = []
|
| 155 |
+
current_section_lines = []
|
| 156 |
+
|
| 157 |
+
# Pass 1: Try header-based splitting
|
| 158 |
+
header_found = False
|
| 159 |
+
for line in lines:
|
| 160 |
+
if is_section_header(line):
|
| 161 |
+
header_found = True
|
| 162 |
+
# Save previous section
|
| 163 |
+
if current_section_lines:
|
| 164 |
+
section_text = "\n".join(current_section_lines).strip()
|
| 165 |
+
if len(section_text) >= min_section_length:
|
| 166 |
+
sections.append(section_text)
|
| 167 |
+
current_section_lines = [line]
|
| 168 |
+
else:
|
| 169 |
+
current_section_lines.append(line)
|
| 170 |
+
|
| 171 |
+
# Don't forget the last section
|
| 172 |
+
if current_section_lines:
|
| 173 |
+
section_text = "\n".join(current_section_lines).strip()
|
| 174 |
+
if len(section_text) >= min_section_length:
|
| 175 |
+
sections.append(section_text)
|
| 176 |
+
|
| 177 |
+
# If no headers found, fall back to paragraph splitting
|
| 178 |
+
if not header_found or len(sections) <= 1:
|
| 179 |
+
sections = []
|
| 180 |
+
paragraphs = re.split(r"\n\s*\n", text)
|
| 181 |
+
for para in paragraphs:
|
| 182 |
+
stripped = para.strip()
|
| 183 |
+
if len(stripped) >= min_section_length:
|
| 184 |
+
sections.append(stripped)
|
| 185 |
+
|
| 186 |
+
# If still just one big block, return it as-is
|
| 187 |
+
if not sections:
|
| 188 |
+
stripped = text.strip()
|
| 189 |
+
if stripped:
|
| 190 |
+
sections = [stripped]
|
| 191 |
+
|
| 192 |
+
return sections
|
| 193 |
+
|
| 194 |
+
|
| 195 |
+
# ---------------------------------------------------------------------------
|
| 196 |
+
# Classifier
|
| 197 |
+
# ---------------------------------------------------------------------------
|
| 198 |
+
|
| 199 |
+
class ResumeSectionClassifier:
|
| 200 |
+
"""
|
| 201 |
+
Classifies resume text sections into categories.
|
| 202 |
+
|
| 203 |
+
Supports both single-section and full-resume classification.
|
| 204 |
+
"""
|
| 205 |
+
|
| 206 |
+
def __init__(
|
| 207 |
+
self,
|
| 208 |
+
model_path: str = "./model_output/final_model",
|
| 209 |
+
device: str = None,
|
| 210 |
+
max_length: int = 256,
|
| 211 |
+
):
|
| 212 |
+
"""
|
| 213 |
+
Initialize the classifier.
|
| 214 |
+
|
| 215 |
+
Args:
|
| 216 |
+
model_path: Path to fine-tuned model directory.
|
| 217 |
+
device: Device string ('cpu', 'cuda', 'mps'). Auto-detected if None.
|
| 218 |
+
max_length: Maximum token sequence length.
|
| 219 |
+
"""
|
| 220 |
+
self.model_path = Path(model_path)
|
| 221 |
+
self.max_length = max_length
|
| 222 |
+
|
| 223 |
+
# Auto-detect device
|
| 224 |
+
if device is None:
|
| 225 |
+
if torch.cuda.is_available():
|
| 226 |
+
self.device = torch.device("cuda")
|
| 227 |
+
elif torch.backends.mps.is_available():
|
| 228 |
+
self.device = torch.device("mps")
|
| 229 |
+
else:
|
| 230 |
+
self.device = torch.device("cpu")
|
| 231 |
+
else:
|
| 232 |
+
self.device = torch.device(device)
|
| 233 |
+
|
| 234 |
+
# Load model and tokenizer
|
| 235 |
+
self.tokenizer = AutoTokenizer.from_pretrained(str(self.model_path))
|
| 236 |
+
self.model = AutoModelForSequenceClassification.from_pretrained(
|
| 237 |
+
str(self.model_path)
|
| 238 |
+
).to(self.device)
|
| 239 |
+
self.model.eval()
|
| 240 |
+
|
| 241 |
+
# Load label mapping
|
| 242 |
+
label_mapping_path = self.model_path / "label_mapping.json"
|
| 243 |
+
if label_mapping_path.exists():
|
| 244 |
+
with open(label_mapping_path) as f:
|
| 245 |
+
mapping = json.load(f)
|
| 246 |
+
self.id2label = {int(k): v for k, v in mapping["id2label"].items()}
|
| 247 |
+
self.label2id = mapping["label2id"]
|
| 248 |
+
else:
|
| 249 |
+
# Fall back to model config
|
| 250 |
+
self.id2label = self.model.config.id2label
|
| 251 |
+
self.label2id = self.model.config.label2id
|
| 252 |
+
|
| 253 |
+
self.labels = sorted(self.label2id.keys())
|
| 254 |
+
|
| 255 |
+
def classify_section(self, text: str) -> SectionPrediction:
|
| 256 |
+
"""
|
| 257 |
+
Classify a single text section.
|
| 258 |
+
|
| 259 |
+
Args:
|
| 260 |
+
text: Section text to classify.
|
| 261 |
+
|
| 262 |
+
Returns:
|
| 263 |
+
SectionPrediction with label, confidence, and all scores.
|
| 264 |
+
"""
|
| 265 |
+
inputs = self.tokenizer(
|
| 266 |
+
text,
|
| 267 |
+
truncation=True,
|
| 268 |
+
max_length=self.max_length,
|
| 269 |
+
padding=True,
|
| 270 |
+
return_tensors="pt",
|
| 271 |
+
).to(self.device)
|
| 272 |
+
|
| 273 |
+
with torch.no_grad():
|
| 274 |
+
outputs = self.model(**inputs)
|
| 275 |
+
probs = torch.softmax(outputs.logits, dim=-1)[0]
|
| 276 |
+
|
| 277 |
+
scores = {self.id2label[i]: probs[i].item() for i in range(len(probs))}
|
| 278 |
+
predicted_id = probs.argmax().item()
|
| 279 |
+
predicted_label = self.id2label[predicted_id]
|
| 280 |
+
confidence = probs[predicted_id].item()
|
| 281 |
+
|
| 282 |
+
return SectionPrediction(
|
| 283 |
+
text=text,
|
| 284 |
+
label=predicted_label,
|
| 285 |
+
confidence=confidence,
|
| 286 |
+
all_scores=scores,
|
| 287 |
+
)
|
| 288 |
+
|
| 289 |
+
def classify_sections(self, texts: list) -> list:
|
| 290 |
+
"""
|
| 291 |
+
Classify multiple text sections (batched).
|
| 292 |
+
|
| 293 |
+
Args:
|
| 294 |
+
texts: List of section texts.
|
| 295 |
+
|
| 296 |
+
Returns:
|
| 297 |
+
List of SectionPrediction objects.
|
| 298 |
+
"""
|
| 299 |
+
if not texts:
|
| 300 |
+
return []
|
| 301 |
+
|
| 302 |
+
inputs = self.tokenizer(
|
| 303 |
+
texts,
|
| 304 |
+
truncation=True,
|
| 305 |
+
max_length=self.max_length,
|
| 306 |
+
padding=True,
|
| 307 |
+
return_tensors="pt",
|
| 308 |
+
).to(self.device)
|
| 309 |
+
|
| 310 |
+
with torch.no_grad():
|
| 311 |
+
outputs = self.model(**inputs)
|
| 312 |
+
probs = torch.softmax(outputs.logits, dim=-1)
|
| 313 |
+
|
| 314 |
+
results = []
|
| 315 |
+
for i, text in enumerate(texts):
|
| 316 |
+
scores = {self.id2label[j]: probs[i][j].item() for j in range(probs.shape[1])}
|
| 317 |
+
predicted_id = probs[i].argmax().item()
|
| 318 |
+
predicted_label = self.id2label[predicted_id]
|
| 319 |
+
confidence = probs[i][predicted_id].item()
|
| 320 |
+
|
| 321 |
+
results.append(SectionPrediction(
|
| 322 |
+
text=text,
|
| 323 |
+
label=predicted_label,
|
| 324 |
+
confidence=confidence,
|
| 325 |
+
all_scores=scores,
|
| 326 |
+
))
|
| 327 |
+
|
| 328 |
+
return results
|
| 329 |
+
|
| 330 |
+
def classify_resume(
|
| 331 |
+
self,
|
| 332 |
+
resume_text: str,
|
| 333 |
+
min_section_length: int = 20,
|
| 334 |
+
) -> ResumeAnalysis:
|
| 335 |
+
"""
|
| 336 |
+
Classify a full resume by splitting into sections and classifying each.
|
| 337 |
+
|
| 338 |
+
Args:
|
| 339 |
+
resume_text: Full resume text.
|
| 340 |
+
min_section_length: Minimum section length in characters.
|
| 341 |
+
|
| 342 |
+
Returns:
|
| 343 |
+
ResumeAnalysis with all section predictions.
|
| 344 |
+
"""
|
| 345 |
+
sections = split_resume_into_sections(resume_text, min_section_length)
|
| 346 |
+
predictions = self.classify_sections(sections)
|
| 347 |
+
|
| 348 |
+
# Compute label distribution
|
| 349 |
+
label_dist = {}
|
| 350 |
+
for pred in predictions:
|
| 351 |
+
label_dist[pred.label] = label_dist.get(pred.label, 0) + 1
|
| 352 |
+
|
| 353 |
+
return ResumeAnalysis(
|
| 354 |
+
sections=predictions,
|
| 355 |
+
section_count=len(predictions),
|
| 356 |
+
label_distribution=label_dist,
|
| 357 |
+
)
|
| 358 |
+
|
| 359 |
+
|
| 360 |
+
# ---------------------------------------------------------------------------
|
| 361 |
+
# CLI
|
| 362 |
+
# ---------------------------------------------------------------------------
|
| 363 |
+
|
| 364 |
+
def main():
|
| 365 |
+
import argparse
|
| 366 |
+
|
| 367 |
+
parser = argparse.ArgumentParser(
|
| 368 |
+
description="Classify resume sections",
|
| 369 |
+
formatter_class=argparse.RawDescriptionHelpFormatter,
|
| 370 |
+
epilog="""
|
| 371 |
+
Examples:
|
| 372 |
+
python inference.py --file resume.txt
|
| 373 |
+
python inference.py --text "BS in Computer Science, MIT, 2023"
|
| 374 |
+
python inference.py --file resume.txt --format json
|
| 375 |
+
python inference.py --model ./model_output/final_model --file resume.txt
|
| 376 |
+
""",
|
| 377 |
+
)
|
| 378 |
+
|
| 379 |
+
input_group = parser.add_mutually_exclusive_group(required=True)
|
| 380 |
+
input_group.add_argument("--file", type=str, help="Path to resume text file")
|
| 381 |
+
input_group.add_argument("--text", type=str, help="Direct text to classify")
|
| 382 |
+
|
| 383 |
+
parser.add_argument("--model", type=str, default="./model_output/final_model",
|
| 384 |
+
help="Path to fine-tuned model (default: ./model_output/final_model)")
|
| 385 |
+
parser.add_argument("--device", type=str, default=None,
|
| 386 |
+
help="Device: cpu, cuda, mps (auto-detected if omitted)")
|
| 387 |
+
parser.add_argument("--max-length", type=int, default=256,
|
| 388 |
+
help="Maximum token sequence length (default: 256)")
|
| 389 |
+
parser.add_argument("--min-section-length", type=int, default=20,
|
| 390 |
+
help="Minimum section length in characters (default: 20)")
|
| 391 |
+
parser.add_argument("--format", type=str, choices=["text", "json"], default="text",
|
| 392 |
+
help="Output format (default: text)")
|
| 393 |
+
parser.add_argument("--single", action="store_true",
|
| 394 |
+
help="Classify as single section (no splitting)")
|
| 395 |
+
|
| 396 |
+
args = parser.parse_args()
|
| 397 |
+
|
| 398 |
+
# Load classifier
|
| 399 |
+
try:
|
| 400 |
+
classifier = ResumeSectionClassifier(
|
| 401 |
+
model_path=args.model,
|
| 402 |
+
device=args.device,
|
| 403 |
+
max_length=args.max_length,
|
| 404 |
+
)
|
| 405 |
+
except Exception as e:
|
| 406 |
+
print(f"Error loading model from '{args.model}': {e}", file=sys.stderr)
|
| 407 |
+
print("Have you trained the model yet? Run: python train.py", file=sys.stderr)
|
| 408 |
+
sys.exit(1)
|
| 409 |
+
|
| 410 |
+
# Get input text
|
| 411 |
+
if args.file:
|
| 412 |
+
file_path = Path(args.file)
|
| 413 |
+
if not file_path.exists():
|
| 414 |
+
print(f"File not found: {args.file}", file=sys.stderr)
|
| 415 |
+
sys.exit(1)
|
| 416 |
+
text = file_path.read_text(encoding="utf-8")
|
| 417 |
+
else:
|
| 418 |
+
text = args.text
|
| 419 |
+
|
| 420 |
+
# Classify
|
| 421 |
+
if args.single:
|
| 422 |
+
result = classifier.classify_section(text)
|
| 423 |
+
if args.format == "json":
|
| 424 |
+
print(json.dumps(result.to_dict(), indent=2))
|
| 425 |
+
else:
|
| 426 |
+
print(f"Label: {result.label}")
|
| 427 |
+
print(f"Confidence: {result.confidence:.1%}")
|
| 428 |
+
print("\nAll scores:")
|
| 429 |
+
for label, score in sorted(result.all_scores.items(), key=lambda x: -x[1]):
|
| 430 |
+
bar = "#" * int(score * 40)
|
| 431 |
+
print(f" {label:20s} {score:.4f} {bar}")
|
| 432 |
+
else:
|
| 433 |
+
analysis = classifier.classify_resume(text, min_section_length=args.min_section_length)
|
| 434 |
+
if args.format == "json":
|
| 435 |
+
print(analysis.to_json())
|
| 436 |
+
else:
|
| 437 |
+
print(analysis.summary())
|
| 438 |
+
|
| 439 |
+
|
| 440 |
+
if __name__ == "__main__":
|
| 441 |
+
main()
|
requirements.txt
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
transformers>=4.36.0
|
| 2 |
+
datasets>=2.16.0
|
| 3 |
+
torch>=2.1.0
|
| 4 |
+
scikit-learn>=1.3.0
|
| 5 |
+
accelerate>=0.25.0
|
| 6 |
+
evaluate>=0.4.0
|
| 7 |
+
pandas>=2.0.0
|
| 8 |
+
huggingface_hub>=0.20.0
|
train.py
ADDED
|
@@ -0,0 +1,437 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Resume Section Classifier – Training Script
|
| 3 |
+
|
| 4 |
+
Fine-tunes distilbert-base-uncased for classifying resume text sections
|
| 5 |
+
into 8 categories: education, experience, skills, projects, summary,
|
| 6 |
+
certifications, contact, awards.
|
| 7 |
+
|
| 8 |
+
Author: Lorenzo Scaturchio (gr8monk3ys)
|
| 9 |
+
|
| 10 |
+
Usage:
|
| 11 |
+
python train.py # Train with defaults
|
| 12 |
+
python train.py --epochs 5 --batch-size 32
|
| 13 |
+
python train.py --push-to-hub # Push to HuggingFace Hub
|
| 14 |
+
python train.py --output-dir ./my_model
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
import json
|
| 18 |
+
import logging
|
| 19 |
+
import os
|
| 20 |
+
import sys
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
|
| 23 |
+
import evaluate
|
| 24 |
+
import numpy as np
|
| 25 |
+
import torch
|
| 26 |
+
from datasets import DatasetDict
|
| 27 |
+
from transformers import (
|
| 28 |
+
AutoModelForSequenceClassification,
|
| 29 |
+
AutoTokenizer,
|
| 30 |
+
DataCollatorWithPadding,
|
| 31 |
+
EarlyStoppingCallback,
|
| 32 |
+
Trainer,
|
| 33 |
+
TrainingArguments,
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
from data_generator import generate_dataset, get_label_mapping, load_as_hf_dataset
|
| 37 |
+
|
| 38 |
+
# ---------------------------------------------------------------------------
|
| 39 |
+
# Logging
|
| 40 |
+
# ---------------------------------------------------------------------------
|
| 41 |
+
logging.basicConfig(
|
| 42 |
+
level=logging.INFO,
|
| 43 |
+
format="%(asctime)s [%(levelname)s] %(message)s",
|
| 44 |
+
handlers=[logging.StreamHandler(sys.stdout)],
|
| 45 |
+
)
|
| 46 |
+
logger = logging.getLogger(__name__)
|
| 47 |
+
|
| 48 |
+
# ---------------------------------------------------------------------------
|
| 49 |
+
# Constants
|
| 50 |
+
# ---------------------------------------------------------------------------
|
| 51 |
+
MODEL_NAME = "distilbert-base-uncased"
|
| 52 |
+
DEFAULT_OUTPUT_DIR = "./model_output"
|
| 53 |
+
DEFAULT_LOGGING_DIR = "./logs"
|
| 54 |
+
HUB_MODEL_ID = "gr8monk3ys/resume-section-classifier"
|
| 55 |
+
MAX_LENGTH = 256
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
# ---------------------------------------------------------------------------
|
| 59 |
+
# Metrics computation
|
| 60 |
+
# ---------------------------------------------------------------------------
|
| 61 |
+
def build_compute_metrics(id2label: dict):
|
| 62 |
+
"""Build a compute_metrics function with access to label mappings."""
|
| 63 |
+
accuracy_metric = evaluate.load("accuracy")
|
| 64 |
+
f1_metric = evaluate.load("f1")
|
| 65 |
+
precision_metric = evaluate.load("precision")
|
| 66 |
+
recall_metric = evaluate.load("recall")
|
| 67 |
+
|
| 68 |
+
def compute_metrics(eval_pred):
|
| 69 |
+
logits, labels = eval_pred
|
| 70 |
+
predictions = np.argmax(logits, axis=-1)
|
| 71 |
+
|
| 72 |
+
acc = accuracy_metric.compute(predictions=predictions, references=labels)
|
| 73 |
+
f1_macro = f1_metric.compute(predictions=predictions, references=labels, average="macro")
|
| 74 |
+
f1_weighted = f1_metric.compute(predictions=predictions, references=labels, average="weighted")
|
| 75 |
+
precision = precision_metric.compute(predictions=predictions, references=labels, average="weighted")
|
| 76 |
+
recall = recall_metric.compute(predictions=predictions, references=labels, average="weighted")
|
| 77 |
+
|
| 78 |
+
return {
|
| 79 |
+
"accuracy": acc["accuracy"],
|
| 80 |
+
"f1_macro": f1_macro["f1"],
|
| 81 |
+
"f1_weighted": f1_weighted["f1"],
|
| 82 |
+
"precision": precision["precision"],
|
| 83 |
+
"recall": recall["recall"],
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
return compute_metrics
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
# ---------------------------------------------------------------------------
|
| 90 |
+
# Tokenization
|
| 91 |
+
# ---------------------------------------------------------------------------
|
| 92 |
+
def tokenize_dataset(dataset_dict: DatasetDict, tokenizer, label2id: dict, max_length: int = MAX_LENGTH):
|
| 93 |
+
"""Tokenize all splits and encode labels as integers."""
|
| 94 |
+
|
| 95 |
+
def preprocess(examples):
|
| 96 |
+
tokenized = tokenizer(
|
| 97 |
+
examples["text"],
|
| 98 |
+
truncation=True,
|
| 99 |
+
max_length=max_length,
|
| 100 |
+
padding=False, # Dynamic padding via DataCollator
|
| 101 |
+
)
|
| 102 |
+
tokenized["labels"] = [label2id[label] for label in examples["label"]]
|
| 103 |
+
return tokenized
|
| 104 |
+
|
| 105 |
+
tokenized = dataset_dict.map(
|
| 106 |
+
preprocess,
|
| 107 |
+
batched=True,
|
| 108 |
+
remove_columns=["text", "label"],
|
| 109 |
+
desc="Tokenizing",
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
return tokenized
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
# ---------------------------------------------------------------------------
|
| 116 |
+
# Main training function
|
| 117 |
+
# ---------------------------------------------------------------------------
|
| 118 |
+
def train(
|
| 119 |
+
output_dir: str = DEFAULT_OUTPUT_DIR,
|
| 120 |
+
model_name: str = MODEL_NAME,
|
| 121 |
+
epochs: int = 4,
|
| 122 |
+
batch_size: int = 16,
|
| 123 |
+
learning_rate: float = 2e-5,
|
| 124 |
+
weight_decay: float = 0.01,
|
| 125 |
+
warmup_ratio: float = 0.1,
|
| 126 |
+
max_length: int = MAX_LENGTH,
|
| 127 |
+
examples_per_category: int = 80,
|
| 128 |
+
augmented_copies: int = 2,
|
| 129 |
+
seed: int = 42,
|
| 130 |
+
push_to_hub: bool = False,
|
| 131 |
+
hub_model_id: str = HUB_MODEL_ID,
|
| 132 |
+
fp16: bool = None,
|
| 133 |
+
gradient_accumulation_steps: int = 1,
|
| 134 |
+
early_stopping_patience: int = 3,
|
| 135 |
+
):
|
| 136 |
+
"""
|
| 137 |
+
Full training pipeline.
|
| 138 |
+
|
| 139 |
+
Args:
|
| 140 |
+
output_dir: Directory to save model and artifacts.
|
| 141 |
+
model_name: Pretrained model identifier.
|
| 142 |
+
epochs: Number of training epochs.
|
| 143 |
+
batch_size: Training batch size.
|
| 144 |
+
learning_rate: Peak learning rate.
|
| 145 |
+
weight_decay: Weight decay for AdamW.
|
| 146 |
+
warmup_ratio: Fraction of steps for warmup.
|
| 147 |
+
max_length: Maximum token sequence length.
|
| 148 |
+
examples_per_category: Base synthetic examples per category.
|
| 149 |
+
augmented_copies: Augmented copies per base example.
|
| 150 |
+
seed: Random seed.
|
| 151 |
+
push_to_hub: Whether to push to HuggingFace Hub.
|
| 152 |
+
hub_model_id: Hub model repository ID.
|
| 153 |
+
fp16: Use mixed precision (auto-detected if None).
|
| 154 |
+
gradient_accumulation_steps: Gradient accumulation steps.
|
| 155 |
+
early_stopping_patience: Early stopping patience (epochs).
|
| 156 |
+
"""
|
| 157 |
+
output_path = Path(output_dir)
|
| 158 |
+
output_path.mkdir(parents=True, exist_ok=True)
|
| 159 |
+
|
| 160 |
+
# Auto-detect fp16
|
| 161 |
+
if fp16 is None:
|
| 162 |
+
fp16 = torch.cuda.is_available()
|
| 163 |
+
|
| 164 |
+
logger.info("=" * 60)
|
| 165 |
+
logger.info("Resume Section Classifier – Training")
|
| 166 |
+
logger.info("=" * 60)
|
| 167 |
+
logger.info(f"Model: {model_name}")
|
| 168 |
+
logger.info(f"Output: {output_dir}")
|
| 169 |
+
logger.info(f"Epochs: {epochs}, Batch size: {batch_size}, LR: {learning_rate}")
|
| 170 |
+
logger.info(f"Device: {'CUDA' if torch.cuda.is_available() else 'MPS' if torch.backends.mps.is_available() else 'CPU'}")
|
| 171 |
+
logger.info(f"FP16: {fp16}")
|
| 172 |
+
|
| 173 |
+
# ------------------------------------------------------------------
|
| 174 |
+
# 1. Generate synthetic data
|
| 175 |
+
# ------------------------------------------------------------------
|
| 176 |
+
logger.info("\n[1/5] Generating synthetic training data...")
|
| 177 |
+
raw_dataset = generate_dataset(
|
| 178 |
+
examples_per_category=examples_per_category,
|
| 179 |
+
augmented_copies=augmented_copies,
|
| 180 |
+
seed=seed,
|
| 181 |
+
)
|
| 182 |
+
label2id, id2label = get_label_mapping(raw_dataset)
|
| 183 |
+
num_labels = len(label2id)
|
| 184 |
+
|
| 185 |
+
logger.info(f" Total examples: {len(raw_dataset)}")
|
| 186 |
+
logger.info(f" Labels ({num_labels}): {list(label2id.keys())}")
|
| 187 |
+
|
| 188 |
+
# Create HF DatasetDict with train/val/test splits
|
| 189 |
+
dataset_dict = load_as_hf_dataset(raw_dataset)
|
| 190 |
+
logger.info(f" Train: {len(dataset_dict['train'])}")
|
| 191 |
+
logger.info(f" Validation: {len(dataset_dict['validation'])}")
|
| 192 |
+
logger.info(f" Test: {len(dataset_dict['test'])}")
|
| 193 |
+
|
| 194 |
+
# ------------------------------------------------------------------
|
| 195 |
+
# 2. Tokenize
|
| 196 |
+
# ------------------------------------------------------------------
|
| 197 |
+
logger.info("\n[2/5] Loading tokenizer and tokenizing data...")
|
| 198 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 199 |
+
tokenized_dataset = tokenize_dataset(dataset_dict, tokenizer, label2id, max_length)
|
| 200 |
+
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
|
| 201 |
+
|
| 202 |
+
# ------------------------------------------------------------------
|
| 203 |
+
# 3. Load model
|
| 204 |
+
# ------------------------------------------------------------------
|
| 205 |
+
logger.info("\n[3/5] Loading pretrained model...")
|
| 206 |
+
model = AutoModelForSequenceClassification.from_pretrained(
|
| 207 |
+
model_name,
|
| 208 |
+
num_labels=num_labels,
|
| 209 |
+
id2label=id2label,
|
| 210 |
+
label2id=label2id,
|
| 211 |
+
)
|
| 212 |
+
logger.info(f" Parameters: {sum(p.numel() for p in model.parameters()):,}")
|
| 213 |
+
logger.info(f" Trainable: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
|
| 214 |
+
|
| 215 |
+
# ------------------------------------------------------------------
|
| 216 |
+
# 4. Training
|
| 217 |
+
# ------------------------------------------------------------------
|
| 218 |
+
logger.info("\n[4/5] Training...")
|
| 219 |
+
|
| 220 |
+
training_args = TrainingArguments(
|
| 221 |
+
output_dir=output_dir,
|
| 222 |
+
overwrite_output_dir=True,
|
| 223 |
+
# Training hyperparameters
|
| 224 |
+
num_train_epochs=epochs,
|
| 225 |
+
per_device_train_batch_size=batch_size,
|
| 226 |
+
per_device_eval_batch_size=batch_size * 2,
|
| 227 |
+
gradient_accumulation_steps=gradient_accumulation_steps,
|
| 228 |
+
learning_rate=learning_rate,
|
| 229 |
+
weight_decay=weight_decay,
|
| 230 |
+
warmup_ratio=warmup_ratio,
|
| 231 |
+
lr_scheduler_type="cosine",
|
| 232 |
+
# Evaluation
|
| 233 |
+
eval_strategy="epoch",
|
| 234 |
+
save_strategy="epoch",
|
| 235 |
+
load_best_model_at_end=True,
|
| 236 |
+
metric_for_best_model="f1_macro",
|
| 237 |
+
greater_is_better=True,
|
| 238 |
+
# Logging
|
| 239 |
+
logging_dir=DEFAULT_LOGGING_DIR,
|
| 240 |
+
logging_strategy="steps",
|
| 241 |
+
logging_steps=50,
|
| 242 |
+
report_to="none",
|
| 243 |
+
# Efficiency
|
| 244 |
+
fp16=fp16,
|
| 245 |
+
dataloader_num_workers=0,
|
| 246 |
+
# Reproducibility
|
| 247 |
+
seed=seed,
|
| 248 |
+
data_seed=seed,
|
| 249 |
+
# Hub
|
| 250 |
+
push_to_hub=False, # We'll push manually after evaluation
|
| 251 |
+
# Misc
|
| 252 |
+
save_total_limit=3,
|
| 253 |
+
disable_tqdm=False,
|
| 254 |
+
)
|
| 255 |
+
|
| 256 |
+
callbacks = []
|
| 257 |
+
if early_stopping_patience > 0:
|
| 258 |
+
callbacks.append(EarlyStoppingCallback(early_stopping_patience=early_stopping_patience))
|
| 259 |
+
|
| 260 |
+
trainer = Trainer(
|
| 261 |
+
model=model,
|
| 262 |
+
args=training_args,
|
| 263 |
+
train_dataset=tokenized_dataset["train"],
|
| 264 |
+
eval_dataset=tokenized_dataset["validation"],
|
| 265 |
+
tokenizer=tokenizer,
|
| 266 |
+
data_collator=data_collator,
|
| 267 |
+
compute_metrics=build_compute_metrics(id2label),
|
| 268 |
+
callbacks=callbacks,
|
| 269 |
+
)
|
| 270 |
+
|
| 271 |
+
train_result = trainer.train()
|
| 272 |
+
|
| 273 |
+
# Log training metrics
|
| 274 |
+
logger.info("\nTraining Results:")
|
| 275 |
+
for key, value in train_result.metrics.items():
|
| 276 |
+
logger.info(f" {key}: {value}")
|
| 277 |
+
|
| 278 |
+
# ------------------------------------------------------------------
|
| 279 |
+
# 5. Evaluation
|
| 280 |
+
# ------------------------------------------------------------------
|
| 281 |
+
logger.info("\n[5/5] Evaluating on test set...")
|
| 282 |
+
test_results = trainer.evaluate(tokenized_dataset["test"])
|
| 283 |
+
|
| 284 |
+
logger.info("\nTest Results:")
|
| 285 |
+
for key, value in test_results.items():
|
| 286 |
+
logger.info(f" {key}: {value:.4f}" if isinstance(value, float) else f" {key}: {value}")
|
| 287 |
+
|
| 288 |
+
# ------------------------------------------------------------------
|
| 289 |
+
# Save artifacts
|
| 290 |
+
# ------------------------------------------------------------------
|
| 291 |
+
logger.info("\nSaving model and artifacts...")
|
| 292 |
+
|
| 293 |
+
# Save model + tokenizer
|
| 294 |
+
final_path = output_path / "final_model"
|
| 295 |
+
trainer.save_model(str(final_path))
|
| 296 |
+
tokenizer.save_pretrained(str(final_path))
|
| 297 |
+
|
| 298 |
+
# Save label mapping
|
| 299 |
+
label_mapping = {
|
| 300 |
+
"label2id": label2id,
|
| 301 |
+
"id2label": {str(k): v for k, v in id2label.items()},
|
| 302 |
+
"labels": list(label2id.keys()),
|
| 303 |
+
}
|
| 304 |
+
with open(final_path / "label_mapping.json", "w") as f:
|
| 305 |
+
json.dump(label_mapping, f, indent=2)
|
| 306 |
+
|
| 307 |
+
# Save training config
|
| 308 |
+
train_config = {
|
| 309 |
+
"model_name": model_name,
|
| 310 |
+
"max_length": max_length,
|
| 311 |
+
"epochs": epochs,
|
| 312 |
+
"batch_size": batch_size,
|
| 313 |
+
"learning_rate": learning_rate,
|
| 314 |
+
"weight_decay": weight_decay,
|
| 315 |
+
"warmup_ratio": warmup_ratio,
|
| 316 |
+
"examples_per_category": examples_per_category,
|
| 317 |
+
"augmented_copies": augmented_copies,
|
| 318 |
+
"seed": seed,
|
| 319 |
+
"num_labels": num_labels,
|
| 320 |
+
"train_size": len(dataset_dict["train"]),
|
| 321 |
+
"val_size": len(dataset_dict["validation"]),
|
| 322 |
+
"test_size": len(dataset_dict["test"]),
|
| 323 |
+
}
|
| 324 |
+
with open(final_path / "training_config.json", "w") as f:
|
| 325 |
+
json.dump(train_config, f, indent=2)
|
| 326 |
+
|
| 327 |
+
# Save metrics
|
| 328 |
+
all_metrics = {
|
| 329 |
+
"train": train_result.metrics,
|
| 330 |
+
"test": test_results,
|
| 331 |
+
}
|
| 332 |
+
with open(final_path / "metrics.json", "w") as f:
|
| 333 |
+
json.dump(all_metrics, f, indent=2)
|
| 334 |
+
|
| 335 |
+
logger.info(f"\nAll artifacts saved to: {final_path}")
|
| 336 |
+
|
| 337 |
+
# ------------------------------------------------------------------
|
| 338 |
+
# Optional: Push to Hub
|
| 339 |
+
# ------------------------------------------------------------------
|
| 340 |
+
if push_to_hub:
|
| 341 |
+
logger.info(f"\nPushing to HuggingFace Hub: {hub_model_id}")
|
| 342 |
+
try:
|
| 343 |
+
trainer.push_to_hub(
|
| 344 |
+
repo_id=hub_model_id,
|
| 345 |
+
commit_message="Upload fine-tuned resume section classifier",
|
| 346 |
+
)
|
| 347 |
+
tokenizer.push_to_hub(hub_model_id)
|
| 348 |
+
logger.info("Successfully pushed to Hub!")
|
| 349 |
+
except Exception as e:
|
| 350 |
+
logger.error(f"Failed to push to Hub: {e}")
|
| 351 |
+
logger.info("You can push manually later with:")
|
| 352 |
+
logger.info(f" huggingface-cli upload {hub_model_id} {final_path}")
|
| 353 |
+
|
| 354 |
+
logger.info("\nTraining complete!")
|
| 355 |
+
return test_results
|
| 356 |
+
|
| 357 |
+
|
| 358 |
+
# ---------------------------------------------------------------------------
|
| 359 |
+
# CLI
|
| 360 |
+
# ---------------------------------------------------------------------------
|
| 361 |
+
if __name__ == "__main__":
|
| 362 |
+
import argparse
|
| 363 |
+
|
| 364 |
+
parser = argparse.ArgumentParser(
|
| 365 |
+
description="Fine-tune DistilBERT for resume section classification",
|
| 366 |
+
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
| 367 |
+
)
|
| 368 |
+
|
| 369 |
+
# Model & output
|
| 370 |
+
parser.add_argument("--model-name", type=str, default=MODEL_NAME,
|
| 371 |
+
help="Pretrained model name or path")
|
| 372 |
+
parser.add_argument("--output-dir", type=str, default=DEFAULT_OUTPUT_DIR,
|
| 373 |
+
help="Output directory for model and artifacts")
|
| 374 |
+
|
| 375 |
+
# Training hyperparameters
|
| 376 |
+
parser.add_argument("--epochs", type=int, default=4,
|
| 377 |
+
help="Number of training epochs")
|
| 378 |
+
parser.add_argument("--batch-size", type=int, default=16,
|
| 379 |
+
help="Training batch size per device")
|
| 380 |
+
parser.add_argument("--learning-rate", type=float, default=2e-5,
|
| 381 |
+
help="Peak learning rate")
|
| 382 |
+
parser.add_argument("--weight-decay", type=float, default=0.01,
|
| 383 |
+
help="Weight decay for AdamW")
|
| 384 |
+
parser.add_argument("--warmup-ratio", type=float, default=0.1,
|
| 385 |
+
help="Fraction of total steps for linear warmup")
|
| 386 |
+
parser.add_argument("--max-length", type=int, default=MAX_LENGTH,
|
| 387 |
+
help="Maximum token sequence length")
|
| 388 |
+
parser.add_argument("--gradient-accumulation-steps", type=int, default=1,
|
| 389 |
+
help="Number of gradient accumulation steps")
|
| 390 |
+
|
| 391 |
+
# Data
|
| 392 |
+
parser.add_argument("--examples-per-category", type=int, default=80,
|
| 393 |
+
help="Base synthetic examples per category")
|
| 394 |
+
parser.add_argument("--augmented-copies", type=int, default=2,
|
| 395 |
+
help="Augmented copies per base example")
|
| 396 |
+
parser.add_argument("--seed", type=int, default=42,
|
| 397 |
+
help="Random seed for reproducibility")
|
| 398 |
+
|
| 399 |
+
# Training config
|
| 400 |
+
parser.add_argument("--fp16", action="store_true", default=None,
|
| 401 |
+
help="Force FP16 training")
|
| 402 |
+
parser.add_argument("--no-fp16", action="store_true",
|
| 403 |
+
help="Disable FP16 training")
|
| 404 |
+
parser.add_argument("--early-stopping-patience", type=int, default=3,
|
| 405 |
+
help="Early stopping patience (0 to disable)")
|
| 406 |
+
|
| 407 |
+
# Hub
|
| 408 |
+
parser.add_argument("--push-to-hub", action="store_true",
|
| 409 |
+
help="Push trained model to HuggingFace Hub")
|
| 410 |
+
parser.add_argument("--hub-model-id", type=str, default=HUB_MODEL_ID,
|
| 411 |
+
help="HuggingFace Hub model ID")
|
| 412 |
+
|
| 413 |
+
args = parser.parse_args()
|
| 414 |
+
|
| 415 |
+
# Handle fp16 flags
|
| 416 |
+
fp16 = args.fp16
|
| 417 |
+
if args.no_fp16:
|
| 418 |
+
fp16 = False
|
| 419 |
+
|
| 420 |
+
results = train(
|
| 421 |
+
output_dir=args.output_dir,
|
| 422 |
+
model_name=args.model_name,
|
| 423 |
+
epochs=args.epochs,
|
| 424 |
+
batch_size=args.batch_size,
|
| 425 |
+
learning_rate=args.learning_rate,
|
| 426 |
+
weight_decay=args.weight_decay,
|
| 427 |
+
warmup_ratio=args.warmup_ratio,
|
| 428 |
+
max_length=args.max_length,
|
| 429 |
+
examples_per_category=args.examples_per_category,
|
| 430 |
+
augmented_copies=args.augmented_copies,
|
| 431 |
+
seed=args.seed,
|
| 432 |
+
push_to_hub=args.push_to_hub,
|
| 433 |
+
hub_model_id=args.hub_model_id,
|
| 434 |
+
fp16=fp16,
|
| 435 |
+
gradient_accumulation_steps=args.gradient_accumulation_steps,
|
| 436 |
+
early_stopping_patience=args.early_stopping_patience,
|
| 437 |
+
)
|