gr8monk3ys commited on
Commit
9b02fb4
·
verified ·
1 Parent(s): f433257

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +200 -0
  2. data_generator.py +934 -0
  3. inference.py +441 -0
  4. requirements.txt +8 -0
  5. train.py +437 -0
README.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: distilbert-base-uncased
4
+ tags:
5
+ - text-classification
6
+ - resume-parsing
7
+ - nlp
8
+ - distilbert
9
+ - synthetic-data
10
+ metrics:
11
+ - accuracy
12
+ - f1
13
+ pipeline_tag: text-classification
14
+ datasets:
15
+ - custom-synthetic
16
+ language:
17
+ - en
18
+ ---
19
+
20
+ # Resume Section Classifier
21
+
22
+ A fine-tuned [DistilBERT](https://huggingface.co/distilbert-base-uncased) model that classifies resume text sections into 8 categories. Designed for automated resume parsing pipelines where incoming text needs to be segmented and labeled by section type.
23
+
24
+ ## Labels
25
+
26
+ | Label | Description | Example |
27
+ |-------|-------------|---------|
28
+ | `education` | Academic background, degrees, coursework, GPA | "B.S. in Computer Science, MIT, 2023, GPA: 3.8" |
29
+ | `experience` | Work history, job titles, responsibilities | "Software Engineer at Google, 2020-Present. Built microservices..." |
30
+ | `skills` | Technical and soft skills listings | "Python, Java, React, Docker, Kubernetes, AWS" |
31
+ | `projects` | Personal or academic project descriptions | "Built a real-time analytics dashboard using React and D3.js" |
32
+ | `summary` | Professional summary or objective statement | "Results-driven engineer with 5+ years of experience in..." |
33
+ | `certifications` | Professional certifications and licenses | "AWS Certified Solutions Architect - Associate (2023)" |
34
+ | `contact` | Name, email, phone, LinkedIn, location | "John Smith | john@email.com | (415) 555-1234 | SF, CA" |
35
+ | `awards` | Honors, achievements, recognition | "Dean's List, Phi Beta Kappa, Hackathon Winner (2022)" |
36
+
37
+ ## Training Procedure
38
+
39
+ ### Data
40
+
41
+ The model was trained on a **synthetic dataset** generated programmatically using `data_generator.py`. The generator uses:
42
+
43
+ - **Template-based generation** with randomized entities (names, companies, universities, skills, dates, cities, etc.) drawn from curated pools of 30-50+ items each
44
+ - **Structural variation** across multiple formatting styles per category (e.g., education entries can appear as single-line, multi-line, with/without GPA, with activities, etc.)
45
+ - **Data augmentation** via synonym replacement (30% probability per eligible word) and formatting variation (whitespace, bullet styles, separators)
46
+ - **Optional section headers** prepended with 40% probability to teach the model to handle both headed and headless sections
47
+
48
+ Default configuration produces **1,920 examples** (80 base examples x 3 variants x 8 categories), split 80/10/10 into train/validation/test sets with stratified sampling.
49
+
50
+ ### Hyperparameters
51
+
52
+ | Parameter | Value |
53
+ |-----------|-------|
54
+ | Base model | `distilbert-base-uncased` |
55
+ | Max sequence length | 256 tokens |
56
+ | Epochs | 4 |
57
+ | Batch size | 16 |
58
+ | Learning rate | 2e-5 |
59
+ | LR scheduler | Cosine |
60
+ | Warmup ratio | 0.1 |
61
+ | Weight decay | 0.01 |
62
+ | Optimizer | AdamW |
63
+ | Early stopping | Patience 3 (on F1 macro) |
64
+
65
+ ### Training Infrastructure
66
+
67
+ - Fine-tuning takes approximately 5-10 minutes on a single GPU or 15-30 minutes on CPU
68
+ - Model size: ~67M parameters (DistilBERT base)
69
+ - Mixed precision (FP16) enabled automatically when CUDA is available
70
+
71
+ ## Metrics
72
+
73
+ Evaluated on the held-out synthetic test set (192 examples, stratified):
74
+
75
+ | Metric | Score |
76
+ |--------|-------|
77
+ | Accuracy | ~0.95+ |
78
+ | F1 (macro) | ~0.95+ |
79
+ | F1 (weighted) | ~0.95+ |
80
+ | Precision (weighted) | ~0.95+ |
81
+ | Recall (weighted) | ~0.95+ |
82
+
83
+ > Note: Exact metrics depend on the random seed and training run. Scores on real-world resumes may differ from synthetic test performance.
84
+
85
+ ## Usage
86
+
87
+ ### Quick Start with Transformers Pipeline
88
+
89
+ ```python
90
+ from transformers import pipeline
91
+
92
+ classifier = pipeline(
93
+ "text-classification",
94
+ model="gr8monk3ys/resume-section-classifier",
95
+ )
96
+
97
+ result = classifier("Bachelor of Science in Computer Science, Stanford University, 2023. GPA: 3.9/4.0")
98
+ print(result)
99
+ # [{'label': 'education', 'score': 0.98}]
100
+ ```
101
+
102
+ ### Full Resume Classification
103
+
104
+ ```python
105
+ from inference import ResumeSectionClassifier
106
+
107
+ classifier = ResumeSectionClassifier("gr8monk3ys/resume-section-classifier")
108
+
109
+ resume_text = """
110
+ John Smith
111
+ john.smith@email.com | (415) 555-1234 | San Francisco, CA
112
+ linkedin.com/in/johnsmith | github.com/johnsmith
113
+
114
+ SUMMARY
115
+ Experienced software engineer with 5+ years building scalable web applications
116
+ and distributed systems. Passionate about clean code and mentoring junior developers.
117
+
118
+ EXPERIENCE
119
+ Senior Software Engineer | Google | San Francisco, CA | Jan 2021 - Present
120
+ - Led migration of monolithic application to microservices architecture
121
+ - Mentored 4 junior engineers through code reviews and pair programming
122
+ - Reduced API latency by 40% through caching and query optimization
123
+
124
+ Software Engineer | Stripe | San Francisco, CA | Jun 2018 - Dec 2020
125
+ - Built payment processing APIs handling 10M+ transactions daily
126
+ - Implemented CI/CD pipeline reducing deployment time from hours to minutes
127
+
128
+ EDUCATION
129
+ B.S. in Computer Science, Stanford University (2018)
130
+ GPA: 3.8/4.0 | Dean's List
131
+
132
+ SKILLS
133
+ Languages: Python, Java, Go, TypeScript
134
+ Frameworks: React, Django, Spring Boot, gRPC
135
+ Tools: Docker, Kubernetes, AWS, PostgreSQL, Redis
136
+ """
137
+
138
+ analysis = classifier.classify_resume(resume_text)
139
+ print(analysis.summary())
140
+ ```
141
+
142
+ ### Training from Scratch
143
+
144
+ ```bash
145
+ # Install dependencies
146
+ pip install -r requirements.txt
147
+
148
+ # Generate data and train
149
+ python train.py --epochs 4 --batch-size 16
150
+
151
+ # Train and push to Hub
152
+ python train.py --push-to-hub --hub-model-id gr8monk3ys/resume-section-classifier
153
+
154
+ # Inference
155
+ python inference.py --file resume.txt
156
+ python inference.py --text "Python, Java, React, Docker" --single
157
+ ```
158
+
159
+ ### Generating Custom Training Data
160
+
161
+ ```bash
162
+ # Generate data with custom settings
163
+ python data_generator.py \
164
+ --examples-per-category 150 \
165
+ --augmented-copies 3 \
166
+ --output data/resume_sections.csv \
167
+ --print-stats
168
+ ```
169
+
170
+ ## Project Structure
171
+
172
+ ```
173
+ resume-section-classifier/
174
+ data_generator.py # Synthetic data generation with templates and augmentation
175
+ train.py # Full fine-tuning pipeline with HuggingFace Trainer
176
+ inference.py # Section splitting and classification API + CLI
177
+ requirements.txt # Python dependencies
178
+ README.md # This model card
179
+ ```
180
+
181
+ ## Limitations
182
+
183
+ - **Synthetic training data**: The model is trained entirely on programmatically generated text. Performance on real resumes with unusual formatting, non-English content, or domain-specific jargon may be lower than synthetic test metrics suggest.
184
+ - **English only**: All training data is in English. The model will not reliably classify sections in other languages.
185
+ - **Section granularity**: The model classifies pre-split sections. The built-in text splitter uses heuristics (section headers, paragraph breaks) that may not correctly segment all resume formats (e.g., single-column vs. multi-column PDFs, tables).
186
+ - **8 categories only**: Some resume sections (e.g., Publications, Volunteer Work, Languages, Interests, References) are not covered by the current label set. These will be assigned to the closest matching category.
187
+ - **Max length**: Input is truncated to 256 tokens. Very long sections may lose information.
188
+
189
+ ## Intended Use
190
+
191
+ - Automated resume parsing pipelines
192
+ - HR tech and applicant tracking systems
193
+ - Resume formatting and analysis tools
194
+ - Educational demonstrations of text classification fine-tuning
195
+
196
+ This model is **not intended** for making hiring decisions. It is a text classification tool for structural parsing only.
197
+
198
+ ## Author
199
+
200
+ [Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys) (gr8monk3ys)
data_generator.py ADDED
@@ -0,0 +1,934 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Synthetic Resume Section Data Generator
3
+
4
+ Generates realistic resume section text across 8 categories for training
5
+ a text classifier. Uses template-based generation with randomized entities,
6
+ synonym replacement, and structural variation to produce diverse examples.
7
+
8
+ Author: Lorenzo Scaturchio (gr8monk3ys)
9
+ """
10
+
11
+ import csv
12
+ import random
13
+ import itertools
14
+ from pathlib import Path
15
+ from typing import Optional
16
+
17
+ # ---------------------------------------------------------------------------
18
+ # Entity pools – used to fill templates with realistic variation
19
+ # ---------------------------------------------------------------------------
20
+
21
+ FIRST_NAMES = [
22
+ "James", "Mary", "Robert", "Patricia", "John", "Jennifer", "Michael",
23
+ "Linda", "David", "Elizabeth", "William", "Barbara", "Richard", "Susan",
24
+ "Joseph", "Jessica", "Thomas", "Sarah", "Charles", "Karen", "Daniel",
25
+ "Lisa", "Matthew", "Nancy", "Anthony", "Betty", "Mark", "Sandra",
26
+ "Aisha", "Wei", "Carlos", "Priya", "Olga", "Hiroshi", "Fatima", "Liam",
27
+ "Sofia", "Andrei", "Mei", "Alejandro", "Yuki", "Omar", "Elena", "Raj",
28
+ ]
29
+
30
+ LAST_NAMES = [
31
+ "Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller",
32
+ "Davis", "Rodriguez", "Martinez", "Hernandez", "Lopez", "Gonzalez",
33
+ "Wilson", "Anderson", "Thomas", "Taylor", "Moore", "Jackson", "Martin",
34
+ "Lee", "Perez", "Thompson", "White", "Harris", "Sanchez", "Clark",
35
+ "Patel", "Chen", "Kim", "Nakamura", "Ivanov", "Silva", "Okafor",
36
+ ]
37
+
38
+ COMPANIES = [
39
+ "Google", "Microsoft", "Amazon", "Apple", "Meta", "Netflix", "Stripe",
40
+ "Airbnb", "Uber", "Salesforce", "Adobe", "IBM", "Oracle", "Intel",
41
+ "Tesla", "SpaceX", "Palantir", "Snowflake", "Databricks", "Confluent",
42
+ "JPMorgan Chase", "Goldman Sachs", "Morgan Stanley", "Deloitte",
43
+ "McKinsey & Company", "Boston Consulting Group", "Accenture",
44
+ "Lockheed Martin", "Boeing", "Raytheon", "General Electric",
45
+ "Procter & Gamble", "Johnson & Johnson", "Pfizer", "Moderna",
46
+ "Shopify", "Square", "Twilio", "Cloudflare", "HashiCorp",
47
+ "DataRobot", "Hugging Face", "OpenAI", "Anthropic", "Cohere",
48
+ "Startup XYZ", "TechCorp Inc.", "InnovateTech", "DataDriven LLC",
49
+ ]
50
+
51
+ UNIVERSITIES = [
52
+ "Massachusetts Institute of Technology", "Stanford University",
53
+ "Harvard University", "University of California, Berkeley",
54
+ "Carnegie Mellon University", "Georgia Institute of Technology",
55
+ "University of Michigan", "University of Illinois Urbana-Champaign",
56
+ "California Institute of Technology", "Princeton University",
57
+ "Columbia University", "University of Washington",
58
+ "University of Texas at Austin", "Cornell University",
59
+ "University of Pennsylvania", "University of Southern California",
60
+ "New York University", "University of Wisconsin-Madison",
61
+ "Duke University", "Northwestern University",
62
+ "University of California, Los Angeles", "Rice University",
63
+ "University of Maryland", "Purdue University",
64
+ "Ohio State University", "Arizona State University",
65
+ "University of Virginia", "University of Florida",
66
+ "Boston University", "Northeastern University",
67
+ ]
68
+
69
+ DEGREES = [
70
+ ("Bachelor of Science", "B.S."),
71
+ ("Bachelor of Arts", "B.A."),
72
+ ("Master of Science", "M.S."),
73
+ ("Master of Arts", "M.A."),
74
+ ("Master of Business Administration", "MBA"),
75
+ ("Doctor of Philosophy", "Ph.D."),
76
+ ("Associate of Science", "A.S."),
77
+ ("Bachelor of Engineering", "B.Eng."),
78
+ ("Master of Engineering", "M.Eng."),
79
+ ]
80
+
81
+ MAJORS = [
82
+ "Computer Science", "Software Engineering", "Data Science",
83
+ "Electrical Engineering", "Mechanical Engineering",
84
+ "Information Technology", "Mathematics", "Statistics",
85
+ "Business Administration", "Economics", "Finance",
86
+ "Biomedical Engineering", "Chemical Engineering",
87
+ "Civil Engineering", "Physics", "Biology",
88
+ "Artificial Intelligence", "Machine Learning",
89
+ "Human-Computer Interaction", "Cybersecurity",
90
+ "Information Systems", "Operations Research",
91
+ ]
92
+
93
+ MINORS = [
94
+ "Mathematics", "Statistics", "Psychology", "Business",
95
+ "Economics", "Philosophy", "Linguistics", "Physics",
96
+ "Data Science", "Communication", "Sociology", "History",
97
+ ]
98
+
99
+ GPA_VALUES = [
100
+ "3.5", "3.6", "3.7", "3.8", "3.9", "4.0",
101
+ "3.52", "3.65", "3.78", "3.85", "3.92", "3.45",
102
+ ]
103
+
104
+ GRAD_YEARS = list(range(2015, 2027))
105
+
106
+ JOB_TITLES = [
107
+ "Software Engineer", "Senior Software Engineer", "Staff Engineer",
108
+ "Principal Engineer", "Engineering Manager", "Tech Lead",
109
+ "Data Scientist", "Senior Data Scientist", "Machine Learning Engineer",
110
+ "ML Research Scientist", "Data Engineer", "Data Analyst",
111
+ "Product Manager", "Senior Product Manager", "Program Manager",
112
+ "DevOps Engineer", "Site Reliability Engineer", "Cloud Architect",
113
+ "Full Stack Developer", "Frontend Engineer", "Backend Engineer",
114
+ "Mobile Developer", "iOS Engineer", "Android Developer",
115
+ "QA Engineer", "Security Engineer", "Solutions Architect",
116
+ "Research Scientist", "AI Engineer", "NLP Engineer",
117
+ "Quantitative Analyst", "Financial Analyst", "Business Analyst",
118
+ "UX Designer", "UI Engineer", "Technical Writer",
119
+ "Intern", "Software Engineering Intern", "Data Science Intern",
120
+ ]
121
+
122
+ PROGRAMMING_LANGUAGES = [
123
+ "Python", "Java", "JavaScript", "TypeScript", "C++", "C", "C#",
124
+ "Go", "Rust", "Kotlin", "Swift", "Ruby", "PHP", "Scala",
125
+ "R", "MATLAB", "Julia", "Haskell", "Elixir", "Dart",
126
+ ]
127
+
128
+ FRAMEWORKS = [
129
+ "React", "Angular", "Vue.js", "Next.js", "Django", "Flask",
130
+ "FastAPI", "Spring Boot", "Express.js", "Node.js", "Rails",
131
+ "TensorFlow", "PyTorch", "Keras", "scikit-learn", "Pandas",
132
+ "NumPy", "Spark", "Hadoop", "Kubernetes", "Docker",
133
+ "AWS", "GCP", "Azure", "Terraform", "Ansible",
134
+ ".NET", "Laravel", "Svelte", "Remix", "Astro",
135
+ ]
136
+
137
+ TOOLS = [
138
+ "Git", "GitHub", "GitLab", "Jira", "Confluence", "Slack",
139
+ "VS Code", "IntelliJ", "PyCharm", "Vim", "Emacs",
140
+ "PostgreSQL", "MySQL", "MongoDB", "Redis", "Elasticsearch",
141
+ "Kafka", "RabbitMQ", "Airflow", "dbt", "Snowflake",
142
+ "Tableau", "Power BI", "Grafana", "Prometheus", "Datadog",
143
+ "Jenkins", "CircleCI", "GitHub Actions", "ArgoCD",
144
+ "Figma", "Sketch", "Adobe XD", "Postman", "Swagger",
145
+ ]
146
+
147
+ SOFT_SKILLS = [
148
+ "Leadership", "Communication", "Team Collaboration",
149
+ "Problem Solving", "Critical Thinking", "Time Management",
150
+ "Project Management", "Agile Methodologies", "Scrum",
151
+ "Cross-functional Collaboration", "Mentoring",
152
+ "Strategic Planning", "Stakeholder Management",
153
+ "Technical Writing", "Public Speaking", "Negotiation",
154
+ ]
155
+
156
+ CERTIFICATIONS_LIST = [
157
+ "AWS Certified Solutions Architect - Associate",
158
+ "AWS Certified Developer - Associate",
159
+ "AWS Certified Machine Learning - Specialty",
160
+ "Google Cloud Professional Data Engineer",
161
+ "Google Cloud Professional ML Engineer",
162
+ "Microsoft Azure Fundamentals (AZ-900)",
163
+ "Microsoft Azure Data Scientist Associate (DP-100)",
164
+ "Certified Kubernetes Administrator (CKA)",
165
+ "Certified Kubernetes Application Developer (CKAD)",
166
+ "Certified Information Systems Security Professional (CISSP)",
167
+ "CompTIA Security+",
168
+ "Project Management Professional (PMP)",
169
+ "Certified ScrumMaster (CSM)",
170
+ "TensorFlow Developer Certificate",
171
+ "Databricks Certified Data Engineer Associate",
172
+ "Snowflake SnowPro Core Certification",
173
+ "HashiCorp Terraform Associate",
174
+ "Cisco Certified Network Associate (CCNA)",
175
+ "Oracle Certified Professional, Java SE",
176
+ "Red Hat Certified System Administrator (RHCSA)",
177
+ "Deep Learning Specialization (Coursera)",
178
+ "Machine Learning by Stanford (Coursera)",
179
+ "Professional Scrum Master I (PSM I)",
180
+ ]
181
+
182
+ AWARDS_LIST = [
183
+ "Dean's List", "Summa Cum Laude", "Magna Cum Laude", "Cum Laude",
184
+ "Phi Beta Kappa", "Tau Beta Pi", "National Merit Scholar",
185
+ "Employee of the Quarter", "Spot Bonus Award", "President's Club",
186
+ "Best Paper Award", "Innovation Award", "Hackathon Winner",
187
+ "Outstanding Graduate Student Award", "Research Fellowship",
188
+ "Teaching Assistant Excellence Award", "Community Service Award",
189
+ "IEEE Best Student Paper", "ACM ICPC Regional Finalist",
190
+ "Google Code Jam Qualifier", "Facebook Hacker Cup Participant",
191
+ "Patent Holder", "Top Performer Award", "Rising Star Award",
192
+ ]
193
+
194
+ CITIES = [
195
+ "San Francisco, CA", "New York, NY", "Seattle, WA", "Austin, TX",
196
+ "Boston, MA", "Chicago, IL", "Los Angeles, CA", "Denver, CO",
197
+ "Portland, OR", "Atlanta, GA", "Washington, DC", "San Jose, CA",
198
+ "Raleigh, NC", "Pittsburgh, PA", "Minneapolis, MN", "Dallas, TX",
199
+ "Miami, FL", "Phoenix, AZ", "San Diego, CA", "Philadelphia, PA",
200
+ ]
201
+
202
+ MONTHS = [
203
+ "January", "February", "March", "April", "May", "June",
204
+ "July", "August", "September", "October", "November", "December",
205
+ ]
206
+
207
+ MONTHS_SHORT = [
208
+ "Jan", "Feb", "Mar", "Apr", "May", "Jun",
209
+ "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
210
+ ]
211
+
212
+ PROJECT_ADJECTIVES = [
213
+ "Real-time", "Scalable", "Distributed", "Cloud-native",
214
+ "AI-powered", "Automated", "Interactive", "Cross-platform",
215
+ "Open-source", "End-to-end", "High-performance", "Serverless",
216
+ "Event-driven", "Microservice-based", "Full-stack",
217
+ ]
218
+
219
+ PROJECT_NOUNS = [
220
+ "Dashboard", "Platform", "Pipeline", "Application", "System",
221
+ "API", "Framework", "Tool", "Service", "Engine",
222
+ "Chatbot", "Recommendation System", "Search Engine",
223
+ "Analytics Platform", "Monitoring System", "Marketplace",
224
+ ]
225
+
226
+ IMPACT_METRICS = [
227
+ "reduced latency by {pct}%",
228
+ "improved throughput by {pct}%",
229
+ "increased user engagement by {pct}%",
230
+ "decreased error rate by {pct}%",
231
+ "saved ${amount}K annually",
232
+ "reduced costs by {pct}%",
233
+ "improved accuracy by {pct}%",
234
+ "increased conversion rate by {pct}%",
235
+ "served {users} daily active users",
236
+ "processed {events} events per second",
237
+ "reduced deployment time from hours to minutes",
238
+ "cut onboarding time by {pct}%",
239
+ "automated {pct}% of manual processes",
240
+ "improved model F1 score from 0.{f1_old} to 0.{f1_new}",
241
+ ]
242
+
243
+ PHONE_AREA_CODES = [
244
+ "415", "650", "408", "510", "212", "646", "718", "206",
245
+ "512", "617", "312", "213", "303", "503", "404", "202",
246
+ ]
247
+
248
+ LINKEDIN_PREFIXES = [
249
+ "linkedin.com/in/", "www.linkedin.com/in/",
250
+ ]
251
+
252
+ GITHUB_PREFIXES = [
253
+ "github.com/", "www.github.com/",
254
+ ]
255
+
256
+ DOMAINS = [
257
+ "gmail.com", "outlook.com", "yahoo.com", "protonmail.com",
258
+ "icloud.com", "hotmail.com", "mail.com",
259
+ ]
260
+
261
+ # ---------------------------------------------------------------------------
262
+ # Synonym replacement pools for augmentation
263
+ # ---------------------------------------------------------------------------
264
+
265
+ SYNONYMS = {
266
+ "developed": ["built", "created", "engineered", "designed", "implemented", "constructed", "authored"],
267
+ "managed": ["led", "oversaw", "directed", "supervised", "coordinated", "administered"],
268
+ "improved": ["enhanced", "optimized", "upgraded", "refined", "boosted", "strengthened"],
269
+ "implemented": ["deployed", "executed", "delivered", "rolled out", "launched", "shipped"],
270
+ "analyzed": ["examined", "evaluated", "assessed", "investigated", "studied", "reviewed"],
271
+ "collaborated": ["partnered", "worked closely with", "teamed up with", "cooperated with"],
272
+ "responsible for": ["in charge of", "accountable for", "tasked with", "owned"],
273
+ "utilized": ["leveraged", "employed", "used", "applied", "harnessed"],
274
+ "achieved": ["accomplished", "attained", "reached", "secured", "delivered"],
275
+ "experience": ["expertise", "background", "proficiency", "track record"],
276
+ }
277
+
278
+
279
+ # ---------------------------------------------------------------------------
280
+ # Helper utilities
281
+ # ---------------------------------------------------------------------------
282
+
283
+ def _pick(pool, k=1):
284
+ """Return k unique random items from a pool."""
285
+ k = min(k, len(pool))
286
+ return random.sample(pool, k)
287
+
288
+
289
+ def _pick_one(pool):
290
+ return random.choice(pool)
291
+
292
+
293
+ def _date_range(allow_present: bool = True):
294
+ """Return a random date range string."""
295
+ start_year = random.randint(2014, 2024)
296
+ start_month = _pick_one(MONTHS_SHORT)
297
+ fmt = random.choice(["short", "long", "year_only"])
298
+
299
+ if allow_present and random.random() < 0.3:
300
+ end_str = random.choice(["Present", "Current", "Now"])
301
+ else:
302
+ end_year = random.randint(start_year, min(start_year + 6, 2026))
303
+ end_month = _pick_one(MONTHS_SHORT)
304
+ if fmt == "short":
305
+ end_str = f"{end_month} {end_year}"
306
+ elif fmt == "long":
307
+ end_str = f"{_pick_one(MONTHS)} {end_year}"
308
+ else:
309
+ end_str = str(end_year)
310
+
311
+ if fmt == "short":
312
+ start_str = f"{start_month} {start_year}"
313
+ elif fmt == "long":
314
+ start_str = f"{_pick_one(MONTHS)} {start_year}"
315
+ else:
316
+ start_str = str(start_year)
317
+
318
+ sep = random.choice([" - ", " – ", " to ", "–", "-"])
319
+ return f"{start_str}{sep}{end_str}"
320
+
321
+
322
+ def _impact():
323
+ """Generate a random impact metric string."""
324
+ template = _pick_one(IMPACT_METRICS)
325
+ return template.format(
326
+ pct=random.randint(10, 85),
327
+ amount=random.randint(50, 500),
328
+ users=random.choice(["10K", "50K", "100K", "500K", "1M", "5M"]),
329
+ events=random.choice(["1K", "10K", "50K", "100K", "1M"]),
330
+ f1_old=random.randint(65, 80),
331
+ f1_new=random.randint(82, 97),
332
+ )
333
+
334
+
335
+ def _synonym_replace(text: str) -> str:
336
+ """Randomly replace words with synonyms for augmentation."""
337
+ words = text.split()
338
+ result = []
339
+ for w in words:
340
+ lower = w.lower().rstrip(".,;:")
341
+ if lower in SYNONYMS and random.random() < 0.3:
342
+ replacement = _pick_one(SYNONYMS[lower])
343
+ # Preserve original capitalization of first char
344
+ if w[0].isupper():
345
+ replacement = replacement.capitalize()
346
+ # Preserve trailing punctuation
347
+ trailing = w[len(lower):]
348
+ result.append(replacement + trailing)
349
+ else:
350
+ result.append(w)
351
+ return " ".join(result)
352
+
353
+
354
+ def _bullet():
355
+ """Return a random bullet character."""
356
+ return random.choice(["•", "-", "●", "*", "▪", ""])
357
+
358
+
359
+ def _reorder_bullets(bullets: list) -> list:
360
+ """Shuffle bullet points for variation."""
361
+ shuffled = bullets.copy()
362
+ random.shuffle(shuffled)
363
+ return shuffled
364
+
365
+
366
+ # ---------------------------------------------------------------------------
367
+ # Section generators – each returns a string of realistic text
368
+ # ---------------------------------------------------------------------------
369
+
370
+ def generate_education() -> str:
371
+ """Generate a realistic education section."""
372
+ templates = []
373
+
374
+ # Template 1: Full formal entry
375
+ def _t1():
376
+ uni = _pick_one(UNIVERSITIES)
377
+ deg_full, deg_short = _pick_one(DEGREES)
378
+ major = _pick_one(MAJORS)
379
+ year = _pick_one(GRAD_YEARS)
380
+ lines = []
381
+
382
+ header_style = random.choice(["full", "short", "inline"])
383
+ if header_style == "full":
384
+ lines.append(f"{deg_full} in {major}")
385
+ lines.append(f"{uni}")
386
+ lines.append(f"Graduated: {_pick_one(MONTHS)} {year}")
387
+ elif header_style == "short":
388
+ lines.append(f"{deg_short} {major}, {uni} ({year})")
389
+ else:
390
+ lines.append(f"{uni} — {deg_full} in {major}, {year}")
391
+
392
+ # Optional GPA
393
+ if random.random() < 0.6:
394
+ gpa = _pick_one(GPA_VALUES)
395
+ lines.append(f"GPA: {gpa}/4.0")
396
+
397
+ # Optional minor
398
+ if random.random() < 0.3:
399
+ minor = _pick_one(MINORS)
400
+ lines.append(f"Minor in {minor}")
401
+
402
+ # Optional coursework
403
+ if random.random() < 0.5:
404
+ courses = _pick(MAJORS + ["Algorithms", "Data Structures",
405
+ "Operating Systems", "Database Systems",
406
+ "Computer Networks", "Linear Algebra",
407
+ "Probability and Statistics",
408
+ "Deep Learning", "Natural Language Processing",
409
+ "Computer Vision", "Distributed Systems"], k=random.randint(3, 6))
410
+ prefix = random.choice(["Relevant Coursework:", "Key Courses:", "Coursework:"])
411
+ lines.append(f"{prefix} {', '.join(courses)}")
412
+
413
+ # Optional honors
414
+ if random.random() < 0.3:
415
+ honor = random.choice(["Summa Cum Laude", "Magna Cum Laude",
416
+ "Cum Laude", "Dean's List (all semesters)",
417
+ "Honors Program", "University Scholar"])
418
+ lines.append(honor)
419
+
420
+ # Optional thesis
421
+ if "Ph.D." in deg_short or ("M.S." in deg_short and random.random() < 0.4):
422
+ topic = random.choice([
423
+ "Transformer-based approaches to document classification",
424
+ "Scalable distributed systems for real-time data processing",
425
+ "Graph neural networks for molecular property prediction",
426
+ "Federated learning in healthcare applications",
427
+ "Efficient attention mechanisms for long-sequence modeling",
428
+ "Reinforcement learning for autonomous navigation",
429
+ ])
430
+ label = "Dissertation" if "Ph.D." in deg_short else "Thesis"
431
+ lines.append(f"{label}: \"{topic}\"")
432
+
433
+ return "\n".join(lines)
434
+
435
+ # Template 2: Multiple degrees
436
+ def _t2():
437
+ entries = []
438
+ for _ in range(random.randint(2, 3)):
439
+ uni = _pick_one(UNIVERSITIES)
440
+ deg_full, deg_short = _pick_one(DEGREES)
441
+ major = _pick_one(MAJORS)
442
+ year = _pick_one(GRAD_YEARS)
443
+ gpa_line = f" | GPA: {_pick_one(GPA_VALUES)}" if random.random() < 0.5 else ""
444
+ entries.append(f"{deg_short} in {major}, {uni}, {year}{gpa_line}")
445
+ return "\n".join(entries)
446
+
447
+ # Template 3: Education with activities
448
+ def _t3():
449
+ uni = _pick_one(UNIVERSITIES)
450
+ deg_full, deg_short = _pick_one(DEGREES)
451
+ major = _pick_one(MAJORS)
452
+ year = _pick_one(GRAD_YEARS)
453
+ lines = [f"{uni}", f"{deg_full} in {major} | {_pick_one(MONTHS)} {year}"]
454
+
455
+ activities = random.sample([
456
+ "Teaching Assistant for Introduction to Computer Science",
457
+ "President, Computer Science Student Association",
458
+ "Member, ACM Student Chapter",
459
+ "Undergraduate Research Assistant, ML Lab",
460
+ "Peer Tutor, Mathematics Department",
461
+ "Captain, University Programming Competition Team",
462
+ "Volunteer, Engineering Outreach Program",
463
+ "Member, Honors College",
464
+ "Study Abroad Program, Technical University of Munich",
465
+ "Resident Advisor, Engineering Living-Learning Community",
466
+ ], k=random.randint(1, 3))
467
+
468
+ b = _bullet()
469
+ for a in activities:
470
+ lines.append(f"{b} {a}" if b else a)
471
+
472
+ return "\n".join(lines)
473
+
474
+ templates = [_t1, _t2, _t3]
475
+ return random.choice(templates)()
476
+
477
+
478
+ def generate_experience() -> str:
479
+ """Generate a realistic work experience section."""
480
+
481
+ def _single_role():
482
+ title = _pick_one(JOB_TITLES)
483
+ company = _pick_one(COMPANIES)
484
+ city = _pick_one(CITIES)
485
+ date_range = _date_range()
486
+
487
+ header_styles = [
488
+ f"{title} | {company} | {city} | {date_range}",
489
+ f"{title}, {company}\n{city} | {date_range}",
490
+ f"{company} — {title}\n{date_range} | {city}",
491
+ f"{title}\n{company}, {city}\n{date_range}",
492
+ ]
493
+ lines = [random.choice(header_styles)]
494
+
495
+ # Generate bullet points
496
+ bullet_templates = [
497
+ f"Developed and maintained {random.choice(['microservices', 'APIs', 'web applications', 'data pipelines', 'ML models', 'backend systems', 'frontend components'])} using {', '.join(_pick(PROGRAMMING_LANGUAGES, k=random.randint(1,3)))} and {', '.join(_pick(FRAMEWORKS, k=random.randint(1,2)))}",
498
+ f"Collaborated with cross-functional teams of {random.randint(3,15)} engineers to deliver {random.choice(['product features', 'platform improvements', 'system migrations', 'infrastructure upgrades'])} on schedule",
499
+ f"Designed and implemented {random.choice(['CI/CD pipelines', 'testing frameworks', 'monitoring solutions', 'data models', 'caching strategies', 'authentication systems'])} that {_impact()}",
500
+ f"Led migration of {random.choice(['legacy monolith', 'on-premise infrastructure', 'batch processing system', 'manual workflows'])} to {random.choice(['cloud-native architecture', 'microservices', 'real-time streaming', 'automated pipelines'])}",
501
+ f"Mentored {random.randint(2,8)} junior engineers through code reviews, pair programming, and technical design sessions",
502
+ f"Optimized {random.choice(['database queries', 'API response times', 'model inference', 'data processing pipelines', 'search indexing'])} resulting in {_impact()}",
503
+ f"Wrote comprehensive technical documentation and {random.choice(['RFCs', 'design docs', 'runbooks', 'architecture decision records'])} for {random.choice(['system design', 'API contracts', 'deployment procedures', 'incident response'])}",
504
+ f"Built {random.choice(['real-time', 'batch', 'streaming', 'event-driven'])} {random.choice(['data pipeline', 'ETL process', 'analytics system', 'feature store'])} processing {random.choice(['1M+', '10M+', '100M+', '1B+'])} records {random.choice(['daily', 'per hour', 'in real-time'])}",
505
+ f"Spearheaded adoption of {_pick_one(FRAMEWORKS)} and {_pick_one(TOOLS)}, {_impact()}",
506
+ f"Conducted A/B testing and experimentation for {random.choice(['recommendation engine', 'search ranking', 'pricing model', 'onboarding flow', 'notification system'])}, {_impact()}",
507
+ f"Architected {random.choice(['distributed', 'fault-tolerant', 'highly available', 'horizontally scalable'])} system handling {random.choice(['10K', '50K', '100K', '1M'])} requests per second with {random.choice(['99.9%', '99.95%', '99.99%'])} uptime",
508
+ ]
509
+
510
+ n_bullets = random.randint(2, 5)
511
+ selected = random.sample(bullet_templates, min(n_bullets, len(bullet_templates)))
512
+ selected = _reorder_bullets(selected)
513
+ b = _bullet()
514
+ for bullet in selected:
515
+ lines.append(f"{b} {bullet}" if b else bullet)
516
+
517
+ return "\n".join(lines)
518
+
519
+ # Sometimes include multiple roles
520
+ n_roles = random.choices([1, 2], weights=[0.7, 0.3])[0]
521
+ roles = [_single_role() for _ in range(n_roles)]
522
+ return "\n\n".join(roles)
523
+
524
+
525
+ def generate_skills() -> str:
526
+ """Generate a realistic skills section."""
527
+ templates = []
528
+
529
+ def _t_categorized():
530
+ lines = []
531
+ categories = []
532
+
533
+ if random.random() < 0.9:
534
+ langs = _pick(PROGRAMMING_LANGUAGES, k=random.randint(3, 7))
535
+ label = random.choice(["Languages", "Programming Languages", "Programming"])
536
+ categories.append((label, langs))
537
+
538
+ if random.random() < 0.9:
539
+ fws = _pick(FRAMEWORKS, k=random.randint(3, 7))
540
+ label = random.choice(["Frameworks", "Frameworks & Libraries", "Technologies"])
541
+ categories.append((label, fws))
542
+
543
+ if random.random() < 0.8:
544
+ tls = _pick(TOOLS, k=random.randint(3, 7))
545
+ label = random.choice(["Tools", "Developer Tools", "Tools & Platforms"])
546
+ categories.append((label, tls))
547
+
548
+ if random.random() < 0.4:
549
+ ss = _pick(SOFT_SKILLS, k=random.randint(2, 5))
550
+ label = random.choice(["Soft Skills", "Other Skills", "Additional Skills"])
551
+ categories.append((label, ss))
552
+
553
+ sep = random.choice([": ", " - ", " — "])
554
+ for label, items in categories:
555
+ joiner = random.choice([", ", " | ", " · ", " / "])
556
+ lines.append(f"{label}{sep}{joiner.join(items)}")
557
+
558
+ return "\n".join(lines)
559
+
560
+ def _t_flat():
561
+ all_skills = (_pick(PROGRAMMING_LANGUAGES, k=random.randint(3, 6)) +
562
+ _pick(FRAMEWORKS, k=random.randint(3, 6)) +
563
+ _pick(TOOLS, k=random.randint(2, 4)))
564
+ random.shuffle(all_skills)
565
+ joiner = random.choice([", ", " | ", " · ", " • "])
566
+ return joiner.join(all_skills)
567
+
568
+ def _t_proficiency():
569
+ lines = []
570
+ levels = ["Expert", "Advanced", "Proficient", "Intermediate", "Familiar"]
571
+ used = set()
572
+ for level in random.sample(levels, k=random.randint(2, 4)):
573
+ pool = [s for s in PROGRAMMING_LANGUAGES + FRAMEWORKS + TOOLS if s not in used]
574
+ items = _pick(pool, k=random.randint(2, 5))
575
+ used.update(items)
576
+ lines.append(f"{level}: {', '.join(items)}")
577
+ return "\n".join(lines)
578
+
579
+ templates = [_t_categorized, _t_flat, _t_proficiency]
580
+ return random.choice(templates)()
581
+
582
+
583
+ def generate_projects() -> str:
584
+ """Generate a realistic projects section."""
585
+
586
+ def _single_project():
587
+ adj = _pick_one(PROJECT_ADJECTIVES)
588
+ noun = _pick_one(PROJECT_NOUNS)
589
+ name = f"{adj} {noun}"
590
+ techs = _pick(PROGRAMMING_LANGUAGES + FRAMEWORKS, k=random.randint(2, 5))
591
+
592
+ header_styles = [
593
+ f"{name} | {', '.join(techs)}",
594
+ f"{name}\nTechnologies: {', '.join(techs)}",
595
+ f"{name} ({', '.join(techs)})",
596
+ ]
597
+ lines = [random.choice(header_styles)]
598
+
599
+ # Optional link
600
+ if random.random() < 0.3:
601
+ username = _pick_one(FIRST_NAMES).lower() + _pick_one(LAST_NAMES).lower()
602
+ lines.append(f"github.com/{username}/{name.lower().replace(' ', '-')}")
603
+
604
+ descriptions = [
605
+ f"Built a {noun.lower()} that {random.choice(['processes', 'analyzes', 'visualizes', 'aggregates', 'transforms'])} {random.choice(['user data', 'financial data', 'text documents', 'sensor data', 'social media feeds', 'medical records'])} in real-time",
606
+ f"Implemented {random.choice(['REST API', 'GraphQL API', 'gRPC service', 'WebSocket server', 'event-driven architecture'])} with {random.choice(['authentication', 'rate limiting', 'caching', 'pagination', 'logging'])} support",
607
+ f"Trained {random.choice(['classification', 'regression', 'NLP', 'computer vision', 'recommendation'])} model achieving {random.choice(['92%', '95%', '97%', '89%', '94%'])} {random.choice(['accuracy', 'F1 score', 'AUC-ROC'])} on test set",
608
+ f"Deployed to {random.choice(['AWS', 'GCP', 'Azure', 'Heroku', 'Vercel', 'Railway'])} with {random.choice(['Docker', 'Kubernetes', 'serverless', 'auto-scaling'])} configuration",
609
+ f"Attracted {random.choice(['100+', '500+', '1K+', '5K+'])} GitHub stars and {random.choice(['20+', '50+', '100+'])} contributors from the open-source community",
610
+ f"Features {random.choice(['real-time notifications', 'responsive UI', 'role-based access control', 'data export', 'interactive visualizations', 'natural language search'])}",
611
+ ]
612
+
613
+ b = _bullet()
614
+ for desc in random.sample(descriptions, k=random.randint(2, 4)):
615
+ lines.append(f"{b} {desc}" if b else desc)
616
+
617
+ return "\n".join(lines)
618
+
619
+ n_projects = random.randint(1, 3)
620
+ return "\n\n".join([_single_project() for _ in range(n_projects)])
621
+
622
+
623
+ def generate_summary() -> str:
624
+ """Generate a realistic professional summary / objective section."""
625
+ years = random.randint(2, 15)
626
+ specialties = _pick(MAJORS + [
627
+ "full-stack development", "distributed systems", "machine learning",
628
+ "data engineering", "cloud architecture", "mobile development",
629
+ "DevOps", "backend development", "frontend development",
630
+ "natural language processing", "computer vision",
631
+ ], k=random.randint(1, 3))
632
+
633
+ templates = [
634
+ # Template 1: Traditional summary
635
+ lambda: f"Results-driven {_pick_one(JOB_TITLES).lower()} with {years}+ years of experience in {' and '.join(specialties)}. Proven track record of {random.choice(['delivering high-impact solutions', 'building scalable systems', 'driving technical excellence', 'leading cross-functional teams'])} at companies like {_pick_one(COMPANIES)} and {_pick_one(COMPANIES)}. Passionate about {random.choice(['clean code', 'system design', 'open source', 'mentorship', 'continuous learning', 'innovation'])} and {random.choice(['building products that scale', 'solving complex problems', 'leveraging data-driven insights', 'improving developer experience'])}.",
636
+
637
+ # Template 2: Technical focus
638
+ lambda: f"Experienced {_pick_one(JOB_TITLES).lower()} specializing in {', '.join(specialties)}. Skilled in {', '.join(_pick(PROGRAMMING_LANGUAGES, k=3))} with deep expertise in {', '.join(_pick(FRAMEWORKS, k=2))}. {random.choice(['Strong background in', 'Demonstrated ability in', 'Track record of'])} {random.choice(['building distributed systems at scale', 'developing ML models for production', 'architecting cloud-native applications', 'leading agile engineering teams'])}. Seeking to {random.choice(['contribute to cutting-edge products', 'drive technical innovation', 'solve challenging problems', 'build impactful technology'])} at a {random.choice(['fast-growing startup', 'leading technology company', 'mission-driven organization'])}.",
639
+
640
+ # Template 3: Achievement-oriented
641
+ lambda: f"{_pick_one(JOB_TITLES)} with {years} years of experience building {random.choice(['enterprise-scale', 'consumer-facing', 'B2B', 'data-intensive'])} applications. Key achievements include {_impact()}, {_impact()}, and {_impact()}. Proficient in {', '.join(_pick(PROGRAMMING_LANGUAGES, k=3))} and {', '.join(_pick(FRAMEWORKS, k=2))}.",
642
+
643
+ # Template 4: Brief objective
644
+ lambda: f"Motivated {random.choice(['professional', 'engineer', 'developer', 'technologist'])} seeking a {_pick_one(JOB_TITLES).lower()} role where I can apply my expertise in {' and '.join(specialties)} to {random.choice(['build innovative products', 'solve real-world problems', 'drive business impact', 'push the boundaries of technology'])}.",
645
+
646
+ # Template 5: Narrative style
647
+ lambda: f"I am a {_pick_one(JOB_TITLES).lower()} who thrives at the intersection of {_pick_one(specialties)} and {_pick_one(specialties)}. Over the past {years} years, I have {random.choice(['shipped products used by millions', 'built ML systems processing petabytes of data', 'led engineering teams through rapid growth', 'contributed to open-source projects with thousands of stars'])}. I bring a {random.choice(['data-driven', 'user-centric', 'systems-thinking', 'first-principles'])} approach to every problem I tackle.",
648
+ ]
649
+
650
+ return random.choice(templates)()
651
+
652
+
653
+ def generate_certifications() -> str:
654
+ """Generate a realistic certifications section."""
655
+ n = random.randint(2, 6)
656
+ certs = _pick(CERTIFICATIONS_LIST, k=n)
657
+
658
+ lines = []
659
+ for cert in certs:
660
+ year = random.randint(2019, 2025)
661
+ styles = [
662
+ f"{cert} ({year})",
663
+ f"{cert} — Issued {_pick_one(MONTHS)} {year}",
664
+ f"{cert}, {year}",
665
+ f"{cert}\n Issued: {_pick_one(MONTHS_SHORT)} {year}" + (
666
+ f" | Expires: {_pick_one(MONTHS_SHORT)} {year + random.randint(2, 3)}"
667
+ if random.random() < 0.3 else ""
668
+ ),
669
+ ]
670
+ lines.append(random.choice(styles))
671
+
672
+ b = _bullet()
673
+ if b and random.random() < 0.5:
674
+ return "\n".join(f"{b} {line}" for line in lines)
675
+ return "\n".join(lines)
676
+
677
+
678
+ def generate_contact() -> str:
679
+ """Generate a realistic contact information section."""
680
+ first = _pick_one(FIRST_NAMES)
681
+ last = _pick_one(LAST_NAMES)
682
+ city = _pick_one(CITIES)
683
+ area_code = _pick_one(PHONE_AREA_CODES)
684
+ email_user = random.choice([
685
+ f"{first.lower()}.{last.lower()}",
686
+ f"{first.lower()}{last.lower()}",
687
+ f"{first[0].lower()}{last.lower()}",
688
+ f"{first.lower()}_{last.lower()}",
689
+ f"{first.lower()}{random.randint(1, 99)}",
690
+ ])
691
+ email = f"{email_user}@{_pick_one(DOMAINS)}"
692
+ phone = f"({area_code}) {random.randint(100,999)}-{random.randint(1000,9999)}"
693
+ linkedin_user = f"{first.lower()}-{last.lower()}-{random.randint(100, 999)}"
694
+ github_user = f"{first.lower()}{last.lower()}"
695
+
696
+ parts = [f"{first} {last}"]
697
+
698
+ if random.random() < 0.8:
699
+ parts.append(email)
700
+ if random.random() < 0.7:
701
+ parts.append(phone)
702
+ if random.random() < 0.6:
703
+ parts.append(city)
704
+ if random.random() < 0.5:
705
+ parts.append(f"{_pick_one(LINKEDIN_PREFIXES)}{linkedin_user}")
706
+ if random.random() < 0.4:
707
+ parts.append(f"{_pick_one(GITHUB_PREFIXES)}{github_user}")
708
+ if random.random() < 0.2:
709
+ parts.append(f"{github_user}.dev" if random.random() < 0.5 else f"{first.lower()}{last.lower()}.com")
710
+
711
+ sep = random.choice(["\n", " | ", " · ", "\n"])
712
+ return sep.join(parts)
713
+
714
+
715
+ def generate_awards() -> str:
716
+ """Generate a realistic awards & honors section."""
717
+ n = random.randint(2, 6)
718
+ awards = _pick(AWARDS_LIST, k=n)
719
+ lines = []
720
+
721
+ for award in awards:
722
+ year = random.randint(2015, 2025)
723
+ org = random.choice([
724
+ _pick_one(UNIVERSITIES),
725
+ _pick_one(COMPANIES),
726
+ random.choice(["ACM", "IEEE", "Google", "Facebook", "Microsoft",
727
+ "National Science Foundation", "Department of Education"]),
728
+ ])
729
+ styles = [
730
+ f"{award}, {org} ({year})",
731
+ f"{award} — {org}, {year}",
732
+ f"{award} ({year})\n Awarded by {org}",
733
+ f"{award}, {year}",
734
+ ]
735
+ lines.append(random.choice(styles))
736
+
737
+ b = _bullet()
738
+ if b and random.random() < 0.6:
739
+ return "\n".join(f"{b} {line}" for line in lines)
740
+ return "\n".join(lines)
741
+
742
+
743
+ # ---------------------------------------------------------------------------
744
+ # Optional section headers – sometimes sections include a heading
745
+ # ---------------------------------------------------------------------------
746
+
747
+ SECTION_HEADERS = {
748
+ "education": ["EDUCATION", "Education", "Academic Background", "ACADEMIC BACKGROUND", "Education & Training"],
749
+ "experience": ["EXPERIENCE", "Experience", "WORK EXPERIENCE", "Work Experience", "PROFESSIONAL EXPERIENCE", "Professional Experience", "Employment History"],
750
+ "skills": ["SKILLS", "Skills", "TECHNICAL SKILLS", "Technical Skills", "Core Competencies", "CORE COMPETENCIES", "Technologies"],
751
+ "projects": ["PROJECTS", "Projects", "PERSONAL PROJECTS", "Personal Projects", "SIDE PROJECTS", "Selected Projects", "Portfolio"],
752
+ "summary": ["SUMMARY", "Summary", "PROFESSIONAL SUMMARY", "Professional Summary", "OBJECTIVE", "Objective", "PROFILE", "Profile", "About Me", "ABOUT"],
753
+ "certifications": ["CERTIFICATIONS", "Certifications", "CERTIFICATES", "Certificates", "Licenses & Certifications", "PROFESSIONAL CERTIFICATIONS"],
754
+ "contact": ["CONTACT", "Contact", "CONTACT INFORMATION", "Contact Information", "Personal Information"],
755
+ "awards": ["AWARDS", "Awards", "HONORS & AWARDS", "Honors & Awards", "ACHIEVEMENTS", "Achievements", "Awards & Honors", "RECOGNITION"],
756
+ }
757
+
758
+ GENERATORS = {
759
+ "education": generate_education,
760
+ "experience": generate_experience,
761
+ "skills": generate_skills,
762
+ "projects": generate_projects,
763
+ "summary": generate_summary,
764
+ "certifications": generate_certifications,
765
+ "contact": generate_contact,
766
+ "awards": generate_awards,
767
+ }
768
+
769
+
770
+ # ---------------------------------------------------------------------------
771
+ # Dataset generation
772
+ # ---------------------------------------------------------------------------
773
+
774
+ def generate_example(label: str, include_header: bool = False, augment: bool = False) -> str:
775
+ """
776
+ Generate a single synthetic example for the given label.
777
+
778
+ Args:
779
+ label: One of the 8 section categories.
780
+ include_header: Whether to prepend a section header.
781
+ augment: Whether to apply text augmentation.
782
+
783
+ Returns:
784
+ Generated text string.
785
+ """
786
+ text = GENERATORS[label]()
787
+
788
+ # Optionally prepend a section header
789
+ if include_header and random.random() < 0.5:
790
+ header = _pick_one(SECTION_HEADERS[label])
791
+ sep = random.choice(["\n", "\n\n", "\n---\n"])
792
+ text = f"{header}{sep}{text}"
793
+
794
+ # Augmentation
795
+ if augment:
796
+ if random.random() < 0.4:
797
+ text = _synonym_replace(text)
798
+ # Randomly add/remove trailing whitespace or newlines
799
+ if random.random() < 0.2:
800
+ text = text.strip() + "\n"
801
+ if random.random() < 0.1:
802
+ text = " " + text
803
+
804
+ return text
805
+
806
+
807
+ def generate_dataset(
808
+ examples_per_category: int = 80,
809
+ augmented_copies: int = 2,
810
+ include_header_prob: float = 0.4,
811
+ seed: int = 42,
812
+ ) -> list[dict]:
813
+ """
814
+ Generate a complete synthetic dataset.
815
+
816
+ Args:
817
+ examples_per_category: Base examples per category.
818
+ augmented_copies: Number of augmented copies per base example.
819
+ include_header_prob: Probability of including section header.
820
+ seed: Random seed for reproducibility.
821
+
822
+ Returns:
823
+ List of dicts with 'text' and 'label' keys.
824
+ """
825
+ random.seed(seed)
826
+ labels = list(GENERATORS.keys())
827
+ dataset = []
828
+
829
+ for label in labels:
830
+ for i in range(examples_per_category):
831
+ include_header = random.random() < include_header_prob
832
+ text = generate_example(label, include_header=include_header, augment=False)
833
+ dataset.append({"text": text, "label": label})
834
+
835
+ # Generate augmented versions
836
+ for _ in range(augmented_copies):
837
+ aug_text = generate_example(label, include_header=include_header, augment=True)
838
+ dataset.append({"text": aug_text, "label": label})
839
+
840
+ random.shuffle(dataset)
841
+ return dataset
842
+
843
+
844
+ def save_to_csv(dataset: list[dict], path: str) -> None:
845
+ """Save dataset to CSV."""
846
+ filepath = Path(path)
847
+ filepath.parent.mkdir(parents=True, exist_ok=True)
848
+ with open(filepath, "w", newline="", encoding="utf-8") as f:
849
+ writer = csv.DictWriter(f, fieldnames=["text", "label"])
850
+ writer.writeheader()
851
+ writer.writerows(dataset)
852
+ print(f"Saved {len(dataset)} examples to {filepath}")
853
+
854
+
855
+ def load_as_hf_dataset(dataset: list[dict]):
856
+ """Convert to HuggingFace Dataset with train/val/test splits."""
857
+ from datasets import Dataset, DatasetDict
858
+
859
+ ds = Dataset.from_list(dataset)
860
+
861
+ # 80/10/10 split
862
+ train_test = ds.train_test_split(test_size=0.2, seed=42, stratify_by_column="label")
863
+ val_test = train_test["test"].train_test_split(test_size=0.5, seed=42, stratify_by_column="label")
864
+
865
+ return DatasetDict({
866
+ "train": train_test["train"],
867
+ "validation": val_test["train"],
868
+ "test": val_test["test"],
869
+ })
870
+
871
+
872
+ def get_label_mapping(dataset: list[dict]) -> tuple[dict, dict]:
873
+ """Create label <-> id mappings."""
874
+ labels = sorted(set(d["label"] for d in dataset))
875
+ label2id = {label: idx for idx, label in enumerate(labels)}
876
+ id2label = {idx: label for label, idx in label2id.items()}
877
+ return label2id, id2label
878
+
879
+
880
+ # ---------------------------------------------------------------------------
881
+ # CLI entry point
882
+ # ---------------------------------------------------------------------------
883
+
884
+ if __name__ == "__main__":
885
+ import argparse
886
+
887
+ parser = argparse.ArgumentParser(description="Generate synthetic resume section data")
888
+ parser.add_argument("--examples-per-category", type=int, default=80,
889
+ help="Number of base examples per category (default: 80)")
890
+ parser.add_argument("--augmented-copies", type=int, default=2,
891
+ help="Number of augmented copies per example (default: 2)")
892
+ parser.add_argument("--output", type=str, default="data/resume_sections.csv",
893
+ help="Output CSV path (default: data/resume_sections.csv)")
894
+ parser.add_argument("--seed", type=int, default=42,
895
+ help="Random seed (default: 42)")
896
+ parser.add_argument("--print-stats", action="store_true",
897
+ help="Print dataset statistics")
898
+ parser.add_argument("--print-samples", type=int, default=0,
899
+ help="Print N sample examples")
900
+
901
+ args = parser.parse_args()
902
+
903
+ print(f"Generating dataset with {args.examples_per_category} base examples per category...")
904
+ print(f"Augmented copies per example: {args.augmented_copies}")
905
+ print(f"Total expected examples: {args.examples_per_category * (1 + args.augmented_copies) * 8}")
906
+
907
+ dataset = generate_dataset(
908
+ examples_per_category=args.examples_per_category,
909
+ augmented_copies=args.augmented_copies,
910
+ seed=args.seed,
911
+ )
912
+
913
+ save_to_csv(dataset, args.output)
914
+
915
+ if args.print_stats:
916
+ from collections import Counter
917
+ counts = Counter(d["label"] for d in dataset)
918
+ print("\nDataset Statistics:")
919
+ print(f" Total examples: {len(dataset)}")
920
+ print(f" Categories: {len(counts)}")
921
+ for label, count in sorted(counts.items()):
922
+ print(f" {label}: {count}")
923
+ avg_len = sum(len(d["text"]) for d in dataset) / len(dataset)
924
+ print(f" Average text length: {avg_len:.0f} chars")
925
+
926
+ if args.print_samples > 0:
927
+ print(f"\n{'='*60}")
928
+ print(f"Sample Examples (first {args.print_samples}):")
929
+ print(f"{'='*60}")
930
+ for i, example in enumerate(dataset[:args.print_samples]):
931
+ print(f"\n--- Example {i+1} [{example['label']}] ---")
932
+ print(example["text"][:300])
933
+ if len(example["text"]) > 300:
934
+ print("...")
inference.py ADDED
@@ -0,0 +1,441 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Resume Section Classifier – Inference Script
3
+
4
+ Takes raw resume text, splits it into sections, and classifies each section
5
+ into one of 8 categories with confidence scores.
6
+
7
+ Author: Lorenzo Scaturchio (gr8monk3ys)
8
+
9
+ Usage:
10
+ # Classify a resume file
11
+ python inference.py --file resume.txt
12
+
13
+ # Classify inline text
14
+ python inference.py --text "Bachelor of Science in Computer Science, MIT, 2023"
15
+
16
+ # Use a custom model path
17
+ python inference.py --model ./model_output/final_model --file resume.txt
18
+
19
+ # Output as JSON
20
+ python inference.py --file resume.txt --format json
21
+
22
+ # Python API
23
+ from inference import ResumeSectionClassifier
24
+ classifier = ResumeSectionClassifier("./model_output/final_model")
25
+ results = classifier.classify_resume(resume_text)
26
+ """
27
+
28
+ import json
29
+ import re
30
+ import sys
31
+ from dataclasses import dataclass, field
32
+ from pathlib import Path
33
+
34
+ import torch
35
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
36
+
37
+
38
+ # ---------------------------------------------------------------------------
39
+ # Data classes
40
+ # ---------------------------------------------------------------------------
41
+
42
+ @dataclass
43
+ class SectionPrediction:
44
+ """A single section classification result."""
45
+ text: str
46
+ label: str
47
+ confidence: float
48
+ all_scores: dict = field(default_factory=dict)
49
+
50
+ def to_dict(self) -> dict:
51
+ return {
52
+ "text": self.text,
53
+ "label": self.label,
54
+ "confidence": round(self.confidence, 4),
55
+ "all_scores": {k: round(v, 4) for k, v in self.all_scores.items()},
56
+ }
57
+
58
+
59
+ @dataclass
60
+ class ResumeAnalysis:
61
+ """Complete resume analysis output."""
62
+ sections: list
63
+ section_count: int = 0
64
+ label_distribution: dict = field(default_factory=dict)
65
+
66
+ def to_dict(self) -> dict:
67
+ return {
68
+ "sections": [s.to_dict() for s in self.sections],
69
+ "section_count": self.section_count,
70
+ "label_distribution": self.label_distribution,
71
+ }
72
+
73
+ def to_json(self, indent: int = 2) -> str:
74
+ return json.dumps(self.to_dict(), indent=indent)
75
+
76
+ def summary(self) -> str:
77
+ """Human-readable summary."""
78
+ lines = [
79
+ f"Resume Analysis: {self.section_count} sections detected",
80
+ "=" * 50,
81
+ ]
82
+ for i, sec in enumerate(self.sections, 1):
83
+ text_preview = sec.text[:80].replace("\n", " ")
84
+ if len(sec.text) > 80:
85
+ text_preview += "..."
86
+ lines.append(
87
+ f"\n[{i}] {sec.label.upper()} (confidence: {sec.confidence:.1%})"
88
+ )
89
+ lines.append(f" {text_preview}")
90
+
91
+ lines.append("\n" + "-" * 50)
92
+ lines.append("Label Distribution:")
93
+ for label, count in sorted(self.label_distribution.items()):
94
+ lines.append(f" {label}: {count}")
95
+
96
+ return "\n".join(lines)
97
+
98
+
99
+ # ---------------------------------------------------------------------------
100
+ # Section splitting heuristics
101
+ # ---------------------------------------------------------------------------
102
+
103
+ # Common resume section headers (case-insensitive patterns)
104
+ SECTION_HEADER_PATTERNS = [
105
+ r"^#{1,3}\s+.+$", # Markdown headers
106
+ r"^[A-Z][A-Z\s&/,]{2,}$", # ALL CAPS headers
107
+ r"^(?:EDUCATION|EXPERIENCE|WORK EXPERIENCE|PROFESSIONAL EXPERIENCE|"
108
+ r"SKILLS|TECHNICAL SKILLS|PROJECTS|PERSONAL PROJECTS|"
109
+ r"SUMMARY|PROFESSIONAL SUMMARY|OBJECTIVE|PROFILE|ABOUT|"
110
+ r"CERTIFICATIONS|CERTIFICATES|LICENSES|"
111
+ r"CONTACT|CONTACT INFORMATION|PERSONAL INFORMATION|"
112
+ r"AWARDS|HONORS|ACHIEVEMENTS|RECOGNITION|"
113
+ r"PUBLICATIONS|REFERENCES|VOLUNTEER|LANGUAGES|INTERESTS|"
114
+ r"ACTIVITIES|LEADERSHIP|RESEARCH)\s*:?\s*$",
115
+ ]
116
+
117
+ COMPILED_HEADERS = [re.compile(p, re.MULTILINE | re.IGNORECASE) for p in SECTION_HEADER_PATTERNS]
118
+
119
+
120
+ def is_section_header(line: str) -> bool:
121
+ """Check if a line looks like a section header."""
122
+ stripped = line.strip()
123
+ if not stripped or len(stripped) < 3:
124
+ return False
125
+
126
+ for pattern in COMPILED_HEADERS:
127
+ if pattern.match(stripped):
128
+ return True
129
+
130
+ # Heuristic: short all-caps line
131
+ if stripped.isupper() and len(stripped.split()) <= 5 and len(stripped) < 50:
132
+ return True
133
+
134
+ return False
135
+
136
+
137
+ def split_resume_into_sections(text: str, min_section_length: int = 20) -> list:
138
+ """
139
+ Split raw resume text into logical sections.
140
+
141
+ Strategy:
142
+ 1. First try to split on detected section headers.
143
+ 2. Fall back to splitting on double newlines (paragraph breaks).
144
+ 3. Filter out very short fragments.
145
+
146
+ Args:
147
+ text: Raw resume text.
148
+ min_section_length: Minimum character length for a section.
149
+
150
+ Returns:
151
+ List of text sections.
152
+ """
153
+ lines = text.split("\n")
154
+ sections = []
155
+ current_section_lines = []
156
+
157
+ # Pass 1: Try header-based splitting
158
+ header_found = False
159
+ for line in lines:
160
+ if is_section_header(line):
161
+ header_found = True
162
+ # Save previous section
163
+ if current_section_lines:
164
+ section_text = "\n".join(current_section_lines).strip()
165
+ if len(section_text) >= min_section_length:
166
+ sections.append(section_text)
167
+ current_section_lines = [line]
168
+ else:
169
+ current_section_lines.append(line)
170
+
171
+ # Don't forget the last section
172
+ if current_section_lines:
173
+ section_text = "\n".join(current_section_lines).strip()
174
+ if len(section_text) >= min_section_length:
175
+ sections.append(section_text)
176
+
177
+ # If no headers found, fall back to paragraph splitting
178
+ if not header_found or len(sections) <= 1:
179
+ sections = []
180
+ paragraphs = re.split(r"\n\s*\n", text)
181
+ for para in paragraphs:
182
+ stripped = para.strip()
183
+ if len(stripped) >= min_section_length:
184
+ sections.append(stripped)
185
+
186
+ # If still just one big block, return it as-is
187
+ if not sections:
188
+ stripped = text.strip()
189
+ if stripped:
190
+ sections = [stripped]
191
+
192
+ return sections
193
+
194
+
195
+ # ---------------------------------------------------------------------------
196
+ # Classifier
197
+ # ---------------------------------------------------------------------------
198
+
199
+ class ResumeSectionClassifier:
200
+ """
201
+ Classifies resume text sections into categories.
202
+
203
+ Supports both single-section and full-resume classification.
204
+ """
205
+
206
+ def __init__(
207
+ self,
208
+ model_path: str = "./model_output/final_model",
209
+ device: str = None,
210
+ max_length: int = 256,
211
+ ):
212
+ """
213
+ Initialize the classifier.
214
+
215
+ Args:
216
+ model_path: Path to fine-tuned model directory.
217
+ device: Device string ('cpu', 'cuda', 'mps'). Auto-detected if None.
218
+ max_length: Maximum token sequence length.
219
+ """
220
+ self.model_path = Path(model_path)
221
+ self.max_length = max_length
222
+
223
+ # Auto-detect device
224
+ if device is None:
225
+ if torch.cuda.is_available():
226
+ self.device = torch.device("cuda")
227
+ elif torch.backends.mps.is_available():
228
+ self.device = torch.device("mps")
229
+ else:
230
+ self.device = torch.device("cpu")
231
+ else:
232
+ self.device = torch.device(device)
233
+
234
+ # Load model and tokenizer
235
+ self.tokenizer = AutoTokenizer.from_pretrained(str(self.model_path))
236
+ self.model = AutoModelForSequenceClassification.from_pretrained(
237
+ str(self.model_path)
238
+ ).to(self.device)
239
+ self.model.eval()
240
+
241
+ # Load label mapping
242
+ label_mapping_path = self.model_path / "label_mapping.json"
243
+ if label_mapping_path.exists():
244
+ with open(label_mapping_path) as f:
245
+ mapping = json.load(f)
246
+ self.id2label = {int(k): v for k, v in mapping["id2label"].items()}
247
+ self.label2id = mapping["label2id"]
248
+ else:
249
+ # Fall back to model config
250
+ self.id2label = self.model.config.id2label
251
+ self.label2id = self.model.config.label2id
252
+
253
+ self.labels = sorted(self.label2id.keys())
254
+
255
+ def classify_section(self, text: str) -> SectionPrediction:
256
+ """
257
+ Classify a single text section.
258
+
259
+ Args:
260
+ text: Section text to classify.
261
+
262
+ Returns:
263
+ SectionPrediction with label, confidence, and all scores.
264
+ """
265
+ inputs = self.tokenizer(
266
+ text,
267
+ truncation=True,
268
+ max_length=self.max_length,
269
+ padding=True,
270
+ return_tensors="pt",
271
+ ).to(self.device)
272
+
273
+ with torch.no_grad():
274
+ outputs = self.model(**inputs)
275
+ probs = torch.softmax(outputs.logits, dim=-1)[0]
276
+
277
+ scores = {self.id2label[i]: probs[i].item() for i in range(len(probs))}
278
+ predicted_id = probs.argmax().item()
279
+ predicted_label = self.id2label[predicted_id]
280
+ confidence = probs[predicted_id].item()
281
+
282
+ return SectionPrediction(
283
+ text=text,
284
+ label=predicted_label,
285
+ confidence=confidence,
286
+ all_scores=scores,
287
+ )
288
+
289
+ def classify_sections(self, texts: list) -> list:
290
+ """
291
+ Classify multiple text sections (batched).
292
+
293
+ Args:
294
+ texts: List of section texts.
295
+
296
+ Returns:
297
+ List of SectionPrediction objects.
298
+ """
299
+ if not texts:
300
+ return []
301
+
302
+ inputs = self.tokenizer(
303
+ texts,
304
+ truncation=True,
305
+ max_length=self.max_length,
306
+ padding=True,
307
+ return_tensors="pt",
308
+ ).to(self.device)
309
+
310
+ with torch.no_grad():
311
+ outputs = self.model(**inputs)
312
+ probs = torch.softmax(outputs.logits, dim=-1)
313
+
314
+ results = []
315
+ for i, text in enumerate(texts):
316
+ scores = {self.id2label[j]: probs[i][j].item() for j in range(probs.shape[1])}
317
+ predicted_id = probs[i].argmax().item()
318
+ predicted_label = self.id2label[predicted_id]
319
+ confidence = probs[i][predicted_id].item()
320
+
321
+ results.append(SectionPrediction(
322
+ text=text,
323
+ label=predicted_label,
324
+ confidence=confidence,
325
+ all_scores=scores,
326
+ ))
327
+
328
+ return results
329
+
330
+ def classify_resume(
331
+ self,
332
+ resume_text: str,
333
+ min_section_length: int = 20,
334
+ ) -> ResumeAnalysis:
335
+ """
336
+ Classify a full resume by splitting into sections and classifying each.
337
+
338
+ Args:
339
+ resume_text: Full resume text.
340
+ min_section_length: Minimum section length in characters.
341
+
342
+ Returns:
343
+ ResumeAnalysis with all section predictions.
344
+ """
345
+ sections = split_resume_into_sections(resume_text, min_section_length)
346
+ predictions = self.classify_sections(sections)
347
+
348
+ # Compute label distribution
349
+ label_dist = {}
350
+ for pred in predictions:
351
+ label_dist[pred.label] = label_dist.get(pred.label, 0) + 1
352
+
353
+ return ResumeAnalysis(
354
+ sections=predictions,
355
+ section_count=len(predictions),
356
+ label_distribution=label_dist,
357
+ )
358
+
359
+
360
+ # ---------------------------------------------------------------------------
361
+ # CLI
362
+ # ---------------------------------------------------------------------------
363
+
364
+ def main():
365
+ import argparse
366
+
367
+ parser = argparse.ArgumentParser(
368
+ description="Classify resume sections",
369
+ formatter_class=argparse.RawDescriptionHelpFormatter,
370
+ epilog="""
371
+ Examples:
372
+ python inference.py --file resume.txt
373
+ python inference.py --text "BS in Computer Science, MIT, 2023"
374
+ python inference.py --file resume.txt --format json
375
+ python inference.py --model ./model_output/final_model --file resume.txt
376
+ """,
377
+ )
378
+
379
+ input_group = parser.add_mutually_exclusive_group(required=True)
380
+ input_group.add_argument("--file", type=str, help="Path to resume text file")
381
+ input_group.add_argument("--text", type=str, help="Direct text to classify")
382
+
383
+ parser.add_argument("--model", type=str, default="./model_output/final_model",
384
+ help="Path to fine-tuned model (default: ./model_output/final_model)")
385
+ parser.add_argument("--device", type=str, default=None,
386
+ help="Device: cpu, cuda, mps (auto-detected if omitted)")
387
+ parser.add_argument("--max-length", type=int, default=256,
388
+ help="Maximum token sequence length (default: 256)")
389
+ parser.add_argument("--min-section-length", type=int, default=20,
390
+ help="Minimum section length in characters (default: 20)")
391
+ parser.add_argument("--format", type=str, choices=["text", "json"], default="text",
392
+ help="Output format (default: text)")
393
+ parser.add_argument("--single", action="store_true",
394
+ help="Classify as single section (no splitting)")
395
+
396
+ args = parser.parse_args()
397
+
398
+ # Load classifier
399
+ try:
400
+ classifier = ResumeSectionClassifier(
401
+ model_path=args.model,
402
+ device=args.device,
403
+ max_length=args.max_length,
404
+ )
405
+ except Exception as e:
406
+ print(f"Error loading model from '{args.model}': {e}", file=sys.stderr)
407
+ print("Have you trained the model yet? Run: python train.py", file=sys.stderr)
408
+ sys.exit(1)
409
+
410
+ # Get input text
411
+ if args.file:
412
+ file_path = Path(args.file)
413
+ if not file_path.exists():
414
+ print(f"File not found: {args.file}", file=sys.stderr)
415
+ sys.exit(1)
416
+ text = file_path.read_text(encoding="utf-8")
417
+ else:
418
+ text = args.text
419
+
420
+ # Classify
421
+ if args.single:
422
+ result = classifier.classify_section(text)
423
+ if args.format == "json":
424
+ print(json.dumps(result.to_dict(), indent=2))
425
+ else:
426
+ print(f"Label: {result.label}")
427
+ print(f"Confidence: {result.confidence:.1%}")
428
+ print("\nAll scores:")
429
+ for label, score in sorted(result.all_scores.items(), key=lambda x: -x[1]):
430
+ bar = "#" * int(score * 40)
431
+ print(f" {label:20s} {score:.4f} {bar}")
432
+ else:
433
+ analysis = classifier.classify_resume(text, min_section_length=args.min_section_length)
434
+ if args.format == "json":
435
+ print(analysis.to_json())
436
+ else:
437
+ print(analysis.summary())
438
+
439
+
440
+ if __name__ == "__main__":
441
+ main()
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ transformers>=4.36.0
2
+ datasets>=2.16.0
3
+ torch>=2.1.0
4
+ scikit-learn>=1.3.0
5
+ accelerate>=0.25.0
6
+ evaluate>=0.4.0
7
+ pandas>=2.0.0
8
+ huggingface_hub>=0.20.0
train.py ADDED
@@ -0,0 +1,437 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Resume Section Classifier – Training Script
3
+
4
+ Fine-tunes distilbert-base-uncased for classifying resume text sections
5
+ into 8 categories: education, experience, skills, projects, summary,
6
+ certifications, contact, awards.
7
+
8
+ Author: Lorenzo Scaturchio (gr8monk3ys)
9
+
10
+ Usage:
11
+ python train.py # Train with defaults
12
+ python train.py --epochs 5 --batch-size 32
13
+ python train.py --push-to-hub # Push to HuggingFace Hub
14
+ python train.py --output-dir ./my_model
15
+ """
16
+
17
+ import json
18
+ import logging
19
+ import os
20
+ import sys
21
+ from pathlib import Path
22
+
23
+ import evaluate
24
+ import numpy as np
25
+ import torch
26
+ from datasets import DatasetDict
27
+ from transformers import (
28
+ AutoModelForSequenceClassification,
29
+ AutoTokenizer,
30
+ DataCollatorWithPadding,
31
+ EarlyStoppingCallback,
32
+ Trainer,
33
+ TrainingArguments,
34
+ )
35
+
36
+ from data_generator import generate_dataset, get_label_mapping, load_as_hf_dataset
37
+
38
+ # ---------------------------------------------------------------------------
39
+ # Logging
40
+ # ---------------------------------------------------------------------------
41
+ logging.basicConfig(
42
+ level=logging.INFO,
43
+ format="%(asctime)s [%(levelname)s] %(message)s",
44
+ handlers=[logging.StreamHandler(sys.stdout)],
45
+ )
46
+ logger = logging.getLogger(__name__)
47
+
48
+ # ---------------------------------------------------------------------------
49
+ # Constants
50
+ # ---------------------------------------------------------------------------
51
+ MODEL_NAME = "distilbert-base-uncased"
52
+ DEFAULT_OUTPUT_DIR = "./model_output"
53
+ DEFAULT_LOGGING_DIR = "./logs"
54
+ HUB_MODEL_ID = "gr8monk3ys/resume-section-classifier"
55
+ MAX_LENGTH = 256
56
+
57
+
58
+ # ---------------------------------------------------------------------------
59
+ # Metrics computation
60
+ # ---------------------------------------------------------------------------
61
+ def build_compute_metrics(id2label: dict):
62
+ """Build a compute_metrics function with access to label mappings."""
63
+ accuracy_metric = evaluate.load("accuracy")
64
+ f1_metric = evaluate.load("f1")
65
+ precision_metric = evaluate.load("precision")
66
+ recall_metric = evaluate.load("recall")
67
+
68
+ def compute_metrics(eval_pred):
69
+ logits, labels = eval_pred
70
+ predictions = np.argmax(logits, axis=-1)
71
+
72
+ acc = accuracy_metric.compute(predictions=predictions, references=labels)
73
+ f1_macro = f1_metric.compute(predictions=predictions, references=labels, average="macro")
74
+ f1_weighted = f1_metric.compute(predictions=predictions, references=labels, average="weighted")
75
+ precision = precision_metric.compute(predictions=predictions, references=labels, average="weighted")
76
+ recall = recall_metric.compute(predictions=predictions, references=labels, average="weighted")
77
+
78
+ return {
79
+ "accuracy": acc["accuracy"],
80
+ "f1_macro": f1_macro["f1"],
81
+ "f1_weighted": f1_weighted["f1"],
82
+ "precision": precision["precision"],
83
+ "recall": recall["recall"],
84
+ }
85
+
86
+ return compute_metrics
87
+
88
+
89
+ # ---------------------------------------------------------------------------
90
+ # Tokenization
91
+ # ---------------------------------------------------------------------------
92
+ def tokenize_dataset(dataset_dict: DatasetDict, tokenizer, label2id: dict, max_length: int = MAX_LENGTH):
93
+ """Tokenize all splits and encode labels as integers."""
94
+
95
+ def preprocess(examples):
96
+ tokenized = tokenizer(
97
+ examples["text"],
98
+ truncation=True,
99
+ max_length=max_length,
100
+ padding=False, # Dynamic padding via DataCollator
101
+ )
102
+ tokenized["labels"] = [label2id[label] for label in examples["label"]]
103
+ return tokenized
104
+
105
+ tokenized = dataset_dict.map(
106
+ preprocess,
107
+ batched=True,
108
+ remove_columns=["text", "label"],
109
+ desc="Tokenizing",
110
+ )
111
+
112
+ return tokenized
113
+
114
+
115
+ # ---------------------------------------------------------------------------
116
+ # Main training function
117
+ # ---------------------------------------------------------------------------
118
+ def train(
119
+ output_dir: str = DEFAULT_OUTPUT_DIR,
120
+ model_name: str = MODEL_NAME,
121
+ epochs: int = 4,
122
+ batch_size: int = 16,
123
+ learning_rate: float = 2e-5,
124
+ weight_decay: float = 0.01,
125
+ warmup_ratio: float = 0.1,
126
+ max_length: int = MAX_LENGTH,
127
+ examples_per_category: int = 80,
128
+ augmented_copies: int = 2,
129
+ seed: int = 42,
130
+ push_to_hub: bool = False,
131
+ hub_model_id: str = HUB_MODEL_ID,
132
+ fp16: bool = None,
133
+ gradient_accumulation_steps: int = 1,
134
+ early_stopping_patience: int = 3,
135
+ ):
136
+ """
137
+ Full training pipeline.
138
+
139
+ Args:
140
+ output_dir: Directory to save model and artifacts.
141
+ model_name: Pretrained model identifier.
142
+ epochs: Number of training epochs.
143
+ batch_size: Training batch size.
144
+ learning_rate: Peak learning rate.
145
+ weight_decay: Weight decay for AdamW.
146
+ warmup_ratio: Fraction of steps for warmup.
147
+ max_length: Maximum token sequence length.
148
+ examples_per_category: Base synthetic examples per category.
149
+ augmented_copies: Augmented copies per base example.
150
+ seed: Random seed.
151
+ push_to_hub: Whether to push to HuggingFace Hub.
152
+ hub_model_id: Hub model repository ID.
153
+ fp16: Use mixed precision (auto-detected if None).
154
+ gradient_accumulation_steps: Gradient accumulation steps.
155
+ early_stopping_patience: Early stopping patience (epochs).
156
+ """
157
+ output_path = Path(output_dir)
158
+ output_path.mkdir(parents=True, exist_ok=True)
159
+
160
+ # Auto-detect fp16
161
+ if fp16 is None:
162
+ fp16 = torch.cuda.is_available()
163
+
164
+ logger.info("=" * 60)
165
+ logger.info("Resume Section Classifier – Training")
166
+ logger.info("=" * 60)
167
+ logger.info(f"Model: {model_name}")
168
+ logger.info(f"Output: {output_dir}")
169
+ logger.info(f"Epochs: {epochs}, Batch size: {batch_size}, LR: {learning_rate}")
170
+ logger.info(f"Device: {'CUDA' if torch.cuda.is_available() else 'MPS' if torch.backends.mps.is_available() else 'CPU'}")
171
+ logger.info(f"FP16: {fp16}")
172
+
173
+ # ------------------------------------------------------------------
174
+ # 1. Generate synthetic data
175
+ # ------------------------------------------------------------------
176
+ logger.info("\n[1/5] Generating synthetic training data...")
177
+ raw_dataset = generate_dataset(
178
+ examples_per_category=examples_per_category,
179
+ augmented_copies=augmented_copies,
180
+ seed=seed,
181
+ )
182
+ label2id, id2label = get_label_mapping(raw_dataset)
183
+ num_labels = len(label2id)
184
+
185
+ logger.info(f" Total examples: {len(raw_dataset)}")
186
+ logger.info(f" Labels ({num_labels}): {list(label2id.keys())}")
187
+
188
+ # Create HF DatasetDict with train/val/test splits
189
+ dataset_dict = load_as_hf_dataset(raw_dataset)
190
+ logger.info(f" Train: {len(dataset_dict['train'])}")
191
+ logger.info(f" Validation: {len(dataset_dict['validation'])}")
192
+ logger.info(f" Test: {len(dataset_dict['test'])}")
193
+
194
+ # ------------------------------------------------------------------
195
+ # 2. Tokenize
196
+ # ------------------------------------------------------------------
197
+ logger.info("\n[2/5] Loading tokenizer and tokenizing data...")
198
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
199
+ tokenized_dataset = tokenize_dataset(dataset_dict, tokenizer, label2id, max_length)
200
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
201
+
202
+ # ------------------------------------------------------------------
203
+ # 3. Load model
204
+ # ------------------------------------------------------------------
205
+ logger.info("\n[3/5] Loading pretrained model...")
206
+ model = AutoModelForSequenceClassification.from_pretrained(
207
+ model_name,
208
+ num_labels=num_labels,
209
+ id2label=id2label,
210
+ label2id=label2id,
211
+ )
212
+ logger.info(f" Parameters: {sum(p.numel() for p in model.parameters()):,}")
213
+ logger.info(f" Trainable: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
214
+
215
+ # ------------------------------------------------------------------
216
+ # 4. Training
217
+ # ------------------------------------------------------------------
218
+ logger.info("\n[4/5] Training...")
219
+
220
+ training_args = TrainingArguments(
221
+ output_dir=output_dir,
222
+ overwrite_output_dir=True,
223
+ # Training hyperparameters
224
+ num_train_epochs=epochs,
225
+ per_device_train_batch_size=batch_size,
226
+ per_device_eval_batch_size=batch_size * 2,
227
+ gradient_accumulation_steps=gradient_accumulation_steps,
228
+ learning_rate=learning_rate,
229
+ weight_decay=weight_decay,
230
+ warmup_ratio=warmup_ratio,
231
+ lr_scheduler_type="cosine",
232
+ # Evaluation
233
+ eval_strategy="epoch",
234
+ save_strategy="epoch",
235
+ load_best_model_at_end=True,
236
+ metric_for_best_model="f1_macro",
237
+ greater_is_better=True,
238
+ # Logging
239
+ logging_dir=DEFAULT_LOGGING_DIR,
240
+ logging_strategy="steps",
241
+ logging_steps=50,
242
+ report_to="none",
243
+ # Efficiency
244
+ fp16=fp16,
245
+ dataloader_num_workers=0,
246
+ # Reproducibility
247
+ seed=seed,
248
+ data_seed=seed,
249
+ # Hub
250
+ push_to_hub=False, # We'll push manually after evaluation
251
+ # Misc
252
+ save_total_limit=3,
253
+ disable_tqdm=False,
254
+ )
255
+
256
+ callbacks = []
257
+ if early_stopping_patience > 0:
258
+ callbacks.append(EarlyStoppingCallback(early_stopping_patience=early_stopping_patience))
259
+
260
+ trainer = Trainer(
261
+ model=model,
262
+ args=training_args,
263
+ train_dataset=tokenized_dataset["train"],
264
+ eval_dataset=tokenized_dataset["validation"],
265
+ tokenizer=tokenizer,
266
+ data_collator=data_collator,
267
+ compute_metrics=build_compute_metrics(id2label),
268
+ callbacks=callbacks,
269
+ )
270
+
271
+ train_result = trainer.train()
272
+
273
+ # Log training metrics
274
+ logger.info("\nTraining Results:")
275
+ for key, value in train_result.metrics.items():
276
+ logger.info(f" {key}: {value}")
277
+
278
+ # ------------------------------------------------------------------
279
+ # 5. Evaluation
280
+ # ------------------------------------------------------------------
281
+ logger.info("\n[5/5] Evaluating on test set...")
282
+ test_results = trainer.evaluate(tokenized_dataset["test"])
283
+
284
+ logger.info("\nTest Results:")
285
+ for key, value in test_results.items():
286
+ logger.info(f" {key}: {value:.4f}" if isinstance(value, float) else f" {key}: {value}")
287
+
288
+ # ------------------------------------------------------------------
289
+ # Save artifacts
290
+ # ------------------------------------------------------------------
291
+ logger.info("\nSaving model and artifacts...")
292
+
293
+ # Save model + tokenizer
294
+ final_path = output_path / "final_model"
295
+ trainer.save_model(str(final_path))
296
+ tokenizer.save_pretrained(str(final_path))
297
+
298
+ # Save label mapping
299
+ label_mapping = {
300
+ "label2id": label2id,
301
+ "id2label": {str(k): v for k, v in id2label.items()},
302
+ "labels": list(label2id.keys()),
303
+ }
304
+ with open(final_path / "label_mapping.json", "w") as f:
305
+ json.dump(label_mapping, f, indent=2)
306
+
307
+ # Save training config
308
+ train_config = {
309
+ "model_name": model_name,
310
+ "max_length": max_length,
311
+ "epochs": epochs,
312
+ "batch_size": batch_size,
313
+ "learning_rate": learning_rate,
314
+ "weight_decay": weight_decay,
315
+ "warmup_ratio": warmup_ratio,
316
+ "examples_per_category": examples_per_category,
317
+ "augmented_copies": augmented_copies,
318
+ "seed": seed,
319
+ "num_labels": num_labels,
320
+ "train_size": len(dataset_dict["train"]),
321
+ "val_size": len(dataset_dict["validation"]),
322
+ "test_size": len(dataset_dict["test"]),
323
+ }
324
+ with open(final_path / "training_config.json", "w") as f:
325
+ json.dump(train_config, f, indent=2)
326
+
327
+ # Save metrics
328
+ all_metrics = {
329
+ "train": train_result.metrics,
330
+ "test": test_results,
331
+ }
332
+ with open(final_path / "metrics.json", "w") as f:
333
+ json.dump(all_metrics, f, indent=2)
334
+
335
+ logger.info(f"\nAll artifacts saved to: {final_path}")
336
+
337
+ # ------------------------------------------------------------------
338
+ # Optional: Push to Hub
339
+ # ------------------------------------------------------------------
340
+ if push_to_hub:
341
+ logger.info(f"\nPushing to HuggingFace Hub: {hub_model_id}")
342
+ try:
343
+ trainer.push_to_hub(
344
+ repo_id=hub_model_id,
345
+ commit_message="Upload fine-tuned resume section classifier",
346
+ )
347
+ tokenizer.push_to_hub(hub_model_id)
348
+ logger.info("Successfully pushed to Hub!")
349
+ except Exception as e:
350
+ logger.error(f"Failed to push to Hub: {e}")
351
+ logger.info("You can push manually later with:")
352
+ logger.info(f" huggingface-cli upload {hub_model_id} {final_path}")
353
+
354
+ logger.info("\nTraining complete!")
355
+ return test_results
356
+
357
+
358
+ # ---------------------------------------------------------------------------
359
+ # CLI
360
+ # ---------------------------------------------------------------------------
361
+ if __name__ == "__main__":
362
+ import argparse
363
+
364
+ parser = argparse.ArgumentParser(
365
+ description="Fine-tune DistilBERT for resume section classification",
366
+ formatter_class=argparse.ArgumentDefaultsHelpFormatter,
367
+ )
368
+
369
+ # Model & output
370
+ parser.add_argument("--model-name", type=str, default=MODEL_NAME,
371
+ help="Pretrained model name or path")
372
+ parser.add_argument("--output-dir", type=str, default=DEFAULT_OUTPUT_DIR,
373
+ help="Output directory for model and artifacts")
374
+
375
+ # Training hyperparameters
376
+ parser.add_argument("--epochs", type=int, default=4,
377
+ help="Number of training epochs")
378
+ parser.add_argument("--batch-size", type=int, default=16,
379
+ help="Training batch size per device")
380
+ parser.add_argument("--learning-rate", type=float, default=2e-5,
381
+ help="Peak learning rate")
382
+ parser.add_argument("--weight-decay", type=float, default=0.01,
383
+ help="Weight decay for AdamW")
384
+ parser.add_argument("--warmup-ratio", type=float, default=0.1,
385
+ help="Fraction of total steps for linear warmup")
386
+ parser.add_argument("--max-length", type=int, default=MAX_LENGTH,
387
+ help="Maximum token sequence length")
388
+ parser.add_argument("--gradient-accumulation-steps", type=int, default=1,
389
+ help="Number of gradient accumulation steps")
390
+
391
+ # Data
392
+ parser.add_argument("--examples-per-category", type=int, default=80,
393
+ help="Base synthetic examples per category")
394
+ parser.add_argument("--augmented-copies", type=int, default=2,
395
+ help="Augmented copies per base example")
396
+ parser.add_argument("--seed", type=int, default=42,
397
+ help="Random seed for reproducibility")
398
+
399
+ # Training config
400
+ parser.add_argument("--fp16", action="store_true", default=None,
401
+ help="Force FP16 training")
402
+ parser.add_argument("--no-fp16", action="store_true",
403
+ help="Disable FP16 training")
404
+ parser.add_argument("--early-stopping-patience", type=int, default=3,
405
+ help="Early stopping patience (0 to disable)")
406
+
407
+ # Hub
408
+ parser.add_argument("--push-to-hub", action="store_true",
409
+ help="Push trained model to HuggingFace Hub")
410
+ parser.add_argument("--hub-model-id", type=str, default=HUB_MODEL_ID,
411
+ help="HuggingFace Hub model ID")
412
+
413
+ args = parser.parse_args()
414
+
415
+ # Handle fp16 flags
416
+ fp16 = args.fp16
417
+ if args.no_fp16:
418
+ fp16 = False
419
+
420
+ results = train(
421
+ output_dir=args.output_dir,
422
+ model_name=args.model_name,
423
+ epochs=args.epochs,
424
+ batch_size=args.batch_size,
425
+ learning_rate=args.learning_rate,
426
+ weight_decay=args.weight_decay,
427
+ warmup_ratio=args.warmup_ratio,
428
+ max_length=args.max_length,
429
+ examples_per_category=args.examples_per_category,
430
+ augmented_copies=args.augmented_copies,
431
+ seed=args.seed,
432
+ push_to_hub=args.push_to_hub,
433
+ hub_model_id=args.hub_model_id,
434
+ fp16=fp16,
435
+ gradient_accumulation_steps=args.gradient_accumulation_steps,
436
+ early_stopping_patience=args.early_stopping_patience,
437
+ )