Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +200 -0
data_generator.py +934 -0
inference.py +441 -0
requirements.txt +8 -0
train.py +437 -0

README.md ADDED Viewed

	@@ -0,0 +1,200 @@

+---
+license: mit
+base_model: distilbert-base-uncased
+tags:
+  - text-classification
+  - resume-parsing
+  - nlp
+  - distilbert
+  - synthetic-data
+metrics:
+  - accuracy
+  - f1
+pipeline_tag: text-classification
+datasets:
+  - custom-synthetic
+language:
+  - en
+---
+# Resume Section Classifier
+A fine-tuned [DistilBERT](https://huggingface.co/distilbert-base-uncased) model that classifies resume text sections into 8 categories. Designed for automated resume parsing pipelines where incoming text needs to be segmented and labeled by section type.
+## Labels
+| Label | Description | Example |
+|-------|-------------|---------|
+| `education` | Academic background, degrees, coursework, GPA | "B.S. in Computer Science, MIT, 2023, GPA: 3.8" |
+| `experience` | Work history, job titles, responsibilities | "Software Engineer at Google, 2020-Present. Built microservices..." |
+| `skills` | Technical and soft skills listings | "Python, Java, React, Docker, Kubernetes, AWS" |
+| `projects` | Personal or academic project descriptions | "Built a real-time analytics dashboard using React and D3.js" |
+| `summary` | Professional summary or objective statement | "Results-driven engineer with 5+ years of experience in..." |
+| `certifications` | Professional certifications and licenses | "AWS Certified Solutions Architect - Associate (2023)" |
+| `contact` | Name, email, phone, LinkedIn, location | "John Smith | john@email.com | (415) 555-1234 | SF, CA" |
+| `awards` | Honors, achievements, recognition | "Dean's List, Phi Beta Kappa, Hackathon Winner (2022)" |
+## Training Procedure
+### Data
+The model was trained on a **synthetic dataset** generated programmatically using `data_generator.py`. The generator uses:
+- **Template-based generation** with randomized entities (names, companies, universities, skills, dates, cities, etc.) drawn from curated pools of 30-50+ items each
+- **Structural variation** across multiple formatting styles per category (e.g., education entries can appear as single-line, multi-line, with/without GPA, with activities, etc.)
+- **Data augmentation** via synonym replacement (30% probability per eligible word) and formatting variation (whitespace, bullet styles, separators)
+- **Optional section headers** prepended with 40% probability to teach the model to handle both headed and headless sections
+Default configuration produces **1,920 examples** (80 base examples x 3 variants x 8 categories), split 80/10/10 into train/validation/test sets with stratified sampling.
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Base model | `distilbert-base-uncased` |
+| Max sequence length | 256 tokens |
+| Epochs | 4 |
+| Batch size | 16 |
+| Learning rate | 2e-5 |
+| LR scheduler | Cosine |
+| Warmup ratio | 0.1 |
+| Weight decay | 0.01 |
+| Optimizer | AdamW |
+| Early stopping | Patience 3 (on F1 macro) |
+### Training Infrastructure
+- Fine-tuning takes approximately 5-10 minutes on a single GPU or 15-30 minutes on CPU
+- Model size: ~67M parameters (DistilBERT base)
+- Mixed precision (FP16) enabled automatically when CUDA is available
+## Metrics
+Evaluated on the held-out synthetic test set (192 examples, stratified):
+| Metric | Score |
+|--------|-------|
+| Accuracy | ~0.95+ |
+| F1 (macro) | ~0.95+ |
+| F1 (weighted) | ~0.95+ |
+| Precision (weighted) | ~0.95+ |
+| Recall (weighted) | ~0.95+ |
+> Note: Exact metrics depend on the random seed and training run. Scores on real-world resumes may differ from synthetic test performance.
+## Usage
+### Quick Start with Transformers Pipeline
+```python
+from transformers import pipeline
+classifier = pipeline(
+    "text-classification",
+    model="gr8monk3ys/resume-section-classifier",
+)
+result = classifier("Bachelor of Science in Computer Science, Stanford University, 2023. GPA: 3.9/4.0")
+print(result)
+# [{'label': 'education', 'score': 0.98}]
+```
+### Full Resume Classification
+```python
+from inference import ResumeSectionClassifier
+classifier = ResumeSectionClassifier("gr8monk3ys/resume-section-classifier")
+resume_text = """
+John Smith
+john.smith@email.com | (415) 555-1234 | San Francisco, CA
+linkedin.com/in/johnsmith | github.com/johnsmith
+SUMMARY
+Experienced software engineer with 5+ years building scalable web applications
+and distributed systems. Passionate about clean code and mentoring junior developers.
+EXPERIENCE
+Senior Software Engineer | Google | San Francisco, CA | Jan 2021 - Present
+- Led migration of monolithic application to microservices architecture
+- Mentored 4 junior engineers through code reviews and pair programming
+- Reduced API latency by 40% through caching and query optimization
+Software Engineer | Stripe | San Francisco, CA | Jun 2018 - Dec 2020
+- Built payment processing APIs handling 10M+ transactions daily
+- Implemented CI/CD pipeline reducing deployment time from hours to minutes
+EDUCATION
+B.S. in Computer Science, Stanford University (2018)
+GPA: 3.8/4.0 | Dean's List
+SKILLS
+Languages: Python, Java, Go, TypeScript
+Frameworks: React, Django, Spring Boot, gRPC
+Tools: Docker, Kubernetes, AWS, PostgreSQL, Redis
+"""
+analysis = classifier.classify_resume(resume_text)
+print(analysis.summary())
+```
+### Training from Scratch
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Generate data and train
+python train.py --epochs 4 --batch-size 16
+# Train and push to Hub
+python train.py --push-to-hub --hub-model-id gr8monk3ys/resume-section-classifier
+# Inference
+python inference.py --file resume.txt
+python inference.py --text "Python, Java, React, Docker" --single
+```
+### Generating Custom Training Data
+```bash
+# Generate data with custom settings
+python data_generator.py \
+    --examples-per-category 150 \
+    --augmented-copies 3 \
+    --output data/resume_sections.csv \
+    --print-stats
+```
+## Project Structure
+```
+resume-section-classifier/
+  data_generator.py    # Synthetic data generation with templates and augmentation
+  train.py             # Full fine-tuning pipeline with HuggingFace Trainer
+  inference.py         # Section splitting and classification API + CLI
+  requirements.txt     # Python dependencies
+  README.md            # This model card
+```
+## Limitations
+- **Synthetic training data**: The model is trained entirely on programmatically generated text. Performance on real resumes with unusual formatting, non-English content, or domain-specific jargon may be lower than synthetic test metrics suggest.
+- **English only**: All training data is in English. The model will not reliably classify sections in other languages.
+- **Section granularity**: The model classifies pre-split sections. The built-in text splitter uses heuristics (section headers, paragraph breaks) that may not correctly segment all resume formats (e.g., single-column vs. multi-column PDFs, tables).
+- **8 categories only**: Some resume sections (e.g., Publications, Volunteer Work, Languages, Interests, References) are not covered by the current label set. These will be assigned to the closest matching category.
+- **Max length**: Input is truncated to 256 tokens. Very long sections may lose information.
+## Intended Use
+- Automated resume parsing pipelines
+- HR tech and applicant tracking systems
+- Resume formatting and analysis tools
+- Educational demonstrations of text classification fine-tuning
+This model is **not intended** for making hiring decisions. It is a text classification tool for structural parsing only.
+## Author
+[Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys) (gr8monk3ys)

data_generator.py ADDED Viewed

	@@ -0,0 +1,934 @@

+"""
+Synthetic Resume Section Data Generator
+Generates realistic resume section text across 8 categories for training
+a text classifier. Uses template-based generation with randomized entities,
+synonym replacement, and structural variation to produce diverse examples.
+Author: Lorenzo Scaturchio (gr8monk3ys)
+"""
+import csv
+import random
+import itertools
+from pathlib import Path
+from typing import Optional
+# ---------------------------------------------------------------------------
+# Entity pools – used to fill templates with realistic variation
+# ---------------------------------------------------------------------------
+FIRST_NAMES = [
+    "James", "Mary", "Robert", "Patricia", "John", "Jennifer", "Michael",
+    "Linda", "David", "Elizabeth", "William", "Barbara", "Richard", "Susan",
+    "Joseph", "Jessica", "Thomas", "Sarah", "Charles", "Karen", "Daniel",
+    "Lisa", "Matthew", "Nancy", "Anthony", "Betty", "Mark", "Sandra",
+    "Aisha", "Wei", "Carlos", "Priya", "Olga", "Hiroshi", "Fatima", "Liam",
+    "Sofia", "Andrei", "Mei", "Alejandro", "Yuki", "Omar", "Elena", "Raj",
+]
+LAST_NAMES = [
+    "Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller",
+    "Davis", "Rodriguez", "Martinez", "Hernandez", "Lopez", "Gonzalez",
+    "Wilson", "Anderson", "Thomas", "Taylor", "Moore", "Jackson", "Martin",
+    "Lee", "Perez", "Thompson", "White", "Harris", "Sanchez", "Clark",
+    "Patel", "Chen", "Kim", "Nakamura", "Ivanov", "Silva", "Okafor",
+]
+COMPANIES = [
+    "Google", "Microsoft", "Amazon", "Apple", "Meta", "Netflix", "Stripe",
+    "Airbnb", "Uber", "Salesforce", "Adobe", "IBM", "Oracle", "Intel",
+    "Tesla", "SpaceX", "Palantir", "Snowflake", "Databricks", "Confluent",
+    "JPMorgan Chase", "Goldman Sachs", "Morgan Stanley", "Deloitte",
+    "McKinsey & Company", "Boston Consulting Group", "Accenture",
+    "Lockheed Martin", "Boeing", "Raytheon", "General Electric",
+    "Procter & Gamble", "Johnson & Johnson", "Pfizer", "Moderna",
+    "Shopify", "Square", "Twilio", "Cloudflare", "HashiCorp",
+    "DataRobot", "Hugging Face", "OpenAI", "Anthropic", "Cohere",
+    "Startup XYZ", "TechCorp Inc.", "InnovateTech", "DataDriven LLC",
+]
+UNIVERSITIES = [
+    "Massachusetts Institute of Technology", "Stanford University",
+    "Harvard University", "University of California, Berkeley",
+    "Carnegie Mellon University", "Georgia Institute of Technology",
+    "University of Michigan", "University of Illinois Urbana-Champaign",
+    "California Institute of Technology", "Princeton University",
+    "Columbia University", "University of Washington",
+    "University of Texas at Austin", "Cornell University",
+    "University of Pennsylvania", "University of Southern California",
+    "New York University", "University of Wisconsin-Madison",
+    "Duke University", "Northwestern University",
+    "University of California, Los Angeles", "Rice University",
+    "University of Maryland", "Purdue University",
+    "Ohio State University", "Arizona State University",
+    "University of Virginia", "University of Florida",
+    "Boston University", "Northeastern University",
+]
+DEGREES = [
+    ("Bachelor of Science", "B.S."),
+    ("Bachelor of Arts", "B.A."),
+    ("Master of Science", "M.S."),
+    ("Master of Arts", "M.A."),
+    ("Master of Business Administration", "MBA"),
+    ("Doctor of Philosophy", "Ph.D."),
+    ("Associate of Science", "A.S."),
+    ("Bachelor of Engineering", "B.Eng."),
+    ("Master of Engineering", "M.Eng."),
+]
+MAJORS = [
+    "Computer Science", "Software Engineering", "Data Science",
+    "Electrical Engineering", "Mechanical Engineering",
+    "Information Technology", "Mathematics", "Statistics",
+    "Business Administration", "Economics", "Finance",
+    "Biomedical Engineering", "Chemical Engineering",
+    "Civil Engineering", "Physics", "Biology",
+    "Artificial Intelligence", "Machine Learning",
+    "Human-Computer Interaction", "Cybersecurity",
+    "Information Systems", "Operations Research",
+]
+MINORS = [
+    "Mathematics", "Statistics", "Psychology", "Business",
+    "Economics", "Philosophy", "Linguistics", "Physics",
+    "Data Science", "Communication", "Sociology", "History",
+]
+GPA_VALUES = [
+    "3.5", "3.6", "3.7", "3.8", "3.9", "4.0",
+    "3.52", "3.65", "3.78", "3.85", "3.92", "3.45",
+]
+GRAD_YEARS = list(range(2015, 2027))
+JOB_TITLES = [
+    "Software Engineer", "Senior Software Engineer", "Staff Engineer",
+    "Principal Engineer", "Engineering Manager", "Tech Lead",
+    "Data Scientist", "Senior Data Scientist", "Machine Learning Engineer",
+    "ML Research Scientist", "Data Engineer", "Data Analyst",
+    "Product Manager", "Senior Product Manager", "Program Manager",
+    "DevOps Engineer", "Site Reliability Engineer", "Cloud Architect",
+    "Full Stack Developer", "Frontend Engineer", "Backend Engineer",
+    "Mobile Developer", "iOS Engineer", "Android Developer",
+    "QA Engineer", "Security Engineer", "Solutions Architect",
+    "Research Scientist", "AI Engineer", "NLP Engineer",
+    "Quantitative Analyst", "Financial Analyst", "Business Analyst",
+    "UX Designer", "UI Engineer", "Technical Writer",
+    "Intern", "Software Engineering Intern", "Data Science Intern",
+]
+PROGRAMMING_LANGUAGES = [
+    "Python", "Java", "JavaScript", "TypeScript", "C++", "C", "C#",
+    "Go", "Rust", "Kotlin", "Swift", "Ruby", "PHP", "Scala",
+    "R", "MATLAB", "Julia", "Haskell", "Elixir", "Dart",
+]
+FRAMEWORKS = [
+    "React", "Angular", "Vue.js", "Next.js", "Django", "Flask",
+    "FastAPI", "Spring Boot", "Express.js", "Node.js", "Rails",
+    "TensorFlow", "PyTorch", "Keras", "scikit-learn", "Pandas",
+    "NumPy", "Spark", "Hadoop", "Kubernetes", "Docker",
+    "AWS", "GCP", "Azure", "Terraform", "Ansible",
+    ".NET", "Laravel", "Svelte", "Remix", "Astro",
+]
+TOOLS = [
+    "Git", "GitHub", "GitLab", "Jira", "Confluence", "Slack",
+    "VS Code", "IntelliJ", "PyCharm", "Vim", "Emacs",
+    "PostgreSQL", "MySQL", "MongoDB", "Redis", "Elasticsearch",
+    "Kafka", "RabbitMQ", "Airflow", "dbt", "Snowflake",
+    "Tableau", "Power BI", "Grafana", "Prometheus", "Datadog",
+    "Jenkins", "CircleCI", "GitHub Actions", "ArgoCD",
+    "Figma", "Sketch", "Adobe XD", "Postman", "Swagger",
+]
+SOFT_SKILLS = [
+    "Leadership", "Communication", "Team Collaboration",
+    "Problem Solving", "Critical Thinking", "Time Management",
+    "Project Management", "Agile Methodologies", "Scrum",
+    "Cross-functional Collaboration", "Mentoring",
+    "Strategic Planning", "Stakeholder Management",
+    "Technical Writing", "Public Speaking", "Negotiation",
+]
+CERTIFICATIONS_LIST = [
+    "AWS Certified Solutions Architect - Associate",
+    "AWS Certified Developer - Associate",
+    "AWS Certified Machine Learning - Specialty",
+    "Google Cloud Professional Data Engineer",
+    "Google Cloud Professional ML Engineer",
+    "Microsoft Azure Fundamentals (AZ-900)",
+    "Microsoft Azure Data Scientist Associate (DP-100)",
+    "Certified Kubernetes Administrator (CKA)",
+    "Certified Kubernetes Application Developer (CKAD)",
+    "Certified Information Systems Security Professional (CISSP)",
+    "CompTIA Security+",
+    "Project Management Professional (PMP)",
+    "Certified ScrumMaster (CSM)",
+    "TensorFlow Developer Certificate",
+    "Databricks Certified Data Engineer Associate",
+    "Snowflake SnowPro Core Certification",
+    "HashiCorp Terraform Associate",
+    "Cisco Certified Network Associate (CCNA)",
+    "Oracle Certified Professional, Java SE",
+    "Red Hat Certified System Administrator (RHCSA)",
+    "Deep Learning Specialization (Coursera)",
+    "Machine Learning by Stanford (Coursera)",
+    "Professional Scrum Master I (PSM I)",
+]
+AWARDS_LIST = [
+    "Dean's List", "Summa Cum Laude", "Magna Cum Laude", "Cum Laude",
+    "Phi Beta Kappa", "Tau Beta Pi", "National Merit Scholar",
+    "Employee of the Quarter", "Spot Bonus Award", "President's Club",
+    "Best Paper Award", "Innovation Award", "Hackathon Winner",
+    "Outstanding Graduate Student Award", "Research Fellowship",
+    "Teaching Assistant Excellence Award", "Community Service Award",
+    "IEEE Best Student Paper", "ACM ICPC Regional Finalist",
+    "Google Code Jam Qualifier", "Facebook Hacker Cup Participant",
+    "Patent Holder", "Top Performer Award", "Rising Star Award",
+]
+CITIES = [
+    "San Francisco, CA", "New York, NY", "Seattle, WA", "Austin, TX",
+    "Boston, MA", "Chicago, IL", "Los Angeles, CA", "Denver, CO",
+    "Portland, OR", "Atlanta, GA", "Washington, DC", "San Jose, CA",
+    "Raleigh, NC", "Pittsburgh, PA", "Minneapolis, MN", "Dallas, TX",
+    "Miami, FL", "Phoenix, AZ", "San Diego, CA", "Philadelphia, PA",
+]
+MONTHS = [
+    "January", "February", "March", "April", "May", "June",
+    "July", "August", "September", "October", "November", "December",
+]
+MONTHS_SHORT = [
+    "Jan", "Feb", "Mar", "Apr", "May", "Jun",
+    "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
+]
+PROJECT_ADJECTIVES = [
+    "Real-time", "Scalable", "Distributed", "Cloud-native",
+    "AI-powered", "Automated", "Interactive", "Cross-platform",
+    "Open-source", "End-to-end", "High-performance", "Serverless",
+    "Event-driven", "Microservice-based", "Full-stack",
+]
+PROJECT_NOUNS = [
+    "Dashboard", "Platform", "Pipeline", "Application", "System",
+    "API", "Framework", "Tool", "Service", "Engine",
+    "Chatbot", "Recommendation System", "Search Engine",
+    "Analytics Platform", "Monitoring System", "Marketplace",
+]
+IMPACT_METRICS = [
+    "reduced latency by {pct}%",
+    "improved throughput by {pct}%",
+    "increased user engagement by {pct}%",
+    "decreased error rate by {pct}%",
+    "saved ${amount}K annually",
+    "reduced costs by {pct}%",
+    "improved accuracy by {pct}%",
+    "increased conversion rate by {pct}%",
+    "served {users} daily active users",
+    "processed {events} events per second",
+    "reduced deployment time from hours to minutes",
+    "cut onboarding time by {pct}%",
+    "automated {pct}% of manual processes",
+    "improved model F1 score from 0.{f1_old} to 0.{f1_new}",
+]
+PHONE_AREA_CODES = [
+    "415", "650", "408", "510", "212", "646", "718", "206",
+    "512", "617", "312", "213", "303", "503", "404", "202",
+]
+LINKEDIN_PREFIXES = [
+    "linkedin.com/in/", "www.linkedin.com/in/",
+]
+GITHUB_PREFIXES = [
+    "github.com/", "www.github.com/",
+]
+DOMAINS = [
+    "gmail.com", "outlook.com", "yahoo.com", "protonmail.com",
+    "icloud.com", "hotmail.com", "mail.com",
+]
+# ---------------------------------------------------------------------------
+# Synonym replacement pools for augmentation
+# ---------------------------------------------------------------------------
+SYNONYMS = {
+    "developed": ["built", "created", "engineered", "designed", "implemented", "constructed", "authored"],
+    "managed": ["led", "oversaw", "directed", "supervised", "coordinated", "administered"],
+    "improved": ["enhanced", "optimized", "upgraded", "refined", "boosted", "strengthened"],
+    "implemented": ["deployed", "executed", "delivered", "rolled out", "launched", "shipped"],
+    "analyzed": ["examined", "evaluated", "assessed", "investigated", "studied", "reviewed"],
+    "collaborated": ["partnered", "worked closely with", "teamed up with", "cooperated with"],
+    "responsible for": ["in charge of", "accountable for", "tasked with", "owned"],
+    "utilized": ["leveraged", "employed", "used", "applied", "harnessed"],
+    "achieved": ["accomplished", "attained", "reached", "secured", "delivered"],
+    "experience": ["expertise", "background", "proficiency", "track record"],
+}
+# ---------------------------------------------------------------------------
+# Helper utilities
+# ---------------------------------------------------------------------------
+def _pick(pool, k=1):
+    """Return k unique random items from a pool."""
+    k = min(k, len(pool))
+    return random.sample(pool, k)
+def _pick_one(pool):
+    return random.choice(pool)
+def _date_range(allow_present: bool = True):
+    """Return a random date range string."""
+    start_year = random.randint(2014, 2024)
+    start_month = _pick_one(MONTHS_SHORT)
+    fmt = random.choice(["short", "long", "year_only"])
+    if allow_present and random.random() < 0.3:
+        end_str = random.choice(["Present", "Current", "Now"])
+    else:
+        end_year = random.randint(start_year, min(start_year + 6, 2026))
+        end_month = _pick_one(MONTHS_SHORT)
+        if fmt == "short":
+            end_str = f"{end_month} {end_year}"
+        elif fmt == "long":
+            end_str = f"{_pick_one(MONTHS)} {end_year}"
+        else:
+            end_str = str(end_year)
+    if fmt == "short":
+        start_str = f"{start_month} {start_year}"
+    elif fmt == "long":
+        start_str = f"{_pick_one(MONTHS)} {start_year}"
+    else:
+        start_str = str(start_year)
+    sep = random.choice([" - ", " – ", " to ", "–", "-"])
+    return f"{start_str}{sep}{end_str}"
+def _impact():
+    """Generate a random impact metric string."""
+    template = _pick_one(IMPACT_METRICS)
+    return template.format(
+        pct=random.randint(10, 85),
+        amount=random.randint(50, 500),
+        users=random.choice(["10K", "50K", "100K", "500K", "1M", "5M"]),
+        events=random.choice(["1K", "10K", "50K", "100K", "1M"]),
+        f1_old=random.randint(65, 80),
+        f1_new=random.randint(82, 97),
+    )
+def _synonym_replace(text: str) -> str:
+    """Randomly replace words with synonyms for augmentation."""
+    words = text.split()
+    result = []
+    for w in words:
+        lower = w.lower().rstrip(".,;:")
+        if lower in SYNONYMS and random.random() < 0.3:
+            replacement = _pick_one(SYNONYMS[lower])
+            # Preserve original capitalization of first char
+            if w[0].isupper():
+                replacement = replacement.capitalize()
+            # Preserve trailing punctuation
+            trailing = w[len(lower):]
+            result.append(replacement + trailing)
+        else:
+            result.append(w)
+    return " ".join(result)
+def _bullet():
+    """Return a random bullet character."""
+    return random.choice(["•", "-", "●", "*", "▪", ""])
+def _reorder_bullets(bullets: list) -> list:
+    """Shuffle bullet points for variation."""
+    shuffled = bullets.copy()
+    random.shuffle(shuffled)
+    return shuffled
+# ---------------------------------------------------------------------------
+# Section generators – each returns a string of realistic text
+# ---------------------------------------------------------------------------
+def generate_education() -> str:
+    """Generate a realistic education section."""
+    templates = []
+    # Template 1: Full formal entry
+    def _t1():
+        uni = _pick_one(UNIVERSITIES)
+        deg_full, deg_short = _pick_one(DEGREES)
+        major = _pick_one(MAJORS)
+        year = _pick_one(GRAD_YEARS)
+        lines = []
+        header_style = random.choice(["full", "short", "inline"])
+        if header_style == "full":
+            lines.append(f"{deg_full} in {major}")
+            lines.append(f"{uni}")
+            lines.append(f"Graduated: {_pick_one(MONTHS)} {year}")
+        elif header_style == "short":
+            lines.append(f"{deg_short} {major}, {uni} ({year})")
+        else:
+            lines.append(f"{uni} — {deg_full} in {major}, {year}")
+        # Optional GPA
+        if random.random() < 0.6:
+            gpa = _pick_one(GPA_VALUES)
+            lines.append(f"GPA: {gpa}/4.0")
+        # Optional minor
+        if random.random() < 0.3:
+            minor = _pick_one(MINORS)
+            lines.append(f"Minor in {minor}")
+        # Optional coursework
+        if random.random() < 0.5:
+            courses = _pick(MAJORS + ["Algorithms", "Data Structures",
+                                       "Operating Systems", "Database Systems",
+                                       "Computer Networks", "Linear Algebra",
+                                       "Probability and Statistics",
+                                       "Deep Learning", "Natural Language Processing",
+                                       "Computer Vision", "Distributed Systems"], k=random.randint(3, 6))
+            prefix = random.choice(["Relevant Coursework:", "Key Courses:", "Coursework:"])
+            lines.append(f"{prefix} {', '.join(courses)}")
+        # Optional honors
+        if random.random() < 0.3:
+            honor = random.choice(["Summa Cum Laude", "Magna Cum Laude",
+                                    "Cum Laude", "Dean's List (all semesters)",
+                                    "Honors Program", "University Scholar"])
+            lines.append(honor)
+        # Optional thesis
+        if "Ph.D." in deg_short or ("M.S." in deg_short and random.random() < 0.4):
+            topic = random.choice([
+                "Transformer-based approaches to document classification",
+                "Scalable distributed systems for real-time data processing",
+                "Graph neural networks for molecular property prediction",
+                "Federated learning in healthcare applications",
+                "Efficient attention mechanisms for long-sequence modeling",
+                "Reinforcement learning for autonomous navigation",
+            ])
+            label = "Dissertation" if "Ph.D." in deg_short else "Thesis"
+            lines.append(f"{label}: \"{topic}\"")
+        return "\n".join(lines)
+    # Template 2: Multiple degrees
+    def _t2():
+        entries = []
+        for _ in range(random.randint(2, 3)):
+            uni = _pick_one(UNIVERSITIES)
+            deg_full, deg_short = _pick_one(DEGREES)
+            major = _pick_one(MAJORS)
+            year = _pick_one(GRAD_YEARS)
+            gpa_line = f" | GPA: {_pick_one(GPA_VALUES)}" if random.random() < 0.5 else ""
+            entries.append(f"{deg_short} in {major}, {uni}, {year}{gpa_line}")
+        return "\n".join(entries)
+    # Template 3: Education with activities
+    def _t3():
+        uni = _pick_one(UNIVERSITIES)
+        deg_full, deg_short = _pick_one(DEGREES)
+        major = _pick_one(MAJORS)
+        year = _pick_one(GRAD_YEARS)
+        lines = [f"{uni}", f"{deg_full} in {major} | {_pick_one(MONTHS)} {year}"]
+        activities = random.sample([
+            "Teaching Assistant for Introduction to Computer Science",
+            "President, Computer Science Student Association",
+            "Member, ACM Student Chapter",
+            "Undergraduate Research Assistant, ML Lab",
+            "Peer Tutor, Mathematics Department",
+            "Captain, University Programming Competition Team",
+            "Volunteer, Engineering Outreach Program",
+            "Member, Honors College",
+            "Study Abroad Program, Technical University of Munich",
+            "Resident Advisor, Engineering Living-Learning Community",
+        ], k=random.randint(1, 3))
+        b = _bullet()
+        for a in activities:
+            lines.append(f"{b} {a}" if b else a)
+        return "\n".join(lines)
+    templates = [_t1, _t2, _t3]
+    return random.choice(templates)()
+def generate_experience() -> str:
+    """Generate a realistic work experience section."""
+    def _single_role():
+        title = _pick_one(JOB_TITLES)
+        company = _pick_one(COMPANIES)
+        city = _pick_one(CITIES)
+        date_range = _date_range()
+        header_styles = [
+            f"{title} | {company} | {city} | {date_range}",
+            f"{title}, {company}\n{city} | {date_range}",
+            f"{company} — {title}\n{date_range} | {city}",
+            f"{title}\n{company}, {city}\n{date_range}",
+        ]
+        lines = [random.choice(header_styles)]
+        # Generate bullet points
+        bullet_templates = [
+            f"Developed and maintained {random.choice(['microservices', 'APIs', 'web applications', 'data pipelines', 'ML models', 'backend systems', 'frontend components'])} using {', '.join(_pick(PROGRAMMING_LANGUAGES, k=random.randint(1,3)))} and {', '.join(_pick(FRAMEWORKS, k=random.randint(1,2)))}",
+            f"Collaborated with cross-functional teams of {random.randint(3,15)} engineers to deliver {random.choice(['product features', 'platform improvements', 'system migrations', 'infrastructure upgrades'])} on schedule",
+            f"Designed and implemented {random.choice(['CI/CD pipelines', 'testing frameworks', 'monitoring solutions', 'data models', 'caching strategies', 'authentication systems'])} that {_impact()}",
+            f"Led migration of {random.choice(['legacy monolith', 'on-premise infrastructure', 'batch processing system', 'manual workflows'])} to {random.choice(['cloud-native architecture', 'microservices', 'real-time streaming', 'automated pipelines'])}",
+            f"Mentored {random.randint(2,8)} junior engineers through code reviews, pair programming, and technical design sessions",
+            f"Optimized {random.choice(['database queries', 'API response times', 'model inference', 'data processing pipelines', 'search indexing'])} resulting in {_impact()}",
+            f"Wrote comprehensive technical documentation and {random.choice(['RFCs', 'design docs', 'runbooks', 'architecture decision records'])} for {random.choice(['system design', 'API contracts', 'deployment procedures', 'incident response'])}",
+            f"Built {random.choice(['real-time', 'batch', 'streaming', 'event-driven'])} {random.choice(['data pipeline', 'ETL process', 'analytics system', 'feature store'])} processing {random.choice(['1M+', '10M+', '100M+', '1B+'])} records {random.choice(['daily', 'per hour', 'in real-time'])}",
+            f"Spearheaded adoption of {_pick_one(FRAMEWORKS)} and {_pick_one(TOOLS)}, {_impact()}",
+            f"Conducted A/B testing and experimentation for {random.choice(['recommendation engine', 'search ranking', 'pricing model', 'onboarding flow', 'notification system'])}, {_impact()}",
+            f"Architected {random.choice(['distributed', 'fault-tolerant', 'highly available', 'horizontally scalable'])} system handling {random.choice(['10K', '50K', '100K', '1M'])} requests per second with {random.choice(['99.9%', '99.95%', '99.99%'])} uptime",
+        ]
+        n_bullets = random.randint(2, 5)
+        selected = random.sample(bullet_templates, min(n_bullets, len(bullet_templates)))
+        selected = _reorder_bullets(selected)
+        b = _bullet()
+        for bullet in selected:
+            lines.append(f"{b} {bullet}" if b else bullet)
+        return "\n".join(lines)
+    # Sometimes include multiple roles
+    n_roles = random.choices([1, 2], weights=[0.7, 0.3])[0]
+    roles = [_single_role() for _ in range(n_roles)]
+    return "\n\n".join(roles)
+def generate_skills() -> str:
+    """Generate a realistic skills section."""
+    templates = []
+    def _t_categorized():
+        lines = []
+        categories = []
+        if random.random() < 0.9:
+            langs = _pick(PROGRAMMING_LANGUAGES, k=random.randint(3, 7))
+            label = random.choice(["Languages", "Programming Languages", "Programming"])
+            categories.append((label, langs))
+        if random.random() < 0.9:
+            fws = _pick(FRAMEWORKS, k=random.randint(3, 7))
+            label = random.choice(["Frameworks", "Frameworks & Libraries", "Technologies"])
+            categories.append((label, fws))
+        if random.random() < 0.8:
+            tls = _pick(TOOLS, k=random.randint(3, 7))
+            label = random.choice(["Tools", "Developer Tools", "Tools & Platforms"])
+            categories.append((label, tls))
+        if random.random() < 0.4:
+            ss = _pick(SOFT_SKILLS, k=random.randint(2, 5))
+            label = random.choice(["Soft Skills", "Other Skills", "Additional Skills"])
+            categories.append((label, ss))
+        sep = random.choice([": ", " - ", " — "])
+        for label, items in categories:
+            joiner = random.choice([", ", " | ", " · ", " / "])
+            lines.append(f"{label}{sep}{joiner.join(items)}")
+        return "\n".join(lines)
+    def _t_flat():
+        all_skills = (_pick(PROGRAMMING_LANGUAGES, k=random.randint(3, 6)) +
+                      _pick(FRAMEWORKS, k=random.randint(3, 6)) +
+                      _pick(TOOLS, k=random.randint(2, 4)))
+        random.shuffle(all_skills)
+        joiner = random.choice([", ", " | ", " · ", " • "])
+        return joiner.join(all_skills)
+    def _t_proficiency():
+        lines = []
+        levels = ["Expert", "Advanced", "Proficient", "Intermediate", "Familiar"]
+        used = set()
+        for level in random.sample(levels, k=random.randint(2, 4)):
+            pool = [s for s in PROGRAMMING_LANGUAGES + FRAMEWORKS + TOOLS if s not in used]
+            items = _pick(pool, k=random.randint(2, 5))
+            used.update(items)
+            lines.append(f"{level}: {', '.join(items)}")
+        return "\n".join(lines)
+    templates = [_t_categorized, _t_flat, _t_proficiency]
+    return random.choice(templates)()
+def generate_projects() -> str:
+    """Generate a realistic projects section."""
+    def _single_project():
+        adj = _pick_one(PROJECT_ADJECTIVES)
+        noun = _pick_one(PROJECT_NOUNS)
+        name = f"{adj} {noun}"
+        techs = _pick(PROGRAMMING_LANGUAGES + FRAMEWORKS, k=random.randint(2, 5))
+        header_styles = [
+            f"{name} | {', '.join(techs)}",
+            f"{name}\nTechnologies: {', '.join(techs)}",
+            f"{name} ({', '.join(techs)})",
+        ]
+        lines = [random.choice(header_styles)]
+        # Optional link
+        if random.random() < 0.3:
+            username = _pick_one(FIRST_NAMES).lower() + _pick_one(LAST_NAMES).lower()
+            lines.append(f"github.com/{username}/{name.lower().replace(' ', '-')}")
+        descriptions = [
+            f"Built a {noun.lower()} that {random.choice(['processes', 'analyzes', 'visualizes', 'aggregates', 'transforms'])} {random.choice(['user data', 'financial data', 'text documents', 'sensor data', 'social media feeds', 'medical records'])} in real-time",
+            f"Implemented {random.choice(['REST API', 'GraphQL API', 'gRPC service', 'WebSocket server', 'event-driven architecture'])} with {random.choice(['authentication', 'rate limiting', 'caching', 'pagination', 'logging'])} support",
+            f"Trained {random.choice(['classification', 'regression', 'NLP', 'computer vision', 'recommendation'])} model achieving {random.choice(['92%', '95%', '97%', '89%', '94%'])} {random.choice(['accuracy', 'F1 score', 'AUC-ROC'])} on test set",
+            f"Deployed to {random.choice(['AWS', 'GCP', 'Azure', 'Heroku', 'Vercel', 'Railway'])} with {random.choice(['Docker', 'Kubernetes', 'serverless', 'auto-scaling'])} configuration",
+            f"Attracted {random.choice(['100+', '500+', '1K+', '5K+'])} GitHub stars and {random.choice(['20+', '50+', '100+'])} contributors from the open-source community",
+            f"Features {random.choice(['real-time notifications', 'responsive UI', 'role-based access control', 'data export', 'interactive visualizations', 'natural language search'])}",
+        ]
+        b = _bullet()
+        for desc in random.sample(descriptions, k=random.randint(2, 4)):
+            lines.append(f"{b} {desc}" if b else desc)
+        return "\n".join(lines)
+    n_projects = random.randint(1, 3)
+    return "\n\n".join([_single_project() for _ in range(n_projects)])
+def generate_summary() -> str:
+    """Generate a realistic professional summary / objective section."""
+    years = random.randint(2, 15)
+    specialties = _pick(MAJORS + [
+        "full-stack development", "distributed systems", "machine learning",
+        "data engineering", "cloud architecture", "mobile development",
+        "DevOps", "backend development", "frontend development",
+        "natural language processing", "computer vision",
+    ], k=random.randint(1, 3))
+    templates = [
+        # Template 1: Traditional summary
+        lambda: f"Results-driven {_pick_one(JOB_TITLES).lower()} with {years}+ years of experience in {' and '.join(specialties)}. Proven track record of {random.choice(['delivering high-impact solutions', 'building scalable systems', 'driving technical excellence', 'leading cross-functional teams'])} at companies like {_pick_one(COMPANIES)} and {_pick_one(COMPANIES)}. Passionate about {random.choice(['clean code', 'system design', 'open source', 'mentorship', 'continuous learning', 'innovation'])} and {random.choice(['building products that scale', 'solving complex problems', 'leveraging data-driven insights', 'improving developer experience'])}.",
+        # Template 2: Technical focus
+        lambda: f"Experienced {_pick_one(JOB_TITLES).lower()} specializing in {', '.join(specialties)}. Skilled in {', '.join(_pick(PROGRAMMING_LANGUAGES, k=3))} with deep expertise in {', '.join(_pick(FRAMEWORKS, k=2))}. {random.choice(['Strong background in', 'Demonstrated ability in', 'Track record of'])} {random.choice(['building distributed systems at scale', 'developing ML models for production', 'architecting cloud-native applications', 'leading agile engineering teams'])}. Seeking to {random.choice(['contribute to cutting-edge products', 'drive technical innovation', 'solve challenging problems', 'build impactful technology'])} at a {random.choice(['fast-growing startup', 'leading technology company', 'mission-driven organization'])}.",
+        # Template 3: Achievement-oriented
+        lambda: f"{_pick_one(JOB_TITLES)} with {years} years of experience building {random.choice(['enterprise-scale', 'consumer-facing', 'B2B', 'data-intensive'])} applications. Key achievements include {_impact()}, {_impact()}, and {_impact()}. Proficient in {', '.join(_pick(PROGRAMMING_LANGUAGES, k=3))} and {', '.join(_pick(FRAMEWORKS, k=2))}.",
+        # Template 4: Brief objective
+        lambda: f"Motivated {random.choice(['professional', 'engineer', 'developer', 'technologist'])} seeking a {_pick_one(JOB_TITLES).lower()} role where I can apply my expertise in {' and '.join(specialties)} to {random.choice(['build innovative products', 'solve real-world problems', 'drive business impact', 'push the boundaries of technology'])}.",
+        # Template 5: Narrative style
+        lambda: f"I am a {_pick_one(JOB_TITLES).lower()} who thrives at the intersection of {_pick_one(specialties)} and {_pick_one(specialties)}. Over the past {years} years, I have {random.choice(['shipped products used by millions', 'built ML systems processing petabytes of data', 'led engineering teams through rapid growth', 'contributed to open-source projects with thousands of stars'])}. I bring a {random.choice(['data-driven', 'user-centric', 'systems-thinking', 'first-principles'])} approach to every problem I tackle.",
+    ]
+    return random.choice(templates)()
+def generate_certifications() -> str:
+    """Generate a realistic certifications section."""
+    n = random.randint(2, 6)
+    certs = _pick(CERTIFICATIONS_LIST, k=n)
+    lines = []
+    for cert in certs:
+        year = random.randint(2019, 2025)
+        styles = [
+            f"{cert} ({year})",
+            f"{cert} — Issued {_pick_one(MONTHS)} {year}",
+            f"{cert}, {year}",
+            f"{cert}\n  Issued: {_pick_one(MONTHS_SHORT)} {year}" + (
+                f" | Expires: {_pick_one(MONTHS_SHORT)} {year + random.randint(2, 3)}"
+                if random.random() < 0.3 else ""
+            ),
+        ]
+        lines.append(random.choice(styles))
+    b = _bullet()
+    if b and random.random() < 0.5:
+        return "\n".join(f"{b} {line}" for line in lines)
+    return "\n".join(lines)
+def generate_contact() -> str:
+    """Generate a realistic contact information section."""
+    first = _pick_one(FIRST_NAMES)
+    last = _pick_one(LAST_NAMES)
+    city = _pick_one(CITIES)
+    area_code = _pick_one(PHONE_AREA_CODES)
+    email_user = random.choice([
+        f"{first.lower()}.{last.lower()}",
+        f"{first.lower()}{last.lower()}",
+        f"{first[0].lower()}{last.lower()}",
+        f"{first.lower()}_{last.lower()}",
+        f"{first.lower()}{random.randint(1, 99)}",
+    ])
+    email = f"{email_user}@{_pick_one(DOMAINS)}"
+    phone = f"({area_code}) {random.randint(100,999)}-{random.randint(1000,9999)}"
+    linkedin_user = f"{first.lower()}-{last.lower()}-{random.randint(100, 999)}"
+    github_user = f"{first.lower()}{last.lower()}"
+    parts = [f"{first} {last}"]
+    if random.random() < 0.8:
+        parts.append(email)
+    if random.random() < 0.7:
+        parts.append(phone)
+    if random.random() < 0.6:
+        parts.append(city)
+    if random.random() < 0.5:
+        parts.append(f"{_pick_one(LINKEDIN_PREFIXES)}{linkedin_user}")
+    if random.random() < 0.4:
+        parts.append(f"{_pick_one(GITHUB_PREFIXES)}{github_user}")
+    if random.random() < 0.2:
+        parts.append(f"{github_user}.dev" if random.random() < 0.5 else f"{first.lower()}{last.lower()}.com")
+    sep = random.choice(["\n", " | ", " · ", "\n"])
+    return sep.join(parts)
+def generate_awards() -> str:
+    """Generate a realistic awards & honors section."""
+    n = random.randint(2, 6)
+    awards = _pick(AWARDS_LIST, k=n)
+    lines = []
+    for award in awards:
+        year = random.randint(2015, 2025)
+        org = random.choice([
+            _pick_one(UNIVERSITIES),
+            _pick_one(COMPANIES),
+            random.choice(["ACM", "IEEE", "Google", "Facebook", "Microsoft",
+                          "National Science Foundation", "Department of Education"]),
+        ])
+        styles = [
+            f"{award}, {org} ({year})",
+            f"{award} — {org}, {year}",
+            f"{award} ({year})\n  Awarded by {org}",
+            f"{award}, {year}",
+        ]
+        lines.append(random.choice(styles))
+    b = _bullet()
+    if b and random.random() < 0.6:
+        return "\n".join(f"{b} {line}" for line in lines)
+    return "\n".join(lines)
+# ---------------------------------------------------------------------------
+# Optional section headers – sometimes sections include a heading
+# ---------------------------------------------------------------------------
+SECTION_HEADERS = {
+    "education": ["EDUCATION", "Education", "Academic Background", "ACADEMIC BACKGROUND", "Education & Training"],
+    "experience": ["EXPERIENCE", "Experience", "WORK EXPERIENCE", "Work Experience", "PROFESSIONAL EXPERIENCE", "Professional Experience", "Employment History"],
+    "skills": ["SKILLS", "Skills", "TECHNICAL SKILLS", "Technical Skills", "Core Competencies", "CORE COMPETENCIES", "Technologies"],
+    "projects": ["PROJECTS", "Projects", "PERSONAL PROJECTS", "Personal Projects", "SIDE PROJECTS", "Selected Projects", "Portfolio"],
+    "summary": ["SUMMARY", "Summary", "PROFESSIONAL SUMMARY", "Professional Summary", "OBJECTIVE", "Objective", "PROFILE", "Profile", "About Me", "ABOUT"],
+    "certifications": ["CERTIFICATIONS", "Certifications", "CERTIFICATES", "Certificates", "Licenses & Certifications", "PROFESSIONAL CERTIFICATIONS"],
+    "contact": ["CONTACT", "Contact", "CONTACT INFORMATION", "Contact Information", "Personal Information"],
+    "awards": ["AWARDS", "Awards", "HONORS & AWARDS", "Honors & Awards", "ACHIEVEMENTS", "Achievements", "Awards & Honors", "RECOGNITION"],
+}
+GENERATORS = {
+    "education": generate_education,
+    "experience": generate_experience,
+    "skills": generate_skills,
+    "projects": generate_projects,
+    "summary": generate_summary,
+    "certifications": generate_certifications,
+    "contact": generate_contact,
+    "awards": generate_awards,
+}
+# ---------------------------------------------------------------------------
+# Dataset generation
+# ---------------------------------------------------------------------------
+def generate_example(label: str, include_header: bool = False, augment: bool = False) -> str:
+    """
+    Generate a single synthetic example for the given label.
+    Args:
+        label: One of the 8 section categories.
+        include_header: Whether to prepend a section header.
+        augment: Whether to apply text augmentation.
+    Returns:
+        Generated text string.
+    """
+    text = GENERATORS[label]()
+    # Optionally prepend a section header
+    if include_header and random.random() < 0.5:
+        header = _pick_one(SECTION_HEADERS[label])
+        sep = random.choice(["\n", "\n\n", "\n---\n"])
+        text = f"{header}{sep}{text}"
+    # Augmentation
+    if augment:
+        if random.random() < 0.4:
+            text = _synonym_replace(text)
+        # Randomly add/remove trailing whitespace or newlines
+        if random.random() < 0.2:
+            text = text.strip() + "\n"
+        if random.random() < 0.1:
+            text = "  " + text
+    return text
+def generate_dataset(
+    examples_per_category: int = 80,
+    augmented_copies: int = 2,
+    include_header_prob: float = 0.4,
+    seed: int = 42,
+) -> list[dict]:
+    """
+    Generate a complete synthetic dataset.
+    Args:
+        examples_per_category: Base examples per category.
+        augmented_copies: Number of augmented copies per base example.
+        include_header_prob: Probability of including section header.
+        seed: Random seed for reproducibility.
+    Returns:
+        List of dicts with 'text' and 'label' keys.
+    """
+    random.seed(seed)
+    labels = list(GENERATORS.keys())
+    dataset = []
+    for label in labels:
+        for i in range(examples_per_category):
+            include_header = random.random() < include_header_prob
+            text = generate_example(label, include_header=include_header, augment=False)
+            dataset.append({"text": text, "label": label})
+            # Generate augmented versions
+            for _ in range(augmented_copies):
+                aug_text = generate_example(label, include_header=include_header, augment=True)
+                dataset.append({"text": aug_text, "label": label})
+    random.shuffle(dataset)
+    return dataset
+def save_to_csv(dataset: list[dict], path: str) -> None:
+    """Save dataset to CSV."""
+    filepath = Path(path)
+    filepath.parent.mkdir(parents=True, exist_ok=True)
+    with open(filepath, "w", newline="", encoding="utf-8") as f:
+        writer = csv.DictWriter(f, fieldnames=["text", "label"])
+        writer.writeheader()
+        writer.writerows(dataset)
+    print(f"Saved {len(dataset)} examples to {filepath}")
+def load_as_hf_dataset(dataset: list[dict]):
+    """Convert to HuggingFace Dataset with train/val/test splits."""
+    from datasets import Dataset, DatasetDict
+    ds = Dataset.from_list(dataset)
+    # 80/10/10 split
+    train_test = ds.train_test_split(test_size=0.2, seed=42, stratify_by_column="label")
+    val_test = train_test["test"].train_test_split(test_size=0.5, seed=42, stratify_by_column="label")
+    return DatasetDict({
+        "train": train_test["train"],
+        "validation": val_test["train"],
+        "test": val_test["test"],
+    })
+def get_label_mapping(dataset: list[dict]) -> tuple[dict, dict]:
+    """Create label <-> id mappings."""
+    labels = sorted(set(d["label"] for d in dataset))
+    label2id = {label: idx for idx, label in enumerate(labels)}
+    id2label = {idx: label for label, idx in label2id.items()}
+    return label2id, id2label
+# ---------------------------------------------------------------------------
+# CLI entry point
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="Generate synthetic resume section data")
+    parser.add_argument("--examples-per-category", type=int, default=80,
+                        help="Number of base examples per category (default: 80)")
+    parser.add_argument("--augmented-copies", type=int, default=2,
+                        help="Number of augmented copies per example (default: 2)")
+    parser.add_argument("--output", type=str, default="data/resume_sections.csv",
+                        help="Output CSV path (default: data/resume_sections.csv)")
+    parser.add_argument("--seed", type=int, default=42,
+                        help="Random seed (default: 42)")
+    parser.add_argument("--print-stats", action="store_true",
+                        help="Print dataset statistics")
+    parser.add_argument("--print-samples", type=int, default=0,
+                        help="Print N sample examples")
+    args = parser.parse_args()
+    print(f"Generating dataset with {args.examples_per_category} base examples per category...")
+    print(f"Augmented copies per example: {args.augmented_copies}")
+    print(f"Total expected examples: {args.examples_per_category * (1 + args.augmented_copies) * 8}")
+    dataset = generate_dataset(
+        examples_per_category=args.examples_per_category,
+        augmented_copies=args.augmented_copies,
+        seed=args.seed,
+    )
+    save_to_csv(dataset, args.output)
+    if args.print_stats:
+        from collections import Counter
+        counts = Counter(d["label"] for d in dataset)
+        print("\nDataset Statistics:")
+        print(f"  Total examples: {len(dataset)}")
+        print(f"  Categories: {len(counts)}")
+        for label, count in sorted(counts.items()):
+            print(f"    {label}: {count}")
+        avg_len = sum(len(d["text"]) for d in dataset) / len(dataset)
+        print(f"  Average text length: {avg_len:.0f} chars")
+    if args.print_samples > 0:
+        print(f"\n{'='*60}")
+        print(f"Sample Examples (first {args.print_samples}):")
+        print(f"{'='*60}")
+        for i, example in enumerate(dataset[:args.print_samples]):
+            print(f"\n--- Example {i+1} [{example['label']}] ---")
+            print(example["text"][:300])
+            if len(example["text"]) > 300:
+                print("...")

inference.py ADDED Viewed

	@@ -0,0 +1,441 @@

+"""
+Resume Section Classifier – Inference Script
+Takes raw resume text, splits it into sections, and classifies each section
+into one of 8 categories with confidence scores.
+Author: Lorenzo Scaturchio (gr8monk3ys)
+Usage:
+    # Classify a resume file
+    python inference.py --file resume.txt
+    # Classify inline text
+    python inference.py --text "Bachelor of Science in Computer Science, MIT, 2023"
+    # Use a custom model path
+    python inference.py --model ./model_output/final_model --file resume.txt
+    # Output as JSON
+    python inference.py --file resume.txt --format json
+    # Python API
+    from inference import ResumeSectionClassifier
+    classifier = ResumeSectionClassifier("./model_output/final_model")
+    results = classifier.classify_resume(resume_text)
+"""
+import json
+import re
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+# ---------------------------------------------------------------------------
+# Data classes
+# ---------------------------------------------------------------------------
+@dataclass
+class SectionPrediction:
+    """A single section classification result."""
+    text: str
+    label: str
+    confidence: float
+    all_scores: dict = field(default_factory=dict)
+    def to_dict(self) -> dict:
+        return {
+            "text": self.text,
+            "label": self.label,
+            "confidence": round(self.confidence, 4),
+            "all_scores": {k: round(v, 4) for k, v in self.all_scores.items()},
+        }
+@dataclass
+class ResumeAnalysis:
+    """Complete resume analysis output."""
+    sections: list
+    section_count: int = 0
+    label_distribution: dict = field(default_factory=dict)
+    def to_dict(self) -> dict:
+        return {
+            "sections": [s.to_dict() for s in self.sections],
+            "section_count": self.section_count,
+            "label_distribution": self.label_distribution,
+        }
+    def to_json(self, indent: int = 2) -> str:
+        return json.dumps(self.to_dict(), indent=indent)
+    def summary(self) -> str:
+        """Human-readable summary."""
+        lines = [
+            f"Resume Analysis: {self.section_count} sections detected",
+            "=" * 50,
+        ]
+        for i, sec in enumerate(self.sections, 1):
+            text_preview = sec.text[:80].replace("\n", " ")
+            if len(sec.text) > 80:
+                text_preview += "..."
+            lines.append(
+                f"\n[{i}] {sec.label.upper()} (confidence: {sec.confidence:.1%})"
+            )
+            lines.append(f"    {text_preview}")
+        lines.append("\n" + "-" * 50)
+        lines.append("Label Distribution:")
+        for label, count in sorted(self.label_distribution.items()):
+            lines.append(f"  {label}: {count}")
+        return "\n".join(lines)
+# ---------------------------------------------------------------------------
+# Section splitting heuristics
+# ---------------------------------------------------------------------------
+# Common resume section headers (case-insensitive patterns)
+SECTION_HEADER_PATTERNS = [
+    r"^#{1,3}\s+.+$",  # Markdown headers
+    r"^[A-Z][A-Z\s&/,]{2,}$",  # ALL CAPS headers
+    r"^(?:EDUCATION|EXPERIENCE|WORK EXPERIENCE|PROFESSIONAL EXPERIENCE|"
+    r"SKILLS|TECHNICAL SKILLS|PROJECTS|PERSONAL PROJECTS|"
+    r"SUMMARY|PROFESSIONAL SUMMARY|OBJECTIVE|PROFILE|ABOUT|"
+    r"CERTIFICATIONS|CERTIFICATES|LICENSES|"
+    r"CONTACT|CONTACT INFORMATION|PERSONAL INFORMATION|"
+    r"AWARDS|HONORS|ACHIEVEMENTS|RECOGNITION|"
+    r"PUBLICATIONS|REFERENCES|VOLUNTEER|LANGUAGES|INTERESTS|"
+    r"ACTIVITIES|LEADERSHIP|RESEARCH)\s*:?\s*$",
+]
+COMPILED_HEADERS = [re.compile(p, re.MULTILINE | re.IGNORECASE) for p in SECTION_HEADER_PATTERNS]
+def is_section_header(line: str) -> bool:
+    """Check if a line looks like a section header."""
+    stripped = line.strip()
+    if not stripped or len(stripped) < 3:
+        return False
+    for pattern in COMPILED_HEADERS:
+        if pattern.match(stripped):
+            return True
+    # Heuristic: short all-caps line
+    if stripped.isupper() and len(stripped.split()) <= 5 and len(stripped) < 50:
+        return True
+    return False
+def split_resume_into_sections(text: str, min_section_length: int = 20) -> list:
+    """
+    Split raw resume text into logical sections.
+    Strategy:
+    1. First try to split on detected section headers.
+    2. Fall back to splitting on double newlines (paragraph breaks).
+    3. Filter out very short fragments.
+    Args:
+        text: Raw resume text.
+        min_section_length: Minimum character length for a section.
+    Returns:
+        List of text sections.
+    """
+    lines = text.split("\n")
+    sections = []
+    current_section_lines = []
+    # Pass 1: Try header-based splitting
+    header_found = False
+    for line in lines:
+        if is_section_header(line):
+            header_found = True
+            # Save previous section
+            if current_section_lines:
+                section_text = "\n".join(current_section_lines).strip()
+                if len(section_text) >= min_section_length:
+                    sections.append(section_text)
+            current_section_lines = [line]
+        else:
+            current_section_lines.append(line)
+    # Don't forget the last section
+    if current_section_lines:
+        section_text = "\n".join(current_section_lines).strip()
+        if len(section_text) >= min_section_length:
+            sections.append(section_text)
+    # If no headers found, fall back to paragraph splitting
+    if not header_found or len(sections) <= 1:
+        sections = []
+        paragraphs = re.split(r"\n\s*\n", text)
+        for para in paragraphs:
+            stripped = para.strip()
+            if len(stripped) >= min_section_length:
+                sections.append(stripped)
+    # If still just one big block, return it as-is
+    if not sections:
+        stripped = text.strip()
+        if stripped:
+            sections = [stripped]
+    return sections
+# ---------------------------------------------------------------------------
+# Classifier
+# ---------------------------------------------------------------------------
+class ResumeSectionClassifier:
+    """
+    Classifies resume text sections into categories.
+    Supports both single-section and full-resume classification.
+    """
+    def __init__(
+        self,
+        model_path: str = "./model_output/final_model",
+        device: str = None,
+        max_length: int = 256,
+    ):
+        """
+        Initialize the classifier.
+        Args:
+            model_path: Path to fine-tuned model directory.
+            device: Device string ('cpu', 'cuda', 'mps'). Auto-detected if None.
+            max_length: Maximum token sequence length.
+        """
+        self.model_path = Path(model_path)
+        self.max_length = max_length
+        # Auto-detect device
+        if device is None:
+            if torch.cuda.is_available():
+                self.device = torch.device("cuda")
+            elif torch.backends.mps.is_available():
+                self.device = torch.device("mps")
+            else:
+                self.device = torch.device("cpu")
+        else:
+            self.device = torch.device(device)
+        # Load model and tokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(str(self.model_path))
+        self.model = AutoModelForSequenceClassification.from_pretrained(
+            str(self.model_path)
+        ).to(self.device)
+        self.model.eval()
+        # Load label mapping
+        label_mapping_path = self.model_path / "label_mapping.json"
+        if label_mapping_path.exists():
+            with open(label_mapping_path) as f:
+                mapping = json.load(f)
+            self.id2label = {int(k): v for k, v in mapping["id2label"].items()}
+            self.label2id = mapping["label2id"]
+        else:
+            # Fall back to model config
+            self.id2label = self.model.config.id2label
+            self.label2id = self.model.config.label2id
+        self.labels = sorted(self.label2id.keys())
+    def classify_section(self, text: str) -> SectionPrediction:
+        """
+        Classify a single text section.
+        Args:
+            text: Section text to classify.
+        Returns:
+            SectionPrediction with label, confidence, and all scores.
+        """
+        inputs = self.tokenizer(
+            text,
+            truncation=True,
+            max_length=self.max_length,
+            padding=True,
+            return_tensors="pt",
+        ).to(self.device)
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+            probs = torch.softmax(outputs.logits, dim=-1)[0]
+        scores = {self.id2label[i]: probs[i].item() for i in range(len(probs))}
+        predicted_id = probs.argmax().item()
+        predicted_label = self.id2label[predicted_id]
+        confidence = probs[predicted_id].item()
+        return SectionPrediction(
+            text=text,
+            label=predicted_label,
+            confidence=confidence,
+            all_scores=scores,
+        )
+    def classify_sections(self, texts: list) -> list:
+        """
+        Classify multiple text sections (batched).
+        Args:
+            texts: List of section texts.
+        Returns:
+            List of SectionPrediction objects.
+        """
+        if not texts:
+            return []
+        inputs = self.tokenizer(
+            texts,
+            truncation=True,
+            max_length=self.max_length,
+            padding=True,
+            return_tensors="pt",
+        ).to(self.device)
+        with torch.no_grad():
+            outputs = self.model(**inputs)
+            probs = torch.softmax(outputs.logits, dim=-1)
+        results = []
+        for i, text in enumerate(texts):
+            scores = {self.id2label[j]: probs[i][j].item() for j in range(probs.shape[1])}
+            predicted_id = probs[i].argmax().item()
+            predicted_label = self.id2label[predicted_id]
+            confidence = probs[i][predicted_id].item()
+            results.append(SectionPrediction(
+                text=text,
+                label=predicted_label,
+                confidence=confidence,
+                all_scores=scores,
+            ))
+        return results
+    def classify_resume(
+        self,
+        resume_text: str,
+        min_section_length: int = 20,
+    ) -> ResumeAnalysis:
+        """
+        Classify a full resume by splitting into sections and classifying each.
+        Args:
+            resume_text: Full resume text.
+            min_section_length: Minimum section length in characters.
+        Returns:
+            ResumeAnalysis with all section predictions.
+        """
+        sections = split_resume_into_sections(resume_text, min_section_length)
+        predictions = self.classify_sections(sections)
+        # Compute label distribution
+        label_dist = {}
+        for pred in predictions:
+            label_dist[pred.label] = label_dist.get(pred.label, 0) + 1
+        return ResumeAnalysis(
+            sections=predictions,
+            section_count=len(predictions),
+            label_distribution=label_dist,
+        )
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+def main():
+    import argparse
+    parser = argparse.ArgumentParser(
+        description="Classify resume sections",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  python inference.py --file resume.txt
+  python inference.py --text "BS in Computer Science, MIT, 2023"
+  python inference.py --file resume.txt --format json
+  python inference.py --model ./model_output/final_model --file resume.txt
+        """,
+    )
+    input_group = parser.add_mutually_exclusive_group(required=True)
+    input_group.add_argument("--file", type=str, help="Path to resume text file")
+    input_group.add_argument("--text", type=str, help="Direct text to classify")
+    parser.add_argument("--model", type=str, default="./model_output/final_model",
+                        help="Path to fine-tuned model (default: ./model_output/final_model)")
+    parser.add_argument("--device", type=str, default=None,
+                        help="Device: cpu, cuda, mps (auto-detected if omitted)")
+    parser.add_argument("--max-length", type=int, default=256,
+                        help="Maximum token sequence length (default: 256)")
+    parser.add_argument("--min-section-length", type=int, default=20,
+                        help="Minimum section length in characters (default: 20)")
+    parser.add_argument("--format", type=str, choices=["text", "json"], default="text",
+                        help="Output format (default: text)")
+    parser.add_argument("--single", action="store_true",
+                        help="Classify as single section (no splitting)")
+    args = parser.parse_args()
+    # Load classifier
+    try:
+        classifier = ResumeSectionClassifier(
+            model_path=args.model,
+            device=args.device,
+            max_length=args.max_length,
+        )
+    except Exception as e:
+        print(f"Error loading model from '{args.model}': {e}", file=sys.stderr)
+        print("Have you trained the model yet? Run: python train.py", file=sys.stderr)
+        sys.exit(1)
+    # Get input text
+    if args.file:
+        file_path = Path(args.file)
+        if not file_path.exists():
+            print(f"File not found: {args.file}", file=sys.stderr)
+            sys.exit(1)
+        text = file_path.read_text(encoding="utf-8")
+    else:
+        text = args.text
+    # Classify
+    if args.single:
+        result = classifier.classify_section(text)
+        if args.format == "json":
+            print(json.dumps(result.to_dict(), indent=2))
+        else:
+            print(f"Label: {result.label}")
+            print(f"Confidence: {result.confidence:.1%}")
+            print("\nAll scores:")
+            for label, score in sorted(result.all_scores.items(), key=lambda x: -x[1]):
+                bar = "#" * int(score * 40)
+                print(f"  {label:20s} {score:.4f} {bar}")
+    else:
+        analysis = classifier.classify_resume(text, min_section_length=args.min_section_length)
+        if args.format == "json":
+            print(analysis.to_json())
+        else:
+            print(analysis.summary())
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+transformers>=4.36.0
+datasets>=2.16.0
+torch>=2.1.0
+scikit-learn>=1.3.0
+accelerate>=0.25.0
+evaluate>=0.4.0
+pandas>=2.0.0
+huggingface_hub>=0.20.0

train.py ADDED Viewed

	@@ -0,0 +1,437 @@

+"""
+Resume Section Classifier – Training Script
+Fine-tunes distilbert-base-uncased for classifying resume text sections
+into 8 categories: education, experience, skills, projects, summary,
+certifications, contact, awards.
+Author: Lorenzo Scaturchio (gr8monk3ys)
+Usage:
+    python train.py                           # Train with defaults
+    python train.py --epochs 5 --batch-size 32
+    python train.py --push-to-hub             # Push to HuggingFace Hub
+    python train.py --output-dir ./my_model
+"""
+import json
+import logging
+import os
+import sys
+from pathlib import Path
+import evaluate
+import numpy as np
+import torch
+from datasets import DatasetDict
+from transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    DataCollatorWithPadding,
+    EarlyStoppingCallback,
+    Trainer,
+    TrainingArguments,
+)
+from data_generator import generate_dataset, get_label_mapping, load_as_hf_dataset
+# ---------------------------------------------------------------------------
+# Logging
+# ---------------------------------------------------------------------------
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s [%(levelname)s] %(message)s",
+    handlers=[logging.StreamHandler(sys.stdout)],
+)
+logger = logging.getLogger(__name__)
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+MODEL_NAME = "distilbert-base-uncased"
+DEFAULT_OUTPUT_DIR = "./model_output"
+DEFAULT_LOGGING_DIR = "./logs"
+HUB_MODEL_ID = "gr8monk3ys/resume-section-classifier"
+MAX_LENGTH = 256
+# ---------------------------------------------------------------------------
+# Metrics computation
+# ---------------------------------------------------------------------------
+def build_compute_metrics(id2label: dict):
+    """Build a compute_metrics function with access to label mappings."""
+    accuracy_metric = evaluate.load("accuracy")
+    f1_metric = evaluate.load("f1")
+    precision_metric = evaluate.load("precision")
+    recall_metric = evaluate.load("recall")
+    def compute_metrics(eval_pred):
+        logits, labels = eval_pred
+        predictions = np.argmax(logits, axis=-1)
+        acc = accuracy_metric.compute(predictions=predictions, references=labels)
+        f1_macro = f1_metric.compute(predictions=predictions, references=labels, average="macro")
+        f1_weighted = f1_metric.compute(predictions=predictions, references=labels, average="weighted")
+        precision = precision_metric.compute(predictions=predictions, references=labels, average="weighted")
+        recall = recall_metric.compute(predictions=predictions, references=labels, average="weighted")
+        return {
+            "accuracy": acc["accuracy"],
+            "f1_macro": f1_macro["f1"],
+            "f1_weighted": f1_weighted["f1"],
+            "precision": precision["precision"],
+            "recall": recall["recall"],
+        }
+    return compute_metrics
+# ---------------------------------------------------------------------------
+# Tokenization
+# ---------------------------------------------------------------------------
+def tokenize_dataset(dataset_dict: DatasetDict, tokenizer, label2id: dict, max_length: int = MAX_LENGTH):
+    """Tokenize all splits and encode labels as integers."""
+    def preprocess(examples):
+        tokenized = tokenizer(
+            examples["text"],
+            truncation=True,
+            max_length=max_length,
+            padding=False,  # Dynamic padding via DataCollator
+        )
+        tokenized["labels"] = [label2id[label] for label in examples["label"]]
+        return tokenized
+    tokenized = dataset_dict.map(
+        preprocess,
+        batched=True,
+        remove_columns=["text", "label"],
+        desc="Tokenizing",
+    )
+    return tokenized
+# ---------------------------------------------------------------------------
+# Main training function
+# ---------------------------------------------------------------------------
+def train(
+    output_dir: str = DEFAULT_OUTPUT_DIR,
+    model_name: str = MODEL_NAME,
+    epochs: int = 4,
+    batch_size: int = 16,
+    learning_rate: float = 2e-5,
+    weight_decay: float = 0.01,
+    warmup_ratio: float = 0.1,
+    max_length: int = MAX_LENGTH,
+    examples_per_category: int = 80,
+    augmented_copies: int = 2,
+    seed: int = 42,
+    push_to_hub: bool = False,
+    hub_model_id: str = HUB_MODEL_ID,
+    fp16: bool = None,
+    gradient_accumulation_steps: int = 1,
+    early_stopping_patience: int = 3,
+):
+    """
+    Full training pipeline.
+    Args:
+        output_dir: Directory to save model and artifacts.
+        model_name: Pretrained model identifier.
+        epochs: Number of training epochs.
+        batch_size: Training batch size.
+        learning_rate: Peak learning rate.
+        weight_decay: Weight decay for AdamW.
+        warmup_ratio: Fraction of steps for warmup.
+        max_length: Maximum token sequence length.
+        examples_per_category: Base synthetic examples per category.
+        augmented_copies: Augmented copies per base example.
+        seed: Random seed.
+        push_to_hub: Whether to push to HuggingFace Hub.
+        hub_model_id: Hub model repository ID.
+        fp16: Use mixed precision (auto-detected if None).
+        gradient_accumulation_steps: Gradient accumulation steps.
+        early_stopping_patience: Early stopping patience (epochs).
+    """
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    # Auto-detect fp16
+    if fp16 is None:
+        fp16 = torch.cuda.is_available()
+    logger.info("=" * 60)
+    logger.info("Resume Section Classifier – Training")
+    logger.info("=" * 60)
+    logger.info(f"Model: {model_name}")
+    logger.info(f"Output: {output_dir}")
+    logger.info(f"Epochs: {epochs}, Batch size: {batch_size}, LR: {learning_rate}")
+    logger.info(f"Device: {'CUDA' if torch.cuda.is_available() else 'MPS' if torch.backends.mps.is_available() else 'CPU'}")
+    logger.info(f"FP16: {fp16}")
+    # ------------------------------------------------------------------
+    # 1. Generate synthetic data
+    # ------------------------------------------------------------------
+    logger.info("\n[1/5] Generating synthetic training data...")
+    raw_dataset = generate_dataset(
+        examples_per_category=examples_per_category,
+        augmented_copies=augmented_copies,
+        seed=seed,
+    )
+    label2id, id2label = get_label_mapping(raw_dataset)
+    num_labels = len(label2id)
+    logger.info(f"  Total examples: {len(raw_dataset)}")
+    logger.info(f"  Labels ({num_labels}): {list(label2id.keys())}")
+    # Create HF DatasetDict with train/val/test splits
+    dataset_dict = load_as_hf_dataset(raw_dataset)
+    logger.info(f"  Train: {len(dataset_dict['train'])}")
+    logger.info(f"  Validation: {len(dataset_dict['validation'])}")
+    logger.info(f"  Test: {len(dataset_dict['test'])}")
+    # ------------------------------------------------------------------
+    # 2. Tokenize
+    # ------------------------------------------------------------------
+    logger.info("\n[2/5] Loading tokenizer and tokenizing data...")
+    tokenizer = AutoTokenizer.from_pretrained(model_name)
+    tokenized_dataset = tokenize_dataset(dataset_dict, tokenizer, label2id, max_length)
+    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+    # ------------------------------------------------------------------
+    # 3. Load model
+    # ------------------------------------------------------------------
+    logger.info("\n[3/5] Loading pretrained model...")
+    model = AutoModelForSequenceClassification.from_pretrained(
+        model_name,
+        num_labels=num_labels,
+        id2label=id2label,
+        label2id=label2id,
+    )
+    logger.info(f"  Parameters: {sum(p.numel() for p in model.parameters()):,}")
+    logger.info(f"  Trainable: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
+    # ------------------------------------------------------------------
+    # 4. Training
+    # ------------------------------------------------------------------
+    logger.info("\n[4/5] Training...")
+    training_args = TrainingArguments(
+        output_dir=output_dir,
+        overwrite_output_dir=True,
+        # Training hyperparameters
+        num_train_epochs=epochs,
+        per_device_train_batch_size=batch_size,
+        per_device_eval_batch_size=batch_size * 2,
+        gradient_accumulation_steps=gradient_accumulation_steps,
+        learning_rate=learning_rate,
+        weight_decay=weight_decay,
+        warmup_ratio=warmup_ratio,
+        lr_scheduler_type="cosine",
+        # Evaluation
+        eval_strategy="epoch",
+        save_strategy="epoch",
+        load_best_model_at_end=True,
+        metric_for_best_model="f1_macro",
+        greater_is_better=True,
+        # Logging
+        logging_dir=DEFAULT_LOGGING_DIR,
+        logging_strategy="steps",
+        logging_steps=50,
+        report_to="none",
+        # Efficiency
+        fp16=fp16,
+        dataloader_num_workers=0,
+        # Reproducibility
+        seed=seed,
+        data_seed=seed,
+        # Hub
+        push_to_hub=False,  # We'll push manually after evaluation
+        # Misc
+        save_total_limit=3,
+        disable_tqdm=False,
+    )
+    callbacks = []
+    if early_stopping_patience > 0:
+        callbacks.append(EarlyStoppingCallback(early_stopping_patience=early_stopping_patience))
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=tokenized_dataset["train"],
+        eval_dataset=tokenized_dataset["validation"],
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=build_compute_metrics(id2label),
+        callbacks=callbacks,
+    )
+    train_result = trainer.train()
+    # Log training metrics
+    logger.info("\nTraining Results:")
+    for key, value in train_result.metrics.items():
+        logger.info(f"  {key}: {value}")
+    # ------------------------------------------------------------------
+    # 5. Evaluation
+    # ------------------------------------------------------------------
+    logger.info("\n[5/5] Evaluating on test set...")
+    test_results = trainer.evaluate(tokenized_dataset["test"])
+    logger.info("\nTest Results:")
+    for key, value in test_results.items():
+        logger.info(f"  {key}: {value:.4f}" if isinstance(value, float) else f"  {key}: {value}")
+    # ------------------------------------------------------------------
+    # Save artifacts
+    # ------------------------------------------------------------------
+    logger.info("\nSaving model and artifacts...")
+    # Save model + tokenizer
+    final_path = output_path / "final_model"
+    trainer.save_model(str(final_path))
+    tokenizer.save_pretrained(str(final_path))
+    # Save label mapping
+    label_mapping = {
+        "label2id": label2id,
+        "id2label": {str(k): v for k, v in id2label.items()},
+        "labels": list(label2id.keys()),
+    }
+    with open(final_path / "label_mapping.json", "w") as f:
+        json.dump(label_mapping, f, indent=2)
+    # Save training config
+    train_config = {
+        "model_name": model_name,
+        "max_length": max_length,
+        "epochs": epochs,
+        "batch_size": batch_size,
+        "learning_rate": learning_rate,
+        "weight_decay": weight_decay,
+        "warmup_ratio": warmup_ratio,
+        "examples_per_category": examples_per_category,
+        "augmented_copies": augmented_copies,
+        "seed": seed,
+        "num_labels": num_labels,
+        "train_size": len(dataset_dict["train"]),
+        "val_size": len(dataset_dict["validation"]),
+        "test_size": len(dataset_dict["test"]),
+    }
+    with open(final_path / "training_config.json", "w") as f:
+        json.dump(train_config, f, indent=2)
+    # Save metrics
+    all_metrics = {
+        "train": train_result.metrics,
+        "test": test_results,
+    }
+    with open(final_path / "metrics.json", "w") as f:
+        json.dump(all_metrics, f, indent=2)
+    logger.info(f"\nAll artifacts saved to: {final_path}")
+    # ------------------------------------------------------------------
+    # Optional: Push to Hub
+    # ------------------------------------------------------------------
+    if push_to_hub:
+        logger.info(f"\nPushing to HuggingFace Hub: {hub_model_id}")
+        try:
+            trainer.push_to_hub(
+                repo_id=hub_model_id,
+                commit_message="Upload fine-tuned resume section classifier",
+            )
+            tokenizer.push_to_hub(hub_model_id)
+            logger.info("Successfully pushed to Hub!")
+        except Exception as e:
+            logger.error(f"Failed to push to Hub: {e}")
+            logger.info("You can push manually later with:")
+            logger.info(f"  huggingface-cli upload {hub_model_id} {final_path}")
+    logger.info("\nTraining complete!")
+    return test_results
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(
+        description="Fine-tune DistilBERT for resume section classification",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    # Model & output
+    parser.add_argument("--model-name", type=str, default=MODEL_NAME,
+                        help="Pretrained model name or path")
+    parser.add_argument("--output-dir", type=str, default=DEFAULT_OUTPUT_DIR,
+                        help="Output directory for model and artifacts")
+    # Training hyperparameters
+    parser.add_argument("--epochs", type=int, default=4,
+                        help="Number of training epochs")
+    parser.add_argument("--batch-size", type=int, default=16,
+                        help="Training batch size per device")
+    parser.add_argument("--learning-rate", type=float, default=2e-5,
+                        help="Peak learning rate")
+    parser.add_argument("--weight-decay", type=float, default=0.01,
+                        help="Weight decay for AdamW")
+    parser.add_argument("--warmup-ratio", type=float, default=0.1,
+                        help="Fraction of total steps for linear warmup")
+    parser.add_argument("--max-length", type=int, default=MAX_LENGTH,
+                        help="Maximum token sequence length")
+    parser.add_argument("--gradient-accumulation-steps", type=int, default=1,
+                        help="Number of gradient accumulation steps")
+    # Data
+    parser.add_argument("--examples-per-category", type=int, default=80,
+                        help="Base synthetic examples per category")
+    parser.add_argument("--augmented-copies", type=int, default=2,
+                        help="Augmented copies per base example")
+    parser.add_argument("--seed", type=int, default=42,
+                        help="Random seed for reproducibility")
+    # Training config
+    parser.add_argument("--fp16", action="store_true", default=None,
+                        help="Force FP16 training")
+    parser.add_argument("--no-fp16", action="store_true",
+                        help="Disable FP16 training")
+    parser.add_argument("--early-stopping-patience", type=int, default=3,
+                        help="Early stopping patience (0 to disable)")
+    # Hub
+    parser.add_argument("--push-to-hub", action="store_true",
+                        help="Push trained model to HuggingFace Hub")
+    parser.add_argument("--hub-model-id", type=str, default=HUB_MODEL_ID,
+                        help="HuggingFace Hub model ID")
+    args = parser.parse_args()
+    # Handle fp16 flags
+    fp16 = args.fp16
+    if args.no_fp16:
+        fp16 = False
+    results = train(
+        output_dir=args.output_dir,
+        model_name=args.model_name,
+        epochs=args.epochs,
+        batch_size=args.batch_size,
+        learning_rate=args.learning_rate,
+        weight_decay=args.weight_decay,
+        warmup_ratio=args.warmup_ratio,
+        max_length=args.max_length,
+        examples_per_category=args.examples_per_category,
+        augmented_copies=args.augmented_copies,
+        seed=args.seed,
+        push_to_hub=args.push_to_hub,
+        hub_model_id=args.hub_model_id,
+        fp16=fp16,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        early_stopping_patience=args.early_stopping_patience,
+    )