Resume Section Classifier
A fine-tuned DistilBERT model that classifies resume text sections into 8 categories. Designed for automated resume parsing pipelines where incoming text needs to be segmented and labeled by section type.
Labels
| Label | Description | Example |
|---|---|---|
education |
Academic background, degrees, coursework, GPA | "B.S. in Computer Science, MIT, 2023, GPA: 3.8" |
experience |
Work history, job titles, responsibilities | "Software Engineer at Google, 2020-Present. Built microservices..." |
skills |
Technical and soft skills listings | "Python, Java, React, Docker, Kubernetes, AWS" |
projects |
Personal or academic project descriptions | "Built a real-time analytics dashboard using React and D3.js" |
summary |
Professional summary or objective statement | "Results-driven engineer with 5+ years of experience in..." |
certifications |
Professional certifications and licenses | "AWS Certified Solutions Architect - Associate (2023)" |
contact |
Name, email, phone, LinkedIn, location | "John Smith |
awards |
Honors, achievements, recognition | "Dean's List, Phi Beta Kappa, Hackathon Winner (2022)" |
Training Procedure
Data
The model was trained on a synthetic dataset generated programmatically using data_generator.py. The generator uses:
- Template-based generation with randomized entities (names, companies, universities, skills, dates, cities, etc.) drawn from curated pools of 30-50+ items each
- Structural variation across multiple formatting styles per category (e.g., education entries can appear as single-line, multi-line, with/without GPA, with activities, etc.)
- Data augmentation via synonym replacement (30% probability per eligible word) and formatting variation (whitespace, bullet styles, separators)
- Optional section headers prepended with 40% probability to teach the model to handle both headed and headless sections
Default configuration produces 1,920 examples (80 base examples x 3 variants x 8 categories), split 80/10/10 into train/validation/test sets with stratified sampling.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | distilbert-base-uncased |
| Max sequence length | 256 tokens |
| Epochs | 4 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| LR scheduler | Cosine |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Optimizer | AdamW |
| Early stopping | Patience 3 (on F1 macro) |
Training Infrastructure
- Fine-tuning takes approximately 5-10 minutes on a single GPU or 15-30 minutes on CPU
- Model size: ~67M parameters (DistilBERT base)
- Mixed precision (FP16) enabled automatically when CUDA is available
Metrics
Evaluated on the held-out synthetic test set (192 examples, stratified):
| Metric | Score |
|---|---|
| Accuracy | ~0.95+ |
| F1 (macro) | ~0.95+ |
| F1 (weighted) | ~0.95+ |
| Precision (weighted) | ~0.95+ |
| Recall (weighted) | ~0.95+ |
Note: Exact metrics depend on the random seed and training run. Scores on real-world resumes may differ from synthetic test performance.
Usage
Quick Start with Transformers Pipeline
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="gr8monk3ys/resume-section-classifier",
)
result = classifier("Bachelor of Science in Computer Science, Stanford University, 2023. GPA: 3.9/4.0")
print(result)
# [{'label': 'education', 'score': 0.98}]
Full Resume Classification
from inference import ResumeSectionClassifier
classifier = ResumeSectionClassifier("gr8monk3ys/resume-section-classifier")
resume_text = """
John Smith
john.smith@email.com | (415) 555-1234 | San Francisco, CA
linkedin.com/in/johnsmith | github.com/johnsmith
SUMMARY
Experienced software engineer with 5+ years building scalable web applications
and distributed systems. Passionate about clean code and mentoring junior developers.
EXPERIENCE
Senior Software Engineer | Google | San Francisco, CA | Jan 2021 - Present
- Led migration of monolithic application to microservices architecture
- Mentored 4 junior engineers through code reviews and pair programming
- Reduced API latency by 40% through caching and query optimization
Software Engineer | Stripe | San Francisco, CA | Jun 2018 - Dec 2020
- Built payment processing APIs handling 10M+ transactions daily
- Implemented CI/CD pipeline reducing deployment time from hours to minutes
EDUCATION
B.S. in Computer Science, Stanford University (2018)
GPA: 3.8/4.0 | Dean's List
SKILLS
Languages: Python, Java, Go, TypeScript
Frameworks: React, Django, Spring Boot, gRPC
Tools: Docker, Kubernetes, AWS, PostgreSQL, Redis
"""
analysis = classifier.classify_resume(resume_text)
print(analysis.summary())
Training from Scratch
# Install dependencies
pip install -r requirements.txt
# Generate data and train
python train.py --epochs 4 --batch-size 16
# Train and push to Hub
python train.py --push-to-hub --hub-model-id gr8monk3ys/resume-section-classifier
# Inference
python inference.py --file resume.txt
python inference.py --text "Python, Java, React, Docker" --single
Generating Custom Training Data
# Generate data with custom settings
python data_generator.py \
--examples-per-category 150 \
--augmented-copies 3 \
--output data/resume_sections.csv \
--print-stats
Project Structure
resume-section-classifier/
data_generator.py # Synthetic data generation with templates and augmentation
train.py # Full fine-tuning pipeline with HuggingFace Trainer
inference.py # Section splitting and classification API + CLI
requirements.txt # Python dependencies
README.md # This model card
Limitations
- Synthetic training data: The model is trained entirely on programmatically generated text. Performance on real resumes with unusual formatting, non-English content, or domain-specific jargon may be lower than synthetic test metrics suggest.
- English only: All training data is in English. The model will not reliably classify sections in other languages.
- Section granularity: The model classifies pre-split sections. The built-in text splitter uses heuristics (section headers, paragraph breaks) that may not correctly segment all resume formats (e.g., single-column vs. multi-column PDFs, tables).
- 8 categories only: Some resume sections (e.g., Publications, Volunteer Work, Languages, Interests, References) are not covered by the current label set. These will be assigned to the closest matching category.
- Max length: Input is truncated to 256 tokens. Very long sections may lose information.
Intended Use
- Automated resume parsing pipelines
- HR tech and applicant tracking systems
- Resume formatting and analysis tools
- Educational demonstrations of text classification fine-tuning
This model is not intended for making hiring decisions. It is a text classification tool for structural parsing only.
Author
Lorenzo Scaturchio (gr8monk3ys)
Model tree for gr8monk3ys/resume-section-classifier
Base model
distilbert/distilbert-base-uncased