Resume Section Classifier

A fine-tuned DistilBERT model that classifies resume text sections into 8 categories. Designed for automated resume parsing pipelines where incoming text needs to be segmented and labeled by section type.

Labels

Label	Description	Example
`education`	Academic background, degrees, coursework, GPA	"B.S. in Computer Science, MIT, 2023, GPA: 3.8"
`experience`	Work history, job titles, responsibilities	"Software Engineer at Google, 2020-Present. Built microservices..."
`skills`	Technical and soft skills listings	"Python, Java, React, Docker, Kubernetes, AWS"
`projects`	Personal or academic project descriptions	"Built a real-time analytics dashboard using React and D3.js"
`summary`	Professional summary or objective statement	"Results-driven engineer with 5+ years of experience in..."
`certifications`	Professional certifications and licenses	"AWS Certified Solutions Architect - Associate (2023)"
`contact`	Name, email, phone, LinkedIn, location	"John Smith
`awards`	Honors, achievements, recognition	"Dean's List, Phi Beta Kappa, Hackathon Winner (2022)"

Training Procedure

Data

The model was trained on a synthetic dataset generated programmatically using data_generator.py. The generator uses:

Template-based generation with randomized entities (names, companies, universities, skills, dates, cities, etc.) drawn from curated pools of 30-50+ items each
Structural variation across multiple formatting styles per category (e.g., education entries can appear as single-line, multi-line, with/without GPA, with activities, etc.)
Data augmentation via synonym replacement (30% probability per eligible word) and formatting variation (whitespace, bullet styles, separators)
Optional section headers prepended with 40% probability to teach the model to handle both headed and headless sections

Default configuration produces 1,920 examples (80 base examples x 3 variants x 8 categories), split 80/10/10 into train/validation/test sets with stratified sampling.

Hyperparameters

Parameter	Value
Base model	`distilbert-base-uncased`
Max sequence length	256 tokens
Epochs	4
Batch size	16
Learning rate	2e-5
LR scheduler	Cosine
Warmup ratio	0.1
Weight decay	0.01
Optimizer	AdamW
Early stopping	Patience 3 (on F1 macro)

Training Infrastructure

Fine-tuning takes approximately 5-10 minutes on a single GPU or 15-30 minutes on CPU
Model size: ~67M parameters (DistilBERT base)
Mixed precision (FP16) enabled automatically when CUDA is available

Metrics

Evaluated on the held-out synthetic test set (192 examples, stratified):

Metric	Score
Accuracy	~0.95+
F1 (macro)	~0.95+
F1 (weighted)	~0.95+
Precision (weighted)	~0.95+
Recall (weighted)	~0.95+

Note: Exact metrics depend on the random seed and training run. Scores on real-world resumes may differ from synthetic test performance.

Usage

Quick Start with Transformers Pipeline

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="gr8monk3ys/resume-section-classifier",
)

result = classifier("Bachelor of Science in Computer Science, Stanford University, 2023. GPA: 3.9/4.0")
print(result)
# [{'label': 'education', 'score': 0.98}]

Full Resume Classification

from inference import ResumeSectionClassifier

classifier = ResumeSectionClassifier("gr8monk3ys/resume-section-classifier")

resume_text = """
John Smith
john.smith@email.com | (415) 555-1234 | San Francisco, CA
linkedin.com/in/johnsmith | github.com/johnsmith

SUMMARY
Experienced software engineer with 5+ years building scalable web applications
and distributed systems. Passionate about clean code and mentoring junior developers.

EXPERIENCE
Senior Software Engineer | Google | San Francisco, CA | Jan 2021 - Present
- Led migration of monolithic application to microservices architecture
- Mentored 4 junior engineers through code reviews and pair programming
- Reduced API latency by 40% through caching and query optimization

Software Engineer | Stripe | San Francisco, CA | Jun 2018 - Dec 2020
- Built payment processing APIs handling 10M+ transactions daily
- Implemented CI/CD pipeline reducing deployment time from hours to minutes

EDUCATION
B.S. in Computer Science, Stanford University (2018)
GPA: 3.8/4.0 | Dean's List

SKILLS
Languages: Python, Java, Go, TypeScript
Frameworks: React, Django, Spring Boot, gRPC
Tools: Docker, Kubernetes, AWS, PostgreSQL, Redis
"""

analysis = classifier.classify_resume(resume_text)
print(analysis.summary())

Training from Scratch

# Install dependencies
pip install -r requirements.txt

# Generate data and train
python train.py --epochs 4 --batch-size 16

# Train and push to Hub
python train.py --push-to-hub --hub-model-id gr8monk3ys/resume-section-classifier

# Inference
python inference.py --file resume.txt
python inference.py --text "Python, Java, React, Docker" --single

Generating Custom Training Data

# Generate data with custom settings
python data_generator.py \
    --examples-per-category 150 \
    --augmented-copies 3 \
    --output data/resume_sections.csv \
    --print-stats

Project Structure

resume-section-classifier/
  data_generator.py    # Synthetic data generation with templates and augmentation
  train.py             # Full fine-tuning pipeline with HuggingFace Trainer
  inference.py         # Section splitting and classification API + CLI
  requirements.txt     # Python dependencies
  README.md            # This model card

Limitations

Synthetic training data: The model is trained entirely on programmatically generated text. Performance on real resumes with unusual formatting, non-English content, or domain-specific jargon may be lower than synthetic test metrics suggest.
English only: All training data is in English. The model will not reliably classify sections in other languages.
Section granularity: The model classifies pre-split sections. The built-in text splitter uses heuristics (section headers, paragraph breaks) that may not correctly segment all resume formats (e.g., single-column vs. multi-column PDFs, tables).
8 categories only: Some resume sections (e.g., Publications, Volunteer Work, Languages, Interests, References) are not covered by the current label set. These will be assigned to the closest matching category.
Max length: Input is truncated to 256 tokens. Very long sections may lose information.

Intended Use

Automated resume parsing pipelines
HR tech and applicant tracking systems
Resume formatting and analysis tools
Educational demonstrations of text classification fine-tuning

This model is not intended for making hiring decisions. It is a text classification tool for structural parsing only.

Author

Lorenzo Scaturchio (gr8monk3ys)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for gr8monk3ys/resume-section-classifier

Base model

distilbert/distilbert-base-uncased

Finetuned

(11897)

this model