Upload folder using huggingface_hub

9b02fb4 verified 14 days ago

7.5 kB

	---
	license: mit
	base_model: distilbert-base-uncased
	tags:
	- text-classification
	- resume-parsing
	- nlp
	- distilbert
	- synthetic-data
	metrics:
	- accuracy
	- f1
	pipeline_tag: text-classification
	datasets:
	- custom-synthetic
	language:
	- en
	---

	# Resume Section Classifier

	A fine-tuned [DistilBERT](https://huggingface.co/distilbert-base-uncased) model that classifies resume text sections into 8 categories. Designed for automated resume parsing pipelines where incoming text needs to be segmented and labeled by section type.

	## Labels

	\| Label \| Description \| Example \|
	\|-------\|-------------\|---------\|
	\| `education` \| Academic background, degrees, coursework, GPA \| "B.S. in Computer Science, MIT, 2023, GPA: 3.8" \|
	\| `experience` \| Work history, job titles, responsibilities \| "Software Engineer at Google, 2020-Present. Built microservices..." \|
	\| `skills` \| Technical and soft skills listings \| "Python, Java, React, Docker, Kubernetes, AWS" \|
	\| `projects` \| Personal or academic project descriptions \| "Built a real-time analytics dashboard using React and D3.js" \|
	\| `summary` \| Professional summary or objective statement \| "Results-driven engineer with 5+ years of experience in..." \|
	\| `certifications` \| Professional certifications and licenses \| "AWS Certified Solutions Architect - Associate (2023)" \|
	\| `contact` \| Name, email, phone, LinkedIn, location \| "John Smith \| john@email.com \| (415) 555-1234 \| SF, CA" \|
	\| `awards` \| Honors, achievements, recognition \| "Dean's List, Phi Beta Kappa, Hackathon Winner (2022)" \|

	## Training Procedure

	### Data

	The model was trained on a synthetic dataset generated programmatically using `data_generator.py`. The generator uses:

	- Template-based generation with randomized entities (names, companies, universities, skills, dates, cities, etc.) drawn from curated pools of 30-50+ items each
	- Structural variation across multiple formatting styles per category (e.g., education entries can appear as single-line, multi-line, with/without GPA, with activities, etc.)
	- Data augmentation via synonym replacement (30% probability per eligible word) and formatting variation (whitespace, bullet styles, separators)
	- Optional section headers prepended with 40% probability to teach the model to handle both headed and headless sections

	Default configuration produces 1,920 examples (80 base examples x 3 variants x 8 categories), split 80/10/10 into train/validation/test sets with stratified sampling.

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| `distilbert-base-uncased` \|
	\| Max sequence length \| 256 tokens \|
	\| Epochs \| 4 \|
	\| Batch size \| 16 \|
	\| Learning rate \| 2e-5 \|
	\| LR scheduler \| Cosine \|
	\| Warmup ratio \| 0.1 \|
	\| Weight decay \| 0.01 \|
	\| Optimizer \| AdamW \|
	\| Early stopping \| Patience 3 (on F1 macro) \|

	### Training Infrastructure

	- Fine-tuning takes approximately 5-10 minutes on a single GPU or 15-30 minutes on CPU
	- Model size: ~67M parameters (DistilBERT base)
	- Mixed precision (FP16) enabled automatically when CUDA is available

	## Metrics

	Evaluated on the held-out synthetic test set (192 examples, stratified):

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Accuracy \| ~0.95+ \|
	\| F1 (macro) \| ~0.95+ \|
	\| F1 (weighted) \| ~0.95+ \|
	\| Precision (weighted) \| ~0.95+ \|
	\| Recall (weighted) \| ~0.95+ \|

	> Note: Exact metrics depend on the random seed and training run. Scores on real-world resumes may differ from synthetic test performance.

	## Usage

	### Quick Start with Transformers Pipeline

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="gr8monk3ys/resume-section-classifier",
	)

	result = classifier("Bachelor of Science in Computer Science, Stanford University, 2023. GPA: 3.9/4.0")
	print(result)
	# [{'label': 'education', 'score': 0.98}]
	```

	### Full Resume Classification

	```python
	from inference import ResumeSectionClassifier

	classifier = ResumeSectionClassifier("gr8monk3ys/resume-section-classifier")

	resume_text = """
	John Smith
	john.smith@email.com \| (415) 555-1234 \| San Francisco, CA
	linkedin.com/in/johnsmith \| github.com/johnsmith

	SUMMARY
	Experienced software engineer with 5+ years building scalable web applications
	and distributed systems. Passionate about clean code and mentoring junior developers.

	EXPERIENCE
	Senior Software Engineer \| Google \| San Francisco, CA \| Jan 2021 - Present
	- Led migration of monolithic application to microservices architecture
	- Mentored 4 junior engineers through code reviews and pair programming
	- Reduced API latency by 40% through caching and query optimization

	Software Engineer \| Stripe \| San Francisco, CA \| Jun 2018 - Dec 2020
	- Built payment processing APIs handling 10M+ transactions daily
	- Implemented CI/CD pipeline reducing deployment time from hours to minutes

	EDUCATION
	B.S. in Computer Science, Stanford University (2018)
	GPA: 3.8/4.0 \| Dean's List

	SKILLS
	Languages: Python, Java, Go, TypeScript
	Frameworks: React, Django, Spring Boot, gRPC
	Tools: Docker, Kubernetes, AWS, PostgreSQL, Redis
	"""

	analysis = classifier.classify_resume(resume_text)
	print(analysis.summary())
	```

	### Training from Scratch

	```bash
	# Install dependencies
	pip install -r requirements.txt

	# Generate data and train
	python train.py --epochs 4 --batch-size 16

	# Train and push to Hub
	python train.py --push-to-hub --hub-model-id gr8monk3ys/resume-section-classifier

	# Inference
	python inference.py --file resume.txt
	python inference.py --text "Python, Java, React, Docker" --single
	```

	### Generating Custom Training Data

	```bash
	# Generate data with custom settings
	python data_generator.py \
	--examples-per-category 150 \
	--augmented-copies 3 \
	--output data/resume_sections.csv \
	--print-stats
	```

	## Project Structure

	```
	resume-section-classifier/
	data_generator.py # Synthetic data generation with templates and augmentation
	train.py # Full fine-tuning pipeline with HuggingFace Trainer
	inference.py # Section splitting and classification API + CLI
	requirements.txt # Python dependencies
	README.md # This model card
	```

	## Limitations

	- Synthetic training data: The model is trained entirely on programmatically generated text. Performance on real resumes with unusual formatting, non-English content, or domain-specific jargon may be lower than synthetic test metrics suggest.
	- English only: All training data is in English. The model will not reliably classify sections in other languages.
	- Section granularity: The model classifies pre-split sections. The built-in text splitter uses heuristics (section headers, paragraph breaks) that may not correctly segment all resume formats (e.g., single-column vs. multi-column PDFs, tables).
	- 8 categories only: Some resume sections (e.g., Publications, Volunteer Work, Languages, Interests, References) are not covered by the current label set. These will be assigned to the closest matching category.
	- Max length: Input is truncated to 256 tokens. Very long sections may lose information.

	## Intended Use

	- Automated resume parsing pipelines
	- HR tech and applicant tracking systems
	- Resume formatting and analysis tools
	- Educational demonstrations of text classification fine-tuning

	This model is not intended for making hiring decisions. It is a text classification tool for structural parsing only.

	## Author

	[Lorenzo Scaturchio](https://huggingface.co/gr8monk3ys) (gr8monk3ys)