Retrain from scratch on dataset 4 (notebook pipeline)

4c5282b verified about 20 hours ago

1.03 kB

	---
	license: mit
	language: en
	pipeline_tag: token-classification
	tags:
	- ner
	- resume-parsing
	- cv-parser
	base_model: roberta-base
	---

	# CV Parser NER — roberta-base (v2)

	Token-classification model that extracts Job Titles, Skills, and
	Education from resumes/CVs using a BIO tag scheme.

	## Provenance
	- Trained from scratch on dataset 4 (`resume_bio_annotated_full.csv`,
	2,483 resumes — 1,739 train / 372 val / 372 test), the team's finalized
	AI-Studio/Vertex-relabelled dataset.
	- Reproduced end-to-end with the project notebooks/scripts
	(`retokenize.py` + `train_bert_run.py`).
	- Base model: `roberta-base` · epochs: 5 · learning rate: 3e-5 ·
	max_length 512 · stride 128 · seed 42.

	## Resume-level performance (dataset-4 splits)
	\| split \| precision \| recall \| F1 \|
	\|-------\|-----------\|--------\|----\|
	\| validation \| — \| — \| 0.6397 \|
	\| test \| — \| — \| 0.6563 \|

	## Labels
	`O, B-JOB_TITLE, I-JOB_TITLE, B-SKILL, I-SKILL, B-EDUCATION, I-EDUCATION`