Retrain from scratch on dataset 4 (notebook pipeline)

4c5282b verified about 14 hours ago

1.03 kB

license: mit
language: en
pipeline_tag: token-classification
tags:
  - ner
  - resume-parsing
  - cv-parser
base_model: roberta-base

CV Parser NER — roberta-base (v2)

Token-classification model that extracts Job Titles, Skills, and Education from resumes/CVs using a BIO tag scheme.

Provenance

Trained from scratch on dataset 4 (resume_bio_annotated_full.csv, 2,483 resumes — 1,739 train / 372 val / 372 test), the team's finalized AI-Studio/Vertex-relabelled dataset.
Reproduced end-to-end with the project notebooks/scripts (retokenize.py + train_bert_run.py).
Base model: roberta-base · epochs: 5 · learning rate: 3e-5 · max_length 512 · stride 128 · seed 42.

split	precision	recall	F1
validation	—	—	0.6397
test	—	—	0.6563

O, B-JOB_TITLE, I-JOB_TITLE, B-SKILL, I-SKILL, B-EDUCATION, I-EDUCATION