Zeqhx's picture
Retrain from scratch on dataset 4 (notebook pipeline)
4c5282b verified
metadata
license: mit
language: en
pipeline_tag: token-classification
tags:
  - ner
  - resume-parsing
  - cv-parser
base_model: roberta-base

CV Parser NER — roberta-base (v2)

Token-classification model that extracts Job Titles, Skills, and Education from resumes/CVs using a BIO tag scheme.

Provenance

  • Trained from scratch on dataset 4 (resume_bio_annotated_full.csv, 2,483 resumes — 1,739 train / 372 val / 372 test), the team's finalized AI-Studio/Vertex-relabelled dataset.
  • Reproduced end-to-end with the project notebooks/scripts (retokenize.py + train_bert_run.py).
  • Base model: roberta-base · epochs: 5 · learning rate: 3e-5 · max_length 512 · stride 128 · seed 42.

Resume-level performance (dataset-4 splits)

split precision recall F1
validation 0.6397
test 0.6563

Labels

O, B-JOB_TITLE, I-JOB_TITLE, B-SKILL, I-SKILL, B-EDUCATION, I-EDUCATION