Zeqhx's picture
Retrain from scratch on dataset 4 (notebook pipeline)
4c5282b verified
---
license: mit
language: en
pipeline_tag: token-classification
tags:
- ner
- resume-parsing
- cv-parser
base_model: roberta-base
---
# CV Parser NER — roberta-base (v2)
Token-classification model that extracts **Job Titles**, **Skills**, and
**Education** from resumes/CVs using a BIO tag scheme.
## Provenance
- **Trained from scratch on dataset 4** (`resume_bio_annotated_full.csv`,
2,483 resumes — 1,739 train / 372 val / 372 test), the team's finalized
AI-Studio/Vertex-relabelled dataset.
- Reproduced end-to-end with the project notebooks/scripts
(`retokenize.py` + `train_bert_run.py`).
- Base model: `roberta-base` · epochs: 5 · learning rate: 3e-5 ·
max_length 512 · stride 128 · seed 42.
## Resume-level performance (dataset-4 splits)
| split | precision | recall | F1 |
|-------|-----------|--------|----|
| validation | — | — | 0.6397 |
| test | — | — | 0.6563 |
## Labels
`O, B-JOB_TITLE, I-JOB_TITLE, B-SKILL, I-SKILL, B-EDUCATION, I-EDUCATION`