Resume NER β€” Structured Resume Information Extraction

Fine-tuned DistilBERT model for extracting structured information from resume text. Raw model output is token-level BIO tags. Included post-processing turns those spans into structured resume fields.

Status

  • Base model: distilbert-base-cased
  • Task: token classification / NER
  • Labels: NAME, EMAIL, PHONE, LOCATION, COMPANY, TITLE, DATE, DEGREE, FIELD, INSTITUTION, SKILL, CERT, LANGUAGE
  • Max context: 512 tokens
  • PyTorch artifact: model.safetensors
  • ONNX artifacts: onnx/model.onnx, onnx/model_quantized.onnx

Latest NER benchmark

Latest retrain on RTX 3080 with noise-augmented data + Kaggle silver labels (25 epochs):

  • entity F1: 97.77%
  • structured micro F1: 97.88%
  • clean resume F1: 99.18%
  • noisy resume F1: 69.24% (OCR/scraped text)

These numbers come from entity-level exact-match evaluation with seqeval and the structured extraction benchmark.

Metric note

Older versions of this repo reported token-level F1. Training now uses entity-level exact-match F1 via seqeval, which is stricter and more useful for NER quality.

Quick start

Transformers

from transformers import pipeline

ner = pipeline("token-classification", model="oksomu/resume-ner", aggregation_strategy="simple")
results = ner(resume_text)

ONNX Runtime via Optimum

from optimum.onnxruntime import ORTModelForTokenClassification
from transformers import AutoTokenizer, pipeline

model = ORTModelForTokenClassification.from_pretrained("oksomu/resume-ner", file_name="onnx/model.onnx")
tokenizer = AutoTokenizer.from_pretrained("oksomu/resume-ner")
ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

Full structured extraction

import { parseResume, computeATSScore } from "resume-extract";

const result = await parseResume(resumeText, "./model");
const ats = computeATSScore(result);

npm: resume-extract | GitHub: somus/resume-extract

Usage

Model predicts these entity types:

  • NAME
  • EMAIL
  • PHONE
  • LOCATION
  • COMPANY
  • TITLE
  • DATE
  • DEGREE
  • FIELD
  • INSTITUTION
  • SKILL
  • CERT
  • LANGUAGE

Pre-processing

resume_config.json pre_processing section defines text normalization applied before NER inference:

  • normalize CRLF, em-dashes, bullet characters
  • strip Phone:/Email: labels from contact lines
  • expand flattened two-column skill tables (Category Skills β†’ Category: Skills)
  • collapse multi-space runs

All rules are config-driven β€” any language implementation can apply the same preprocessing from the JSON config.

Post-processing

resume_config.json post_processing section contains rule-based cleanup applied after NER:

  • entity-specific min lengths, exceptions, and blocked words
  • merge adjacent same-label spans (configurable labels + gap)
  • group flat spans into experience / education entries
  • infer seniority from titles + years
  • infer country from location + phone prefix (317 cities in city_country_map.json)
  • compute experience years from date ranges
  • fix company names with gazetteer (companies.json)
  • normalize skills and certifications via aliases
  • merge multi-word skills

Section detection (training/section_detector.py) fills entities the NER model missed by using resume section context (SKILLS, CERTIFICATIONS, LANGUAGES sections).

All post-processing rules are config-driven for cross-language portability.

For a detailed implementation walkthrough of the full pipeline, see docs/implementation_guide.md.

Training data

Current checked-in dataset snapshot in training/data/ was built from mixed sources:

  • DataTurks annotated resumes
  • generated examples from datasetmaster/resumes
  • 12 manual resume templates across tech and non-tech domains
  • 50 hand-crafted long resumes (>512 tokens) for chunked inference training
  • 93 gold-labeled resume-resource PDFs (hand-annotated, 100% match rate)
  • 2,483 Kaggle resumes with Gemini-extracted silver labels + BIO tagging
  • 2x noise augmentation for OCR robustness (separator swaps, char corruption, case changes)

Data layout:

training/data/
β”œβ”€β”€ ner_train.json          # Main training set
β”œβ”€β”€ ner_val.json            # Validation split
β”œβ”€β”€ kaggle_train.json       # Kaggle BIO training (silver + noise-augmented)
β”œβ”€β”€ gold/                   # Hand-annotated evaluation data
β”œβ”€β”€ sources/                # Raw source data (CSV, JSONL, etc.)
└── resume_resource/        # Source PDFs

Current public rebuild path uses:

  • training/generate_from_structured.py β€” orchestrates all sources + augmentation
  • training/manual_resumes.py β€” short manual templates
  • training/build_gold_labels.py β€” gold-labeled PDF resumes
  • training/noise_augment.py β€” noise augmentation for OCR robustness

Training

Use Python 3.11+.

Install training stack:

uv sync

Create separate ONNX export env:

uv venv .venv-onnx-export
uv pip install --python .venv-onnx-export/bin/python -r training/requirements-export.txt

Train:

python -m training.train_ner
python -m training.train_ner --base-model deberta

Rebuild dataset from public/synthetic sources:

python -m training.generate_from_structured

Run tests:

python -m unittest discover -s tests

Run structured benchmark:

python -m training.benchmark_structured --model-dir .

Latest internal structured benchmark on current validation set:

  • overall micro F1: 97.88%
  • macro field F1: 98.47%
  • clean-resume micro F1: 99.18%
  • noisy-resume micro F1: 69.24%

This benchmark uses in-repo structured post-processing for both gold spans and model predictions. Section-aware chunked inference handles resumes exceeding the 512-token context window. These numbers are intended for internal regression tracking, not external leaderboard claims.

Export workflow

Train in main env. Export in separate ONNX env:

python -m training.train_ner

source .venv-onnx-export/bin/activate
optimum-cli export onnx --model training/output/resume-ner/distilbert/best --task token-classification onnx/

Quantize ONNX model:

python - <<'PY'
from optimum.onnxruntime import ORTModelForTokenClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

model = ORTModelForTokenClassification.from_pretrained("onnx")
quantizer = ORTQuantizer.from_pretrained(model)
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer.quantize(save_dir="onnx", quantization_config=qconfig)
PY

Full details: docs/export.md

Artifact layout

  • root config.json, model.safetensors, tokenizer files β€” canonical PyTorch / Hugging Face checkpoint
  • root companies.json β€” company gazetteer used by post-processing
  • root city_country_map.json β€” 317 cities for country inference
  • label_config.json β€” explicit label metadata used by training/inference tooling
  • onnx/model.onnx β€” exported ONNX model
  • onnx/model_quantized.onnx β€” quantized ONNX model for smaller CPU deployment
  • resume_config.json β€” pre-processing, post-processing, and inference rules (single config for all languages)

ONNX validation

Current ONNX export was validated against PyTorch on same sample input:

  • PyTorch vs ONNX logits: allclose=True (max diff: 0.000013)
  • PyTorch vs ONNX predictions: argmax_equal=True
  • PyTorch vs quantized ONNX predictions: minor diff (expected for INT8)

This is main safety check when training and export happen in separate environments.

Repo layout

  • training/train_ner.py β€” training loop
  • training/generate_from_structured.py β€” dataset builder orchestrating all sources + augmentation
  • training/build_gold_labels.py β€” gold-labeled PDF resume builder
  • training/manual_resumes.py β€” short manual templates
  • training/noise_augment.py β€” noise augmentation for OCR robustness
  • training/labels.py β€” shared label definitions
  • training/dataset_utils.py β€” dedupe / split / manifest helpers
  • training/tagging.py β€” span tagging helpers
  • training/text_preprocess.py β€” config-driven text preprocessing
  • training/structured_postprocess.py β€” config-driven post-processing and grouping
  • training/section_detector.py β€” section-aware entity extraction
  • training/benchmark_structured.py β€” structured benchmark with chunked inference
  • training/analyze_structured_errors.py β€” per-resume error analysis
  • training/synthetic_assets.py β€” synthetic source assets/helpers
  • training/synthetic_formats.py β€” 8 resume format builders (A-H)
  • training/export_onnx.py β€” CLI wrapper for ONNX export
  • training/validate_onnx.py β€” PyTorch vs ONNX parity check
  • training/quantize_onnx.py β€” ONNX quantization helper
  • training/requirements-export.txt β€” separate ONNX export env
  • run_inference.py β€” single-resume inference script
  • docs/implementation_guide.md β€” detailed pre/post processing implementation guide
  • docs/export.md β€” ONNX export and validation notes

Limitations

  • English resumes only
  • max 512 tokens per chunk (section-aware chunking handles longer resumes)
  • image-based/scanned PDFs require OCR before text extraction
  • two-column PDF layouts may flatten during text extraction

License

Apache 2.0

Downloads last month
142
Safetensors
Model size
65.2M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train oksomu/resume-ner