jobbert-ner-sonnet-v2

Distilled Named Entity Recognition model for English-language job postings. One of six students produced for the paper Distributed NER on Spark: A Teacher-Student Pipeline for Large-Scale Entity Extraction from Job Postings (Soltani and Hanine 2026).

  • Teacher: Claude Sonnet 4.6 (labels acquired via AWS Bedrock)
  • Architecture: jjzha/jobbert-base-cased fine-tuned for 8-class token classification with chunked windows (450 tokens, stride 225) — paper §4.3.3 follow-up to v1
  • Student identifier: s3_jobbert_chunked_sonnet
  • Artefact size: ~411 MB

v2 chunked-window variant of S3. Lifts micro-F1 by +0.044 over the v1 dense JobBERT (S3, 0.281 F1) by training and inferring over 450-token chunks with 225-token stride, achieving full-text coverage. The +0.044 closes only ~17–20% of the gap to the spaCy student (S1/S2, 0.51–0.53 F1) — the residual is structural (BIO/WordPiece mismatch on noun-phrase soft-boundary types like SKILL, EDUCATION, EXPERIENCE_LEVEL), not a coverage artefact. See paper §4.3.3 for the analysis.

Intended use

Entity extraction from English-language job-posting descriptions into an eight-type schema:

SKILL, JOB_TITLE, COMPANY, LOCATION, EXPERIENCE_LEVEL, EDUCATION, CERT, COMPENSATION.

Appropriate downstream applications include posting indexing for search and analytics, skill-demand aggregation for labour-market research, cost-quality-speed benchmarking of distilled NER, and teaching use in NLP / distillation courses.

Out-of-scope use

Not suitable for:

  • CVs or résumés (different register; a CV-trained model should be used instead).
  • Non-English postings.
  • Fully-automated candidate screening or hiring decisions; downstream ranking or filtering should be built only after an application-side schema and bias review (see Ethical considerations).
  • Medical, legal, financial or other high-stakes decision support.
  • Posting text from languages or locales for which the underlying teacher labels were not representative.

Training

  • Teacher labels: 5,000 stratified postings labelled by Claude Sonnet 4.6 in a single run at temperature 0. max_tokens was raised from 4,096 to 8,192 mid-run after two truncation failures on entity-dense postings; final labels from the fixed-ceiling run were used.
  • Curator: 80/10/10 train/dev/test split by md5(job_link) mod 10, so Sonnet- and Haiku-trained students see the same posting partitions.
  • Hardware: one NVIDIA A10G 24 GB GPU (AWS g5.xlarge).
  • Training seed: 42.
  • Principal hyperparameters and full training spec: pipeline/training/experiments/specs/s3_jobbert_chunked_sonnet.yaml in the accompanying project repository.

Evaluation

Sonnet-trained students evaluate on all 516 gold postings; Haiku-trained students evaluate on 515 because one posting was dropped by the curator for zero-entity teacher output during the Haiku run. Metric: micro-F1 over exact (text, type) tuples; character-offset matching is relaxed. Entities are deduplicated within a posting before comparison.

Overall Value
Micro-F1 0.3255
Precision 0.2273
Recall 0.5734
95% CI [0.312, 0.339] (posting-level bootstrap, 10,000 resamples)
Latency mean (eval hardware) 1562.79 ms / document
Latency p99 (eval hardware) 4541.83 ms / document
Text coverage full text via 450-token chunked windows with 225-token stride; per-token logits aggregated across overlapping windows
Postings evaluated 516 (of the 516-posting gold set)

Per-entity-type

Entity type P R F1
COMPANY 0.510 0.563 0.535
JOB_TITLE 0.475 0.661 0.553
LOCATION 0.480 0.800 0.600
COMPENSATION 0.240 0.534 0.331
EDUCATION 0.153 0.257 0.192
CERT 0.132 0.293 0.182
EXPERIENCE_LEVEL 0.068 0.123 0.087
SKILL 0.121 0.707 0.207

Teacher comparison

The teacher (Claude Sonnet 4.6) reaches micro-F1 = 0.5171 against the same gold set (95% bootstrap CI [0.503, 0.530]). The student trails the teacher by 0.192 points absolute (37.1% relative). See paper §4.3 for the full comparison and the error-mode analysis of this student's residuals.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("AchrafSoltani/jobbert-ner-sonnet-v2")
model     = AutoModelForTokenClassification.from_pretrained("AchrafSoltani/jobbert-ner-sonnet-v2")
ner       = pipeline("token-classification", model=model, tokenizer=tokenizer,
                     aggregation_strategy="simple")

text = 'Senior Machine Learning Engineer at Acme Corp in Berlin. Requires 5+ years of experience with PyTorch, AWS, and Kubernetes. MSc in Computer Science preferred. Salary $140,000 – $180,000.'
for ent in ner(text):
    print(ent["word"], "->", ent["entity_group"])

# Produces (verified on this release; note the BERT wordpiece tokenisation
# artefacts in numeric spans):
# Senior Machine Learning Engineer     -> JOB_TITLE
# Acme Corp                            -> COMPANY
# Berlin                               -> LOCATION
# 5 + years of experience              -> EXPERIENCE_LEVEL
# PyTorch                              -> SKILL
# AWS                                  -> SKILL
# Kubernetes                           -> SKILL
# MSc in Computer Science              -> EDUCATION
# $ 140, 000 – $ 180, 000              -> COMPENSATION
#
# (illustrative; on long postings the chunked-inference helper
# in the project repo aggregates predictions across windows)

# Note: this checkpoint was trained with 450-token chunked windows and a
# 225-token stride. The single-call `pipeline()` above only sees the first
# 512 tokens; to reproduce the paper's full-text F1 (0.325 / 0.328), apply
# the same chunk-and-aggregate scheme used by the eval script in the
# project repo at `pipeline/scripts/bootstrap_ci_chunked.py`.

Ethical considerations

This model extracts entities from job postings, a document class whose downstream consumers are typically hiring, ranking, or matching systems. Three cautions are transplanted from paper §6:

  • Schema-induced bias. SKILL over-extraction is inherited from the LLM teacher; soft-skill phrases ("communication skills", "interpersonal skills") and generic tools ("Excel", "CRM") are over-represented relative to a tighter gold standard. A downstream ranker that treats such phrases as filters is encoding the teacher's lexical habits as a hiring criterion and is not recommended without a schema review at the application layer.
  • Contested ground truth. A vendor benchmark in the paper against LinkedIn's own job_skills.csv on 938,028 jointly-present postings yielded 9.56% agreement and 56.44% discovery: the two extraction schemas produce largely non-overlapping views of the same corpus. Neither constitutes a ground truth; the numbers measure schema divergence, not model quality.
  • Consent and licensing. The training corpus is a publicly-released Kaggle redistribution of scraped LinkedIn postings. Individuals named in postings (recruiters, hiring managers) did not consent to having their role descriptions re-processed for research. The model is licensed CC BY-NC 4.0 for research and non-commercial evaluation only; any commercial deployment requires a separate legal and ethical review against the data-provenance chain.

Limitations

  • Trained and evaluated on English-language LinkedIn postings from a publicly-released 2024 Kaggle redistribution; generalisation to other platforms (Indeed, Stack Overflow, regional job boards) or other languages is unevaluated.
  • Gold set is single-annotator (516 postings). Intra-annotator stability was scheduled to be measured one week after the main annotation pass; users should treat the reported F1 as having an un-quantified annotator-noise floor until that number lands.
  • Output schema is locked to the eight types above. Finer-grained or taxonomy-aligned schemas require re-training against new labels.
  • The underlying BERT tokeniser still has a 512-token window per forward pass. This checkpoint achieves full-text coverage by training and inferring over 450-token chunks with a 225-token stride, then aggregating overlapping predictions. Throughput is therefore ~1.5–1.7 s per posting (vs ~16 ms for the v1 single-window variant), and the per-posting compute scales linearly with text length. For latency-sensitive deployments at full-text coverage, prefer the spaCy variant; for highest single-window throughput on short text, prefer the v1 dense JobBERT (S3/S4).
  • Soft-boundary entity types (SKILL, EDUCATION, EXPERIENCE_LEVEL) remain the dominant residual error source. Per paper §4.3.3, the chunked-window improvement (+0.044 F1 over v1) closes only ~17–20% of the gap to the spaCy student; the remaining gap is structural in the BIO/WordPiece formulation rather than a coverage artefact. A span-level objective on top of BIO is the suggested follow-up.

Citation

@unpublished{soltani2026distilledner,
  author = {Achraf Soltani and Mohamed Hanine},
  title  = {Distributed NER on Spark: A Teacher-Student Pipeline for Large-Scale Entity Extraction from Job Postings},
  year   = {2026},
  note   = {Advisor: Prof.\ Hanine Mohamed},
  url    = {https://github.com/achrafsoltani/distributed-ner-on-spark},
}

Licence

  • Model weights: CC BY-NC 4.0 — research and non-commercial evaluation only.
  • Source code in the accompanying repository: Apache 2.0.
Downloads last month
12
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support