Instructions to use AchrafSoltani/jobbert-ner-sonnet-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AchrafSoltani/jobbert-ner-sonnet-v2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="AchrafSoltani/jobbert-ner-sonnet-v2")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("AchrafSoltani/jobbert-ner-sonnet-v2") model = AutoModelForTokenClassification.from_pretrained("AchrafSoltani/jobbert-ner-sonnet-v2") - Notebooks
- Google Colab
- Kaggle
jobbert-ner-sonnet-v2
Distilled Named Entity Recognition model for English-language job postings. One of six students produced for the paper Distributed NER on Spark: A Teacher-Student Pipeline for Large-Scale Entity Extraction from Job Postings (Soltani and Hanine 2026).
- Teacher: Claude Sonnet 4.6 (labels acquired via AWS Bedrock)
- Architecture: jjzha/jobbert-base-cased fine-tuned for 8-class token classification with chunked windows (450 tokens, stride 225) — paper §4.3.3 follow-up to v1
- Student identifier:
s3_jobbert_chunked_sonnet - Artefact size: ~411 MB
v2 chunked-window variant of S3. Lifts micro-F1 by +0.044 over the v1 dense JobBERT (S3, 0.281 F1) by training and inferring over 450-token chunks with 225-token stride, achieving full-text coverage. The +0.044 closes only ~17–20% of the gap to the spaCy student (S1/S2, 0.51–0.53 F1) — the residual is structural (BIO/WordPiece mismatch on noun-phrase soft-boundary types like SKILL, EDUCATION, EXPERIENCE_LEVEL), not a coverage artefact. See paper §4.3.3 for the analysis.
Intended use
Entity extraction from English-language job-posting descriptions into an eight-type schema:
SKILL, JOB_TITLE, COMPANY, LOCATION, EXPERIENCE_LEVEL, EDUCATION, CERT, COMPENSATION.
Appropriate downstream applications include posting indexing for search and analytics, skill-demand aggregation for labour-market research, cost-quality-speed benchmarking of distilled NER, and teaching use in NLP / distillation courses.
Out-of-scope use
Not suitable for:
- CVs or résumés (different register; a CV-trained model should be used instead).
- Non-English postings.
- Fully-automated candidate screening or hiring decisions; downstream ranking or filtering should be built only after an application-side schema and bias review (see Ethical considerations).
- Medical, legal, financial or other high-stakes decision support.
- Posting text from languages or locales for which the underlying teacher labels were not representative.
Training
- Teacher labels: 5,000 stratified postings labelled by Claude Sonnet 4.6 in a single run at temperature 0.
max_tokenswas raised from 4,096 to 8,192 mid-run after two truncation failures on entity-dense postings; final labels from the fixed-ceiling run were used. - Curator: 80/10/10 train/dev/test split by
md5(job_link) mod 10, so Sonnet- and Haiku-trained students see the same posting partitions. - Hardware: one NVIDIA A10G 24 GB GPU (AWS g5.xlarge).
- Training seed: 42.
- Principal hyperparameters and full training spec:
pipeline/training/experiments/specs/s3_jobbert_chunked_sonnet.yamlin the accompanying project repository.
Evaluation
Sonnet-trained students evaluate on all 516 gold postings; Haiku-trained students evaluate on 515 because one posting was dropped by the curator for zero-entity teacher output during the Haiku run. Metric: micro-F1 over exact (text, type) tuples; character-offset matching is relaxed. Entities are deduplicated within a posting before comparison.
| Overall | Value |
|---|---|
| Micro-F1 | 0.3255 |
| Precision | 0.2273 |
| Recall | 0.5734 |
| 95% CI | [0.312, 0.339] (posting-level bootstrap, 10,000 resamples) |
| Latency mean (eval hardware) | 1562.79 ms / document |
| Latency p99 (eval hardware) | 4541.83 ms / document |
| Text coverage | full text via 450-token chunked windows with 225-token stride; per-token logits aggregated across overlapping windows |
| Postings evaluated | 516 (of the 516-posting gold set) |
Per-entity-type
| Entity type | P | R | F1 |
|---|---|---|---|
| COMPANY | 0.510 | 0.563 | 0.535 |
| JOB_TITLE | 0.475 | 0.661 | 0.553 |
| LOCATION | 0.480 | 0.800 | 0.600 |
| COMPENSATION | 0.240 | 0.534 | 0.331 |
| EDUCATION | 0.153 | 0.257 | 0.192 |
| CERT | 0.132 | 0.293 | 0.182 |
| EXPERIENCE_LEVEL | 0.068 | 0.123 | 0.087 |
| SKILL | 0.121 | 0.707 | 0.207 |
Teacher comparison
The teacher (Claude Sonnet 4.6) reaches micro-F1 = 0.5171 against the same gold set (95% bootstrap CI [0.503, 0.530]). The student trails the teacher by 0.192 points absolute (37.1% relative). See paper §4.3 for the full comparison and the error-mode analysis of this student's residuals.
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("AchrafSoltani/jobbert-ner-sonnet-v2")
model = AutoModelForTokenClassification.from_pretrained("AchrafSoltani/jobbert-ner-sonnet-v2")
ner = pipeline("token-classification", model=model, tokenizer=tokenizer,
aggregation_strategy="simple")
text = 'Senior Machine Learning Engineer at Acme Corp in Berlin. Requires 5+ years of experience with PyTorch, AWS, and Kubernetes. MSc in Computer Science preferred. Salary $140,000 – $180,000.'
for ent in ner(text):
print(ent["word"], "->", ent["entity_group"])
# Produces (verified on this release; note the BERT wordpiece tokenisation
# artefacts in numeric spans):
# Senior Machine Learning Engineer -> JOB_TITLE
# Acme Corp -> COMPANY
# Berlin -> LOCATION
# 5 + years of experience -> EXPERIENCE_LEVEL
# PyTorch -> SKILL
# AWS -> SKILL
# Kubernetes -> SKILL
# MSc in Computer Science -> EDUCATION
# $ 140, 000 – $ 180, 000 -> COMPENSATION
#
# (illustrative; on long postings the chunked-inference helper
# in the project repo aggregates predictions across windows)
# Note: this checkpoint was trained with 450-token chunked windows and a
# 225-token stride. The single-call `pipeline()` above only sees the first
# 512 tokens; to reproduce the paper's full-text F1 (0.325 / 0.328), apply
# the same chunk-and-aggregate scheme used by the eval script in the
# project repo at `pipeline/scripts/bootstrap_ci_chunked.py`.
Ethical considerations
This model extracts entities from job postings, a document class whose downstream consumers are typically hiring, ranking, or matching systems. Three cautions are transplanted from paper §6:
- Schema-induced bias. SKILL over-extraction is inherited from the LLM teacher; soft-skill phrases ("communication skills", "interpersonal skills") and generic tools ("Excel", "CRM") are over-represented relative to a tighter gold standard. A downstream ranker that treats such phrases as filters is encoding the teacher's lexical habits as a hiring criterion and is not recommended without a schema review at the application layer.
- Contested ground truth. A vendor benchmark in the paper against LinkedIn's own
job_skills.csvon 938,028 jointly-present postings yielded 9.56% agreement and 56.44% discovery: the two extraction schemas produce largely non-overlapping views of the same corpus. Neither constitutes a ground truth; the numbers measure schema divergence, not model quality. - Consent and licensing. The training corpus is a publicly-released Kaggle redistribution of scraped LinkedIn postings. Individuals named in postings (recruiters, hiring managers) did not consent to having their role descriptions re-processed for research. The model is licensed CC BY-NC 4.0 for research and non-commercial evaluation only; any commercial deployment requires a separate legal and ethical review against the data-provenance chain.
Limitations
- Trained and evaluated on English-language LinkedIn postings from a publicly-released 2024 Kaggle redistribution; generalisation to other platforms (Indeed, Stack Overflow, regional job boards) or other languages is unevaluated.
- Gold set is single-annotator (516 postings). Intra-annotator stability was scheduled to be measured one week after the main annotation pass; users should treat the reported F1 as having an un-quantified annotator-noise floor until that number lands.
- Output schema is locked to the eight types above. Finer-grained or taxonomy-aligned schemas require re-training against new labels.
- The underlying BERT tokeniser still has a 512-token window per forward pass. This checkpoint achieves full-text coverage by training and inferring over 450-token chunks with a 225-token stride, then aggregating overlapping predictions. Throughput is therefore ~1.5–1.7 s per posting (vs ~16 ms for the v1 single-window variant), and the per-posting compute scales linearly with text length. For latency-sensitive deployments at full-text coverage, prefer the spaCy variant; for highest single-window throughput on short text, prefer the v1 dense JobBERT (S3/S4).
- Soft-boundary entity types (
SKILL,EDUCATION,EXPERIENCE_LEVEL) remain the dominant residual error source. Per paper §4.3.3, the chunked-window improvement (+0.044 F1 over v1) closes only ~17–20% of the gap to the spaCy student; the remaining gap is structural in the BIO/WordPiece formulation rather than a coverage artefact. A span-level objective on top of BIO is the suggested follow-up.
Citation
@unpublished{soltani2026distilledner,
author = {Achraf Soltani and Mohamed Hanine},
title = {Distributed NER on Spark: A Teacher-Student Pipeline for Large-Scale Entity Extraction from Job Postings},
year = {2026},
note = {Advisor: Prof.\ Hanine Mohamed},
url = {https://github.com/achrafsoltani/distributed-ner-on-spark},
}
Licence
- Model weights: CC BY-NC 4.0 — research and non-commercial evaluation only.
- Source code in the accompanying repository: Apache 2.0.
- Downloads last month
- 12