Add talentclef-biencoder-v1: fine-tuned job-skill retrieval model with full model card

70f6be0 verified 5 days ago

9.24 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- dense-retrieval
	- information-retrieval
	- job-skill-matching
	- esco
	- talentclef
	- xlm-roberta
	base_model: jjzha/esco-xlm-roberta-large
	pipeline_tag: sentence-similarity
	model-index:
	- name: skillscout-large
	results:
	- task:
	type: information-retrieval
	name: Information Retrieval
	dataset:
	name: TalentCLEF 2026 Task B — Validation (304 queries, 9052 skills)
	type: talentclef-2026-taskb-validation
	metrics:
	- type: cosine_ndcg_at_10
	value: 0.4830
	name: nDCG@10
	- type: cosine_map_at_100
	value: 0.1825
	name: MAP@100
	- type: cosine_mrr_at_10
	value: 0.6657
	name: MRR@10
	- type: cosine_accuracy_at_1
	value: 0.5099
	name: Accuracy@1
	- type: cosine_accuracy_at_10
	value: 0.9474
	name: Accuracy@10
	---

	# SkillScout Large — Job-to-Skill Dense Retriever

	SkillScout Large is a dense bi-encoder for retrieving relevant skills from a job title.
	Given a job title (e.g., "Data Scientist"), it encodes it into a 1024-dimensional embedding and retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/) skill gazetteer (9,052 skills) using cosine similarity.

	This is Stage 1 of the TalentGuide two-stage job-skill matching pipeline, trained for [TalentCLEF 2026 Task B](https://talentclef.github.io/).

	> Best pipeline result (TalentCLEF 2026 validation set):
	> nDCG@10 graded = 0.6896 · nDCG@10 binary = 0.7330
	> when combined with a fine-tuned cross-encoder re-ranker at blend α = 0.7.
	> Bi-encoder alone: nDCG@10 graded = 0.3621 · MAP = 0.4545

	---

	## Model Summary

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| [`jjzha/esco-xlm-roberta-large`](https://huggingface.co/jjzha/esco-xlm-roberta-large) \|
	\| Architecture \| XLM-RoBERTa-large + mean pooling \|
	\| Embedding dimension \| 1024 \|
	\| Max sequence length \| 64 tokens \|
	\| Training loss \| Multiple Negatives Ranking (MNR) \|
	\| Training pairs \| 93,720 (ESCO job–skill pairs, essential + optional) \|
	\| Epochs \| 3 \|
	\| Best checkpoint \| Step 3500 (saved by validation nDCG@10) \|
	\| Hardware \| NVIDIA RTX 3070 8GB · fp16 AMP \|

	---

	## What is TalentCLEF Task B?

	TalentCLEF 2026 Task B is a graded information-retrieval shared task:

	- Query: a job title (e.g., "Electrician")
	- Corpus: 9,052 ESCO skills (e.g., "install electric switches", "comply with electrical safety regulations")
	- Relevance levels:
	- `2` — Core skill (essential regardless of context)
	- `1` — Contextual skill (depends on employer / industry)
	- `0` — Non-relevant

	Primary metric: nDCG with graded relevance (core=2, contextual=1)

	---

	## Usage

	### Installation

	```bash
	pip install sentence-transformers faiss-cpu # or faiss-gpu
	```

	### Encode & Compare

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("talentguide/skillscout-large")

	job = "Data Scientist"
	skills = ["data science", "machine learning", "install electric switches"]

	embs = model.encode([job] + skills, normalize_embeddings=True)
	scores = embs[0] @ embs[1:].T

	for skill, score in zip(skills, scores):
	print(f"{score:.3f} {skill}")
	# 0.872 data science
	# 0.731 machine learning
	# 0.112 install electric switches
	```

	### Full Retrieval with FAISS (Recommended)

	```python
	from sentence_transformers import SentenceTransformer
	import faiss, numpy as np

	model = SentenceTransformer("talentguide/skillscout-large")

	# --- Build index once over your skill corpus ---
	skill_texts = [...] # list of skill names / descriptions

	embs = model.encode(skill_texts, batch_size=128,
	normalize_embeddings=True,
	show_progress_bar=True).astype(np.float32)

	index = faiss.IndexFlatIP(embs.shape[1]) # inner product on L2-normed = cosine
	index.add(embs)

	# --- Query at inference time ---
	job_title = "Software Engineer"
	q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)

	scores, idxs = index.search(q, k=50)
	for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
	print(f"{rank:3d}. [{score:.4f}] {skill_texts[idx]}")
	```

	### Demo Output

	```
	Software Engineer
	1. [0.942] define software architecture
	2. [0.938] software frameworks
	3. [0.935] create software design

	Data Scientist
	1. [0.951] data science
	2. [0.921] establish data processes
	3. [0.919] create data models

	Electrician
	1. [0.944] install electric switches
	2. [0.938] install electricity sockets
	3. [0.930] use electrical wire tools
	```

	---

	## Two-Stage Pipeline Integration

	SkillScout Large is designed as Stage 1 — fast ANN retrieval.
	For maximum ranking quality, pair it with a cross-encoder re-ranker:

	```
	Job title
	│
	▼
	[SkillScout Large] ← this model
	│ top-200 candidates (FAISS ANN, ~40ms)
	▼
	[Cross-encoder re-ranker]
	│ fine-grained re-scoring of top-200
	▼
	Final ranked list (graded: core > contextual > irrelevant)
	```

	Score blending (best result at α = 0.7):

	```python
	final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
	```

	---

	## Training Details

	### Data

	Source: [ESCO occupational ontology](https://esco.ec.europa.eu/), TalentCLEF 2026 training split.

	\| \| Count \|
	\|---\|---\|
	\| Raw job–skill pairs (essential + optional) \| 114,699 \|
	\| ESCO jobs with aliases \| 3,039 \|
	\| ESCO skills with aliases \| 13,939 \|
	\| Training InputExamples (after canonical-pair inclusion) \| 93,720 \|
	\| Validation queries \| 304 \|
	\| Validation corpus (skills) \| 9,052 \|
	\| Validation relevance judgments \| 56,417 \|

	Essential pairs are included in full; optional skill pairs are downsampled to 50% of the essential count to maintain class balance.

	### Hyperparameters

	```
	Loss : MultipleNegativesRankingLoss (scale=20, cos_sim)
	Batch size : 64 → 63 in-batch negatives per anchor
	Epochs : 3
	Warmup : 10% of total steps (~440 steps)
	Optimizer : AdamW (fused), lr=5e-5, linear decay
	Precision : fp16 (AMP)
	Max seq length : 64 tokens
	Best model saved : by cosine-nDCG@10 on validation (eval every 500 steps)
	Seed : 42
	```

	### Training Curve

	\| Epoch \| Step \| Train Loss \| nDCG@10 (val) \| MAP@100 (val) \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 0.34 \| 500 \| 2.9232 \| 0.3430 \| — \|
	\| 0.68 \| 1000 \| 2.1179 \| 0.3424 \| — \|
	\| 1.00 \| 1465 \| — \| 0.3676 \| 0.1758 \|
	\| 1.37 \| 2000 \| 1.7070 \| 0.3692 \| — \|
	\| 1.71 \| 2500 \| 1.6366 \| 0.3744 \| — \|
	\| 2.00 \| 2930 \| — \| 0.3717 \| 0.1780 \|
	\| 2.39 \| 3500 ✓ \| 1.4540 \| 0.3769 \| 0.1808 \|

	Best checkpoint saved at step 3500.

	### Validation Metrics (best checkpoint, binary relevance)

	\| Metric \| Value \|
	\|---\|---\|
	\| nDCG@10 \| 0.4830 \|
	\| nDCG@50 \| 0.4240 \|
	\| nDCG@100 \| 0.3769 \|
	\| MAP@100 \| 0.1825 \|
	\| MRR@10 \| 0.6657 \|
	\| Accuracy@1 \| 0.5099 \|
	\| Accuracy@3 \| 0.7993 \|
	\| Accuracy@5 \| 0.8914 \|
	\| Accuracy@10 \| 0.9474 \|

	Evaluated with `sentence_transformers.evaluation.InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).

	### Pipeline Results (graded nDCG, full 9052-skill ranking, server-side)

	\| Run \| nDCG@10 graded \| nDCG@10 binary \| MAP \|
	\|---\|---\|---\|---\|
	\| Zero-shot `jjzha/esco-xlm-roberta-large` \| 0.2039 \| 0.2853 \| 0.2663 \|
	\| SkillScout Large (bi-encoder only) \| 0.3621 \| 0.4830 \| 0.4545 \|
	\| SkillScout Large + cross-encoder (α=0.7) \| 0.6896 \| 0.7330 \| 0.2481 \|

	---

	## Competitive Context (TalentCLEF 2025 Task B)

	\| Team \| MAP (test) \| Approach \|
	\|---\|---\|---\|
	\| pjmathematician (winner 2025) \| 0.36 \| GTE 7B + contrastive + LLM-augmented data \|
	\| NLPnorth (3rd of 14, 2025) \| 0.29 \| 3-class discriminative classification \|
	\| SkillScout Large (2026 val) \| 0.4545 \| MNR fine-tuned bi-encoder (Stage 1 only) \|

	---

	## Limitations

	- English only — trained on ESCO EN labels.
	- ESCO-domain — optimised for the ESCO skill taxonomy; performance on other taxonomies (O*NET, custom) may vary without fine-tuning.
	- 64-token cap — long job descriptions should be reduced to a concise title before encoding.
	- Graded distinction — the bi-encoder alone does not reliably separate core (2) from contextual (1) skills; a cross-encoder re-ranker is needed for strong graded nDCG.

	---

	## Citation

	```bibtex
	@misc{talentguide-skillscout-2026,
	title = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
	author = {TalentGuide},
	year = {2026},
	url = {https://huggingface.co/talentguide/skillscout-large}
	}

	@misc{talentclef2026taskb,
	title = {TalentCLEF 2026 Task B: Job-Skill Matching},
	author = {TalentCLEF Organizers},
	year = {2026},
	url = {https://talentclef.github.io/}
	}
	```

	---

	## Framework Versions

	\| Package \| Version \|
	\|---\|---\|
	\| Python \| 3.12.10 \|
	\| sentence-transformers \| 5.3.0 \|
	\| transformers \| 5.5.0 \|
	\| PyTorch \| 2.11.0+cu128 \|
	\| Accelerate \| 1.13.0 \|
	\| Tokenizers \| 0.22.2 \|

	---

	## License

	[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)