File size: 9,240 Bytes
70f6be0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 | ---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense-retrieval
- information-retrieval
- job-skill-matching
- esco
- talentclef
- xlm-roberta
base_model: jjzha/esco-xlm-roberta-large
pipeline_tag: sentence-similarity
model-index:
- name: skillscout-large
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: TalentCLEF 2026 Task B — Validation (304 queries, 9052 skills)
type: talentclef-2026-taskb-validation
metrics:
- type: cosine_ndcg_at_10
value: 0.4830
name: nDCG@10
- type: cosine_map_at_100
value: 0.1825
name: MAP@100
- type: cosine_mrr_at_10
value: 0.6657
name: MRR@10
- type: cosine_accuracy_at_1
value: 0.5099
name: Accuracy@1
- type: cosine_accuracy_at_10
value: 0.9474
name: Accuracy@10
---
# SkillScout Large — Job-to-Skill Dense Retriever
**SkillScout Large** is a dense bi-encoder for retrieving relevant skills from a job title.
Given a job title (e.g., *"Data Scientist"*), it encodes it into a 1024-dimensional embedding and retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/) skill gazetteer (9,052 skills) using cosine similarity.
This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, trained for [TalentCLEF 2026 Task B](https://talentclef.github.io/).
> **Best pipeline result (TalentCLEF 2026 validation set):**
> nDCG@10 graded = **0.6896** · nDCG@10 binary = **0.7330**
> when combined with a fine-tuned cross-encoder re-ranker at blend α = 0.7.
> Bi-encoder alone: nDCG@10 graded = **0.3621** · MAP = **0.4545**
---
## Model Summary
| Property | Value |
|---|---|
| Base model | [`jjzha/esco-xlm-roberta-large`](https://huggingface.co/jjzha/esco-xlm-roberta-large) |
| Architecture | XLM-RoBERTa-large + mean pooling |
| Embedding dimension | 1024 |
| Max sequence length | 64 tokens |
| Training loss | Multiple Negatives Ranking (MNR) |
| Training pairs | 93,720 (ESCO job–skill pairs, essential + optional) |
| Epochs | 3 |
| Best checkpoint | Step 3500 (saved by validation nDCG@10) |
| Hardware | NVIDIA RTX 3070 8GB · fp16 AMP |
---
## What is TalentCLEF Task B?
**TalentCLEF 2026 Task B** is a graded information-retrieval shared task:
- **Query**: a job title (e.g., *"Electrician"*)
- **Corpus**: 9,052 ESCO skills (e.g., *"install electric switches"*, *"comply with electrical safety regulations"*)
- **Relevance levels**:
- `2` — Core skill (essential regardless of context)
- `1` — Contextual skill (depends on employer / industry)
- `0` — Non-relevant
**Primary metric**: nDCG with graded relevance (core=2, contextual=1)
---
## Usage
### Installation
```bash
pip install sentence-transformers faiss-cpu # or faiss-gpu
```
### Encode & Compare
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("talentguide/skillscout-large")
job = "Data Scientist"
skills = ["data science", "machine learning", "install electric switches"]
embs = model.encode([job] + skills, normalize_embeddings=True)
scores = embs[0] @ embs[1:].T
for skill, score in zip(skills, scores):
print(f"{score:.3f} {skill}")
# 0.872 data science
# 0.731 machine learning
# 0.112 install electric switches
```
### Full Retrieval with FAISS (Recommended)
```python
from sentence_transformers import SentenceTransformer
import faiss, numpy as np
model = SentenceTransformer("talentguide/skillscout-large")
# --- Build index once over your skill corpus ---
skill_texts = [...] # list of skill names / descriptions
embs = model.encode(skill_texts, batch_size=128,
normalize_embeddings=True,
show_progress_bar=True).astype(np.float32)
index = faiss.IndexFlatIP(embs.shape[1]) # inner product on L2-normed = cosine
index.add(embs)
# --- Query at inference time ---
job_title = "Software Engineer"
q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)
scores, idxs = index.search(q, k=50)
for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
print(f"{rank:3d}. [{score:.4f}] {skill_texts[idx]}")
```
### Demo Output
```
Software Engineer
1. [0.942] define software architecture
2. [0.938] software frameworks
3. [0.935] create software design
Data Scientist
1. [0.951] data science
2. [0.921] establish data processes
3. [0.919] create data models
Electrician
1. [0.944] install electric switches
2. [0.938] install electricity sockets
3. [0.930] use electrical wire tools
```
---
## Two-Stage Pipeline Integration
SkillScout Large is designed as **Stage 1** — fast ANN retrieval.
For maximum ranking quality, pair it with a cross-encoder re-ranker:
```
Job title
│
▼
[SkillScout Large] ← this model
│ top-200 candidates (FAISS ANN, ~40ms)
▼
[Cross-encoder re-ranker]
│ fine-grained re-scoring of top-200
▼
Final ranked list (graded: core > contextual > irrelevant)
```
**Score blending** (best result at α = 0.7):
```python
final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
```
---
## Training Details
### Data
Source: [ESCO occupational ontology](https://esco.ec.europa.eu/), TalentCLEF 2026 training split.
| | Count |
|---|---|
| Raw job–skill pairs (essential + optional) | 114,699 |
| ESCO jobs with aliases | 3,039 |
| ESCO skills with aliases | 13,939 |
| Training InputExamples (after canonical-pair inclusion) | **93,720** |
| Validation queries | 304 |
| Validation corpus (skills) | 9,052 |
| Validation relevance judgments | 56,417 |
Essential pairs are included in full; optional skill pairs are downsampled to 50% of the essential count to maintain class balance.
### Hyperparameters
```
Loss : MultipleNegativesRankingLoss (scale=20, cos_sim)
Batch size : 64 → 63 in-batch negatives per anchor
Epochs : 3
Warmup : 10% of total steps (~440 steps)
Optimizer : AdamW (fused), lr=5e-5, linear decay
Precision : fp16 (AMP)
Max seq length : 64 tokens
Best model saved : by cosine-nDCG@10 on validation (eval every 500 steps)
Seed : 42
```
### Training Curve
| Epoch | Step | Train Loss | nDCG@10 (val) | MAP@100 (val) |
|:---:|:---:|:---:|:---:|:---:|
| 0.34 | 500 | 2.9232 | 0.3430 | — |
| 0.68 | 1000 | 2.1179 | 0.3424 | — |
| 1.00 | 1465 | — | 0.3676 | 0.1758 |
| 1.37 | 2000 | 1.7070 | 0.3692 | — |
| 1.71 | 2500 | 1.6366 | 0.3744 | — |
| 2.00 | 2930 | — | 0.3717 | 0.1780 |
| 2.39 | **3500** ✓ | **1.4540** | **0.3769** | **0.1808** |
*Best checkpoint saved at step 3500.*
### Validation Metrics (best checkpoint, binary relevance)
| Metric | Value |
|---|---|
| **nDCG@10** | **0.4830** |
| nDCG@50 | 0.4240 |
| nDCG@100 | 0.3769 |
| **MAP@100** | **0.1825** |
| **MRR@10** | **0.6657** |
| Accuracy@1 | 0.5099 |
| Accuracy@3 | 0.7993 |
| Accuracy@5 | 0.8914 |
| Accuracy@10 | **0.9474** |
*Evaluated with `sentence_transformers.evaluation.InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).*
### Pipeline Results (graded nDCG, full 9052-skill ranking, server-side)
| Run | nDCG@10 graded | nDCG@10 binary | MAP |
|---|---|---|---|
| Zero-shot `jjzha/esco-xlm-roberta-large` | 0.2039 | 0.2853 | 0.2663 |
| **SkillScout Large (bi-encoder only)** | **0.3621** | **0.4830** | **0.4545** |
| SkillScout Large + cross-encoder (α=0.7) | **0.6896** | **0.7330** | 0.2481 |
---
## Competitive Context (TalentCLEF 2025 Task B)
| Team | MAP (test) | Approach |
|---|---|---|
| pjmathematician (winner 2025) | 0.36 | GTE 7B + contrastive + LLM-augmented data |
| NLPnorth (3rd of 14, 2025) | 0.29 | 3-class discriminative classification |
| **SkillScout Large (2026 val)** | **0.4545** | MNR fine-tuned bi-encoder (Stage 1 only) |
---
## Limitations
- **English only** — trained on ESCO EN labels.
- **ESCO-domain** — optimised for the ESCO skill taxonomy; performance on other taxonomies (O*NET, custom) may vary without fine-tuning.
- **64-token cap** — long job descriptions should be reduced to a concise title before encoding.
- **Graded distinction** — the bi-encoder alone does not reliably separate core (2) from contextual (1) skills; a cross-encoder re-ranker is needed for strong graded nDCG.
---
## Citation
```bibtex
@misc{talentguide-skillscout-2026,
title = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
author = {TalentGuide},
year = {2026},
url = {https://huggingface.co/talentguide/skillscout-large}
}
@misc{talentclef2026taskb,
title = {TalentCLEF 2026 Task B: Job-Skill Matching},
author = {TalentCLEF Organizers},
year = {2026},
url = {https://talentclef.github.io/}
}
```
---
## Framework Versions
| Package | Version |
|---|---|
| Python | 3.12.10 |
| sentence-transformers | 5.3.0 |
| transformers | 5.5.0 |
| PyTorch | 2.11.0+cu128 |
| Accelerate | 1.13.0 |
| Tokenizers | 0.22.2 |
---
## License
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|