File size: 9,240 Bytes
70f6be0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense-retrieval
- information-retrieval
- job-skill-matching
- esco
- talentclef
- xlm-roberta
base_model: jjzha/esco-xlm-roberta-large
pipeline_tag: sentence-similarity
model-index:
- name: skillscout-large
  results:
  - task:
      type: information-retrieval
      name: Information Retrieval
    dataset:
      name: TalentCLEF 2026 Task B  Validation (304 queries, 9052 skills)
      type: talentclef-2026-taskb-validation
    metrics:
    - type: cosine_ndcg_at_10
      value: 0.4830
      name: nDCG@10
    - type: cosine_map_at_100
      value: 0.1825
      name: MAP@100
    - type: cosine_mrr_at_10
      value: 0.6657
      name: MRR@10
    - type: cosine_accuracy_at_1
      value: 0.5099
      name: Accuracy@1
    - type: cosine_accuracy_at_10
      value: 0.9474
      name: Accuracy@10
---

# SkillScout Large — Job-to-Skill Dense Retriever

**SkillScout Large** is a dense bi-encoder for retrieving relevant skills from a job title.  
Given a job title (e.g., *"Data Scientist"*), it encodes it into a 1024-dimensional embedding and retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/) skill gazetteer (9,052 skills) using cosine similarity.

This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, trained for [TalentCLEF 2026 Task B](https://talentclef.github.io/).

> **Best pipeline result (TalentCLEF 2026 validation set):**  
> nDCG@10 graded = **0.6896** · nDCG@10 binary = **0.7330**  
> when combined with a fine-tuned cross-encoder re-ranker at blend α = 0.7.  
> Bi-encoder alone: nDCG@10 graded = **0.3621** · MAP = **0.4545**

---

## Model Summary

| Property | Value |
|---|---|
| Base model | [`jjzha/esco-xlm-roberta-large`](https://huggingface.co/jjzha/esco-xlm-roberta-large) |
| Architecture | XLM-RoBERTa-large + mean pooling |
| Embedding dimension | 1024 |
| Max sequence length | 64 tokens |
| Training loss | Multiple Negatives Ranking (MNR) |
| Training pairs | 93,720 (ESCO job–skill pairs, essential + optional) |
| Epochs | 3 |
| Best checkpoint | Step 3500 (saved by validation nDCG@10) |
| Hardware | NVIDIA RTX 3070 8GB · fp16 AMP |

---

## What is TalentCLEF Task B?

**TalentCLEF 2026 Task B** is a graded information-retrieval shared task:

- **Query**: a job title (e.g., *"Electrician"*)
- **Corpus**: 9,052 ESCO skills (e.g., *"install electric switches"*, *"comply with electrical safety regulations"*)
- **Relevance levels**:
  - `2` — Core skill (essential regardless of context)
  - `1` — Contextual skill (depends on employer / industry)
  - `0` — Non-relevant

**Primary metric**: nDCG with graded relevance (core=2, contextual=1)

---

## Usage

### Installation

```bash
pip install sentence-transformers faiss-cpu  # or faiss-gpu
```

### Encode & Compare

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("talentguide/skillscout-large")

job    = "Data Scientist"
skills = ["data science", "machine learning", "install electric switches"]

embs   = model.encode([job] + skills, normalize_embeddings=True)
scores = embs[0] @ embs[1:].T

for skill, score in zip(skills, scores):
    print(f"{score:.3f}  {skill}")
# 0.872  data science
# 0.731  machine learning
# 0.112  install electric switches
```

### Full Retrieval with FAISS (Recommended)

```python
from sentence_transformers import SentenceTransformer
import faiss, numpy as np

model = SentenceTransformer("talentguide/skillscout-large")

# --- Build index once over your skill corpus ---
skill_texts = [...]   # list of skill names / descriptions

embs = model.encode(skill_texts, batch_size=128,
                    normalize_embeddings=True,
                    show_progress_bar=True).astype(np.float32)

index = faiss.IndexFlatIP(embs.shape[1])  # inner product on L2-normed = cosine
index.add(embs)

# --- Query at inference time ---
job_title = "Software Engineer"
q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)

scores, idxs = index.search(q, k=50)
for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
    print(f"{rank:3d}. [{score:.4f}]  {skill_texts[idx]}")
```

### Demo Output

```
Software Engineer
   1. [0.942]  define software architecture
   2. [0.938]  software frameworks
   3. [0.935]  create software design

Data Scientist
   1. [0.951]  data science
   2. [0.921]  establish data processes
   3. [0.919]  create data models

Electrician
   1. [0.944]  install electric switches
   2. [0.938]  install electricity sockets
   3. [0.930]  use electrical wire tools
```

---

## Two-Stage Pipeline Integration

SkillScout Large is designed as **Stage 1** — fast ANN retrieval.  
For maximum ranking quality, pair it with a cross-encoder re-ranker:

```
Job title


[SkillScout Large]              ← this model
   │  top-200 candidates (FAISS ANN, ~40ms)

[Cross-encoder re-ranker]
   │  fine-grained re-scoring of top-200

Final ranked list  (graded: core > contextual > irrelevant)
```

**Score blending** (best result at α = 0.7):

```python
final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
```

---

## Training Details

### Data

Source: [ESCO occupational ontology](https://esco.ec.europa.eu/), TalentCLEF 2026 training split.

| | Count |
|---|---|
| Raw job–skill pairs (essential + optional) | 114,699 |
| ESCO jobs with aliases | 3,039 |
| ESCO skills with aliases | 13,939 |
| Training InputExamples (after canonical-pair inclusion) | **93,720** |
| Validation queries | 304 |
| Validation corpus (skills) | 9,052 |
| Validation relevance judgments | 56,417 |

Essential pairs are included in full; optional skill pairs are downsampled to 50% of the essential count to maintain class balance.

### Hyperparameters

```
Loss              : MultipleNegativesRankingLoss (scale=20, cos_sim)
Batch size        : 64  →  63 in-batch negatives per anchor
Epochs            : 3
Warmup            : 10% of total steps (~440 steps)
Optimizer         : AdamW (fused), lr=5e-5, linear decay
Precision         : fp16 (AMP)
Max seq length    : 64 tokens
Best model saved  : by cosine-nDCG@10 on validation (eval every 500 steps)
Seed              : 42
```

### Training Curve

| Epoch | Step | Train Loss | nDCG@10 (val) | MAP@100 (val) |
|:---:|:---:|:---:|:---:|:---:|
| 0.34 | 500  | 2.9232 | 0.3430 | — |
| 0.68 | 1000 | 2.1179 | 0.3424 | — |
| 1.00 | 1465 | —      | 0.3676 | 0.1758 |
| 1.37 | 2000 | 1.7070 | 0.3692 | — |
| 1.71 | 2500 | 1.6366 | 0.3744 | — |
| 2.00 | 2930 | —      | 0.3717 | 0.1780 |
| 2.39 | **3500** ✓ | **1.4540** | **0.3769** | **0.1808** |

*Best checkpoint saved at step 3500.*

### Validation Metrics (best checkpoint, binary relevance)

| Metric | Value |
|---|---|
| **nDCG@10** | **0.4830** |
| nDCG@50 | 0.4240 |
| nDCG@100 | 0.3769 |
| **MAP@100** | **0.1825** |
| **MRR@10** | **0.6657** |
| Accuracy@1 | 0.5099 |
| Accuracy@3 | 0.7993 |
| Accuracy@5 | 0.8914 |
| Accuracy@10 | **0.9474** |

*Evaluated with `sentence_transformers.evaluation.InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).*

### Pipeline Results (graded nDCG, full 9052-skill ranking, server-side)

| Run | nDCG@10 graded | nDCG@10 binary | MAP |
|---|---|---|---|
| Zero-shot `jjzha/esco-xlm-roberta-large` | 0.2039 | 0.2853 | 0.2663 |
| **SkillScout Large (bi-encoder only)** | **0.3621** | **0.4830** | **0.4545** |
| SkillScout Large + cross-encoder (α=0.7) | **0.6896** | **0.7330** | 0.2481 |

---

## Competitive Context (TalentCLEF 2025 Task B)

| Team | MAP (test) | Approach |
|---|---|---|
| pjmathematician (winner 2025) | 0.36 | GTE 7B + contrastive + LLM-augmented data |
| NLPnorth (3rd of 14, 2025) | 0.29 | 3-class discriminative classification |
| **SkillScout Large (2026 val)** | **0.4545** | MNR fine-tuned bi-encoder (Stage 1 only) |

---

## Limitations

- **English only** — trained on ESCO EN labels.
- **ESCO-domain** — optimised for the ESCO skill taxonomy; performance on other taxonomies (O*NET, custom) may vary without fine-tuning.
- **64-token cap** — long job descriptions should be reduced to a concise title before encoding.
- **Graded distinction** — the bi-encoder alone does not reliably separate core (2) from contextual (1) skills; a cross-encoder re-ranker is needed for strong graded nDCG.

---

## Citation

```bibtex
@misc{talentguide-skillscout-2026,
  title   = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
  author  = {TalentGuide},
  year    = {2026},
  url     = {https://huggingface.co/talentguide/skillscout-large}
}

@misc{talentclef2026taskb,
  title   = {TalentCLEF 2026 Task B: Job-Skill Matching},
  author  = {TalentCLEF Organizers},
  year    = {2026},
  url     = {https://talentclef.github.io/}
}
```

---

## Framework Versions

| Package | Version |
|---|---|
| Python | 3.12.10 |
| sentence-transformers | 5.3.0 |
| transformers | 5.5.0 |
| PyTorch | 2.11.0+cu128 |
| Accelerate | 1.13.0 |
| Tokenizers | 0.22.2 |

---

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)