Bhanu3 commited on
Commit
6e1d756
·
verified ·
1 Parent(s): 9550ada

Update model card: full SkillScout Large documentation

Browse files
Files changed (1) hide show
  1. README.md +78 -100
README.md CHANGED
@@ -22,7 +22,7 @@ model-index:
22
  type: information-retrieval
23
  name: Information Retrieval
24
  dataset:
25
- name: TalentCLEF 2026 Task B Validation (304 queries, 9052 skills)
26
  type: talentclef-2026-taskb-validation
27
  metrics:
28
  - type: cosine_ndcg_at_10
@@ -34,25 +34,25 @@ model-index:
34
  - type: cosine_mrr_at_10
35
  value: 0.6657
36
  name: MRR@10
37
- - type: cosine_accuracy_at_1
38
- value: 0.5099
39
- name: Accuracy@1
40
  - type: cosine_accuracy_at_10
41
  value: 0.9474
42
  name: Accuracy@10
43
  ---
44
 
45
- # SkillScout Large Job-to-Skill Dense Retriever
46
 
47
- **SkillScout Large** is a dense bi-encoder for retrieving relevant skills from a job title.
48
- Given a job title (e.g., *"Data Scientist"*), it encodes it into a 1024-dimensional embedding and retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/) skill gazetteer (9,052 skills) using cosine similarity.
 
 
49
 
50
- This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, trained for [TalentCLEF 2026 Task B](https://talentclef.github.io/).
 
51
 
52
- > **Best pipeline result (TalentCLEF 2026 validation set):**
53
- > nDCG@10 graded = **0.6896** · nDCG@10 binary = **0.7330**
54
- > when combined with a fine-tuned cross-encoder re-ranker at blend α = 0.7.
55
- > Bi-encoder alone: nDCG@10 graded = **0.3621** · MAP = **0.4545**
56
 
57
  ---
58
 
@@ -60,15 +60,15 @@ This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, tr
60
 
61
  | Property | Value |
62
  |---|---|
63
- | Base model | [`jjzha/esco-xlm-roberta-large`](https://huggingface.co/jjzha/esco-xlm-roberta-large) |
64
  | Architecture | XLM-RoBERTa-large + mean pooling |
65
  | Embedding dimension | 1024 |
66
  | Max sequence length | 64 tokens |
67
  | Training loss | Multiple Negatives Ranking (MNR) |
68
- | Training pairs | 93,720 (ESCO jobskill pairs, essential + optional) |
69
  | Epochs | 3 |
70
- | Best checkpoint | Step 3500 (saved by validation nDCG@10) |
71
- | Hardware | NVIDIA RTX 3070 8GB · fp16 AMP |
72
 
73
  ---
74
 
@@ -77,13 +77,9 @@ This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, tr
77
  **TalentCLEF 2026 Task B** is a graded information-retrieval shared task:
78
 
79
  - **Query**: a job title (e.g., *"Electrician"*)
80
- - **Corpus**: 9,052 ESCO skills (e.g., *"install electric switches"*, *"comply with electrical safety regulations"*)
81
- - **Relevance levels**:
82
- - `2` Core skill (essential regardless of context)
83
- - `1` — Contextual skill (depends on employer / industry)
84
- - `0` — Non-relevant
85
-
86
- **Primary metric**: nDCG with graded relevance (core=2, contextual=1)
87
 
88
  ---
89
 
@@ -92,10 +88,10 @@ This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, tr
92
  ### Installation
93
 
94
  ```bash
95
- pip install sentence-transformers faiss-cpu # or faiss-gpu
96
  ```
97
 
98
- ### Encode & Compare
99
 
100
  ```python
101
  from sentence_transformers import SentenceTransformer
@@ -123,8 +119,8 @@ import faiss, numpy as np
123
 
124
  model = SentenceTransformer("talentguide/skillscout-large")
125
 
126
- # --- Build index once over your skill corpus ---
127
- skill_texts = [...] # list of skill names / descriptions
128
 
129
  embs = model.encode(skill_texts, batch_size=128,
130
  normalize_embeddings=True,
@@ -133,11 +129,10 @@ embs = model.encode(skill_texts, batch_size=128,
133
  index = faiss.IndexFlatIP(embs.shape[1]) # inner product on L2-normed = cosine
134
  index.add(embs)
135
 
136
- # --- Query at inference time ---
137
  job_title = "Software Engineer"
138
  q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)
139
-
140
  scores, idxs = index.search(q, k=50)
 
141
  for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
142
  print(f"{rank:3d}. [{score:.4f}] {skill_texts[idx]}")
143
  ```
@@ -165,23 +160,20 @@ Electrician
165
 
166
  ## Two-Stage Pipeline Integration
167
 
168
- SkillScout Large is designed as **Stage 1** — fast ANN retrieval.
169
- For maximum ranking quality, pair it with a cross-encoder re-ranker:
170
-
171
  ```
172
  Job title
173
-
174
-
175
- [SkillScout Large] this model
176
- top-200 candidates (FAISS ANN, ~40ms)
177
-
178
  [Cross-encoder re-ranker]
179
- fine-grained re-scoring of top-200
180
-
181
- Final ranked list (graded: core > contextual > irrelevant)
182
  ```
183
 
184
- **Score blending** (best result at α = 0.7):
185
 
186
  ```python
187
  final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
@@ -197,45 +189,43 @@ Source: [ESCO occupational ontology](https://esco.ec.europa.eu/), TalentCLEF 202
197
 
198
  | | Count |
199
  |---|---|
200
- | Raw job–skill pairs (essential + optional) | 114,699 |
201
- | ESCO jobs with aliases | 3,039 |
202
- | ESCO skills with aliases | 13,939 |
203
- | Training InputExamples (after canonical-pair inclusion) | **93,720** |
204
  | Validation queries | 304 |
205
- | Validation corpus (skills) | 9,052 |
206
- | Validation relevance judgments | 56,417 |
207
 
208
- Essential pairs are included in full; optional skill pairs are downsampled to 50% of the essential count to maintain class balance.
 
209
 
210
  ### Hyperparameters
211
 
212
  ```
213
- Loss : MultipleNegativesRankingLoss (scale=20, cos_sim)
214
- Batch size : 64 63 in-batch negatives per anchor
215
- Epochs : 3
216
- Warmup : 10% of total steps (~440 steps)
217
- Optimizer : AdamW (fused), lr=5e-5, linear decay
218
- Precision : fp16 (AMP)
219
- Max seq length : 64 tokens
220
- Best model saved : by cosine-nDCG@10 on validation (eval every 500 steps)
221
- Seed : 42
222
  ```
223
 
224
  ### Training Curve
225
 
226
- | Epoch | Step | Train Loss | nDCG@10 (val) | MAP@100 (val) |
227
- |:---:|:---:|:---:|:---:|:---:|
228
- | 0.34 | 500 | 2.9232 | 0.3430 | |
229
- | 0.68 | 1000 | 2.1179 | 0.3424 | |
230
- | 1.00 | 1465 | | 0.3676 | 0.1758 |
231
- | 1.37 | 2000 | 1.7070 | 0.3692 | |
232
- | 1.71 | 2500 | 1.6366 | 0.3744 | |
233
- | 2.00 | 2930 | | 0.3717 | 0.1780 |
234
- | 2.39 | **3500** | **1.4540** | **0.3769** | **0.1808** |
235
-
236
- *Best checkpoint saved at step 3500.*
237
 
238
- ### Validation Metrics (best checkpoint, binary relevance)
239
 
240
  | Metric | Value |
241
  |---|---|
@@ -247,17 +237,17 @@ Seed : 42
247
  | Accuracy@1 | 0.5099 |
248
  | Accuracy@3 | 0.7993 |
249
  | Accuracy@5 | 0.8914 |
250
- | Accuracy@10 | **0.9474** |
251
 
252
- *Evaluated with `sentence_transformers.evaluation.InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).*
253
 
254
- ### Pipeline Results (graded nDCG, full 9052-skill ranking, server-side)
255
 
256
  | Run | nDCG@10 graded | nDCG@10 binary | MAP |
257
  |---|---|---|---|
258
  | Zero-shot `jjzha/esco-xlm-roberta-large` | 0.2039 | 0.2853 | 0.2663 |
259
  | **SkillScout Large (bi-encoder only)** | **0.3621** | **0.4830** | **0.4545** |
260
- | SkillScout Large + cross-encoder (α=0.7) | **0.6896** | **0.7330** | 0.2481 |
261
 
262
  ---
263
 
@@ -267,16 +257,16 @@ Seed : 42
267
  |---|---|---|
268
  | pjmathematician (winner 2025) | 0.36 | GTE 7B + contrastive + LLM-augmented data |
269
  | NLPnorth (3rd of 14, 2025) | 0.29 | 3-class discriminative classification |
270
- | **SkillScout Large (2026 val)** | **0.4545** | MNR fine-tuned bi-encoder (Stage 1 only) |
271
 
272
  ---
273
 
274
  ## Limitations
275
 
276
- - **English only** trained on ESCO EN labels.
277
- - **ESCO-domain** optimised for the ESCO skill taxonomy; performance on other taxonomies (O*NET, custom) may vary without fine-tuning.
278
- - **64-token cap** long job descriptions should be reduced to a concise title before encoding.
279
- - **Graded distinction** the bi-encoder alone does not reliably separate core (2) from contextual (1) skills; a cross-encoder re-ranker is needed for strong graded nDCG.
280
 
281
  ---
282
 
@@ -284,17 +274,17 @@ Seed : 42
284
 
285
  ```bibtex
286
  @misc{talentguide-skillscout-2026,
287
- title = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
288
- author = {TalentGuide},
289
- year = {2026},
290
- url = {https://huggingface.co/talentguide/skillscout-large}
291
  }
292
 
293
  @misc{talentclef2026taskb,
294
- title = {TalentCLEF 2026 Task B: Job-Skill Matching},
295
- author = {TalentCLEF Organizers},
296
- year = {2026},
297
- url = {https://talentclef.github.io/}
298
  }
299
  ```
300
 
@@ -302,17 +292,5 @@ Seed : 42
302
 
303
  ## Framework Versions
304
 
305
- | Package | Version |
306
- |---|---|
307
- | Python | 3.12.10 |
308
- | sentence-transformers | 5.3.0 |
309
- | transformers | 5.5.0 |
310
- | PyTorch | 2.11.0+cu128 |
311
- | Accelerate | 1.13.0 |
312
- | Tokenizers | 0.22.2 |
313
-
314
- ---
315
-
316
- ## License
317
-
318
- [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
22
  type: information-retrieval
23
  name: Information Retrieval
24
  dataset:
25
+ name: TalentCLEF 2026 Task B Validation
26
  type: talentclef-2026-taskb-validation
27
  metrics:
28
  - type: cosine_ndcg_at_10
 
34
  - type: cosine_mrr_at_10
35
  value: 0.6657
36
  name: MRR@10
 
 
 
37
  - type: cosine_accuracy_at_10
38
  value: 0.9474
39
  name: Accuracy@10
40
  ---
41
 
42
+ # SkillScout Large - Job-to-Skill Dense Retriever
43
 
44
+ **SkillScout Large** is a dense bi-encoder for retrieving relevant skills from a job title.
45
+ Given a job title (e.g., *"Data Scientist"*), it produces a 1024-dimensional embedding and
46
+ retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/)
47
+ skill gazetteer (9,052 skills) via cosine similarity.
48
 
49
+ This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, trained for
50
+ [TalentCLEF 2026 Task B](https://talentclef.github.io/).
51
 
52
+ > **Best pipeline result (TalentCLEF 2026 validation set):**
53
+ > nDCG@10 graded = **0.6896** | nDCG@10 binary = **0.7330**
54
+ > when combined with a fine-tuned cross-encoder at blend alpha=0.7.
55
+ > Bi-encoder alone: nDCG@10 graded = **0.3621** | MAP = **0.4545**
56
 
57
  ---
58
 
 
60
 
61
  | Property | Value |
62
  |---|---|
63
+ | Base model | [jjzha/esco-xlm-roberta-large](https://huggingface.co/jjzha/esco-xlm-roberta-large) |
64
  | Architecture | XLM-RoBERTa-large + mean pooling |
65
  | Embedding dimension | 1024 |
66
  | Max sequence length | 64 tokens |
67
  | Training loss | Multiple Negatives Ranking (MNR) |
68
+ | Training pairs | 93,720 (ESCO job-skill pairs, essential + optional) |
69
  | Epochs | 3 |
70
+ | Best checkpoint | Step 3500 (by validation nDCG@10) |
71
+ | Hardware | NVIDIA RTX 3070 8GB, fp16 AMP |
72
 
73
  ---
74
 
 
77
  **TalentCLEF 2026 Task B** is a graded information-retrieval shared task:
78
 
79
  - **Query**: a job title (e.g., *"Electrician"*)
80
+ - **Corpus**: 9,052 ESCO skills (e.g., *"install electric switches"*)
81
+ - **Relevance levels**: `2` = Core, `1` = Contextual, `0` = Non-relevant
82
+ - **Primary metric**: nDCG with graded relevance (core=2, contextual=1)
 
 
 
 
83
 
84
  ---
85
 
 
88
  ### Installation
89
 
90
  ```bash
91
+ pip install sentence-transformers faiss-cpu
92
  ```
93
 
94
+ ### Encode and Compare
95
 
96
  ```python
97
  from sentence_transformers import SentenceTransformer
 
119
 
120
  model = SentenceTransformer("talentguide/skillscout-large")
121
 
122
+ # Build index once over your skill corpus
123
+ skill_texts = [...] # list of skill names
124
 
125
  embs = model.encode(skill_texts, batch_size=128,
126
  normalize_embeddings=True,
 
129
  index = faiss.IndexFlatIP(embs.shape[1]) # inner product on L2-normed = cosine
130
  index.add(embs)
131
 
 
132
  job_title = "Software Engineer"
133
  q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)
 
134
  scores, idxs = index.search(q, k=50)
135
+
136
  for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
137
  print(f"{rank:3d}. [{score:.4f}] {skill_texts[idx]}")
138
  ```
 
160
 
161
  ## Two-Stage Pipeline Integration
162
 
 
 
 
163
  ```
164
  Job title
165
+ |
166
+ v
167
+ [SkillScout Large] <- this model
168
+ | top-200 candidates via FAISS ANN
169
+ v
170
  [Cross-encoder re-ranker]
171
+ | fine-grained re-scoring
172
+ v
173
+ Final ranked list (graded: core > contextual > irrelevant)
174
  ```
175
 
176
+ Blend formula (alpha=0.7 gives best validation results):
177
 
178
  ```python
179
  final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
 
189
 
190
  | | Count |
191
  |---|---|
192
+ | Job-skill pairs (essential) | ~57,500 |
193
+ | Job-skill pairs (optional) | ~57,200 |
194
+ | Total InputExamples | **93,720** |
 
195
  | Validation queries | 304 |
196
+ | Validation corpus | 9,052 skills |
197
+ | Validation qrels | 56,417 |
198
 
199
+ Each ESCO job has 5-15 title aliases; skills have multiple phrasings.
200
+ Optional pairs are downsampled to 50% of essential count to maintain class balance.
201
 
202
  ### Hyperparameters
203
 
204
  ```
205
+ Loss : MultipleNegativesRankingLoss (scale=20, cos_sim)
206
+ Batch size : 64 (63 in-batch negatives per anchor)
207
+ Epochs : 3
208
+ Warmup : 10% of steps (~440 steps)
209
+ Optimizer : AdamW fused
210
+ Learning rate : 5e-5, linear decay
211
+ Precision : fp16 AMP
212
+ Max seq len : 64 tokens
213
+ Best model : saved by cosine-nDCG@10 on validation
214
  ```
215
 
216
  ### Training Curve
217
 
218
+ | Epoch | Step | Train Loss | nDCG@10 val | MAP@100 val |
219
+ |---|---|---|---|---|
220
+ | 0.34 | 500 | 2.9232 | 0.3430 | - |
221
+ | 0.68 | 1000 | 2.1179 | 0.3424 | - |
222
+ | 1.00 | 1465 | - | 0.3676 | 0.1758 |
223
+ | 1.37 | 2000 | 1.7070 | 0.3692 | - |
224
+ | 1.71 | 2500 | 1.6366 | 0.3744 | - |
225
+ | 2.00 | 2930 | - | 0.3717 | 0.1780 |
226
+ | **2.39** | **3500** | **1.4540** | **0.3769** | **0.1808** |
 
 
227
 
228
+ ### Validation Metrics (best checkpoint, step 3500)
229
 
230
  | Metric | Value |
231
  |---|---|
 
237
  | Accuracy@1 | 0.5099 |
238
  | Accuracy@3 | 0.7993 |
239
  | Accuracy@5 | 0.8914 |
240
+ | Accuracy@10 | 0.9474 |
241
 
242
+ Evaluated with `InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).
243
 
244
+ ### Pipeline Results (graded relevance, full 9052-skill ranking)
245
 
246
  | Run | nDCG@10 graded | nDCG@10 binary | MAP |
247
  |---|---|---|---|
248
  | Zero-shot `jjzha/esco-xlm-roberta-large` | 0.2039 | 0.2853 | 0.2663 |
249
  | **SkillScout Large (bi-encoder only)** | **0.3621** | **0.4830** | **0.4545** |
250
+ | SkillScout Large + cross-encoder (alpha=0.7) | **0.6896** | **0.7330** | 0.2481 |
251
 
252
  ---
253
 
 
257
  |---|---|---|
258
  | pjmathematician (winner 2025) | 0.36 | GTE 7B + contrastive + LLM-augmented data |
259
  | NLPnorth (3rd of 14, 2025) | 0.29 | 3-class discriminative classification |
260
+ | **SkillScout Large (2026 val, Stage 1 only)** | **0.4545** | MNR fine-tuned bi-encoder |
261
 
262
  ---
263
 
264
  ## Limitations
265
 
266
+ - **English only** - trained on ESCO EN labels.
267
+ - **ESCO-domain optimised** - transfer to O*NET or custom taxonomies may require fine-tuning.
268
+ - **Max 64 tokens** - reduce long descriptions to a concise job title.
269
+ - **Graded distinction** - the bi-encoder alone does not reliably separate core vs contextual skills; a cross-encoder re-ranker is recommended for graded nDCG.
270
 
271
  ---
272
 
 
274
 
275
  ```bibtex
276
  @misc{talentguide-skillscout-2026,
277
+ title = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
278
+ author = {TalentGuide},
279
+ year = {2026},
280
+ url = {https://huggingface.co/talentguide/skillscout-large}
281
  }
282
 
283
  @misc{talentclef2026taskb,
284
+ title = {TalentCLEF 2026 Task B: Job-Skill Matching},
285
+ author = {TalentCLEF Organizers},
286
+ year = {2026},
287
+ url = {https://talentclef.github.io/}
288
  }
289
  ```
290
 
 
292
 
293
  ## Framework Versions
294
 
295
+ - Python 3.12.10 | Sentence Transformers 5.3.0 | Transformers 5.5.0
296
+ - PyTorch 2.11.0+cu128 | Accelerate 1.13.0 | Tokenizers 0.22.2