Bhanu3 commited on
Commit
70f6be0
·
verified ·
1 Parent(s): 763b29a

Add talentclef-biencoder-v1: fine-tuned job-skill retrieval model with full model card

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,318 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: sentence-transformers
6
+ tags:
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - feature-extraction
10
+ - dense-retrieval
11
+ - information-retrieval
12
+ - job-skill-matching
13
+ - esco
14
+ - talentclef
15
+ - xlm-roberta
16
+ base_model: jjzha/esco-xlm-roberta-large
17
+ pipeline_tag: sentence-similarity
18
+ model-index:
19
+ - name: skillscout-large
20
+ results:
21
+ - task:
22
+ type: information-retrieval
23
+ name: Information Retrieval
24
+ dataset:
25
+ name: TalentCLEF 2026 Task B — Validation (304 queries, 9052 skills)
26
+ type: talentclef-2026-taskb-validation
27
+ metrics:
28
+ - type: cosine_ndcg_at_10
29
+ value: 0.4830
30
+ name: nDCG@10
31
+ - type: cosine_map_at_100
32
+ value: 0.1825
33
+ name: MAP@100
34
+ - type: cosine_mrr_at_10
35
+ value: 0.6657
36
+ name: MRR@10
37
+ - type: cosine_accuracy_at_1
38
+ value: 0.5099
39
+ name: Accuracy@1
40
+ - type: cosine_accuracy_at_10
41
+ value: 0.9474
42
+ name: Accuracy@10
43
+ ---
44
+
45
+ # SkillScout Large — Job-to-Skill Dense Retriever
46
+
47
+ **SkillScout Large** is a dense bi-encoder for retrieving relevant skills from a job title.
48
+ Given a job title (e.g., *"Data Scientist"*), it encodes it into a 1024-dimensional embedding and retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/) skill gazetteer (9,052 skills) using cosine similarity.
49
+
50
+ This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, trained for [TalentCLEF 2026 Task B](https://talentclef.github.io/).
51
+
52
+ > **Best pipeline result (TalentCLEF 2026 validation set):**
53
+ > nDCG@10 graded = **0.6896** · nDCG@10 binary = **0.7330**
54
+ > when combined with a fine-tuned cross-encoder re-ranker at blend α = 0.7.
55
+ > Bi-encoder alone: nDCG@10 graded = **0.3621** · MAP = **0.4545**
56
+
57
+ ---
58
+
59
+ ## Model Summary
60
+
61
+ | Property | Value |
62
+ |---|---|
63
+ | Base model | [`jjzha/esco-xlm-roberta-large`](https://huggingface.co/jjzha/esco-xlm-roberta-large) |
64
+ | Architecture | XLM-RoBERTa-large + mean pooling |
65
+ | Embedding dimension | 1024 |
66
+ | Max sequence length | 64 tokens |
67
+ | Training loss | Multiple Negatives Ranking (MNR) |
68
+ | Training pairs | 93,720 (ESCO job–skill pairs, essential + optional) |
69
+ | Epochs | 3 |
70
+ | Best checkpoint | Step 3500 (saved by validation nDCG@10) |
71
+ | Hardware | NVIDIA RTX 3070 8GB · fp16 AMP |
72
+
73
+ ---
74
+
75
+ ## What is TalentCLEF Task B?
76
+
77
+ **TalentCLEF 2026 Task B** is a graded information-retrieval shared task:
78
+
79
+ - **Query**: a job title (e.g., *"Electrician"*)
80
+ - **Corpus**: 9,052 ESCO skills (e.g., *"install electric switches"*, *"comply with electrical safety regulations"*)
81
+ - **Relevance levels**:
82
+ - `2` — Core skill (essential regardless of context)
83
+ - `1` — Contextual skill (depends on employer / industry)
84
+ - `0` — Non-relevant
85
+
86
+ **Primary metric**: nDCG with graded relevance (core=2, contextual=1)
87
+
88
+ ---
89
+
90
+ ## Usage
91
+
92
+ ### Installation
93
+
94
+ ```bash
95
+ pip install sentence-transformers faiss-cpu # or faiss-gpu
96
+ ```
97
+
98
+ ### Encode & Compare
99
+
100
+ ```python
101
+ from sentence_transformers import SentenceTransformer
102
+
103
+ model = SentenceTransformer("talentguide/skillscout-large")
104
+
105
+ job = "Data Scientist"
106
+ skills = ["data science", "machine learning", "install electric switches"]
107
+
108
+ embs = model.encode([job] + skills, normalize_embeddings=True)
109
+ scores = embs[0] @ embs[1:].T
110
+
111
+ for skill, score in zip(skills, scores):
112
+ print(f"{score:.3f} {skill}")
113
+ # 0.872 data science
114
+ # 0.731 machine learning
115
+ # 0.112 install electric switches
116
+ ```
117
+
118
+ ### Full Retrieval with FAISS (Recommended)
119
+
120
+ ```python
121
+ from sentence_transformers import SentenceTransformer
122
+ import faiss, numpy as np
123
+
124
+ model = SentenceTransformer("talentguide/skillscout-large")
125
+
126
+ # --- Build index once over your skill corpus ---
127
+ skill_texts = [...] # list of skill names / descriptions
128
+
129
+ embs = model.encode(skill_texts, batch_size=128,
130
+ normalize_embeddings=True,
131
+ show_progress_bar=True).astype(np.float32)
132
+
133
+ index = faiss.IndexFlatIP(embs.shape[1]) # inner product on L2-normed = cosine
134
+ index.add(embs)
135
+
136
+ # --- Query at inference time ---
137
+ job_title = "Software Engineer"
138
+ q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)
139
+
140
+ scores, idxs = index.search(q, k=50)
141
+ for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
142
+ print(f"{rank:3d}. [{score:.4f}] {skill_texts[idx]}")
143
+ ```
144
+
145
+ ### Demo Output
146
+
147
+ ```
148
+ Software Engineer
149
+ 1. [0.942] define software architecture
150
+ 2. [0.938] software frameworks
151
+ 3. [0.935] create software design
152
+
153
+ Data Scientist
154
+ 1. [0.951] data science
155
+ 2. [0.921] establish data processes
156
+ 3. [0.919] create data models
157
+
158
+ Electrician
159
+ 1. [0.944] install electric switches
160
+ 2. [0.938] install electricity sockets
161
+ 3. [0.930] use electrical wire tools
162
+ ```
163
+
164
+ ---
165
+
166
+ ## Two-Stage Pipeline Integration
167
+
168
+ SkillScout Large is designed as **Stage 1** — fast ANN retrieval.
169
+ For maximum ranking quality, pair it with a cross-encoder re-ranker:
170
+
171
+ ```
172
+ Job title
173
+ ��
174
+
175
+ [SkillScout Large] ← this model
176
+ │ top-200 candidates (FAISS ANN, ~40ms)
177
+
178
+ [Cross-encoder re-ranker]
179
+ │ fine-grained re-scoring of top-200
180
+
181
+ Final ranked list (graded: core > contextual > irrelevant)
182
+ ```
183
+
184
+ **Score blending** (best result at α = 0.7):
185
+
186
+ ```python
187
+ final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
188
+ ```
189
+
190
+ ---
191
+
192
+ ## Training Details
193
+
194
+ ### Data
195
+
196
+ Source: [ESCO occupational ontology](https://esco.ec.europa.eu/), TalentCLEF 2026 training split.
197
+
198
+ | | Count |
199
+ |---|---|
200
+ | Raw job–skill pairs (essential + optional) | 114,699 |
201
+ | ESCO jobs with aliases | 3,039 |
202
+ | ESCO skills with aliases | 13,939 |
203
+ | Training InputExamples (after canonical-pair inclusion) | **93,720** |
204
+ | Validation queries | 304 |
205
+ | Validation corpus (skills) | 9,052 |
206
+ | Validation relevance judgments | 56,417 |
207
+
208
+ Essential pairs are included in full; optional skill pairs are downsampled to 50% of the essential count to maintain class balance.
209
+
210
+ ### Hyperparameters
211
+
212
+ ```
213
+ Loss : MultipleNegativesRankingLoss (scale=20, cos_sim)
214
+ Batch size : 64 → 63 in-batch negatives per anchor
215
+ Epochs : 3
216
+ Warmup : 10% of total steps (~440 steps)
217
+ Optimizer : AdamW (fused), lr=5e-5, linear decay
218
+ Precision : fp16 (AMP)
219
+ Max seq length : 64 tokens
220
+ Best model saved : by cosine-nDCG@10 on validation (eval every 500 steps)
221
+ Seed : 42
222
+ ```
223
+
224
+ ### Training Curve
225
+
226
+ | Epoch | Step | Train Loss | nDCG@10 (val) | MAP@100 (val) |
227
+ |:---:|:---:|:---:|:---:|:---:|
228
+ | 0.34 | 500 | 2.9232 | 0.3430 | — |
229
+ | 0.68 | 1000 | 2.1179 | 0.3424 | — |
230
+ | 1.00 | 1465 | — | 0.3676 | 0.1758 |
231
+ | 1.37 | 2000 | 1.7070 | 0.3692 | — |
232
+ | 1.71 | 2500 | 1.6366 | 0.3744 | — |
233
+ | 2.00 | 2930 | — | 0.3717 | 0.1780 |
234
+ | 2.39 | **3500** ✓ | **1.4540** | **0.3769** | **0.1808** |
235
+
236
+ *Best checkpoint saved at step 3500.*
237
+
238
+ ### Validation Metrics (best checkpoint, binary relevance)
239
+
240
+ | Metric | Value |
241
+ |---|---|
242
+ | **nDCG@10** | **0.4830** |
243
+ | nDCG@50 | 0.4240 |
244
+ | nDCG@100 | 0.3769 |
245
+ | **MAP@100** | **0.1825** |
246
+ | **MRR@10** | **0.6657** |
247
+ | Accuracy@1 | 0.5099 |
248
+ | Accuracy@3 | 0.7993 |
249
+ | Accuracy@5 | 0.8914 |
250
+ | Accuracy@10 | **0.9474** |
251
+
252
+ *Evaluated with `sentence_transformers.evaluation.InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).*
253
+
254
+ ### Pipeline Results (graded nDCG, full 9052-skill ranking, server-side)
255
+
256
+ | Run | nDCG@10 graded | nDCG@10 binary | MAP |
257
+ |---|---|---|---|
258
+ | Zero-shot `jjzha/esco-xlm-roberta-large` | 0.2039 | 0.2853 | 0.2663 |
259
+ | **SkillScout Large (bi-encoder only)** | **0.3621** | **0.4830** | **0.4545** |
260
+ | SkillScout Large + cross-encoder (α=0.7) | **0.6896** | **0.7330** | 0.2481 |
261
+
262
+ ---
263
+
264
+ ## Competitive Context (TalentCLEF 2025 Task B)
265
+
266
+ | Team | MAP (test) | Approach |
267
+ |---|---|---|
268
+ | pjmathematician (winner 2025) | 0.36 | GTE 7B + contrastive + LLM-augmented data |
269
+ | NLPnorth (3rd of 14, 2025) | 0.29 | 3-class discriminative classification |
270
+ | **SkillScout Large (2026 val)** | **0.4545** | MNR fine-tuned bi-encoder (Stage 1 only) |
271
+
272
+ ---
273
+
274
+ ## Limitations
275
+
276
+ - **English only** — trained on ESCO EN labels.
277
+ - **ESCO-domain** — optimised for the ESCO skill taxonomy; performance on other taxonomies (O*NET, custom) may vary without fine-tuning.
278
+ - **64-token cap** — long job descriptions should be reduced to a concise title before encoding.
279
+ - **Graded distinction** — the bi-encoder alone does not reliably separate core (2) from contextual (1) skills; a cross-encoder re-ranker is needed for strong graded nDCG.
280
+
281
+ ---
282
+
283
+ ## Citation
284
+
285
+ ```bibtex
286
+ @misc{talentguide-skillscout-2026,
287
+ title = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
288
+ author = {TalentGuide},
289
+ year = {2026},
290
+ url = {https://huggingface.co/talentguide/skillscout-large}
291
+ }
292
+
293
+ @misc{talentclef2026taskb,
294
+ title = {TalentCLEF 2026 Task B: Job-Skill Matching},
295
+ author = {TalentCLEF Organizers},
296
+ year = {2026},
297
+ url = {https://talentclef.github.io/}
298
+ }
299
+ ```
300
+
301
+ ---
302
+
303
+ ## Framework Versions
304
+
305
+ | Package | Version |
306
+ |---|---|
307
+ | Python | 3.12.10 |
308
+ | sentence-transformers | 5.3.0 |
309
+ | transformers | 5.5.0 |
310
+ | PyTorch | 2.11.0+cu128 |
311
+ | Accelerate | 1.13.0 |
312
+ | Tokenizers | 0.22.2 |
313
+
314
+ ---
315
+
316
+ ## License
317
+
318
+ [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_cross_attention": false,
3
+ "architectures": [
4
+ "XLMRobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "dtype": "float32",
10
+ "eos_token_id": 2,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 1024,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 4096,
16
+ "is_decoder": false,
17
+ "layer_norm_eps": 1e-05,
18
+ "max_position_embeddings": 514,
19
+ "model_type": "xlm-roberta",
20
+ "num_attention_heads": 16,
21
+ "num_hidden_layers": 24,
22
+ "output_past": true,
23
+ "pad_token_id": 1,
24
+ "position_embedding_type": "absolute",
25
+ "tie_word_embeddings": true,
26
+ "transformers_version": "5.5.0",
27
+ "type_vocab_size": 1,
28
+ "use_cache": true,
29
+ "vocab_size": 250002
30
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.3.0",
5
+ "transformers": "5.5.0",
6
+ "pytorch": "2.11.0+cu128"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
eval/Information-Retrieval_evaluation_taskb_val_results.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ epoch,steps,cosine-Accuracy@1,cosine-Accuracy@3,cosine-Accuracy@5,cosine-Accuracy@10,cosine-Precision@1,cosine-Recall@1,cosine-Precision@3,cosine-Recall@3,cosine-Precision@5,cosine-Recall@5,cosine-Precision@10,cosine-Recall@10,cosine-MRR@10,cosine-NDCG@10,cosine-NDCG@50,cosine-NDCG@100,cosine-MAP@100
2
+ 1.0,1465,0.5328947368421053,0.7861842105263158,0.8782894736842105,0.9276315789473685,0.5328947368421053,0.0032031880872582354,0.506578947368421,0.008898304486990168,0.48618421052631583,0.014146896345718819,0.4578947368421053,0.026269226379462513,0.6724402151211364,0.47403210625956155,0.4101240573414333,0.36757918645734217,0.1758130011436744
3
+ 2.0,2930,0.5296052631578947,0.8092105263157895,0.8980263157894737,0.9375,0.5296052631578947,0.00316041371692307,0.4846491228070175,0.008507144941066613,0.48947368421052634,0.014252504636544197,0.45921052631578946,0.026632272459831994,0.6762596595655807,0.4709457808372526,0.4187643981711032,0.3717445435846663,0.17801339821892972
4
+ 3.0,4395,0.5296052631578947,0.8026315789473685,0.868421052631579,0.9375,0.5296052631578947,0.0032347237398130807,0.4956140350877193,0.00875841181887072,0.4901315789473684,0.014392541997646157,0.4648026315789474,0.026968519084827156,0.6734583855472013,0.4766646513891743,0.4224348906713204,0.375764794101007,0.18082895608805505
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e120e8bdcd7a4a29d97858e8ae7cac3c0087594a5d6b9430dd4e3981b6f61b9
3
+ size 2239607120
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 64,
3
+ "do_lower_case": false
4
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bc5c1151948923156f20bcafd54fd796705d693f8d7b56c83aec49d651f6d602
3
+ size 17082986
tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<s>",
5
+ "cls_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "is_local": false,
8
+ "mask_token": "<mask>",
9
+ "model_max_length": 512,
10
+ "pad_token": "<pad>",
11
+ "sep_token": "</s>",
12
+ "tokenizer_class": "XLMRobertaTokenizer",
13
+ "unk_token": "<unk>"
14
+ }