Add talentclef-biencoder-v1: fine-tuned job-skill retrieval model with full model card

Browse files

Files changed (11) hide show

.gitattributes +1 -0
1_Pooling/config.json +10 -0
README.md +318 -0
config.json +30 -0
config_sentence_transformers.json +14 -0
eval/Information-Retrieval_evaluation_taskb_val_results.csv +4 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
tokenizer.json +3 -0
tokenizer_config.json +14 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 1024,
+    "pooling_mode_cls_token": false,
+    "pooling_mode_mean_tokens": true,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,318 @@

+---
+language:
+- en
+license: apache-2.0
+library_name: sentence-transformers
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- dense-retrieval
+- information-retrieval
+- job-skill-matching
+- esco
+- talentclef
+- xlm-roberta
+base_model: jjzha/esco-xlm-roberta-large
+pipeline_tag: sentence-similarity
+model-index:
+- name: skillscout-large
+  results:
+  - task:
+      type: information-retrieval
+      name: Information Retrieval
+    dataset:
+      name: TalentCLEF 2026 Task B — Validation (304 queries, 9052 skills)
+      type: talentclef-2026-taskb-validation
+    metrics:
+    - type: cosine_ndcg_at_10
+      value: 0.4830
+      name: nDCG@10
+    - type: cosine_map_at_100
+      value: 0.1825
+      name: MAP@100
+    - type: cosine_mrr_at_10
+      value: 0.6657
+      name: MRR@10
+    - type: cosine_accuracy_at_1
+      value: 0.5099
+      name: Accuracy@1
+    - type: cosine_accuracy_at_10
+      value: 0.9474
+      name: Accuracy@10
+---
+# SkillScout Large — Job-to-Skill Dense Retriever
+**SkillScout Large** is a dense bi-encoder for retrieving relevant skills from a job title.
+Given a job title (e.g., *"Data Scientist"*), it encodes it into a 1024-dimensional embedding and retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/) skill gazetteer (9,052 skills) using cosine similarity.
+This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, trained for [TalentCLEF 2026 Task B](https://talentclef.github.io/).
+> **Best pipeline result (TalentCLEF 2026 validation set):**
+> nDCG@10 graded = **0.6896** · nDCG@10 binary = **0.7330**
+> when combined with a fine-tuned cross-encoder re-ranker at blend α = 0.7.
+> Bi-encoder alone: nDCG@10 graded = **0.3621** · MAP = **0.4545**
+---
+## Model Summary
+| Property | Value |
+|---|---|
+| Base model | [`jjzha/esco-xlm-roberta-large`](https://huggingface.co/jjzha/esco-xlm-roberta-large) |
+| Architecture | XLM-RoBERTa-large + mean pooling |
+| Embedding dimension | 1024 |
+| Max sequence length | 64 tokens |
+| Training loss | Multiple Negatives Ranking (MNR) |
+| Training pairs | 93,720 (ESCO job–skill pairs, essential + optional) |
+| Epochs | 3 |
+| Best checkpoint | Step 3500 (saved by validation nDCG@10) |
+| Hardware | NVIDIA RTX 3070 8GB · fp16 AMP |
+---
+## What is TalentCLEF Task B?
+**TalentCLEF 2026 Task B** is a graded information-retrieval shared task:
+- **Query**: a job title (e.g., *"Electrician"*)
+- **Corpus**: 9,052 ESCO skills (e.g., *"install electric switches"*, *"comply with electrical safety regulations"*)
+- **Relevance levels**:
+  - `2` — Core skill (essential regardless of context)
+  - `1` — Contextual skill (depends on employer / industry)
+  - `0` — Non-relevant
+**Primary metric**: nDCG with graded relevance (core=2, contextual=1)
+---
+## Usage
+### Installation
+```bash
+pip install sentence-transformers faiss-cpu  # or faiss-gpu
+```
+### Encode & Compare
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("talentguide/skillscout-large")
+job    = "Data Scientist"
+skills = ["data science", "machine learning", "install electric switches"]
+embs   = model.encode([job] + skills, normalize_embeddings=True)
+scores = embs[0] @ embs[1:].T
+for skill, score in zip(skills, scores):
+    print(f"{score:.3f}  {skill}")
+# 0.872  data science
+# 0.731  machine learning
+# 0.112  install electric switches
+```
+### Full Retrieval with FAISS (Recommended)
+```python
+from sentence_transformers import SentenceTransformer
+import faiss, numpy as np
+model = SentenceTransformer("talentguide/skillscout-large")
+# --- Build index once over your skill corpus ---
+skill_texts = [...]   # list of skill names / descriptions
+embs = model.encode(skill_texts, batch_size=128,
+                    normalize_embeddings=True,
+                    show_progress_bar=True).astype(np.float32)
+index = faiss.IndexFlatIP(embs.shape[1])  # inner product on L2-normed = cosine
+index.add(embs)
+# --- Query at inference time ---
+job_title = "Software Engineer"
+q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)
+scores, idxs = index.search(q, k=50)
+for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
+    print(f"{rank:3d}. [{score:.4f}]  {skill_texts[idx]}")
+```
+### Demo Output
+```
+Software Engineer
+   1. [0.942]  define software architecture
+   2. [0.938]  software frameworks
+   3. [0.935]  create software design
+Data Scientist
+   1. [0.951]  data science
+   2. [0.921]  establish data processes
+   3. [0.919]  create data models
+Electrician
+   1. [0.944]  install electric switches
+   2. [0.938]  install electricity sockets
+   3. [0.930]  use electrical wire tools
+```
+---
+## Two-Stage Pipeline Integration
+SkillScout Large is designed as **Stage 1** — fast ANN retrieval.
+For maximum ranking quality, pair it with a cross-encoder re-ranker:
+```
+Job title
+   ��
+   ▼
+[SkillScout Large]              ← this model
+   │  top-200 candidates (FAISS ANN, ~40ms)
+   ▼
+[Cross-encoder re-ranker]
+   │  fine-grained re-scoring of top-200
+   ▼
+Final ranked list  (graded: core > contextual > irrelevant)
+```
+**Score blending** (best result at α = 0.7):
+```python
+final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
+```
+---
+## Training Details
+### Data
+Source: [ESCO occupational ontology](https://esco.ec.europa.eu/), TalentCLEF 2026 training split.
+| | Count |
+|---|---|
+| Raw job–skill pairs (essential + optional) | 114,699 |
+| ESCO jobs with aliases | 3,039 |
+| ESCO skills with aliases | 13,939 |
+| Training InputExamples (after canonical-pair inclusion) | **93,720** |
+| Validation queries | 304 |
+| Validation corpus (skills) | 9,052 |
+| Validation relevance judgments | 56,417 |
+Essential pairs are included in full; optional skill pairs are downsampled to 50% of the essential count to maintain class balance.
+### Hyperparameters
+```
+Loss              : MultipleNegativesRankingLoss (scale=20, cos_sim)
+Batch size        : 64  →  63 in-batch negatives per anchor
+Epochs            : 3
+Warmup            : 10% of total steps (~440 steps)
+Optimizer         : AdamW (fused), lr=5e-5, linear decay
+Precision         : fp16 (AMP)
+Max seq length    : 64 tokens
+Best model saved  : by cosine-nDCG@10 on validation (eval every 500 steps)
+Seed              : 42
+```
+### Training Curve
+| Epoch | Step | Train Loss | nDCG@10 (val) | MAP@100 (val) |
+|:---:|:---:|:---:|:---:|:---:|
+| 0.34 | 500  | 2.9232 | 0.3430 | — |
+| 0.68 | 1000 | 2.1179 | 0.3424 | — |
+| 1.00 | 1465 | —      | 0.3676 | 0.1758 |
+| 1.37 | 2000 | 1.7070 | 0.3692 | — |
+| 1.71 | 2500 | 1.6366 | 0.3744 | — |
+| 2.00 | 2930 | —      | 0.3717 | 0.1780 |
+| 2.39 | **3500** ✓ | **1.4540** | **0.3769** | **0.1808** |
+*Best checkpoint saved at step 3500.*
+### Validation Metrics (best checkpoint, binary relevance)
+| Metric | Value |
+|---|---|
+| **nDCG@10** | **0.4830** |
+| nDCG@50 | 0.4240 |
+| nDCG@100 | 0.3769 |
+| **MAP@100** | **0.1825** |
+| **MRR@10** | **0.6657** |
+| Accuracy@1 | 0.5099 |
+| Accuracy@3 | 0.7993 |
+| Accuracy@5 | 0.8914 |
+| Accuracy@10 | **0.9474** |
+*Evaluated with `sentence_transformers.evaluation.InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).*
+### Pipeline Results (graded nDCG, full 9052-skill ranking, server-side)
+| Run | nDCG@10 graded | nDCG@10 binary | MAP |
+|---|---|---|---|
+| Zero-shot `jjzha/esco-xlm-roberta-large` | 0.2039 | 0.2853 | 0.2663 |
+| **SkillScout Large (bi-encoder only)** | **0.3621** | **0.4830** | **0.4545** |
+| SkillScout Large + cross-encoder (α=0.7) | **0.6896** | **0.7330** | 0.2481 |
+---
+## Competitive Context (TalentCLEF 2025 Task B)
+| Team | MAP (test) | Approach |
+|---|---|---|
+| pjmathematician (winner 2025) | 0.36 | GTE 7B + contrastive + LLM-augmented data |
+| NLPnorth (3rd of 14, 2025) | 0.29 | 3-class discriminative classification |
+| **SkillScout Large (2026 val)** | **0.4545** | MNR fine-tuned bi-encoder (Stage 1 only) |
+---
+## Limitations
+- **English only** — trained on ESCO EN labels.
+- **ESCO-domain** — optimised for the ESCO skill taxonomy; performance on other taxonomies (O*NET, custom) may vary without fine-tuning.
+- **64-token cap** — long job descriptions should be reduced to a concise title before encoding.
+- **Graded distinction** — the bi-encoder alone does not reliably separate core (2) from contextual (1) skills; a cross-encoder re-ranker is needed for strong graded nDCG.
+---
+## Citation
+```bibtex
+@misc{talentguide-skillscout-2026,
+  title   = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
+  author  = {TalentGuide},
+  year    = {2026},
+  url     = {https://huggingface.co/talentguide/skillscout-large}
+}
+@misc{talentclef2026taskb,
+  title   = {TalentCLEF 2026 Task B: Job-Skill Matching},
+  author  = {TalentCLEF Organizers},
+  year    = {2026},
+  url     = {https://talentclef.github.io/}
+}
+```
+---
+## Framework Versions
+| Package | Version |
+|---|---|
+| Python | 3.12.10 |
+| sentence-transformers | 5.3.0 |
+| transformers | 5.5.0 |
+| PyTorch | 2.11.0+cu128 |
+| Accelerate | 1.13.0 |
+| Tokenizers | 0.22.2 |
+---
+## License
+[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "add_cross_attention": false,
+  "architectures": [
+    "XLMRobertaModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "is_decoder": false,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "tie_word_embeddings": true,
+  "transformers_version": "5.5.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "SentenceTransformer",
+  "__version__": {
+    "sentence_transformers": "5.3.0",
+    "transformers": "5.5.0",
+    "pytorch": "2.11.0+cu128"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

eval/Information-Retrieval_evaluation_taskb_val_results.csv ADDED Viewed

	@@ -0,0 +1,4 @@

+epoch,steps,cosine-Accuracy@1,cosine-Accuracy@3,cosine-Accuracy@5,cosine-Accuracy@10,cosine-Precision@1,cosine-Recall@1,cosine-Precision@3,cosine-Recall@3,cosine-Precision@5,cosine-Recall@5,cosine-Precision@10,cosine-Recall@10,cosine-MRR@10,cosine-NDCG@10,cosine-NDCG@50,cosine-NDCG@100,cosine-MAP@100
+1.0,1465,0.5328947368421053,0.7861842105263158,0.8782894736842105,0.9276315789473685,0.5328947368421053,0.0032031880872582354,0.506578947368421,0.008898304486990168,0.48618421052631583,0.014146896345718819,0.4578947368421053,0.026269226379462513,0.6724402151211364,0.47403210625956155,0.4101240573414333,0.36757918645734217,0.1758130011436744
+2.0,2930,0.5296052631578947,0.8092105263157895,0.8980263157894737,0.9375,0.5296052631578947,0.00316041371692307,0.4846491228070175,0.008507144941066613,0.48947368421052634,0.014252504636544197,0.45921052631578946,0.026632272459831994,0.6762596595655807,0.4709457808372526,0.4187643981711032,0.3717445435846663,0.17801339821892972
+3.0,4395,0.5296052631578947,0.8026315789473685,0.868421052631579,0.9375,0.5296052631578947,0.0032347237398130807,0.4956140350877193,0.00875841181887072,0.4901315789473684,0.014392541997646157,0.4648026315789474,0.026968519084827156,0.6734583855472013,0.4766646513891743,0.4224348906713204,0.375764794101007,0.18082895608805505

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7e120e8bdcd7a4a29d97858e8ae7cac3c0087594a5d6b9430dd4e3981b6f61b9
+size 2239607120

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 64,
+    "do_lower_case": false
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc5c1151948923156f20bcafd54fd796705d693f8d7b56c83aec49d651f6d602
+size 17082986

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "add_prefix_space": true,
+  "backend": "tokenizers",
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "is_local": false,
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}