talentguide
/

skillscout-large

@@ -22,7 +22,7 @@ model-index:
       type: information-retrieval
       name: Information Retrieval
     dataset:
-      name: TalentCLEF 2026 Task B — Validation (304 queries, 9052 skills)
       type: talentclef-2026-taskb-validation
     metrics:
     - type: cosine_ndcg_at_10
@@ -34,25 +34,25 @@ model-index:
     - type: cosine_mrr_at_10
       value: 0.6657
       name: MRR@10
-    - type: cosine_accuracy_at_1
-      value: 0.5099
-      name: Accuracy@1
     - type: cosine_accuracy_at_10
       value: 0.9474
       name: Accuracy@10
 ---
-# SkillScout Large — Job-to-Skill Dense Retriever
-**SkillScout Large** is a dense bi-encoder for retrieving relevant skills from a job title.
-Given a job title (e.g., *"Data Scientist"*), it encodes it into a 1024-dimensional embedding and retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/) skill gazetteer (9,052 skills) using cosine similarity.
-This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, trained for [TalentCLEF 2026 Task B](https://talentclef.github.io/).
-> **Best pipeline result (TalentCLEF 2026 validation set):**
-> nDCG@10 graded = **0.6896** · nDCG@10 binary = **0.7330**
-> when combined with a fine-tuned cross-encoder re-ranker at blend α = 0.7.
-> Bi-encoder alone: nDCG@10 graded = **0.3621** · MAP = **0.4545**
 ---
@@ -60,15 +60,15 @@ This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, tr
 | Property | Value |
 |---|---|
-| Base model | [`jjzha/esco-xlm-roberta-large`](https://huggingface.co/jjzha/esco-xlm-roberta-large) |
 | Architecture | XLM-RoBERTa-large + mean pooling |
 | Embedding dimension | 1024 |
 | Max sequence length | 64 tokens |
 | Training loss | Multiple Negatives Ranking (MNR) |
-| Training pairs | 93,720 (ESCO job–skill pairs, essential + optional) |
 | Epochs | 3 |
-| Best checkpoint | Step 3500 (saved by validation nDCG@10) |
-| Hardware | NVIDIA RTX 3070 8GB · fp16 AMP |
 ---
@@ -77,13 +77,9 @@ This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, tr
 **TalentCLEF 2026 Task B** is a graded information-retrieval shared task:
 - **Query**: a job title (e.g., *"Electrician"*)
-- **Corpus**: 9,052 ESCO skills (e.g., *"install electric switches"*, *"comply with electrical safety regulations"*)
-- **Relevance levels**:
-  - `2` — Core skill (essential regardless of context)
-  - `1` — Contextual skill (depends on employer / industry)
-  - `0` — Non-relevant
-**Primary metric**: nDCG with graded relevance (core=2, contextual=1)
 ---
@@ -92,10 +88,10 @@ This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, tr
 ### Installation
 ```bash
-pip install sentence-transformers faiss-cpu  # or faiss-gpu
 ```
-### Encode & Compare
 ```python
 from sentence_transformers import SentenceTransformer
@@ -123,8 +119,8 @@ import faiss, numpy as np
 model = SentenceTransformer("talentguide/skillscout-large")
-# --- Build index once over your skill corpus ---
-skill_texts = [...]   # list of skill names / descriptions
 embs = model.encode(skill_texts, batch_size=128,
                     normalize_embeddings=True,
@@ -133,11 +129,10 @@ embs = model.encode(skill_texts, batch_size=128,
 index = faiss.IndexFlatIP(embs.shape[1])  # inner product on L2-normed = cosine
 index.add(embs)
-# --- Query at inference time ---
 job_title = "Software Engineer"
 q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)
 scores, idxs = index.search(q, k=50)
 for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
     print(f"{rank:3d}. [{score:.4f}]  {skill_texts[idx]}")
 ```
@@ -165,23 +160,20 @@ Electrician
 ## Two-Stage Pipeline Integration
-SkillScout Large is designed as **Stage 1** — fast ANN retrieval.
-For maximum ranking quality, pair it with a cross-encoder re-ranker:
 ```
 Job title
-   │
-   ▼
-[SkillScout Large]              ← this model
-   │  top-200 candidates (FAISS ANN, ~40ms)
-   ▼
 [Cross-encoder re-ranker]
-   │  fine-grained re-scoring of top-200
-   ▼
-Final ranked list  (graded: core > contextual > irrelevant)
 ```
-**Score blending** (best result at α = 0.7):
 ```python
 final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
@@ -197,45 +189,43 @@ Source: [ESCO occupational ontology](https://esco.ec.europa.eu/), TalentCLEF 202
 | | Count |
 |---|---|
-| Raw job–skill pairs (essential + optional) | 114,699 |
-| ESCO jobs with aliases | 3,039 |
-| ESCO skills with aliases | 13,939 |
-| Training InputExamples (after canonical-pair inclusion) | **93,720** |
 | Validation queries | 304 |
-| Validation corpus (skills) | 9,052 |
-| Validation relevance judgments | 56,417 |
-Essential pairs are included in full; optional skill pairs are downsampled to 50% of the essential count to maintain class balance.
 ### Hyperparameters
 ```
-Loss              : MultipleNegativesRankingLoss (scale=20, cos_sim)
-Batch size        : 64  →  63 in-batch negatives per anchor
-Epochs            : 3
-Warmup            : 10% of total steps (~440 steps)
-Optimizer         : AdamW (fused), lr=5e-5, linear decay
-Precision         : fp16 (AMP)
-Max seq length    : 64 tokens
-Best model saved  : by cosine-nDCG@10 on validation (eval every 500 steps)
-Seed              : 42
 ```
 ### Training Curve
-| Epoch | Step | Train Loss | nDCG@10 (val) | MAP@100 (val) |
-|:---:|:---:|:---:|:---:|:---:|
-| 0.34 | 500  | 2.9232 | 0.3430 | — |
-| 0.68 | 1000 | 2.1179 | 0.3424 | — |
-| 1.00 | 1465 | —      | 0.3676 | 0.1758 |
-| 1.37 | 2000 | 1.7070 | 0.3692 | — |
-| 1.71 | 2500 | 1.6366 | 0.3744 | — |
-| 2.00 | 2930 | —      | 0.3717 | 0.1780 |
-| 2.39 | **3500** ✓ | **1.4540** | **0.3769** | **0.1808** |
-*Best checkpoint saved at step 3500.*
-### Validation Metrics (best checkpoint, binary relevance)
 | Metric | Value |
 |---|---|
@@ -247,17 +237,17 @@ Seed              : 42
 | Accuracy@1 | 0.5099 |
 | Accuracy@3 | 0.7993 |
 | Accuracy@5 | 0.8914 |
-| Accuracy@10 | **0.9474** |
-*Evaluated with `sentence_transformers.evaluation.InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).*
-### Pipeline Results (graded nDCG, full 9052-skill ranking, server-side)
 | Run | nDCG@10 graded | nDCG@10 binary | MAP |
 |---|---|---|---|
 | Zero-shot `jjzha/esco-xlm-roberta-large` | 0.2039 | 0.2853 | 0.2663 |
 | **SkillScout Large (bi-encoder only)** | **0.3621** | **0.4830** | **0.4545** |
-| SkillScout Large + cross-encoder (α=0.7) | **0.6896** | **0.7330** | 0.2481 |
 ---
@@ -267,16 +257,16 @@ Seed              : 42
 |---|---|---|
 | pjmathematician (winner 2025) | 0.36 | GTE 7B + contrastive + LLM-augmented data |
 | NLPnorth (3rd of 14, 2025) | 0.29 | 3-class discriminative classification |
-| **SkillScout Large (2026 val)** | **0.4545** | MNR fine-tuned bi-encoder (Stage 1 only) |
 ---
 ## Limitations
-- **English only** — trained on ESCO EN labels.
-- **ESCO-domain** — optimised for the ESCO skill taxonomy; performance on other taxonomies (O*NET, custom) may vary without fine-tuning.
-- **64-token cap** — long job descriptions should be reduced to a concise title before encoding.
-- **Graded distinction** — the bi-encoder alone does not reliably separate core (2) from contextual (1) skills; a cross-encoder re-ranker is needed for strong graded nDCG.
 ---
@@ -284,17 +274,17 @@ Seed              : 42
 ```bibtex
 @misc{talentguide-skillscout-2026,
-  title   = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
-  author  = {TalentGuide},
-  year    = {2026},
-  url     = {https://huggingface.co/talentguide/skillscout-large}
 }
 @misc{talentclef2026taskb,
-  title   = {TalentCLEF 2026 Task B: Job-Skill Matching},
-  author  = {TalentCLEF Organizers},
-  year    = {2026},
-  url     = {https://talentclef.github.io/}
 }
 ```
@@ -302,17 +292,5 @@ Seed              : 42
 ## Framework Versions
-| Package | Version |
-|---|---|
-| Python | 3.12.10 |
-| sentence-transformers | 5.3.0 |
-| transformers | 5.5.0 |
-| PyTorch | 2.11.0+cu128 |
-| Accelerate | 1.13.0 |
-| Tokenizers | 0.22.2 |
----
-## License
-[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

       type: information-retrieval
       name: Information Retrieval
     dataset:
+      name: TalentCLEF 2026 Task B Validation
       type: talentclef-2026-taskb-validation
     metrics:
     - type: cosine_ndcg_at_10
     - type: cosine_mrr_at_10
       value: 0.6657
       name: MRR@10
     - type: cosine_accuracy_at_10
       value: 0.9474
       name: Accuracy@10
 ---
+# SkillScout Large - Job-to-Skill Dense Retriever
+**SkillScout Large** is a dense bi-encoder for retrieving relevant skills from a job title.
+Given a job title (e.g., *"Data Scientist"*), it produces a 1024-dimensional embedding and
+retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/)
+skill gazetteer (9,052 skills) via cosine similarity.
+This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, trained for
+[TalentCLEF 2026 Task B](https://talentclef.github.io/).
+> **Best pipeline result (TalentCLEF 2026 validation set):**
+> nDCG@10 graded = **0.6896** | nDCG@10 binary = **0.7330**
+> when combined with a fine-tuned cross-encoder at blend alpha=0.7.
+> Bi-encoder alone: nDCG@10 graded = **0.3621** | MAP = **0.4545**
 ---
 | Property | Value |
 |---|---|
+| Base model | [jjzha/esco-xlm-roberta-large](https://huggingface.co/jjzha/esco-xlm-roberta-large) |
 | Architecture | XLM-RoBERTa-large + mean pooling |
 | Embedding dimension | 1024 |
 | Max sequence length | 64 tokens |
 | Training loss | Multiple Negatives Ranking (MNR) |
+| Training pairs | 93,720 (ESCO job-skill pairs, essential + optional) |
 | Epochs | 3 |
+| Best checkpoint | Step 3500 (by validation nDCG@10) |
+| Hardware | NVIDIA RTX 3070 8GB, fp16 AMP |
 ---
 **TalentCLEF 2026 Task B** is a graded information-retrieval shared task:
 - **Query**: a job title (e.g., *"Electrician"*)
+- **Corpus**: 9,052 ESCO skills (e.g., *"install electric switches"*)
+- **Relevance levels**: `2` = Core, `1` = Contextual, `0` = Non-relevant
+- **Primary metric**: nDCG with graded relevance (core=2, contextual=1)
 ---
 ### Installation
 ```bash
+pip install sentence-transformers faiss-cpu
 ```
+### Encode and Compare
 ```python
 from sentence_transformers import SentenceTransformer
 model = SentenceTransformer("talentguide/skillscout-large")
+# Build index once over your skill corpus
+skill_texts = [...]  # list of skill names
 embs = model.encode(skill_texts, batch_size=128,
                     normalize_embeddings=True,
 index = faiss.IndexFlatIP(embs.shape[1])  # inner product on L2-normed = cosine
 index.add(embs)
 job_title = "Software Engineer"
 q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)
 scores, idxs = index.search(q, k=50)
 for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
     print(f"{rank:3d}. [{score:.4f}]  {skill_texts[idx]}")
 ```
 ## Two-Stage Pipeline Integration
 ```
 Job title
+   |
+   v
+[SkillScout Large]         <- this model
+   |  top-200 candidates via FAISS ANN
+   v
 [Cross-encoder re-ranker]
+   |  fine-grained re-scoring
+   v
+Final ranked list (graded: core > contextual > irrelevant)
 ```
+Blend formula (alpha=0.7 gives best validation results):
 ```python
 final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
 | | Count |
 |---|---|
+| Job-skill pairs (essential) | ~57,500 |
+| Job-skill pairs (optional) | ~57,200 |
+| Total InputExamples | **93,720** |
 | Validation queries | 304 |
+| Validation corpus | 9,052 skills |
+| Validation qrels | 56,417 |
+Each ESCO job has 5-15 title aliases; skills have multiple phrasings.
+Optional pairs are downsampled to 50% of essential count to maintain class balance.
 ### Hyperparameters
 ```
+Loss           : MultipleNegativesRankingLoss (scale=20, cos_sim)
+Batch size     : 64  (63 in-batch negatives per anchor)
+Epochs         : 3
+Warmup         : 10% of steps (~440 steps)
+Optimizer      : AdamW fused
+Learning rate  : 5e-5, linear decay
+Precision      : fp16 AMP
+Max seq len    : 64 tokens
+Best model     : saved by cosine-nDCG@10 on validation
 ```
 ### Training Curve
+| Epoch | Step | Train Loss | nDCG@10 val | MAP@100 val |
+|---|---|---|---|---|
+| 0.34 | 500  | 2.9232 | 0.3430 | - |
+| 0.68 | 1000 | 2.1179 | 0.3424 | - |
+| 1.00 | 1465 | -      | 0.3676 | 0.1758 |
+| 1.37 | 2000 | 1.7070 | 0.3692 | - |
+| 1.71 | 2500 | 1.6366 | 0.3744 | - |
+| 2.00 | 2930 | -      | 0.3717 | 0.1780 |
+| **2.39** | **3500** | **1.4540** | **0.3769** | **0.1808** |
+### Validation Metrics (best checkpoint, step 3500)
 | Metric | Value |
 |---|---|
 | Accuracy@1 | 0.5099 |
 | Accuracy@3 | 0.7993 |
 | Accuracy@5 | 0.8914 |
+| Accuracy@10 | 0.9474 |
+Evaluated with `InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).
+### Pipeline Results (graded relevance, full 9052-skill ranking)
 | Run | nDCG@10 graded | nDCG@10 binary | MAP |
 |---|---|---|---|
 | Zero-shot `jjzha/esco-xlm-roberta-large` | 0.2039 | 0.2853 | 0.2663 |
 | **SkillScout Large (bi-encoder only)** | **0.3621** | **0.4830** | **0.4545** |
+| SkillScout Large + cross-encoder (alpha=0.7) | **0.6896** | **0.7330** | 0.2481 |
 ---
 |---|---|---|
 | pjmathematician (winner 2025) | 0.36 | GTE 7B + contrastive + LLM-augmented data |
 | NLPnorth (3rd of 14, 2025) | 0.29 | 3-class discriminative classification |
+| **SkillScout Large (2026 val, Stage 1 only)** | **0.4545** | MNR fine-tuned bi-encoder |
 ---
 ## Limitations
+- **English only** - trained on ESCO EN labels.
+- **ESCO-domain optimised** - transfer to O*NET or custom taxonomies may require fine-tuning.
+- **Max 64 tokens** - reduce long descriptions to a concise job title.
+- **Graded distinction** - the bi-encoder alone does not reliably separate core vs contextual skills; a cross-encoder re-ranker is recommended for graded nDCG.
 ---
 ```bibtex
 @misc{talentguide-skillscout-2026,
+  title  = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
+  author = {TalentGuide},
+  year   = {2026},
+  url    = {https://huggingface.co/talentguide/skillscout-large}
 }
 @misc{talentclef2026taskb,
+  title  = {TalentCLEF 2026 Task B: Job-Skill Matching},
+  author = {TalentCLEF Organizers},
+  year   = {2026},
+  url    = {https://talentclef.github.io/}
 }
 ```
 ## Framework Versions
+- Python 3.12.10 | Sentence Transformers 5.3.0 | Transformers 5.5.0
+- PyTorch 2.11.0+cu128 | Accelerate 1.13.0 | Tokenizers 0.22.2