File size: 9,240 Bytes

70f6be0

---
language:
- en
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense-retrieval
- information-retrieval
- job-skill-matching
- esco
- talentclef
- xlm-roberta
base_model: jjzha/esco-xlm-roberta-large
pipeline_tag: sentence-similarity
model-index:
- name: skillscout-large
  results:
  - task:
      type: information-retrieval
      name: Information Retrieval
    dataset:
      name: TalentCLEF 2026 Task B — Validation (304 queries, 9052 skills)
      type: talentclef-2026-taskb-validation
    metrics:
    - type: cosine_ndcg_at_10
      value: 0.4830
      name: nDCG@10
    - type: cosine_map_at_100
      value: 0.1825
      name: MAP@100
    - type: cosine_mrr_at_10
      value: 0.6657
      name: MRR@10
    - type: cosine_accuracy_at_1
      value: 0.5099
      name: Accuracy@1
    - type: cosine_accuracy_at_10
      value: 0.9474
      name: Accuracy@10
---

# SkillScout Large — Job-to-Skill Dense Retriever

**SkillScout Large** is a dense bi-encoder for retrieving relevant skills from a job title.  
Given a job title (e.g., *"Data Scientist"*), it encodes it into a 1024-dimensional embedding and retrieves the most semantically relevant skills from the [ESCO](https://esco.ec.europa.eu/) skill gazetteer (9,052 skills) using cosine similarity.

This is **Stage 1** of the TalentGuide two-stage job-skill matching pipeline, trained for [TalentCLEF 2026 Task B](https://talentclef.github.io/).

> **Best pipeline result (TalentCLEF 2026 validation set):**  
> nDCG@10 graded = **0.6896** · nDCG@10 binary = **0.7330**  
> when combined with a fine-tuned cross-encoder re-ranker at blend α = 0.7.  
> Bi-encoder alone: nDCG@10 graded = **0.3621** · MAP = **0.4545**

---

## Model Summary

| Property | Value |
|---|---|
| Base model | [`jjzha/esco-xlm-roberta-large`](https://huggingface.co/jjzha/esco-xlm-roberta-large) |
| Architecture | XLM-RoBERTa-large + mean pooling |
| Embedding dimension | 1024 |
| Max sequence length | 64 tokens |
| Training loss | Multiple Negatives Ranking (MNR) |
| Training pairs | 93,720 (ESCO job–skill pairs, essential + optional) |
| Epochs | 3 |
| Best checkpoint | Step 3500 (saved by validation nDCG@10) |
| Hardware | NVIDIA RTX 3070 8GB · fp16 AMP |

---

## What is TalentCLEF Task B?

**TalentCLEF 2026 Task B** is a graded information-retrieval shared task:

- **Query**: a job title (e.g., *"Electrician"*)
- **Corpus**: 9,052 ESCO skills (e.g., *"install electric switches"*, *"comply with electrical safety regulations"*)
- **Relevance levels**:
  - `2` — Core skill (essential regardless of context)
  - `1` — Contextual skill (depends on employer / industry)
  - `0` — Non-relevant

**Primary metric**: nDCG with graded relevance (core=2, contextual=1)

---

## Usage

### Installation

```bash
pip install sentence-transformers faiss-cpu  # or faiss-gpu
```

### Encode & Compare

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("talentguide/skillscout-large")

job    = "Data Scientist"
skills = ["data science", "machine learning", "install electric switches"]

embs   = model.encode([job] + skills, normalize_embeddings=True)
scores = embs[0] @ embs[1:].T

for skill, score in zip(skills, scores):
    print(f"{score:.3f}  {skill}")
# 0.872  data science
# 0.731  machine learning
# 0.112  install electric switches
```

### Full Retrieval with FAISS (Recommended)

```python
from sentence_transformers import SentenceTransformer
import faiss, numpy as np

model = SentenceTransformer("talentguide/skillscout-large")

# --- Build index once over your skill corpus ---
skill_texts = [...]   # list of skill names / descriptions

embs = model.encode(skill_texts, batch_size=128,
                    normalize_embeddings=True,
                    show_progress_bar=True).astype(np.float32)

index = faiss.IndexFlatIP(embs.shape[1])  # inner product on L2-normed = cosine
index.add(embs)

# --- Query at inference time ---
job_title = "Software Engineer"
q = model.encode([job_title], normalize_embeddings=True).astype(np.float32)

scores, idxs = index.search(q, k=50)
for rank, (idx, score) in enumerate(zip(idxs[0], scores[0]), 1):
    print(f"{rank:3d}. [{score:.4f}]  {skill_texts[idx]}")
```

### Demo Output

```
Software Engineer
   1. [0.942]  define software architecture
   2. [0.938]  software frameworks
   3. [0.935]  create software design

Data Scientist
   1. [0.951]  data science
   2. [0.921]  establish data processes
   3. [0.919]  create data models

Electrician
   1. [0.944]  install electric switches
   2. [0.938]  install electricity sockets
   3. [0.930]  use electrical wire tools
```

---

## Two-Stage Pipeline Integration

SkillScout Large is designed as **Stage 1** — fast ANN retrieval.  
For maximum ranking quality, pair it with a cross-encoder re-ranker:

```
Job title
   │
   ▼
[SkillScout Large]              ← this model
   │  top-200 candidates (FAISS ANN, ~40ms)
   ▼
[Cross-encoder re-ranker]
   │  fine-grained re-scoring of top-200
   ▼
Final ranked list  (graded: core > contextual > irrelevant)
```

**Score blending** (best result at α = 0.7):

```python
final_score = alpha * biencoder_score + (1 - alpha) * crossencoder_score
```

---

## Training Details

### Data

Source: [ESCO occupational ontology](https://esco.ec.europa.eu/), TalentCLEF 2026 training split.

| | Count |
|---|---|
| Raw job–skill pairs (essential + optional) | 114,699 |
| ESCO jobs with aliases | 3,039 |
| ESCO skills with aliases | 13,939 |
| Training InputExamples (after canonical-pair inclusion) | **93,720** |
| Validation queries | 304 |
| Validation corpus (skills) | 9,052 |
| Validation relevance judgments | 56,417 |

Essential pairs are included in full; optional skill pairs are downsampled to 50% of the essential count to maintain class balance.

### Hyperparameters

```
Loss              : MultipleNegativesRankingLoss (scale=20, cos_sim)
Batch size        : 64  →  63 in-batch negatives per anchor
Epochs            : 3
Warmup            : 10% of total steps (~440 steps)
Optimizer         : AdamW (fused), lr=5e-5, linear decay
Precision         : fp16 (AMP)
Max seq length    : 64 tokens
Best model saved  : by cosine-nDCG@10 on validation (eval every 500 steps)
Seed              : 42
```

### Training Curve

| Epoch | Step | Train Loss | nDCG@10 (val) | MAP@100 (val) |
|:---:|:---:|:---:|:---:|:---:|
| 0.34 | 500  | 2.9232 | 0.3430 | — |
| 0.68 | 1000 | 2.1179 | 0.3424 | — |
| 1.00 | 1465 | —      | 0.3676 | 0.1758 |
| 1.37 | 2000 | 1.7070 | 0.3692 | — |
| 1.71 | 2500 | 1.6366 | 0.3744 | — |
| 2.00 | 2930 | —      | 0.3717 | 0.1780 |
| 2.39 | **3500** ✓ | **1.4540** | **0.3769** | **0.1808** |

*Best checkpoint saved at step 3500.*

### Validation Metrics (best checkpoint, binary relevance)

| Metric | Value |
|---|---|
| **nDCG@10** | **0.4830** |
| nDCG@50 | 0.4240 |
| nDCG@100 | 0.3769 |
| **MAP@100** | **0.1825** |
| **MRR@10** | **0.6657** |
| Accuracy@1 | 0.5099 |
| Accuracy@3 | 0.7993 |
| Accuracy@5 | 0.8914 |
| Accuracy@10 | **0.9474** |

*Evaluated with `sentence_transformers.evaluation.InformationRetrievalEvaluator` (binary: any qrel > 0 = relevant).*

### Pipeline Results (graded nDCG, full 9052-skill ranking, server-side)

| Run | nDCG@10 graded | nDCG@10 binary | MAP |
|---|---|---|---|
| Zero-shot `jjzha/esco-xlm-roberta-large` | 0.2039 | 0.2853 | 0.2663 |
| **SkillScout Large (bi-encoder only)** | **0.3621** | **0.4830** | **0.4545** |
| SkillScout Large + cross-encoder (α=0.7) | **0.6896** | **0.7330** | 0.2481 |

---

## Competitive Context (TalentCLEF 2025 Task B)

| Team | MAP (test) | Approach |
|---|---|---|
| pjmathematician (winner 2025) | 0.36 | GTE 7B + contrastive + LLM-augmented data |
| NLPnorth (3rd of 14, 2025) | 0.29 | 3-class discriminative classification |
| **SkillScout Large (2026 val)** | **0.4545** | MNR fine-tuned bi-encoder (Stage 1 only) |

---

## Limitations

- **English only** — trained on ESCO EN labels.
- **ESCO-domain** — optimised for the ESCO skill taxonomy; performance on other taxonomies (O*NET, custom) may vary without fine-tuning.
- **64-token cap** — long job descriptions should be reduced to a concise title before encoding.
- **Graded distinction** — the bi-encoder alone does not reliably separate core (2) from contextual (1) skills; a cross-encoder re-ranker is needed for strong graded nDCG.

---

## Citation

```bibtex
@misc{talentguide-skillscout-2026,
  title   = {SkillScout Large: Dense Job-to-Skill Retrieval for TalentCLEF 2026},
  author  = {TalentGuide},
  year    = {2026},
  url     = {https://huggingface.co/talentguide/skillscout-large}
}

@misc{talentclef2026taskb,
  title   = {TalentCLEF 2026 Task B: Job-Skill Matching},
  author  = {TalentCLEF Organizers},
  year    = {2026},
  url     = {https://talentclef.github.io/}
}
```

---

## Framework Versions

| Package | Version |
|---|---|
| Python | 3.12.10 |
| sentence-transformers | 5.3.0 |
| transformers | 5.5.0 |
| PyTorch | 2.11.0+cu128 |
| Accelerate | 1.13.0 |
| Tokenizers | 0.22.2 |

---

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)