--- language: - en license: apache-2.0 library_name: transformers tags: - text-classification - cross-encoder - information-retrieval - job-skill-matching - esco - talentclef - reranking - bert base_model: cross-encoder/ms-marco-MiniLM-L-12-v2 pipeline_tag: text-classification model-index: - name: skillscout-reranker results: - task: type: information-retrieval name: Information Retrieval (re-ranking) dataset: name: TalentCLEF 2026 Task B Validation type: talentclef-2026-taskb-validation metrics: - type: ndcg_at_10_graded value: 0.6896 name: nDCG@10 Graded (pipeline, server) - type: ndcg_at_10_binary value: 0.7330 name: nDCG@10 Binary (pipeline, server) --- # SkillScout Reranker - Job-Skill Cross-Encoder **SkillScout Reranker** is a cross-encoder that re-ranks candidate skills for a given job title, predicting **graded relevance** (0=irrelevant, 1=contextual, 2=core). This is **Stage 2** of the TalentGuide two-stage job-skill matching pipeline, trained for [TalentCLEF 2026 Task B](https://talentclef.github.io/). > **Best pipeline result (TalentCLEF 2026 validation set, server-side):** > nDCG@10 graded = **0.6896** | nDCG@10 binary = **0.7330** > SkillScout Large (bi-encoder) + SkillScout Reranker at blend alpha=0.7. --- ## Model Summary | Property | Value | |---|---| | Base model | [cross-encoder/ms-marco-MiniLM-L-12-v2](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2) | | Architecture | BERT (MiniLM-L12) + 3-class classification head | | Hidden size | 384 | Max seq length | 128 tokens | | Output classes | 0 = non-relevant, 1 = contextual, 2 = core | | Training triples | ~130k (job_title, skill, label) | | Hard negatives | 5 per job, mined from bi-encoder top-K | | Epochs | 3 | Batch size | 32 | | Hardware | NVIDIA RTX 3070 8GB, fp16 AMP | --- ## Usage ### Installation ```bash pip install transformers torch ``` ### Score a single (job, skill) pair ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("talentguide/skillscout-reranker") model = AutoModelForSequenceClassification.from_pretrained("talentguide/skillscout-reranker") model.eval() job = "Data Scientist" skill = "data science" enc = tokenizer(job, skill, return_tensors="pt", truncation=True, max_length=128) with torch.no_grad(): logits = model(**enc).logits # shape [1, 3] probs = logits.softmax(-1)[0].tolist() # [P(irrelevant), P(contextual), P(core)] relevance = logits.argmax(-1).item() # 0, 1, or 2 print(f"Relevance class: {relevance} (0=none, 1=contextual, 2=core)") print(f"Probs: none={probs[0]:.3f} contextual={probs[1]:.3f} core={probs[2]:.3f}") # Relevance class: 2 (0=none, 1=contextual, 2=core) # Probs: none=0.031 contextual=0.142 core=0.827 ``` ### Re-rank a candidate list ```python # candidates: list of skill texts from bi-encoder (e.g. top-200) pairs = [(job, skill) for skill in candidates] encs = tokenizer(pairs, return_tensors="pt", truncation=True, padding=True, max_length=128) with torch.no_grad(): logits = model(**encs).logits # [N, 3] # Use class-2 logit (core probability) as ranking score scores = logits[:, 2].tolist() ranked = sorted(zip(candidates, scores), key=lambda x: -x[1]) for rank, (skill, score) in enumerate(ranked[:10], 1): print(f"{rank:3d}. [{score:.3f}] {skill}") ``` ### Blend with bi-encoder (recommended, alpha=0.7) ```python # bi_scores: cosine scores from SkillScout Large (normalised to [0,1]) # ce_scores: class-2 logit from this model (normalised to [0,1]) alpha = 0.7 final_score = alpha * bi_score + (1 - alpha) * ce_score ``` --- ## Two-Stage Pipeline Integration ``` Job title | v [SkillScout Large] <- talentguide/skillscout-large | top-200 candidates via FAISS ANN v [SkillScout Reranker] <- this model | 3-class graded scoring (core=2, contextual=1, irrelevant=0) v Final ranked list ``` --- ## Training Details ### Data | | Count | |---|---| | Positive triples (essential, label=2) | ~57,500 | | Positive triples (optional, label=1) | ~28,600 | | Hard negatives (label=0, from bi-encoder top-K) | ~15,200 | | Random negatives (label=0) | ~30,000 | | Total training triples | ~130,000 | | Validation queries | 304 | Validation corpus | 9,052 skills | **Hard negatives** are mined by running the fine-tuned bi-encoder (SkillScout Large) over all training jobs, collecting the top-K retrieved skills that are NOT in the positive set. This teaches the cross-encoder to distinguish near-miss retrievals from true positives. ### Hyperparameters ``` Base model : cross-encoder/ms-marco-MiniLM-L-12-v2 Task : 3-class sequence classification (BERT + linear head) Loss : CrossEntropyLoss Batch size : 32 Epochs : 3 Learning rate : 2e-5, linear warmup 10% Optimizer : AdamW Precision : fp16 AMP Max seq len : 128 tokens Input format : [CLS] job_title [SEP] skill_name [SEP] ``` ### Pipeline Results (graded relevance, full 9052-skill ranking) | Run | nDCG@10 graded | nDCG@10 binary | MAP | |---|---|---|---| | Bi-encoder only (SkillScout Large) | 0.3621 | 0.4830 | 0.4545 | | + CE bad negatives (v1) | 0.3226 | 0.4025 | 0.4195 | | + CE fixed negatives (v2) | 0.3315 | 0.4075 | 0.4228 | | + CE blend alpha=0.7 (local, top-100) | 0.3816 | 0.4973 | 0.4632 | | **+ CE blend alpha=0.7 (server, full ranking)** | **0.6896** | **0.7330** | 0.2481 | *Local metrics use top-100 retrieval cutoff; server metrics use full 9,052-skill ranking.* --- ## Limitations - **Must be paired with a retriever** - evaluates pairs, not full corpus ranking. Use with SkillScout Large for efficient retrieval. - **English only** - trained on ESCO EN labels. - **ESCO-domain optimised** - transfer to other taxonomies may require fine-tuning. - **Speed** - re-ranks top-200 candidates (~1-2s per query on GPU). Not suitable for full-corpus scoring at inference time. --- ## Citation ```bibtex @misc{talentguide-skillscout-reranker-2026, title = {SkillScout Reranker: Graded Job-Skill Cross-Encoder for TalentCLEF 2026}, author = {TalentGuide}, year = {2026}, url = {https://huggingface.co/talentguide/skillscout-reranker} } @misc{talentclef2026taskb, title = {TalentCLEF 2026 Task B: Job-Skill Matching}, author = {TalentCLEF Organizers}, year = {2026}, url = {https://talentclef.github.io/} } ``` --- ## Framework Versions - Python 3.12.10 | Transformers 5.5.0 | PyTorch 2.11.0+cu128