--- library_name: sentence-transformers pipeline_tag: sentence-similarity license: apache-2.0 tags: - sentence-transformers - feature-extraction - sentence-similarity - job-matching - skill-similarity - embeddings - esco --- # 🛠️ alvperez/skill-sim-model **skill-sim-model** is a fine-tuned [Sentence-Transformers](https://www.sbert.net) checkpoint that maps short *skill phrases* (e.g. `Python`, `Forklift operation`, `Electrical wiring`) into a 768‑D vector space where semantically related skills cluster together. Training pairs come from the public **ESCO** taxonomy plus curated *hard negatives* for job‑matching research. | Use‑case | How to leverage the embeddings | |----------|--------------------------------| | Candidate ↔ vacancy matching | `score = cosine(skill_vec, job_vec)` | | Deduplicating skill taxonomies | cluster the vectors | | Recruiter query‑expansion | nearest‑neighbour search | | Exploratory dashboards | feed to t‑SNE / PCA | --- ## 🚀 Quick start ```bash pip install -U sentence-transformers ``` ```python from sentence_transformers import SentenceTransformer, util model = SentenceTransformer("alvperez/skill-sim-model") skills = ["Electrical wiring", "Circuit troubleshooting", "Machine learning"] emb = model.encode(skills, convert_to_tensor=True) print(util.pytorch_cos_sim(emb[0], emb)) # similarity matrix ``` ```python from transformers import pipeline similarity = pipeline("sentence-similarity", model="alvperez/skill-sim-model") similarity("forklift operation", ["pallet jack", "python"]) ``` --- ## 📊 Benchmark | Metric | Value | |--------------------------------|-------| | Spearman correlation | **0.845** | | ROC AUC | **0.988** | | MAP@all (*cold‑start*) | **0.232** | > *cold‑start = the system sees only skill strings, no historical interactions.* --- ## ⚙️ Training recipe (brief) * Base: `sentence-transformers/all-mpnet-base-v2` * Loss: `CosineSimilarityLoss` * Epochs × batch: `5 × 32` * LR / warm‑up: `2 e‑5` / `100` steps * Negatives: random + “hard” pairs from ESCO siblings * Hardware: 1 × A100 40 GB (≈ 45 min) Full code in [`/training_scripts`](training_scripts). --- ## 🏹 Intended use * **Employment tech** – rank CVs vs. vacancies * **EdTech / reskilling** – detect skill gaps, suggest learning paths * **HR analytics** – normalise noisy skill fields at scale --- ## ✋ Limitations & bias * Vocabulary dominated by ESCO (English); niche jargon may project poorly. * No explicit fairness constraints; downstream systems should audit (e.g. *Disparate Impact*). * In our tests, a threshold of 0.65 marks a “definitely related” cut‑off; tune for your own precision‑recall needs. --- ## 🔍 Citation ```bibtex @misc{alvperez2025skillsim, title = {Skill-Sim: a Sentence-Transformers model for skill similarity and job matching}, author = {Pérez Amado, Álvaro}, howpublished = {\url{https://huggingface.co/alvperez/skill-sim-model}}, year = {2025} } ``` --- ### Acknowledgements Built on top of Sentence-Transformers and the public **ESCO** dataset. Feedback & PRs welcome!