SSF-MiniLM Finetuned v2 — Skill Extraction Embedding Model
A sentence-transformers model fine-tuned from all-MiniLM-L6-v2 for matching job description sentences to standardized skills from Singapore's SkillsFuture Framework (SSF).
The model maps sentences and skill names into a 384-dimensional dense vector space where job description text lands close to its corresponding skill, enabling accurate semantic skill extraction, tagging, and retrieval.
Highlights
- AUC 0.995 on held-out validation (up from 0.978 baseline)
- 97.1% best accuracy on skill-sentence matching (up from 92.8% baseline)
- Covers 2,196 unique skills across all SSF sectors
- Fast inference: 22M params, runs efficiently on CPU and GPU
- Drop-in replacement for
all-MiniLM-L6-v2— same API, better skill matching
Model Details
| Property | Value |
|---|---|
| Model Type | Sentence Transformer (Bi-Encoder) |
| Base Model | sentence-transformers/all-MiniLM-L6-v2 |
| Architecture | BERT (6 layers, 12 heads, 384 hidden) |
| Parameters | ~22M |
| Max Sequence Length | 256 tokens |
| Output Dimensionality | 384 |
| Similarity Function | Cosine Similarity |
| Pooling | Mean Pooling + L2 Normalization |
| Language | English |
| License | Apache 2.0 |
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
(2): Normalize()
)
Intended Use
Primary Use Cases
- Skill Extraction from Job Descriptions — identify which standardized skills a JD sentence refers to
- Skill Tagging / Auto-labeling — tag resumes, courses, or learning content with SSF skills
- Semantic Skill Search — find relevant skills for a given text query
- Skill Gap Analysis — compare job requirements against employee skill profiles
- HR Tech / Workforce Analytics — power matching engines, recommendation systems, and talent platforms
Suitable Applications
- Resume parsing and skill extraction pipelines
- Job-to-candidate matching engines
- Learning & development recommendation systems
- Skills taxonomy mapping and alignment
- Workforce planning and analytics dashboards
Out-of-Scope Uses
- General-purpose sentence similarity (use the base model instead)
- Non-English text
- Tasks requiring generative output (this is an embedding model)
- Medical, legal, or safety-critical classification without human review
Training Details
Dataset
| Property | Value |
|---|---|
| Name | SSF Skill Extraction Pairs |
| Domain | Workforce Skills / HR / Job Descriptions |
| Source Skills | 2,196 unique skills from Singapore SkillsFuture Framework |
| Synthetic Sentences | 5 JD-style sentences per skill, generated via Qwen3-1.7B (Ollama) |
| Total Training Pairs | 21,958 (positive + hard negative per sentence) |
| Format | (sentence, skill_name, label) — label 1.0 for correct skill, 0.0 for random incorrect skill |
| Validation Split | 10% held-out (2,195 pairs) |
Sample training pairs:
| Sentence | Skill | Label |
|---|---|---|
| Analyzes tax liabilities, identifies applicable rates, and applies corrections to ensure proper calculation and reporting. | Tax Computation | 1.0 |
| Monitor plant health by assessing symptoms and identifying disease risks. | Plant Health Management and Disease Control | 1.0 |
| Analyzes cross-cultural communication challenges in medical and legal contexts, optimizing translation strategies for diverse stakeholders. | Audience Segmentation | 0.0 |
Training Objective
Loss Function: CosineSimilarityLoss with MSE
The model learns to maximize cosine similarity between a JD sentence and its correct skill, while minimizing similarity to randomly-sampled incorrect skills. This contrastive setup produces well-separated embeddings.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 5 |
| Batch Size | 64 |
| Learning Rate | 5e-05 |
| Optimizer | AdamW (fused) |
| Warmup Steps | 10% of total steps |
| Scheduler | Linear decay |
| Seed | 42 |
| Precision | FP32 |
| Deterministic | Yes (CUBLAS_WORKSPACE_CONFIG=:4096:8) |
Training Logs
| Epoch | Step | Training Loss |
|---|---|---|
| 1.45 | 500 | 0.0822 |
| 2.91 | 1,000 | 0.0567 |
| 4.36 | 1,500 | 0.0493 |
Evaluation
Benchmark: Held-Out Skill Matching (10% split, 2,195 pairs)
Embeddings encoded with normalize_embeddings=True. Cosine similarity computed as dot product of normalized vectors.
| Model | AUC | Acc @ 0.5 | Best Accuracy | Pos Mean Sim | Neg Mean Sim |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 (baseline) | 0.978 | 0.810 | 0.928 | 0.530 | 0.133 |
| SSF-MiniLM v1 (1 epoch) | 0.989 | 0.949 | 0.952 | 0.799 | 0.131 |
| SSF-MiniLM v2 (5 epochs) | 0.995 | 0.968 | 0.971 | 0.845 | 0.088 |
Key Observations
- AUC improved from 0.978 to 0.995 — the model almost perfectly ranks correct skills above incorrect ones
- Positive similarity increased from 0.530 to 0.845 — correct pairs are now strongly matched
- Negative similarity dropped from 0.133 to 0.088 — incorrect pairs are pushed further apart
- Best accuracy improved from 92.8% to 97.1% — +4.3% absolute improvement over baseline
- Accuracy @ 0.5 jumped from 81.0% to 96.8% — the default threshold works well out of the box
Metrics Explained
- AUC: Measures ranking quality — how often the model scores positive pairs above negative pairs (1.0 = perfect ranking)
- Accuracy @ 0.5: Classification accuracy using cosine similarity threshold of 0.5
- Best Accuracy: Best accuracy found by scanning thresholds from 1st–99th percentile of scores
- Pos/Neg Mean Similarity: Average cosine similarity for correct vs incorrect skill pairs
Performance Summary
Strengths
- Excellent skill discrimination (AUC 0.995) across 2,196 diverse skills
- Strong positive/negative separation (0.845 vs 0.088 mean similarity)
- Works well with the default 0.5 threshold — no tuning needed for most applications
- Small model footprint (~87MB) enables fast CPU inference
- Covers a comprehensive range of workforce skills: IT, healthcare, engineering, finance, creative, trades, and more
Weaknesses
- Optimized for SkillsFuture Framework skills — may underperform on skills not in the SSF taxonomy
- Trained on synthetic JD sentences — real-world JDs with unusual formatting or jargon may need additional fine-tuning
- Short text bias — best with single sentences or phrases; long paragraphs should be split into sentences first
- English only
Limitations
- Domain specificity: The model is fine-tuned on Singapore's SkillsFuture Framework. Skills from other taxonomies (O*NET, ESCO, ISCO) may not match as precisely without further adaptation.
- Synthetic training data: JD-style sentences were generated by an LLM (Qwen3-1.7B), which may not capture all real-world phrasing variations.
- No cross-lingual support: English only. Multilingual JDs will need translation first.
- Short text focus: Designed for sentence-level matching. For multi-paragraph JDs, split into sentences before encoding.
- Skill taxonomy coverage: Limited to the 2,196 skills in the SSF dataset. New or niche skills outside this taxonomy will fall back to base model behavior.
Ethical Considerations
- Bias: The SSF taxonomy reflects Singapore's workforce structure. Skills from underrepresented or emerging fields may have fewer training examples.
- Fairness: The model matches text to skills — it does not evaluate candidates. Applications should ensure skill matching does not introduce hiring bias.
- Responsible use: This model is a tool for structuring skill data, not for making automated hiring decisions. Always include human review in high-stakes HR workflows.
- Data provenance: Training data is synthetically generated. No personal or proprietary job description data was used in training.
Usage
Quick Start (Sentence Transformers)
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2")
# Encode job description sentences and skills
sentences = [
"Design and implement scalable data pipelines for real-time analytics.",
"Manage patient records and ensure compliance with healthcare regulations.",
]
skills = [
"Data Engineering",
"Healthcare Records Management",
"Polymer Processing",
]
sentence_embeddings = model.encode(sentences, normalize_embeddings=True)
skill_embeddings = model.encode(skills, normalize_embeddings=True)
# Compute similarity (dot product of normalized vectors = cosine similarity)
import numpy as np
similarities = np.dot(sentence_embeddings, skill_embeddings.T)
print(similarities)
# sentence 0 -> "Data Engineering" = high score
# sentence 1 -> "Healthcare Records Management" = high score
Skill Extraction Pipeline
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2")
# Your skill taxonomy (or load from SSF dataset)
skills = ["Data Engineering", "Machine Learning", "Project Management", "Cloud Computing"]
skill_embeddings = model.encode(skills, normalize_embeddings=True)
# Extract skills from a JD sentence
jd_sentence = "Build and deploy ML models on AWS with CI/CD pipelines."
jd_embedding = model.encode([jd_sentence], normalize_embeddings=True)
scores = np.dot(jd_embedding, skill_embeddings.T)[0]
threshold = 0.5
for skill, score in sorted(zip(skills, scores), key=lambda x: -x[1]):
if score >= threshold:
print(f" {skill}: {score:.3f}")
Using with Transformers (Direct)
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2")
model = AutoModel.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2")
def encode(texts):
inputs = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
attention_mask = inputs["attention_mask"].unsqueeze(-1)
embeddings = (outputs.last_hidden_state * attention_mask).sum(1) / attention_mask.sum(1)
# L2 normalize
return torch.nn.functional.normalize(embeddings, p=2, dim=1)
query = encode(["Build scalable APIs with microservice architecture"])
skills = encode(["API Development", "Microservice Architecture", "Gardening"])
similarities = torch.mm(query, skills.T)
print(similarities)
Deployment Notes
| Property | Detail |
|---|---|
| Model Size | ~87 MB (safetensors) |
| Inference Speed | ~5,000 sentences/sec on GPU, ~500/sec on CPU (batch 64) |
| Memory | ~350 MB RAM loaded |
| ONNX Compatible | Yes (via sentence-transformers export) |
| Quantization | Compatible with INT8/FP16 for faster inference |
| Recommended Hardware | Works on CPU; GPU recommended for batch processing |
| Serving | Compatible with Triton, TorchServe, FastAPI, or any ONNX runtime |
Training Data
The training dataset is available at imocha-ai-org/ssf-skill-extraction-pairs and contains:
pairs.jsonl— 21,958 training pairs (sentence, skill, label)generated_sentences.json— 5 synthetic JD sentences per skill (2,196 skills)meta.json— dataset metadata
Framework Versions
- Python: 3.10.19
- Sentence Transformers: 5.2.2
- Transformers: 4.57.3
- PyTorch: 2.9.1+cu128
- Accelerate: 1.12.0
- Datasets: 4.3.0
- Tokenizers: 0.22.2
Citation
BibTeX
@misc{imocha2026ssf-miniLM,
title = {SSF-MiniLM Finetuned v2: Skill Extraction Embedding Model},
author = {imocha AI},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/imocha-ai-org/ssf-miniLM-finetuned-v2}
}
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
Contact / Maintainer
- Organization: imocha AI
- Maintainer: Sarvadnya
- Issues: Open an issue on the model repository
- Downloads last month
- 14
Model tree for imocha-ai-org/ssf-skill-extractor
Base model
sentence-transformers/all-MiniLM-L6-v2Dataset used to train imocha-ai-org/ssf-skill-extractor
Paper for imocha-ai-org/ssf-skill-extractor
Evaluation results
- AUC (Held-Out 10%)self-reported0.995
- Best Accuracyself-reported0.971
- Accuracy @ 0.5self-reported0.968