SSF-MiniLM Finetuned v2 — Skill Extraction Embedding Model

A sentence-transformers model fine-tuned from all-MiniLM-L6-v2 for matching job description sentences to standardized skills from Singapore's SkillsFuture Framework (SSF).

The model maps sentences and skill names into a 384-dimensional dense vector space where job description text lands close to its corresponding skill, enabling accurate semantic skill extraction, tagging, and retrieval.

Highlights

  • AUC 0.995 on held-out validation (up from 0.978 baseline)
  • 97.1% best accuracy on skill-sentence matching (up from 92.8% baseline)
  • Covers 2,196 unique skills across all SSF sectors
  • Fast inference: 22M params, runs efficiently on CPU and GPU
  • Drop-in replacement for all-MiniLM-L6-v2 — same API, better skill matching

Model Details

Property Value
Model Type Sentence Transformer (Bi-Encoder)
Base Model sentence-transformers/all-MiniLM-L6-v2
Architecture BERT (6 layers, 12 heads, 384 hidden)
Parameters ~22M
Max Sequence Length 256 tokens
Output Dimensionality 384
Similarity Function Cosine Similarity
Pooling Mean Pooling + L2 Normalization
Language English
License Apache 2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
  (2): Normalize()
)

Intended Use

Primary Use Cases

  • Skill Extraction from Job Descriptions — identify which standardized skills a JD sentence refers to
  • Skill Tagging / Auto-labeling — tag resumes, courses, or learning content with SSF skills
  • Semantic Skill Search — find relevant skills for a given text query
  • Skill Gap Analysis — compare job requirements against employee skill profiles
  • HR Tech / Workforce Analytics — power matching engines, recommendation systems, and talent platforms

Suitable Applications

  • Resume parsing and skill extraction pipelines
  • Job-to-candidate matching engines
  • Learning & development recommendation systems
  • Skills taxonomy mapping and alignment
  • Workforce planning and analytics dashboards

Out-of-Scope Uses

  • General-purpose sentence similarity (use the base model instead)
  • Non-English text
  • Tasks requiring generative output (this is an embedding model)
  • Medical, legal, or safety-critical classification without human review

Training Details

Dataset

Property Value
Name SSF Skill Extraction Pairs
Domain Workforce Skills / HR / Job Descriptions
Source Skills 2,196 unique skills from Singapore SkillsFuture Framework
Synthetic Sentences 5 JD-style sentences per skill, generated via Qwen3-1.7B (Ollama)
Total Training Pairs 21,958 (positive + hard negative per sentence)
Format (sentence, skill_name, label) — label 1.0 for correct skill, 0.0 for random incorrect skill
Validation Split 10% held-out (2,195 pairs)

Sample training pairs:

Sentence Skill Label
Analyzes tax liabilities, identifies applicable rates, and applies corrections to ensure proper calculation and reporting. Tax Computation 1.0
Monitor plant health by assessing symptoms and identifying disease risks. Plant Health Management and Disease Control 1.0
Analyzes cross-cultural communication challenges in medical and legal contexts, optimizing translation strategies for diverse stakeholders. Audience Segmentation 0.0

Training Objective

Loss Function: CosineSimilarityLoss with MSE

The model learns to maximize cosine similarity between a JD sentence and its correct skill, while minimizing similarity to randomly-sampled incorrect skills. This contrastive setup produces well-separated embeddings.

Training Hyperparameters

Parameter Value
Epochs 5
Batch Size 64
Learning Rate 5e-05
Optimizer AdamW (fused)
Warmup Steps 10% of total steps
Scheduler Linear decay
Seed 42
Precision FP32
Deterministic Yes (CUBLAS_WORKSPACE_CONFIG=:4096:8)

Training Logs

Epoch Step Training Loss
1.45 500 0.0822
2.91 1,000 0.0567
4.36 1,500 0.0493

Evaluation

Benchmark: Held-Out Skill Matching (10% split, 2,195 pairs)

Embeddings encoded with normalize_embeddings=True. Cosine similarity computed as dot product of normalized vectors.

Model AUC Acc @ 0.5 Best Accuracy Pos Mean Sim Neg Mean Sim
all-MiniLM-L6-v2 (baseline) 0.978 0.810 0.928 0.530 0.133
SSF-MiniLM v1 (1 epoch) 0.989 0.949 0.952 0.799 0.131
SSF-MiniLM v2 (5 epochs) 0.995 0.968 0.971 0.845 0.088

Key Observations

  • AUC improved from 0.978 to 0.995 — the model almost perfectly ranks correct skills above incorrect ones
  • Positive similarity increased from 0.530 to 0.845 — correct pairs are now strongly matched
  • Negative similarity dropped from 0.133 to 0.088 — incorrect pairs are pushed further apart
  • Best accuracy improved from 92.8% to 97.1% — +4.3% absolute improvement over baseline
  • Accuracy @ 0.5 jumped from 81.0% to 96.8% — the default threshold works well out of the box

Metrics Explained

  • AUC: Measures ranking quality — how often the model scores positive pairs above negative pairs (1.0 = perfect ranking)
  • Accuracy @ 0.5: Classification accuracy using cosine similarity threshold of 0.5
  • Best Accuracy: Best accuracy found by scanning thresholds from 1st–99th percentile of scores
  • Pos/Neg Mean Similarity: Average cosine similarity for correct vs incorrect skill pairs

Performance Summary

Strengths

  • Excellent skill discrimination (AUC 0.995) across 2,196 diverse skills
  • Strong positive/negative separation (0.845 vs 0.088 mean similarity)
  • Works well with the default 0.5 threshold — no tuning needed for most applications
  • Small model footprint (~87MB) enables fast CPU inference
  • Covers a comprehensive range of workforce skills: IT, healthcare, engineering, finance, creative, trades, and more

Weaknesses

  • Optimized for SkillsFuture Framework skills — may underperform on skills not in the SSF taxonomy
  • Trained on synthetic JD sentences — real-world JDs with unusual formatting or jargon may need additional fine-tuning
  • Short text bias — best with single sentences or phrases; long paragraphs should be split into sentences first
  • English only

Limitations

  • Domain specificity: The model is fine-tuned on Singapore's SkillsFuture Framework. Skills from other taxonomies (O*NET, ESCO, ISCO) may not match as precisely without further adaptation.
  • Synthetic training data: JD-style sentences were generated by an LLM (Qwen3-1.7B), which may not capture all real-world phrasing variations.
  • No cross-lingual support: English only. Multilingual JDs will need translation first.
  • Short text focus: Designed for sentence-level matching. For multi-paragraph JDs, split into sentences before encoding.
  • Skill taxonomy coverage: Limited to the 2,196 skills in the SSF dataset. New or niche skills outside this taxonomy will fall back to base model behavior.

Ethical Considerations

  • Bias: The SSF taxonomy reflects Singapore's workforce structure. Skills from underrepresented or emerging fields may have fewer training examples.
  • Fairness: The model matches text to skills — it does not evaluate candidates. Applications should ensure skill matching does not introduce hiring bias.
  • Responsible use: This model is a tool for structuring skill data, not for making automated hiring decisions. Always include human review in high-stakes HR workflows.
  • Data provenance: Training data is synthetically generated. No personal or proprietary job description data was used in training.

Usage

Quick Start (Sentence Transformers)

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2")

# Encode job description sentences and skills
sentences = [
    "Design and implement scalable data pipelines for real-time analytics.",
    "Manage patient records and ensure compliance with healthcare regulations.",
]
skills = [
    "Data Engineering",
    "Healthcare Records Management",
    "Polymer Processing",
]

sentence_embeddings = model.encode(sentences, normalize_embeddings=True)
skill_embeddings = model.encode(skills, normalize_embeddings=True)

# Compute similarity (dot product of normalized vectors = cosine similarity)
import numpy as np
similarities = np.dot(sentence_embeddings, skill_embeddings.T)
print(similarities)
# sentence 0 -> "Data Engineering" = high score
# sentence 1 -> "Healthcare Records Management" = high score

Skill Extraction Pipeline

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2")

# Your skill taxonomy (or load from SSF dataset)
skills = ["Data Engineering", "Machine Learning", "Project Management", "Cloud Computing"]
skill_embeddings = model.encode(skills, normalize_embeddings=True)

# Extract skills from a JD sentence
jd_sentence = "Build and deploy ML models on AWS with CI/CD pipelines."
jd_embedding = model.encode([jd_sentence], normalize_embeddings=True)

scores = np.dot(jd_embedding, skill_embeddings.T)[0]
threshold = 0.5

for skill, score in sorted(zip(skills, scores), key=lambda x: -x[1]):
    if score >= threshold:
        print(f"  {skill}: {score:.3f}")

Using with Transformers (Direct)

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2")
model = AutoModel.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2")

def encode(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    attention_mask = inputs["attention_mask"].unsqueeze(-1)
    embeddings = (outputs.last_hidden_state * attention_mask).sum(1) / attention_mask.sum(1)
    # L2 normalize
    return torch.nn.functional.normalize(embeddings, p=2, dim=1)

query = encode(["Build scalable APIs with microservice architecture"])
skills = encode(["API Development", "Microservice Architecture", "Gardening"])
similarities = torch.mm(query, skills.T)
print(similarities)

Deployment Notes

Property Detail
Model Size ~87 MB (safetensors)
Inference Speed ~5,000 sentences/sec on GPU, ~500/sec on CPU (batch 64)
Memory ~350 MB RAM loaded
ONNX Compatible Yes (via sentence-transformers export)
Quantization Compatible with INT8/FP16 for faster inference
Recommended Hardware Works on CPU; GPU recommended for batch processing
Serving Compatible with Triton, TorchServe, FastAPI, or any ONNX runtime

Training Data

The training dataset is available at imocha-ai-org/ssf-skill-extraction-pairs and contains:

  • pairs.jsonl — 21,958 training pairs (sentence, skill, label)
  • generated_sentences.json — 5 synthetic JD sentences per skill (2,196 skills)
  • meta.json — dataset metadata

Framework Versions

  • Python: 3.10.19
  • Sentence Transformers: 5.2.2
  • Transformers: 4.57.3
  • PyTorch: 2.9.1+cu128
  • Accelerate: 1.12.0
  • Datasets: 4.3.0
  • Tokenizers: 0.22.2

Citation

BibTeX

@misc{imocha2026ssf-miniLM,
  title     = {SSF-MiniLM Finetuned v2: Skill Extraction Embedding Model},
  author    = {imocha AI},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/imocha-ai-org/ssf-miniLM-finetuned-v2}
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title     = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author    = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month     = "11",
    year      = "2019",
    publisher = "Association for Computational Linguistics",
    url       = "https://arxiv.org/abs/1908.10084",
}

Contact / Maintainer

Downloads last month
14
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for imocha-ai-org/ssf-skill-extractor

Finetuned
(759)
this model

Dataset used to train imocha-ai-org/ssf-skill-extractor

Paper for imocha-ai-org/ssf-skill-extractor

Evaluation results