SSF-MiniLM Finetuned v2 — Skill Extraction Embedding Model

A sentence-transformers model fine-tuned from all-MiniLM-L6-v2 for matching job description sentences to standardized skills from Singapore's SkillsFuture Framework (SSF).

The model maps sentences and skill names into a 384-dimensional dense vector space where job description text lands close to its corresponding skill, enabling accurate semantic skill extraction, tagging, and retrieval.

Highlights

AUC 0.995 on held-out validation (up from 0.978 baseline)
97.1% best accuracy on skill-sentence matching (up from 92.8% baseline)
Covers 2,196 unique skills across all SSF sectors
Fast inference: 22M params, runs efficiently on CPU and GPU
Drop-in replacement for all-MiniLM-L6-v2 — same API, better skill matching

Model Details

Property	Value
Model Type	Sentence Transformer (Bi-Encoder)
Base Model	sentence-transformers/all-MiniLM-L6-v2
Architecture	BERT (6 layers, 12 heads, 384 hidden)
Parameters	~22M
Max Sequence Length	256 tokens
Output Dimensionality	384
Similarity Function	Cosine Similarity
Pooling	Mean Pooling + L2 Normalization
Language	English
License	Apache 2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
  (2): Normalize()
)

Intended Use

Primary Use Cases

Skill Extraction from Job Descriptions — identify which standardized skills a JD sentence refers to
Skill Tagging / Auto-labeling — tag resumes, courses, or learning content with SSF skills
Semantic Skill Search — find relevant skills for a given text query
Skill Gap Analysis — compare job requirements against employee skill profiles
HR Tech / Workforce Analytics — power matching engines, recommendation systems, and talent platforms

Suitable Applications

Resume parsing and skill extraction pipelines
Job-to-candidate matching engines
Learning & development recommendation systems
Skills taxonomy mapping and alignment
Workforce planning and analytics dashboards

Out-of-Scope Uses

General-purpose sentence similarity (use the base model instead)
Non-English text
Tasks requiring generative output (this is an embedding model)
Medical, legal, or safety-critical classification without human review

Training Details

Dataset

Property	Value
Name	SSF Skill Extraction Pairs
Domain	Workforce Skills / HR / Job Descriptions
Source Skills	2,196 unique skills from Singapore SkillsFuture Framework
Synthetic Sentences	5 JD-style sentences per skill, generated via Qwen3-1.7B (Ollama)
Total Training Pairs	21,958 (positive + hard negative per sentence)
Format	`(sentence, skill_name, label)` — label 1.0 for correct skill, 0.0 for random incorrect skill
Validation Split	10% held-out (2,195 pairs)

Sample training pairs:

Sentence	Skill	Label
Analyzes tax liabilities, identifies applicable rates, and applies corrections to ensure proper calculation and reporting.	Tax Computation	1.0
Monitor plant health by assessing symptoms and identifying disease risks.	Plant Health Management and Disease Control	1.0
Analyzes cross-cultural communication challenges in medical and legal contexts, optimizing translation strategies for diverse stakeholders.	Audience Segmentation	0.0

Training Objective

Loss Function: CosineSimilarityLoss with MSE

The model learns to maximize cosine similarity between a JD sentence and its correct skill, while minimizing similarity to randomly-sampled incorrect skills. This contrastive setup produces well-separated embeddings.

Training Hyperparameters

Parameter	Value
Epochs	5
Batch Size	64
Learning Rate	5e-05
Optimizer	AdamW (fused)
Warmup Steps	10% of total steps
Scheduler	Linear decay
Seed	42
Precision	FP32
Deterministic	Yes (`CUBLAS_WORKSPACE_CONFIG=:4096:8`)

Training Logs

Epoch	Step	Training Loss
1.45	500	0.0822
2.91	1,000	0.0567
4.36	1,500	0.0493

Evaluation

Benchmark: Held-Out Skill Matching (10% split, 2,195 pairs)

Embeddings encoded with normalize_embeddings=True. Cosine similarity computed as dot product of normalized vectors.

Model	AUC	Acc @ 0.5	Best Accuracy	Pos Mean Sim	Neg Mean Sim
all-MiniLM-L6-v2 (baseline)	0.978	0.810	0.928	0.530	0.133
SSF-MiniLM v1 (1 epoch)	0.989	0.949	0.952	0.799	0.131
SSF-MiniLM v2 (5 epochs)	0.995	0.968	0.971	0.845	0.088

Key Observations

AUC improved from 0.978 to 0.995 — the model almost perfectly ranks correct skills above incorrect ones
Positive similarity increased from 0.530 to 0.845 — correct pairs are now strongly matched
Negative similarity dropped from 0.133 to 0.088 — incorrect pairs are pushed further apart
Best accuracy improved from 92.8% to 97.1% — +4.3% absolute improvement over baseline
Accuracy @ 0.5 jumped from 81.0% to 96.8% — the default threshold works well out of the box

Metrics Explained

AUC: Measures ranking quality — how often the model scores positive pairs above negative pairs (1.0 = perfect ranking)
Accuracy @ 0.5: Classification accuracy using cosine similarity threshold of 0.5
Best Accuracy: Best accuracy found by scanning thresholds from 1st–99th percentile of scores
Pos/Neg Mean Similarity: Average cosine similarity for correct vs incorrect skill pairs

Performance Summary

Strengths

Excellent skill discrimination (AUC 0.995) across 2,196 diverse skills
Strong positive/negative separation (0.845 vs 0.088 mean similarity)
Works well with the default 0.5 threshold — no tuning needed for most applications
Small model footprint (~87MB) enables fast CPU inference
Covers a comprehensive range of workforce skills: IT, healthcare, engineering, finance, creative, trades, and more

Weaknesses

Optimized for SkillsFuture Framework skills — may underperform on skills not in the SSF taxonomy
Trained on synthetic JD sentences — real-world JDs with unusual formatting or jargon may need additional fine-tuning
Short text bias — best with single sentences or phrases; long paragraphs should be split into sentences first
English only

Limitations

Domain specificity: The model is fine-tuned on Singapore's SkillsFuture Framework. Skills from other taxonomies (O*NET, ESCO, ISCO) may not match as precisely without further adaptation.
Synthetic training data: JD-style sentences were generated by an LLM (Qwen3-1.7B), which may not capture all real-world phrasing variations.
No cross-lingual support: English only. Multilingual JDs will need translation first.
Short text focus: Designed for sentence-level matching. For multi-paragraph JDs, split into sentences before encoding.
Skill taxonomy coverage: Limited to the 2,196 skills in the SSF dataset. New or niche skills outside this taxonomy will fall back to base model behavior.

Ethical Considerations

Bias: The SSF taxonomy reflects Singapore's workforce structure. Skills from underrepresented or emerging fields may have fewer training examples.
Fairness: The model matches text to skills — it does not evaluate candidates. Applications should ensure skill matching does not introduce hiring bias.
Responsible use: This model is a tool for structuring skill data, not for making automated hiring decisions. Always include human review in high-stakes HR workflows.
Data provenance: Training data is synthetically generated. No personal or proprietary job description data was used in training.

Usage

Quick Start (Sentence Transformers)

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2")

# Encode job description sentences and skills
sentences = [
    "Design and implement scalable data pipelines for real-time analytics.",
    "Manage patient records and ensure compliance with healthcare regulations.",
]
skills = [
    "Data Engineering",
    "Healthcare Records Management",
    "Polymer Processing",
]

sentence_embeddings = model.encode(sentences, normalize_embeddings=True)
skill_embeddings = model.encode(skills, normalize_embeddings=True)

# Compute similarity (dot product of normalized vectors = cosine similarity)
import numpy as np
similarities = np.dot(sentence_embeddings, skill_embeddings.T)
print(similarities)
# sentence 0 -> "Data Engineering" = high score
# sentence 1 -> "Healthcare Records Management" = high score

Skill Extraction Pipeline

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("imocha-ai-org/ssf-miniLM-finetuned-v2")

# Your skill taxonomy (or load from SSF dataset)
skills = ["Data Engineering", "Machine Learning", "Project Management", "Cloud Computing"]
skill_embeddings = model.encode(skills, normalize_embeddings=True)

# Extract skills from a JD sentence
jd_sentence = "Build and deploy ML models on AWS with CI/CD pipelines."
jd_embedding = model.encode([jd_sentence], normalize_embeddings=True)

scores = np.dot(jd_embedding, skill_embeddings.T)[0]
threshold = 0.5

for skill, score in sorted(zip(skills, scores), key=lambda x: -x[1]):
    if score >= threshold:
        print(f"  {skill}: {score:.3f}")

Using with Transformers (Direct)

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2")
model = AutoModel.from_pretrained("imocha-ai-org/ssf-miniLM-finetuned-v2")

def encode(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=256, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling
    attention_mask = inputs["attention_mask"].unsqueeze(-1)
    embeddings = (outputs.last_hidden_state * attention_mask).sum(1) / attention_mask.sum(1)
    # L2 normalize
    return torch.nn.functional.normalize(embeddings, p=2, dim=1)

query = encode(["Build scalable APIs with microservice architecture"])
skills = encode(["API Development", "Microservice Architecture", "Gardening"])
similarities = torch.mm(query, skills.T)
print(similarities)

Deployment Notes

Property	Detail
Model Size	~87 MB (safetensors)
Inference Speed	~5,000 sentences/sec on GPU, ~500/sec on CPU (batch 64)
Memory	~350 MB RAM loaded
ONNX Compatible	Yes (via `sentence-transformers` export)
Quantization	Compatible with INT8/FP16 for faster inference
Recommended Hardware	Works on CPU; GPU recommended for batch processing
Serving	Compatible with Triton, TorchServe, FastAPI, or any ONNX runtime

Training Data

The training dataset is available at imocha-ai-org/ssf-skill-extraction-pairs and contains:

pairs.jsonl — 21,958 training pairs (sentence, skill, label)
generated_sentences.json — 5 synthetic JD sentences per skill (2,196 skills)
meta.json — dataset metadata

Framework Versions

Python: 3.10.19
Sentence Transformers: 5.2.2
Transformers: 4.57.3
PyTorch: 2.9.1+cu128
Accelerate: 1.12.0
Datasets: 4.3.0
Tokenizers: 0.22.2

Citation

BibTeX

@misc{imocha2026ssf-miniLM,
  title     = {SSF-MiniLM Finetuned v2: Skill Extraction Embedding Model},
  author    = {imocha AI},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/imocha-ai-org/ssf-miniLM-finetuned-v2}
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title     = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author    = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month     = "11",
    year      = "2019",
    publisher = "Association for Computational Linguistics",
    url       = "https://arxiv.org/abs/1908.10084",
}

Contact / Maintainer

Organization: imocha AI
Maintainer: Sarvadnya
Issues: Open an issue on the model repository

Downloads last month: 18

Safetensors

Model size

22.7M params

Tensor type

F32

Model tree for imocha-ai-org/ssf-skill-extractor

Base model

nreimers/MiniLM-L6-H384-uncased

Quantized

sentence-transformers/all-MiniLM-L6-v2

Finetuned

(969)

this model

Dataset used to train imocha-ai-org/ssf-skill-extractor

Paper for imocha-ai-org/ssf-skill-extractor

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 15

Evaluation results

AUC (Held-Out 10%)
self-reported

0.995
Best Accuracy
self-reported

0.971
Accuracy @ 0.5
self-reported

0.968