bge-small-jobs-data-embedding
A fine-tuned and ONNX-quantized embedding model for job-to-candidate matching.
Built on BAAI/bge-small-en-v1.5, trained on a
purpose-built dataset of ~1,850 job/candidate triplets covering 30+ tech domains.
The quantized INT8 ONNX model is ~4Γ smaller and ~2Γ faster on CPU than the original FP32 PyTorch model, with no measurable loss in retrieval quality (max cosine diff < 0.001).
What it does
Given a job posting and a candidate profile β both represented as short skill strings β the model embeds them into a 384-dimensional space where good matches are close and mismatches are far.
Job β "Job Title: Senior Python Engineer. Required Skills: Python, FastAPI, PostgreSQL, Docker, AWS"
User β "User Skills: Python, FastAPI, Postgres, docker-compose, AWS"
The model correctly ranks this user above a Go engineer, a DevOps engineer, or a data analyst β
even when the user writes skill aliases like Postgres instead of PostgreSQL or K8s instead
of Kubernetes.
Quick start
Install
pip install onnxruntime transformers numpy
Load and embed
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained('upply-org/bge-small-jobs-data-embedding')
session = ort.InferenceSession('model_quantized.onnx', providers=['CPUExecutionProvider'])
def embed(texts: list[str]) -> np.ndarray:
enc = tokenizer(texts, padding=True, truncation=True, max_length=64, return_tensors='np')
out = session.run(None, {'input_ids': enc['input_ids'], 'attention_mask': enc['attention_mask']})
cls = out[0][:, 0, :] # CLS-token pooling β matches BGE default
return cls / np.linalg.norm(cls, axis=1, keepdims=True)
Rank candidates for a job
job = "Job Title: Senior Python Engineer. Required Skills: Python, FastAPI, PostgreSQL, Docker, AWS"
candidates = [
"User Skills: Python, FastAPI, Postgres, docker-compose, AWS", # β
strong match
"User Skills: Python, Django, MySQL, Redis", # β
partial match
"User Skills: React, TypeScript, Next.js, Tailwind CSS", # β frontend
"User Skills: Kubernetes, Terraform, Helm, CI/CD, Ansible", # β DevOps
]
job_emb = embed([job])
cand_emb = embed(candidates)
scores = (job_emb @ cand_emb.T).flatten()
for score, cand in sorted(zip(scores, candidates), reverse=True):
print(f"{score:.3f} {cand[:70]}")
Input format
| Field | Format |
|---|---|
| Job (anchor) | "Job Title: <title>. Required Skills: <s1>, <s2>, ..." |
| Candidate | "User Skills: <s1>, <s2>, ..." |
Skills are comma-separated. Order does not matter. The model handles common aliases
(Postgres / PostgreSQL, K8s / Kubernetes, py / Python, golang / Go, etc.).
Training
Base model
BAAI/bge-small-en-v1.5 β 33M parameters, 384-dim embeddings, 512-token context.
Dataset β v2 (current)
~1,850 triplets (anchor, positive, negative) across two training rounds:
Round 1 (~1,600 triplets) β generated via OpenRouter (Llama 3.3 70B) across 30 batches:
| Category | Description |
|---|---|
| Core domains | Python/Java/Go backend, React/Vue frontend, DevOps/SRE, ML/NLP/MLOps, iOS/Android |
| Missing roles | C#/.NET, LLM Engineer, QA/SDET, Cloud Security, SOC Analyst, BI Developer |
| Synonym pairs | Postgres, K8s, sklearn, golang, dotnet, langchain, β¦ |
| Partial matches | User covers 50β65% of required skills |
| Overqualified | User lists 15β22 skills, job needs 4β6 |
| Career changers | 40% target-domain + 60% adjacent-domain skills |
| Hard negatives | Maximally confusable pairs (ML vs Data Science, DevOps vs Cloud Security, β¦) |
Round 2 β jobs_train_hard_3_.jsonl (128 new triplets, this run):
Targeted the 10 failing queries from the v1 evaluation (80% rank-1 accuracy β improved). New domains and patterns added:
| Category | New data highlights |
|---|---|
| Career changer fixes | cc_to_devops, cc_to_llm, cc_to_cloud_sec hard pairs |
| Hard negative fixes | hard_cloudsec_not_devops (Cloud Security vs Platform/DevOps) |
| Synonym fixes | syn_backend (alias confusion), syn_devops (platform overlap) |
| New domains | React Native, Rust/WASM, Robotics/ROS, Game Dev (Unity/Unreal), Blockchain/Solidity, UX Design, Incident Response, Pentesting, DevSecOps, Embedded Systems |
| Java sub-domain clarity | Junior / Mid / Lead / Senior Java roles with distinct negative pairs |
| SRE vs DevOps vs Platform | Finer-grained hard negatives across infra roles |
Fine-tuning
Loss : MultipleNegativesRankingLoss
Epochs : 3
Batch : 16
Warmup : 10 steps
Device : CPU
Framework : sentence-transformers 5.x
Export
FP32 PyTorch β ONNX FP32 (via optimum) β ONNX INT8 (dynamic quantization)
Evaluation (v1 baseline β 80% rank-1)
49 queries across 7 categories measured with InformationRetrievalEvaluator (NDCG@10):
| Category | Queries | Pass | Fail |
|---|---|---|---|
| Standard match | 18 | 18 | 0 |
| Synonyms | 5 | 3 | 2 (syn_backend, syn_devops) |
| Partial match | 5 | 5 | 0 |
| Overqualified | 2 | 1 | 1 (over_analyst β #3) |
| Single skill | 7 | 6 | 1 (single_python β #11) |
| Career changer | 4 | 0 | 4 (cc_to_devops #27, cc_to_llm #11, cc_to_cloud_sec #7, cc_to_ml #3) |
| Hard negatives | 8 | 7 | 1 (hard_cloudsec_not_devops β #5) |
Round 2 training data (jobs_train_hard_3_.jsonl) directly targets all 10 failing queries.
Files
| File | Size | Description |
|---|---|---|
model_quantized.onnx |
~34 MB | INT8 quantized β use this for inference |
model.onnx |
~133 MB | FP32 ONNX β for debugging or higher precision |
tokenizer.json |
β | BGE tokenizer |
tokenizer_config.json |
β | Tokenizer config |
vocab.txt |
β | Vocabulary |
jobs_train.jsonl |
β | Round 1 base training triplets |
jobs_train_hard.jsonl |
β | Round 1 hard-negative triplets |
jobs_train_hard_3_.jsonl |
β | Round 2 training triplets (128 pairs, hard negatives) |
License
Apache 2.0 β see LICENSE.
Base model (BAAI/bge-small-en-v1.5) is MIT licensed.
- Downloads last month
- 26
Model tree for upply-org/bge-small-jobs-data-embedding
Base model
BAAI/bge-small-en-v1.5