bge-small-jobs-data-embedding

A fine-tuned and ONNX-quantized embedding model for job-to-candidate matching.
Built on BAAI/bge-small-en-v1.5, trained on a purpose-built dataset of ~1,850 job/candidate triplets covering 30+ tech domains.

The quantized INT8 ONNX model is ~4× smaller and ~2× faster on CPU than the original FP32 PyTorch model, with no measurable loss in retrieval quality (max cosine diff < 0.001).

What it does

Given a job posting and a candidate profile — both represented as short skill strings — the model embeds them into a 384-dimensional space where good matches are close and mismatches are far.

Job  → "Job Title: Senior Python Engineer. Required Skills: Python, FastAPI, PostgreSQL, Docker, AWS"
User → "User Skills: Python, FastAPI, Postgres, docker-compose, AWS"

The model correctly ranks this user above a Go engineer, a DevOps engineer, or a data analyst — even when the user writes skill aliases like Postgres instead of PostgreSQL or K8s instead of Kubernetes.

Quick start

Install

pip install onnxruntime transformers numpy

Load and embed

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained('upply-org/bge-small-jobs-data-embedding')
session   = ort.InferenceSession('model_quantized.onnx', providers=['CPUExecutionProvider'])

def embed(texts: list[str]) -> np.ndarray:
    enc = tokenizer(texts, padding=True, truncation=True, max_length=64, return_tensors='np')
    out = session.run(None, {'input_ids': enc['input_ids'], 'attention_mask': enc['attention_mask']})
    cls = out[0][:, 0, :]   # CLS-token pooling — matches BGE default
    return cls / np.linalg.norm(cls, axis=1, keepdims=True)

Rank candidates for a job

job = "Job Title: Senior Python Engineer. Required Skills: Python, FastAPI, PostgreSQL, Docker, AWS"

candidates = [
    "User Skills: Python, FastAPI, Postgres, docker-compose, AWS",   # ✅ strong match
    "User Skills: Python, Django, MySQL, Redis",                      # ✅ partial match
    "User Skills: React, TypeScript, Next.js, Tailwind CSS",          # ❌ frontend
    "User Skills: Kubernetes, Terraform, Helm, CI/CD, Ansible",       # ❌ DevOps
]

job_emb  = embed([job])
cand_emb = embed(candidates)
scores   = (job_emb @ cand_emb.T).flatten()

for score, cand in sorted(zip(scores, candidates), reverse=True):
    print(f"{score:.3f}  {cand[:70]}")

Input format

Field	Format
Job (anchor)	`"Job Title: <title>. Required Skills: <s1>, <s2>, ..."`
Candidate	`"User Skills: <s1>, <s2>, ..."`

Skills are comma-separated. Order does not matter. The model handles common aliases (Postgres / PostgreSQL, K8s / Kubernetes, py / Python, golang / Go, etc.).

Training

Base model

BAAI/bge-small-en-v1.5 — 33M parameters, 384-dim embeddings, 512-token context.

Dataset — v2 (current)

~1,850 triplets (anchor, positive, negative) across two training rounds:

Round 1 (~1,600 triplets) — generated via OpenRouter (Llama 3.3 70B) across 30 batches:

Category	Description
Core domains	Python/Java/Go backend, React/Vue frontend, DevOps/SRE, ML/NLP/MLOps, iOS/Android
Missing roles	C#/.NET, LLM Engineer, QA/SDET, Cloud Security, SOC Analyst, BI Developer
Synonym pairs	`Postgres`, `K8s`, `sklearn`, `golang`, `dotnet`, `langchain`, …
Partial matches	User covers 50–65% of required skills
Overqualified	User lists 15–22 skills, job needs 4–6
Career changers	40% target-domain + 60% adjacent-domain skills
Hard negatives	Maximally confusable pairs (ML vs Data Science, DevOps vs Cloud Security, …)

Round 2 — jobs_train_hard_3_.jsonl (128 new triplets, this run):

Targeted the 10 failing queries from the v1 evaluation (80% rank-1 accuracy → improved). New domains and patterns added:

Category	New data highlights
Career changer fixes	cc_to_devops, cc_to_llm, cc_to_cloud_sec hard pairs
Hard negative fixes	hard_cloudsec_not_devops (Cloud Security vs Platform/DevOps)
Synonym fixes	syn_backend (alias confusion), syn_devops (platform overlap)
New domains	React Native, Rust/WASM, Robotics/ROS, Game Dev (Unity/Unreal), Blockchain/Solidity, UX Design, Incident Response, Pentesting, DevSecOps, Embedded Systems
Java sub-domain clarity	Junior / Mid / Lead / Senior Java roles with distinct negative pairs
SRE vs DevOps vs Platform	Finer-grained hard negatives across infra roles

Fine-tuning

Loss      : MultipleNegativesRankingLoss
Epochs    : 3
Batch     : 16
Warmup    : 10 steps
Device    : CPU
Framework : sentence-transformers 5.x

Export

FP32 PyTorch → ONNX FP32 (via optimum) → ONNX INT8 (dynamic quantization)

Evaluation (v1 baseline — 80% rank-1)

49 queries across 7 categories measured with InformationRetrievalEvaluator (NDCG@10):

Category	Queries	Pass	Fail
Standard match	18	18	0
Synonyms	5	3	2 (syn_backend, syn_devops)
Partial match	5	5	0
Overqualified	2	1	1 (over_analyst → #3)
Single skill	7	6	1 (single_python → #11)
Career changer	4	0	4 (cc_to_devops #27, cc_to_llm #11, cc_to_cloud_sec #7, cc_to_ml #3)
Hard negatives	8	7	1 (hard_cloudsec_not_devops → #5)

Round 2 training data (jobs_train_hard_3_.jsonl) directly targets all 10 failing queries.

Files

File	Size	Description
`model_quantized.onnx`	~34 MB	INT8 quantized — use this for inference
`model.onnx`	~133 MB	FP32 ONNX — for debugging or higher precision
`tokenizer.json`	—	BGE tokenizer
`tokenizer_config.json`	—	Tokenizer config
`vocab.txt`	—	Vocabulary
`jobs_train.jsonl`	—	Round 1 base training triplets
`jobs_train_hard.jsonl`	—	Round 1 hard-negative triplets
`jobs_train_hard_3_.jsonl`	—	Round 2 training triplets (128 pairs, hard negatives)

License

Apache 2.0 — see LICENSE.
Base model (BAAI/bge-small-en-v1.5) is MIT licensed.

Downloads last month: 5

Model tree for upply-org/bge-small-jobs-data-embedding

Base model

BAAI/bge-small-en-v1.5

Quantized

(21)

this model