bge-small-jobs-data-embedding

A fine-tuned and ONNX-quantized embedding model for job-to-candidate matching.
Built on BAAI/bge-small-en-v1.5, trained on a purpose-built dataset of ~1,850 job/candidate triplets covering 30+ tech domains.

The quantized INT8 ONNX model is ~4Γ— smaller and ~2Γ— faster on CPU than the original FP32 PyTorch model, with no measurable loss in retrieval quality (max cosine diff < 0.001).


What it does

Given a job posting and a candidate profile β€” both represented as short skill strings β€” the model embeds them into a 384-dimensional space where good matches are close and mismatches are far.

Job  β†’ "Job Title: Senior Python Engineer. Required Skills: Python, FastAPI, PostgreSQL, Docker, AWS"
User β†’ "User Skills: Python, FastAPI, Postgres, docker-compose, AWS"

The model correctly ranks this user above a Go engineer, a DevOps engineer, or a data analyst β€” even when the user writes skill aliases like Postgres instead of PostgreSQL or K8s instead of Kubernetes.


Quick start

Install

pip install onnxruntime transformers numpy

Load and embed

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained('upply-org/bge-small-jobs-data-embedding')
session   = ort.InferenceSession('model_quantized.onnx', providers=['CPUExecutionProvider'])

def embed(texts: list[str]) -> np.ndarray:
    enc = tokenizer(texts, padding=True, truncation=True, max_length=64, return_tensors='np')
    out = session.run(None, {'input_ids': enc['input_ids'], 'attention_mask': enc['attention_mask']})
    cls = out[0][:, 0, :]   # CLS-token pooling β€” matches BGE default
    return cls / np.linalg.norm(cls, axis=1, keepdims=True)

Rank candidates for a job

job = "Job Title: Senior Python Engineer. Required Skills: Python, FastAPI, PostgreSQL, Docker, AWS"

candidates = [
    "User Skills: Python, FastAPI, Postgres, docker-compose, AWS",   # βœ… strong match
    "User Skills: Python, Django, MySQL, Redis",                      # βœ… partial match
    "User Skills: React, TypeScript, Next.js, Tailwind CSS",          # ❌ frontend
    "User Skills: Kubernetes, Terraform, Helm, CI/CD, Ansible",       # ❌ DevOps
]

job_emb  = embed([job])
cand_emb = embed(candidates)
scores   = (job_emb @ cand_emb.T).flatten()

for score, cand in sorted(zip(scores, candidates), reverse=True):
    print(f"{score:.3f}  {cand[:70]}")

Input format

Field Format
Job (anchor) "Job Title: <title>. Required Skills: <s1>, <s2>, ..."
Candidate "User Skills: <s1>, <s2>, ..."

Skills are comma-separated. Order does not matter. The model handles common aliases (Postgres / PostgreSQL, K8s / Kubernetes, py / Python, golang / Go, etc.).


Training

Base model

BAAI/bge-small-en-v1.5 β€” 33M parameters, 384-dim embeddings, 512-token context.

Dataset β€” v2 (current)

~1,850 triplets (anchor, positive, negative) across two training rounds:

Round 1 (~1,600 triplets) β€” generated via OpenRouter (Llama 3.3 70B) across 30 batches:

Category Description
Core domains Python/Java/Go backend, React/Vue frontend, DevOps/SRE, ML/NLP/MLOps, iOS/Android
Missing roles C#/.NET, LLM Engineer, QA/SDET, Cloud Security, SOC Analyst, BI Developer
Synonym pairs Postgres, K8s, sklearn, golang, dotnet, langchain, …
Partial matches User covers 50–65% of required skills
Overqualified User lists 15–22 skills, job needs 4–6
Career changers 40% target-domain + 60% adjacent-domain skills
Hard negatives Maximally confusable pairs (ML vs Data Science, DevOps vs Cloud Security, …)

Round 2 β€” jobs_train_hard_3_.jsonl (128 new triplets, this run):

Targeted the 10 failing queries from the v1 evaluation (80% rank-1 accuracy β†’ improved). New domains and patterns added:

Category New data highlights
Career changer fixes cc_to_devops, cc_to_llm, cc_to_cloud_sec hard pairs
Hard negative fixes hard_cloudsec_not_devops (Cloud Security vs Platform/DevOps)
Synonym fixes syn_backend (alias confusion), syn_devops (platform overlap)
New domains React Native, Rust/WASM, Robotics/ROS, Game Dev (Unity/Unreal), Blockchain/Solidity, UX Design, Incident Response, Pentesting, DevSecOps, Embedded Systems
Java sub-domain clarity Junior / Mid / Lead / Senior Java roles with distinct negative pairs
SRE vs DevOps vs Platform Finer-grained hard negatives across infra roles

Fine-tuning

Loss      : MultipleNegativesRankingLoss
Epochs    : 3
Batch     : 16
Warmup    : 10 steps
Device    : CPU
Framework : sentence-transformers 5.x

Export

FP32 PyTorch β†’ ONNX FP32 (via optimum) β†’ ONNX INT8 (dynamic quantization)


Evaluation (v1 baseline β€” 80% rank-1)

49 queries across 7 categories measured with InformationRetrievalEvaluator (NDCG@10):

Category Queries Pass Fail
Standard match 18 18 0
Synonyms 5 3 2 (syn_backend, syn_devops)
Partial match 5 5 0
Overqualified 2 1 1 (over_analyst β†’ #3)
Single skill 7 6 1 (single_python β†’ #11)
Career changer 4 0 4 (cc_to_devops #27, cc_to_llm #11, cc_to_cloud_sec #7, cc_to_ml #3)
Hard negatives 8 7 1 (hard_cloudsec_not_devops β†’ #5)

Round 2 training data (jobs_train_hard_3_.jsonl) directly targets all 10 failing queries.


Files

File Size Description
model_quantized.onnx ~34 MB INT8 quantized β€” use this for inference
model.onnx ~133 MB FP32 ONNX β€” for debugging or higher precision
tokenizer.json β€” BGE tokenizer
tokenizer_config.json β€” Tokenizer config
vocab.txt β€” Vocabulary
jobs_train.jsonl β€” Round 1 base training triplets
jobs_train_hard.jsonl β€” Round 1 hard-negative triplets
jobs_train_hard_3_.jsonl β€” Round 2 training triplets (128 pairs, hard negatives)

License

Apache 2.0 β€” see LICENSE.
Base model (BAAI/bge-small-en-v1.5) is MIT licensed.

Downloads last month
26
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for upply-org/bge-small-jobs-data-embedding

Quantized
(18)
this model