sutd-bge-large-ft67

Finetuned BAAI/bge-large-en-v1.5 for job-description-to-module retrieval in the SUTD Course Recommendation Chatbot (MLOps Group 9).

Given a job description, this model retrieves relevant SUTD elective modules. It is used as the dense retrieval backbone in the RAG and Hybrid pipelines.

Model Details

Property	Value
Base model	`BAAI/bge-large-en-v1.5`
Embedding dimension	1024
Max sequence length	512
Similarity function	Cosine
Loss	`MultipleNegativesRankingLoss`

Training Data

Trained on 67 hand-annotated (job description, relevant SUTD module) pairs spanning four pillars: ASD, EPD, ESD, ISTD/CSD. Each job description was matched to one or more relevant modules. After train/validation splitting and hard-negative expansion by the Sentence Transformers trainer, this produces 601 training samples and 66 validation samples.

A version trained on the augmented 98-pair dataset is available at henreads/sutd-bge-large-aug98.

Training Setup

Hardware: Modal A10G (24 GB VRAM)
Epochs: up to 10 with early stopping (patience 4); converged at epoch 3
Effective batch size: 16 (per-device batch 4, gradient accumulation 4)
Learning rate: 2e-5
Tracking: Weights & Biases (sutd-mlops-bge-finetune)

Evaluation

Evaluated on a 10-job held-out retrieval set (completely separate from training). NDCG@10 improves from 0.679 (base BGE-large) to 0.747 with ft67 finetuning.

Model	NDCG@10
BGE-large-en-v1.5 (base)	0.679
sutd-bge-large-ft67 (this model)	0.747
sutd-bge-large-aug98	0.770

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("henreads/sutd-bge-large-ft67")

job_description = "Data Scientist at GovTech. Build ML models with Python..."
module_passage = "50.007 Machine Learning — Topics: supervised learning, neural networks..."

embeddings = model.encode([job_description, module_passage], normalize_embeddings=True)
similarity = embeddings[0] @ embeddings[1]
print(similarity)

Project

Part of the SUTD Course Recommendation Chatbot — MLOps Group 9.
Code: github.com/henreads/sutd-mlops-group9

Downloads last month: 2

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for henreads/sutd-bge-large-ft67

Base model

BAAI/bge-large-en-v1.5

Finetuned

(89)

this model