Resume Job Fit Classifier

A cross-encoder model for predicting whether a resume is a fit for a job description.

Model Description

Fine-tuned BAAI/bge-m3 as a cross-encoder classifier on resume and job description pairs. The model takes a resume and a job description as input and predicts one of three classes: Good Fit, No Fit, or Potential Fit.

The input is structured as:

[CLS] resume_text [SEP] job_description_text [SEP]

The transformer attention mechanism allows every resume token to attend to every JD token simultaneously, making this a true comparison model rather than independent embeddings.

Datasets

Two datasets were used for training:

cnamuangtoun/resume-job-description-fit
- Train: 5,616 pairs
- Test: 1,759 pairs (used as evaluation benchmark)
- Labels: Good Fit, No Fit, Potential Fit
kens1ang/resume-job-fit-augmented
- Train: 31,205 pairs
- Labels: Good Fit, No Fit, Potential Fit

Combined training set: ~36,800 pairs

Label distribution (combined):

No Fit: 50.4%
Good Fit: 24.7%
Potential Fit: 24.9%

Training Details

Base model: BAAI/bge-m3 (570M parameters, supports up to 8192 tokens)
Max sequence length: 8192 tokens (resume: 4096, JD: 4000)
Optimizer: AdamW with layer-wise learning rates
- Bottom layers: LR / 10
- Top layers: full LR
- Classifier head: full LR
Learning rate: 8e-6 with cosine scheduler
Warmup ratio: 15%
Batch size: 1 per device, gradient accumulation steps: 32 (effective batch: 32)
Epochs: 40 max with early stopping patience 6
Loss: Weighted CrossEntropyLoss to handle class imbalance (No Fit = 50%)
Sampling: WeightedRandomSampler to oversample minority classes
Good Fit weight boost: 2x to prioritize finding the best candidates
Label smoothing: 0.1
Dropout: 0.3 classifier, 0.15 hidden layers
Precision: fp16 mixed precision
Gradient checkpointing: enabled
Hardware: NVIDIA RTX 4090 (24GB VRAM)

Results

Metric	Eval	Test
Accuracy	97.06%	54.80%
Macro F1	96.96%	52.13%
F1 Good Fit	97.21%	42.46%
F1 No Fit	97.38%	67.43%
F1 Potential Fit	96.30%	46.50%

Known Limitations & Open Problem

There is a significant gap between eval (97%) and test (52%) performance. After extensive experimentation this appears to be caused by label inconsistency between the two training datasets — the augmented dataset uses different labeling criteria than the original dataset, and the test set follows the original dataset's labeling logic. The model learns contradictory rules and fails to generalize.

Things that were tried:

Full fine-tuning vs frozen layers
2-class (Fit/No Fit) vs 3-class classification — 2 classes gave 69% test F1
Layer-wise learning rates
Weighted loss + weighted sampling
Various dropout, weight decay, label smoothing values
Training on original dataset only — best test F1: 69% (2 classes)
Training on combined datasets — test F1 dropped to 52%

If you have ideas on how to overcome this gap, contributions and suggestions are welcome. Possible directions:

A cleaner dataset labeled consistently by human recruiters
A base model pretrained specifically on recruitment text (e.g. JobBERT)
A better data mixing strategy to handle label inconsistency between datasets
Confidence thresholding at inference time

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

model = AutoModelForSequenceClassification.from_pretrained("med2425/bge-resume-fit")
tokenizer = AutoTokenizer.from_pretrained("med2425/bge-resume-fit")

model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

resume = """
John Smith | Senior ML Engineer
6 years experience building production ML systems.
Skills: Python, PyTorch, TensorFlow, NLP, AWS, Docker.
Built NLP pipelines processing 10M documents/day at TechCorp (2020-Present).
Fine-tuned BERT models achieving 94% accuracy on document classification.
B.Sc. Computer Science, State University 2018.
"""

jd = """
Senior Machine Learning Engineer
Requirements: 5+ years ML experience, strong Python,
PyTorch or TensorFlow, NLP experience, production deployment on AWS/GCP/Azure,
Bachelor in Computer Science or related field.
"""

inputs = tokenizer(resume, jd, return_tensors="pt", truncation=True, max_length=8192).to(device)

with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits, dim=-1).squeeze().tolist()

id2label = {0: "Good Fit", 1: "No Fit", 2: "Potential Fit"}
for i, p in enumerate(probs):
    print(f"{id2label[i]}: {p:.2%}")
print(f"Prediction: {id2label[np.argmax(probs)]}")

Note: Use full-length realistic resumes and job descriptions for best results. The model was trained on resumes averaging 700 words and JDs averaging 400 words. Very short inputs may produce unreliable predictions.

Downloads last month: 4

Safetensors

Model size

0.6B params

Tensor type

F32