Resume Job Fit Classifier

A cross-encoder model for predicting whether a resume is a fit for a job description.

Model Description

Fine-tuned BAAI/bge-m3 as a cross-encoder classifier on resume and job description pairs. The model takes a resume and a job description as input and predicts one of three classes: Good Fit, No Fit, or Potential Fit.

The input is structured as:

[CLS] resume_text [SEP] job_description_text [SEP]

The transformer attention mechanism allows every resume token to attend to every JD token simultaneously, making this a true comparison model rather than independent embeddings.

Datasets

Two datasets were used for training:

  1. cnamuangtoun/resume-job-description-fit

    • Train: 5,616 pairs
    • Test: 1,759 pairs (used as evaluation benchmark)
    • Labels: Good Fit, No Fit, Potential Fit
  2. kens1ang/resume-job-fit-augmented

    • Train: 31,205 pairs
    • Labels: Good Fit, No Fit, Potential Fit

Combined training set: ~36,800 pairs

Label distribution (combined):

  • No Fit: 50.4%
  • Good Fit: 24.7%
  • Potential Fit: 24.9%

Training Details

  • Base model: BAAI/bge-m3 (570M parameters, supports up to 8192 tokens)
  • Max sequence length: 8192 tokens (resume: 4096, JD: 4000)
  • Optimizer: AdamW with layer-wise learning rates
    • Bottom layers: LR / 10
    • Top layers: full LR
    • Classifier head: full LR
  • Learning rate: 8e-6 with cosine scheduler
  • Warmup ratio: 15%
  • Batch size: 1 per device, gradient accumulation steps: 32 (effective batch: 32)
  • Epochs: 40 max with early stopping patience 6
  • Loss: Weighted CrossEntropyLoss to handle class imbalance (No Fit = 50%)
  • Sampling: WeightedRandomSampler to oversample minority classes
  • Good Fit weight boost: 2x to prioritize finding the best candidates
  • Label smoothing: 0.1
  • Dropout: 0.3 classifier, 0.15 hidden layers
  • Precision: fp16 mixed precision
  • Gradient checkpointing: enabled
  • Hardware: NVIDIA RTX 4090 (24GB VRAM)

Results

Metric Eval Test
Accuracy 97.06% 54.80%
Macro F1 96.96% 52.13%
F1 Good Fit 97.21% 42.46%
F1 No Fit 97.38% 67.43%
F1 Potential Fit 96.30% 46.50%

Known Limitations & Open Problem

There is a significant gap between eval (97%) and test (52%) performance. After extensive experimentation this appears to be caused by label inconsistency between the two training datasets โ€” the augmented dataset uses different labeling criteria than the original dataset, and the test set follows the original dataset's labeling logic. The model learns contradictory rules and fails to generalize.

Things that were tried:

  • Full fine-tuning vs frozen layers
  • 2-class (Fit/No Fit) vs 3-class classification โ€” 2 classes gave 69% test F1
  • Layer-wise learning rates
  • Weighted loss + weighted sampling
  • Various dropout, weight decay, label smoothing values
  • Training on original dataset only โ€” best test F1: 69% (2 classes)
  • Training on combined datasets โ€” test F1 dropped to 52%

If you have ideas on how to overcome this gap, contributions and suggestions are welcome. Possible directions:

  • A cleaner dataset labeled consistently by human recruiters
  • A base model pretrained specifically on recruitment text (e.g. JobBERT)
  • A better data mixing strategy to handle label inconsistency between datasets
  • Confidence thresholding at inference time

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

model = AutoModelForSequenceClassification.from_pretrained("med2425/bge-resume-fit")
tokenizer = AutoTokenizer.from_pretrained("med2425/bge-resume-fit")

model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

resume = """
John Smith | Senior ML Engineer
6 years experience building production ML systems.
Skills: Python, PyTorch, TensorFlow, NLP, AWS, Docker.
Built NLP pipelines processing 10M documents/day at TechCorp (2020-Present).
Fine-tuned BERT models achieving 94% accuracy on document classification.
B.Sc. Computer Science, State University 2018.
"""

jd = """
Senior Machine Learning Engineer
Requirements: 5+ years ML experience, strong Python,
PyTorch or TensorFlow, NLP experience, production deployment on AWS/GCP/Azure,
Bachelor in Computer Science or related field.
"""

inputs = tokenizer(resume, jd, return_tensors="pt", truncation=True, max_length=8192).to(device)

with torch.no_grad():
    probs = torch.softmax(model(**inputs).logits, dim=-1).squeeze().tolist()

id2label = {0: "Good Fit", 1: "No Fit", 2: "Potential Fit"}
for i, p in enumerate(probs):
    print(f"{id2label[i]}: {p:.2%}")
print(f"Prediction: {id2label[np.argmax(probs)]}")

Note: Use full-length realistic resumes and job descriptions for best results. The model was trained on resumes averaging 700 words and JDs averaging 400 words. Very short inputs may produce unreliable predictions.

Downloads last month
29
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support