LordofMonarchs's picture
Update README.md
efa1e0f verified
|
Raw
History Blame Contribute Delete
9.33 kB
metadata
license: mit
library_name: lightgbm
tags:
  - learning-to-rank
  - lightgbm
  - lambdarank
  - recruitment
  - candidate-ranking

Intelligent Candidate Ranker (LightGBM LambdaRank)

The ranking model from the Redrob Intelligent Candidate Discovery and Ranking Challenge submission.

Given a 22-feature vector describing a candidate's fit against a job description, this model outputs a relevance score. It is trained on labels generated by 2,500 pairwise judgments from a local LLM (Gemma3) rather than hand-coded heuristics, specifically to avoid label circularity.

Model Training Architecture

Full pipeline (retrieval, feature engineering, consistency scoring, reasoning generation) lives in the GitHub repo: https://github.com/Pranjal1342/Intelligent-Candidate-Discovery-Ranking-System


This model's role in the pipeline

This model is one stage inside a larger offline candidate-ranking pipeline. It does not do retrieval, does not compute the input features itself, and does not produce the final rank on its own.

raw_score = this_model.predict(feature_vector)
final_score = raw_score * consistency_score   # applied by the host pipeline, not this model

consistency_score is a separate, multiplicative honeypot/data-integrity check computed by the host application β€” it is not part of this model's output.


How to load

from huggingface_hub import hf_hub_download
import lightgbm as lgb

model_path = hf_hub_download(
    repo_id="<your-username>/intelligent-candidate-ranker",
    filename="lgbm_model.txt"
)
model = lgb.Booster(model_file=model_path)

raw_score = model.predict(feature_vector)  # feature_vector: 22-dim float32

Input / Output

  • Input: a 22-feature float32 vector per candidate (exact feature order below)
  • Output: a single raw relevance score (higher = more relevant). Not yet penalized for data-integrity issues β€” combine with a consistency_score downstream before final ranking.

Feature vector (in order)

# Feature Formula / Source
1 bm25_score Stage 1 BM25 retrieval score (normalised)
2 yoe profile.years_of_experience
3 Param_A_Systems_Depth Fraction of career months in roles whose descriptions contain retrieval, search, or ranking keywords
4 Param_B_Availability (recruiter_response_rate + exp(-days_inactive / 90)) / 2
5 Param_C_Tenure min(avg_tenure_months, 48) / 48, rewards 3+ year tenures
6 Param_D_Notice_Exp exp(-max(0, days-30) / 30): 30d β†’ 1.0, 60d β†’ 0.37, 90d β†’ 0.14, 150d β†’ 0.006
7 Param_E_Credibility advanced_claimed_count / max(1, assessed_count), higher means less credible
8 Param_F_Consulting Fraction of career at IT-services consulting firms (industry == "IT Services" AND size == "10001+")
9 Param_G_Location Noida/Pune = 1.0, other India = 0.7, outside and willing to relocate = 0.3, outside and unwilling = 0.0
10 Param_H_GitHub github_activity_score / 100; 0.3 imputed when the field equals -1 (absent)
11 title_ai_fraction Career-weighted fraction in AI, ML, or data roles via a static title taxonomy
12 prod_signal_log Log-compressed production keyword count, -1.0 if academic-only
13 consistency_score Multiplicative honeypot penalty, c1 Γ— c2 Γ— c3 Γ— c4 Γ— c5 (included as a training feature; also reapplied post-inference β€” see below)
14 hard_req_coverage Fraction of JD hard requirements satisfied by the candidate's skill list
15 flag_consulting_only consulting_fraction > 0.95
16 flag_title_chaser avg_tenure < 18 months across 3+ jobs
17 flag_langchain_dabbler LLM-era months > 12 and pre-LLM months == 0
18 flag_cv_specialist CV/speech months > 24 and NLP/IR months == 0
19 flag_title_desc_mismatch Domain-category mismatch fraction across career history
20 flag_template_desc Max SequenceMatcher ratio against the template registry
21 interaction_req_x_consistency hard_req_coverage * consistency_score
22 interaction_yoe_x_prod yoe * prod_signal_log

Training

Model configuration:

  • objective: lambdarank
  • eval_at: [5, 10, 50], explicitly optimising Precision@5
  • Early stopping monitors NDCG@5, patience 30
  • 200 boosting rounds

Training labels β€” Gemma3 pairwise annotation (the key differentiator):

Rather than a pure heuristic label, training labels are generated via 2,500 pairwise LLM comparisons using Gemma3:4b-it-q4_K_M running locally on Ollama, with zero external API calls and full reproducibility. A stratified sample of 500 candidates is drawn across three strata (top-100, boundary 101–300, and a broader pool with guaranteed low-consistency coverage), and each candidate receives roughly five matchups against random opponents.

For each pair, Gemma3 reads both candidates' full structured profiles alongside the JD requirements and disqualifiers, then produces a single verdict: CANDIDATE_A, CANDIDATE_B, or TIE. Win and loss tallies convert to Elo ratings via Laplace-smoothed win rates:

win_rate = (wins + 0.5) / (total + 1)
elo = 400 * log10(win_rate / (1 - win_rate)) + 1500

Elo ratings are thresholded to 0–3 relevance labels by quartile, producing a balanced training set with roughly 125 candidates per label.

Why this breaks circularity: Gemma had no knowledge of the 22 engineered features, the BM25 scores, or the penalty weights. It learned independently that IR-specific skills (FAISS, BM25, Qdrant, Sentence Transformers) outrank generic ML skills, and that production-company backgrounds outrank consulting-only careers. LightGBM then learns how the 22 features correlate with these independent judgments, surfacing interactions that were never explicitly encoded.


Model Comparison: Heuristic vs. Gemma-Trained

The competition provides no ground-truth relevance labels, so a standard NDCG@10 ablation against a labeled holdout set isn't possible to compute honestly. What is available, and what is reported here, is a direct head-to-head comparison between a LightGBM model trained on the original heuristic weak label and this model (trained on Gemma3 pairwise labels), run on the same candidate pool with the same feature vectors.

Method: both trained models score the full ~8,500-candidate retrieval pool. The same post-inference consistency multiplier is applied to both before ranking, so the comparison isolates the effect of the training label, not the honeypot suppression layer.

Metric Result
Top-10 overlap between the two models 0 of 10 candidates in common
Spearman rank correlation (top-100) 0.001 β€” statistically independent rankings
Honeypot leakage, heuristic-trained model Required a hand-coded post-processing suppression list to keep keyword-stuffed non-technical profiles out of the top 100
Honeypot leakage, Gemma-trained model (this model) 0 of 100 candidates with consistency_score < 0.25, achieved with no post-processing suppression list

Qualitative before/after: prior to the Gemma retrain, the heuristic-trained model's unsuppressed top-10 surfaced profiles such as Content Writer, Project Manager, and Sales Executive β€” each with AI-sounding skills listed but no underlying technical career history, because the heuristic label rewarded keyword coverage directly. After the Gemma retrain, the same pool's top-10 surfaced candidates with FAISS, BM25, Qdrant, Sentence Transformers, and Hugging Face Transformers in their skill history β€” sourced from a model that never saw bm25_score or hard_req_coverage during label generation and discovered the IR-relevance ordering independently from reading full candidate profiles.

The two models disagreeing almost completely (Spearman 0.001) is itself evidence of non-circularity: a model trained on labels derived from the same 22 features it predicts on would be expected to correlate strongly with a heuristic built from those same features, not diverge from it entirely.


Intended Use & Limitations

  • Built for a hackathon submission (Redrob Intelligent Candidate Discovery and Ranking Challenge); not validated for production hiring decisions.
  • Expects the exact 22-feature schema above, computed by the host pipeline's src/features.py. Feeding hand-built or differently-ordered features will produce meaningless scores.
  • Raw model output is not the final ranking score β€” it must be multiplied by a separately computed consistency_score before use.
  • Trained on a synthetic/competition candidate dataset; label distribution and feature semantics may not generalize to other candidate pools without retraining.

AI Tool Disclosure

Gemma3:4b-it-q4_K_M (Google DeepMind, running locally via Ollama) was used offline to generate 2,500 pairwise relevance judgments on a stratified sample of 500 candidates. These judgments served as independent, non-circular training labels for this model. No candidate data was transmitted to any external service at any point.