Upload folder using huggingface_hub

c754148 verified about 21 hours ago

4.1 kB

	team:
	name: "Ctrl Coffee Repeat"
	primary_contact_name: "Pranjal H Dohare"
	primary_contact_email: "pranjaldohare8@gmail.com"
	primary_contact_phone: "+919320480095"
	github_repository_url: "https://github.com/Pranjal1342/Intelligent-Candidate-Discovery-Ranking-System"
	sandbox_demo_url: "https://ctrl-coffee-repeat.streamlit.app/"
	members:
	- name: "Pranjal H Dohare"
	role: "Lead Developer"
	- name: "Priyanka Tiwari"
	role: "Architecture and System Design"

	submission:
	version: "1.0.0"
	timestamp: "2026-07-01"
	output_file: "CTRL_COFFEE_REPEAT.csv"

	system:
	pipeline_type: "Offline-Indexed Lexical Retrieval + LightGBM LambdaRank"
	hardware: "CPU-only, ≤16GB RAM"
	runtime_seconds: 4
	network_calls_during_ranking: 0

	methodology_summary: \|
	This system uses a deterministic, CPU-only pipeline optimized for NDCG@10 and P@5.

	Stage 1 (Retrieval): A precomputed NumPy CSR BM25 matrix (built offline, ~40 MB) is queried
	at runtime in under 0.1 seconds via dual-pass: Pass A expands JD requirements using a
	skill alias taxonomy (skill_aliases.json), Pass B targets production-context keywords
	(deployed, scale, serving, latency). A rare-term safety net retrieves candidates with niche
	skills (pinecone, lambdarank) that might otherwise be missed. This produces a ~8,500-candidate
	Stage 1 pool in approximately 0.03 seconds.

	Stage 2 (Features): A 22-feature schema-grounded matrix extracts signals from every candidate
	record. Includes 5 adversarial detection functions: domain-category mismatch, synthetic template
	detection, production signal log-compression, LangChain dabbler detection, and CV/speech
	specialist detection. Stage 3 adds a consistency composite (c1×c2×c3×c4×c5) that zeros out
	scores for timeline impossibilities, signup anomalies, salary inversions, assessment
	contradictions, and engagement mismatches.

	Stage 4 (Ranking): LightGBM with objective=lambdarank trains on relevance labels generated
	via 2,500 pairwise LLM comparisons using Gemma3:4b-it-q4_K_M (running offline and locally
	via Ollama — zero external API calls). This explicitly breaks circularity: the LLM judges
	profiles organically without knowledge of the 22 features or BM25 scores, then Elo ratings
	are converted to 0-3 relevance labels by quartile thresholding. Candidates with data integrity
	violations are suppressed post-inference via a consistency multiplier
	(final_score = raw_score × consistency_score).

	Stage 5 (Reasoning): Deterministic grammar engine generates fact-grounded reasoning with
	numeric regex audit (all cited numbers must exist in the candidate JSON), n-gram collision
	avoidance (difflib.SequenceMatcher), and priority-ranked concern surfacing. Pre-submission
	blocking audits enforce diversity (max 25% archetype concentration, max 30% employer
	concentration) and honeypot detection (assert low_consistency_in_top100 < 10).

	Model comparison evidence: the heuristic-trained model required a hand-coded suppression list
	to keep non-technical profiles out of the top 100. The Gemma-trained model achieved 0 honeypot
	leakage with no suppression list, and the two models show Spearman correlation of 0.001 on the
	top-100 ranking — confirming the LLM labels are genuinely independent of the engineered features.

	ai_tools_used:
	- tool: "Google DeepMind Antigravity"
	usage: "Code scaffolding, module structure, latency diagnostics, iterative debugging"
	human_review: true
	- tool: "Gemma3:4b-it-q4_K_M via Ollama (local, offline)"
	usage: >
	Offline pairwise candidate annotation: 2,500 comparisons on a stratified sample of
	500 Stage 1 candidates to generate non-circular LightGBM training labels.
	No candidate data transmitted to any external service. Runs in
	experiments/pairwise_llm_check/annotate_and_retrain.py, entirely separate from
	the ranking pipeline. Exempt from the 5-minute/zero-network ranking budget.
	human_review: true

	reference_date: "2026-01-01"