team: name: "Ctrl Coffee Repeat" primary_contact_name: "Pranjal H Dohare" primary_contact_email: "pranjaldohare8@gmail.com" primary_contact_phone: "+919320480095" github_repository_url: "https://github.com/Pranjal1342/Intelligent-Candidate-Discovery-Ranking-System" sandbox_demo_url: "https://ctrl-coffee-repeat.streamlit.app/" members: - name: "Pranjal H Dohare" role: "Lead Developer" - name: "Priyanka Tiwari" role: "Architecture and System Design" submission: version: "1.0.0" timestamp: "2026-07-01" output_file: "CTRL_COFFEE_REPEAT.csv" system: pipeline_type: "Offline-Indexed Lexical Retrieval + LightGBM LambdaRank" hardware: "CPU-only, ≤16GB RAM" runtime_seconds: 4 network_calls_during_ranking: 0 methodology_summary: | This system uses a deterministic, CPU-only pipeline optimized for NDCG@10 and P@5. Stage 1 (Retrieval): A precomputed NumPy CSR BM25 matrix (built offline, ~40 MB) is queried at runtime in under 0.1 seconds via dual-pass: Pass A expands JD requirements using a skill alias taxonomy (skill_aliases.json), Pass B targets production-context keywords (deployed, scale, serving, latency). A rare-term safety net retrieves candidates with niche skills (pinecone, lambdarank) that might otherwise be missed. This produces a ~8,500-candidate Stage 1 pool in approximately 0.03 seconds. Stage 2 (Features): A 22-feature schema-grounded matrix extracts signals from every candidate record. Includes 5 adversarial detection functions: domain-category mismatch, synthetic template detection, production signal log-compression, LangChain dabbler detection, and CV/speech specialist detection. Stage 3 adds a consistency composite (c1×c2×c3×c4×c5) that zeros out scores for timeline impossibilities, signup anomalies, salary inversions, assessment contradictions, and engagement mismatches. Stage 4 (Ranking): LightGBM with objective=lambdarank trains on relevance labels generated via 2,500 pairwise LLM comparisons using Gemma3:4b-it-q4_K_M (running offline and locally via Ollama — zero external API calls). This explicitly breaks circularity: the LLM judges profiles organically without knowledge of the 22 features or BM25 scores, then Elo ratings are converted to 0-3 relevance labels by quartile thresholding. Candidates with data integrity violations are suppressed post-inference via a consistency multiplier (final_score = raw_score × consistency_score). Stage 5 (Reasoning): Deterministic grammar engine generates fact-grounded reasoning with numeric regex audit (all cited numbers must exist in the candidate JSON), n-gram collision avoidance (difflib.SequenceMatcher), and priority-ranked concern surfacing. Pre-submission blocking audits enforce diversity (max 25% archetype concentration, max 30% employer concentration) and honeypot detection (assert low_consistency_in_top100 < 10). Model comparison evidence: the heuristic-trained model required a hand-coded suppression list to keep non-technical profiles out of the top 100. The Gemma-trained model achieved 0 honeypot leakage with no suppression list, and the two models show Spearman correlation of 0.001 on the top-100 ranking — confirming the LLM labels are genuinely independent of the engineered features. ai_tools_used: - tool: "Google DeepMind Antigravity" usage: "Code scaffolding, module structure, latency diagnostics, iterative debugging" human_review: true - tool: "Gemma3:4b-it-q4_K_M via Ollama (local, offline)" usage: > Offline pairwise candidate annotation: 2,500 comparisons on a stratified sample of 500 Stage 1 candidates to generate non-circular LightGBM training labels. No candidate data transmitted to any external service. Runs in experiments/pairwise_llm_check/annotate_and_retrain.py, entirely separate from the ranking pipeline. Exempt from the 5-minute/zero-network ranking budget. human_review: true reference_date: "2026-01-01"