ResearchIT / docs /PHASE6-HANDOFF.md
siddhm11
Phase 6.1+6.2+6.3: Wire all 37 features into caller + health endpoint
5856b85

Phase 6: LightGBM Reranker β€” Complete Handoff Document

Date: 2026-04-29 (integration complete) | 2026-05-02 (documentation finalized) | 2026-05-03 (6.1+6.2+6.3 shipped)
Status: Integration COMPLETE βœ… | 6.1+6.2 Wiring COMPLETE βœ… | 6.3 Health Endpoint COMPLETE βœ… | Tests PASSING βœ…
Contributors:

  • ML Intern (Siddh via Claude Opus 4.6 on HuggingFace): Model training pipeline β€” scripts, data engineering, LightGBM training
  • Antigravity (integration agent): Integration into ResearchIT app β€” reranker.py rewrite, tests, documentation

Table of Contents

  1. Executive Summary
  2. Model Provenance β€” Who Built What
  3. Where to Find the Model
  4. The 37-Feature Schema
  5. Model Performance
  6. How It Works (End to End)
  7. File Inventory
  8. Test Results
  9. How to Reproduce Everything
  10. Deployment Checklist
  11. Credentials & Infrastructure
  12. Known Limitations & Future Work
  13. Glossary

1. Executive Summary

Before Phase 6: Recommendations were scored by a hand-tuned heuristic with 5 features:

score = 0.40Γ—lt_sim + 0.25Γ—st_sim + 0.15Γ—recency + 0.10Γ—rrf_conf - 0.15Γ—neg_penalty

After Phase 6: A LightGBM LambdaRank model with 37 features scores candidates. The heuristic is kept as a permanent fallback.

Metric Heuristic LightGBM Improvement
nDCG@5 0.182 0.825 +354%
nDCG@10 0.264 0.879 +233%
Recall@10 0.438 0.983 +124%
MRR 0.291 0.880 +203%
Latency β€” 0.143ms/100 candidates βœ… <1ms

Important caveat: These metrics are computed on citation pseudo-labels (cited=relevant), not real user saves. The heuristic baseline is also weakened because EWMA features (20–22) are zero during training. Real-world improvement will be smaller but still significant β€” the model accesses 37 features vs 5.


2. Model Provenance β€” Who Built What

ML Intern (Siddh, via Claude Opus 4.6 on HuggingFace)

Role: Data pipeline + model training
Platform: HuggingFace Chat (Claude Opus 4.6 sandbox)
Conversation logs: docs/ML Intern docs/ (5 files preserving the full conversation)

Deliverable Description
scripts/01_fetch_citation_edges.py Semantic Scholar Batch API scraper β†’ citations.parquet (242K edges)
scripts/02_generate_training_triples.py ANN search + Turso metadata β†’ 37-feature training data with pseudo-labels
scripts/03_train_lightgbm.py LambdaRank training + evaluation + latency benchmark
reranker_v1.txt The trained production model (974 KB, 141 trees)
Evaluation artifacts eval_metrics.json, baseline_comparison.json, feature_importance.csv, feature_schema.json
HuggingFace repo siddhm11/researchit-reranker-phase6

Training data summary:

  • Sampled 50,000 papers from the 1.6M corpus
  • 242,179 citation edges (in-corpus only β€” both papers must be in Qdrant)
  • 90,993 training triples + 7,007 eval triples (temporal split: train < 2023, eval β‰₯ 2023)
  • Label scheme: 2 = directly cited, 1 = co-cited, 0 = ANN-retrieved but not cited

Antigravity (Integration Agent)

Role: Wire model into ResearchIT production code

Deliverable Description
app/recommend/reranker.py rewrite 5 features β†’ 37 features, LightGBM loading with heuristic fallback
requirements.txt update Added lightgbm>=4.0,<5.0
tests/test_reranker_integration.py 7-test integration suite
tests/demo_reranker.py Interactive demo with 20 realistic papers
tests/test_reranker_diversity.py fixes Updated 3 tests from 5-feature β†’ 37-feature schema
scripts/fix_model_crlf.py Utility to fix Windows CRLF corruption in model file
scripts/export_arxiv_ids.py Exports 1.6M arXiv IDs from Turso for the ML Intern

3. Where to Find the Model

Primary location (HuggingFace)

URL: https://huggingface.co/siddhm11/researchit-reranker-phase6
Model file: production_model/reranker_v1.txt
Direct link: https://huggingface.co/siddhm11/researchit-reranker-phase6/blob/main/production_model/reranker_v1.txt

Local clone (in this repo)

Path: models/reranker-phase6/production_model/reranker_v1.txt

This directory was cloned from the HF repo and contains:

models/reranker-phase6/
β”œβ”€β”€ README.md                    # Full model documentation
β”œβ”€β”€ INTEGRATION_GUIDE.md         # Step-by-step integration code
β”œβ”€β”€ CHANGELOG.md                 # Version history
β”œβ”€β”€ load_model.py                # Quick-start loading snippet
β”œβ”€β”€ production_model/
β”‚   β”œβ”€β”€ reranker_v1.txt          ← THE MODEL (974 KB, 141 trees, 37 features)
β”‚   β”œβ”€β”€ eval_metrics.json        # nDCG, recall, MRR, latency benchmarks
β”‚   β”œβ”€β”€ baseline_comparison.json # LightGBM vs heuristic head-to-head
β”‚   β”œβ”€β”€ feature_importance.csv   # All 37 features ranked by split gain
β”‚   └── feature_schema.json      # Exact feature column order (MUST match code)
β”œβ”€β”€ scripts/                     # Training pipeline (3 scripts)
β”‚   β”œβ”€β”€ 01_fetch_citation_edges.py
β”‚   β”œβ”€β”€ 02_generate_training_triples.py
β”‚   └── 03_train_lightgbm.py
β”œβ”€β”€ synthetic_model/             # Old proof-of-concept (ignore)
└── tests/
    └── test_full_pipeline.py

How to load the model

import lightgbm as lgb
model = lgb.Booster(model_file="models/reranker-phase6/production_model/reranker_v1.txt")
scores = model.predict(features)  # (N, 37) numpy array β†’ (N,) relevance scores

Model file properties

Property Value
Format LightGBM v4 text model (plain text, no pickle)
Objective lambdarank (optimizes nDCG directly)
Trees 141 (early stopped from 500)
Leaves per tree 63
Learning rate 0.05
Features 37 (must match feature_schema.json exactly)
File size 974 KB
Best iteration 141

4. The 37-Feature Schema

The model expects features in this exact order (defined in feature_schema.json and FEATURE_NAMES in reranker.py):

Content/Retrieval Features (0–19)

# Name Source Notes
0 qdrant_cosine_score Qdrant ANN search Raw embedding similarity
1 candidate_position ANN rank order 0-indexed
2 candidate_citation_count Turso papers table Raw count
3 candidate_log_citations Derived log(citation_count + 1)
4 candidate_influential_citations Turso papers table From Semantic Scholar
5 candidate_age_days Turso update_date Days since publication
6 candidate_recency_score Derived exp(-0.002 Γ— age_days)
7 query_citation_count N/A in prod 0 (no seed paper)
8 query_age_days N/A in prod 0 (no seed paper)
9 year_diff Derived |current_year - paper_year|
10 same_primary_category N/A in prod 0 (no seed paper)
11 co_citation_count N/A in prod 0 (no citation graph)
12 shared_author_count N/A in prod 0 (no seed paper)
13 candidate_is_newer Derived 1 if paper_year >= current_year
14 query_log_citations N/A in prod 0
15 citation_count_ratio Derived cand_citations / (query_citations + 1)
16 age_ratio Derived cand_age / (query_age + 1)
17 candidate_citations_per_year Derived citations / max(age_years, 0.5)
18 query_num_references N/A in prod 0
19 candidate_num_cited_by N/A in prod 0

User Behavior Features (20–30)

# Name Source Status
20 ewma_longterm_similarity profiles.load_profile("long_term") βœ… Active
21 ewma_shortterm_similarity profiles.load_profile("short_term") βœ… Active
22 ewma_negative_similarity profiles.load_profile("negative") βœ… Active
23 cluster_importance Ward clustering βœ… Active when passed
24 cluster_distance_to_medoid Ward clustering βœ… Active when passed
25 is_suppressed_category db.get_suppressed_categories() βœ… Active when passed
26 onboarding_category_match Phase 5 onboarding Zero until wired
27 user_total_saves interactions table Zero until wired
28 user_total_dismissals interactions table Zero until wired
29 user_days_since_last_save interactions table Zero until wired
30 user_session_save_count Session state Zero until wired

Cross Features (31–36) β€” Auto-computed

# Name Formula
31 cosine_x_recency feat[0] Γ— feat[6]
32 cosine_x_citations feat[0] Γ— feat[3]
33 category_x_recency feat[10] Γ— feat[6]
34 cosine_x_cocitation feat[0] Γ— log(feat[11] + 1)
35 position_inverse 1 / (feat[1] + 1)
36 citations_x_recency feat[3] Γ— feat[6]

Key insight: Features 20–30 were ALL zero during training (no real users). The model learned to work without them. When you retrain with real user data, these features will "light up" and the model will learn user-specific ranking signals.


5. Model Performance

Feature Importance (Top 10 by split gain)

Rank Feature Importance % of Total
1 candidate_num_cited_by 75,203 65.2%
2 age_ratio 7,597 6.6%
3 candidate_position 6,765 5.9%
4 cosine_x_citations 2,383 2.1%
5 qdrant_cosine_score 2,353 2.0%
6 candidate_citation_count 2,042 1.8%
7 citation_count_ratio 2,001 1.7%
8 query_age_days 1,749 1.5%
9 query_num_references 1,726 1.5%
10 candidate_citations_per_year 1,633 1.4%

Interpretation: The model promotes highly-cited, recent papers over position-biased ANN ordering. Features 20–30 (user behavior) have zero importance because they were zero-filled during training β€” this is expected and will change after retraining with real data.


6. How It Works (End to End)

At Module Import Time

reranker.py loads β†’ tries import lightgbm
  β†’ searches for model file in 4 locations:
      1. RERANKER_MODEL_PATH env var
      2. models/reranker-phase6/production_model/reranker_v1.txt (relative)
      3. production_model/reranker_v1.txt (relative)
      4. Absolute path computed from __file__
  β†’ if found: loads lgb.Booster, sets _USE_LGB = True
  β†’ if not found: prints warning, _USE_LGB = False (heuristic fallback)

At Recommendation Time

recommendations.py calls rerank_candidates(ids, embeddings, metadata, ...)
  β†’ compute_features() builds (N, 37) feature matrix
    β†’ Batch cosine similarities (vectorized NumPy, fast)
    β†’ Per-candidate metadata features (citations, age, category)
    β†’ User behavior features (EWMA, cluster, interaction counts)
    β†’ Cross features (auto-computed from above)
  β†’ if _USE_LGB: scores = model.predict(features)
    else: scores = heuristic_score(features)
  β†’ Sort by scores descending
  β†’ Return (sorted_ids, sorted_scores, sorted_embeddings)

Backward Compatibility

The existing caller in recommendations.py (line 305) does NOT need changes:

rerank_candidates(
    candidate_ids=valid_ids,
    candidate_embeddings=valid_embs,
    candidate_metadata=valid_meta,
    long_term_vec=lt_vec,
    short_term_vec=st_vec,
    negative_vec=neg_vec,
)

All Phase 6 parameters are keyword-only with safe defaults. The model zero-fills missing features.


7. File Inventory

Files modified by Phase 6

File Change
app/recommend/reranker.py Complete rewrite: 181 β†’ 473 lines, 5 β†’ 37 features, LightGBM + heuristic
requirements.txt Added lightgbm>=4.0,<5.0
tests/test_reranker_diversity.py Updated 3 tests from 5-feature β†’ 37-feature expectations

Files created by Phase 6

File Purpose
models/reranker-phase6/ Complete model repo clone from HuggingFace
tests/test_reranker_integration.py 7-test integration suite (smoke, features, E2E, latency, compat)
tests/demo_reranker.py Interactive demo with 20 realistic papers
scripts/fix_model_crlf.py Utility to fix Windows line-ending corruption
scripts/export_arxiv_ids.py Exports 1.6M arXiv IDs from Turso
docs/PHASE6-HANDOFF.md This document
docs/ML Intern docs/ ML Intern conversation logs (5 files)

8. Test Results

Integration Test Suite (7/7 PASSED)

$ python tests/test_reranker_integration.py

1. Smoke Test          βœ…  141 trees, 37 features loaded
2. Feature Computation βœ…  (N, 37) matrix, values verified
3. Heuristic Fallback  βœ…  Scores [0.39, 0.83]
4. E2E Pipeline        βœ…  50 candidates reranked via LightGBM
5. Latency Benchmark   βœ…  0.143ms / 100 candidates (target: <1ms)
6. Backward Compat     βœ…  Old 6-arg call works
7. LGB vs Heuristic    βœ…  Top-5 overlap 1/5, Kendall Ο„ = -0.07

Full Test Suite (121/121 PASSED)

$ python -m pytest tests/ -v
121 passed, 0 failed

All existing Phase 1–5 tests continue to pass with zero regressions.

How to run tests

cd ResearchIT-Final

# Set encoding for Windows emoji support
$env:PYTHONIOENCODING='utf-8'

# Run Phase 6 integration tests
python tests/test_reranker_integration.py

# Run interactive demo (20 realistic papers)
python tests/demo_reranker.py

# Run full test suite
python -m pytest tests/ -v

9. How to Reproduce Everything

Step 0: Export arXiv IDs (already done β€” arxiv_ids.txt exists)

python scripts/export_arxiv_ids.py
# Output: arxiv_ids.txt (1.6M lines, 18.5 MB)

Step 1: Fetch citation edges (~2 hours)

cd models/reranker-phase6/scripts
S2_API_KEY=<your_key> python 01_fetch_citation_edges.py \
    --corpus-file ../../../arxiv_ids.txt \
    --max-papers 50000
# Output: citations.parquet (242K edges)

Step 2: Generate training triples (~30 min)

python 02_generate_training_triples.py
# Requires: Qdrant + Turso access via env vars
# Output: ltr_dataset/train.parquet + eval.parquet

Step 3: Train model (~7 min)

python 03_train_lightgbm.py
# Output: production_model/reranker_v1.txt + eval_metrics.json

Step 4: Fix line endings on Windows (if needed)

python scripts/fix_model_crlf.py

Note: The intermediate data files (citations.parquet, train.parquet, eval.parquet) were in the ML Intern's HuggingFace sandbox which has expired. They are fully reproducible by re-running Steps 1–3.


10. Deployment Checklist

  • Rewrite reranker.py with 37-feature schema
  • Add lightgbm>=4.0,<5.0 to requirements.txt
  • Integration tests passing (7/7)
  • Full test suite passing (121/121)
  • Schema alignment verified (code = JSON = model)
  • Latency verified (0.143ms < 1ms target)
  • Backward compatibility verified
  • Documentation complete
  • Commit Phase 6 changes to Git
  • Push to GitHub
  • Push model file to HF Spaces (or set RERANKER_MODEL_PATH)
  • Add lightgbm>=4.0,<5.0 to Docker image
  • Verify model loads in production: [reranker] βœ… LightGBM model loaded

11. Credentials & Infrastructure

Credential Env Var Status Used By
Qdrant Cloud QDRANT_URL, QDRANT_API_KEY βœ… In .env + HF Embedding search
Zilliz Cloud ZILLIZ_URI, ZILLIZ_TOKEN βœ… In .env + HF Sparse search
Turso (libSQL) TURSO_URL, TURSO_DB_TOKEN βœ… In .env + HF Paper metadata
Groq GROQ_API_KEY βœ… In .env + HF Query rewriting
Semantic Scholar S2_API_KEY βœ… In .env Script 1 only (not needed in prod)
Model path RERANKER_MODEL_PATH Optional Override model file location

12. Known Limitations & Future Work

Current limitations

  1. Citation pseudo-labels β‰  real user preferences: The model was trained on "what would a researcher cite?" not "what would a user save?" These correlate but aren't identical.
  2. Features 20–30 are zero: User behavior features had no signal during training. The model works without them but will improve significantly when retrained with real data.
  3. candidate_num_cited_by dominates (65% importance): This is because citation data is the strongest signal available. With real user data, expect EWMA and interaction features to gain importance.
  4. Recommendations router still uses old call signature: The caller at recommendations.py:305 passes only the old 6 args. Phase 6 params (qdrant_scores, cluster_importance, suppressed_categories) are available but not wired yet.

Optional enhancement: Wire rich features

Update recommendations.py line 305 to pass additional context:

reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
    candidate_ids=valid_ids,
    candidate_embeddings=valid_embs,
    candidate_metadata=valid_meta,
    long_term_vec=lt_vec,
    short_term_vec=st_vec,
    negative_vec=neg_vec,
    cluster_importance=clusters[0].importance if clusters else 0.0,
    cluster_medoid=clusters[0].medoid_embedding if clusters else None,
    suppressed_categories=suppressed,
)

Future: Retraining with real user data

When you have 500+ user interactions:

  1. Export: SELECT user_id, arxiv_id, action, created_at FROM interactions
  2. Relabel: save=2, click=1, dismiss=0
  3. Re-run Script 2 with real labels β†’ new training data
  4. Re-run Script 3 β†’ new model
  5. Features 20–30 will gain significant importance

13. Glossary

Term Definition
LambdaRank Learning-to-rank objective that optimizes nDCG directly via pairwise ordering
nDCG@K Normalized Discounted Cumulative Gain at K. 1.0 = perfect, 0.0 = random
EWMA Exponentially Weighted Moving Average. User profile vectors with temporal decay
Pseudo-labels Using citation data as proxy for relevance (cited = relevant)
Cold-start User behavior features are zero because no real users exist yet
Heuristic fallback Hand-tuned scoring formula that runs when LightGBM is unavailable
Feature schema The exact 37-feature order. Must match between training and inference
Booster LightGBM's model class. Loaded from plain text, no pickle needed

Phase Timeline

Phase 1   βœ…  Zero-ML Recommender (Qdrant + HTMX)
Phase 2a  βœ…  EWMA Profile Embeddings
Phase 2b  βœ…  Ward Clustering + Multi-Interest
Phase 2c  βœ…  Heuristic Re-ranking + MMR
Phase 3   βœ…  Hybrid Semantic Search
Phase 3.5 βœ…  Turso Metadata DB
Phase 4   βœ…  Quota Fusion + Hungarian + Suppression
Phase 4.5 βœ…  Instrumentation Foundation
Phase 5   βœ…  Cold-Start Onboarding + UI Redesign
Phase 6   βœ…  LightGBM Reranker ← COMPLETE
Phase 7   πŸ“‹  Evaluation Framework (NOT STARTED)
Phase 8   πŸ“‹  LLM Summaries + Distilled Reranker
Phase 9   πŸ“‹  Exploration + Collaborative Filtering