- ResearchIT Phase 6 β LightGBM Reranker
- π― TL;DR
- π¦ Repository Contents
- π§ How It Works
- π Production Results
- π Feature Importance (Top 15)
- π¬ The 37-Feature Schema
- π Reproducing the Pipeline
- π Integration Into ResearchIT
- β οΈ Known Limitations
- πΊοΈ Roadmap
- π References
- π οΈ Dependencies
- π License
- Generated by ML Intern
- Usage
- π― TL;DR
ResearchIT Phase 6 β LightGBM Reranker
Status: β Production model trained and evaluated on real citation data.
Parent project: siddhm11/ResearchIT
Replaces: Hand-tuned heuristic scorer inapp/recommend/reranker.py
Architecture position: LightGBM-1 in the Doc 07 multi-stage pipeline
π― TL;DR
A LightGBM lambdarank model that reranks arXiv paper recommendations. Trained on 242K real citation edges from Semantic Scholar across 1.6M arXiv papers in the ResearchIT corpus.
| Metric | Heuristic | LightGBM | Improvement |
|---|---|---|---|
| nDCG@5 | 0.1819 | 0.8250 | +353.6% |
| nDCG@10 | 0.2641 | 0.8791 | +232.8% |
| nDCG@20 | 0.3296 | 0.8857 | +168.7% |
| Recall@10 | 0.4384 | 0.9825 | +124.1% |
| HR@10 | 0.6638 | 1.0000 | +50.6% |
| MRR | 0.2906 | 0.8795 | +202.7% |
Latency: 0.371ms per 100 candidates (budget: <1ms) β
Model size: 948 KB
Verdict: β
DEPLOY β massive improvement across all metrics.
π¦ Repository Contents
researchit-reranker-phase6/
β
βββ production_model/ β PRODUCTION ARTIFACTS
β βββ reranker_v1.txt β THE MODEL (948 KB, LightGBM text format)
β βββ eval_metrics.json β Full benchmark results + training metadata
β βββ baseline_comparison.json β LightGBM vs heuristic comparison
β βββ feature_importance.csv β All 37 features ranked by gain
β βββ feature_schema.json β 37-feature schema definition (ordered)
β
βββ scripts/ β REPRODUCIBLE PIPELINE (3 scripts)
β βββ 01_fetch_citation_edges.py β S2 API β citations.parquet
β βββ 02_generate_training_triples.py β Qdrant ANN + Turso β train/eval.parquet
β βββ 03_train_lightgbm.py β LightGBM lambdarank training + eval
β
βββ synthetic_model/ β PROOF OF CONCEPT (synthetic data)
β βββ reranker_v1_synthetic.txt β Model trained on synthetic data (286 KB)
β βββ test_results.json β Synthetic eval results
β
βββ tests/
β βββ test_full_pipeline.py β Comprehensive test suite (6 categories)
β
βββ INTEGRATION_GUIDE.md β Step-by-step integration into ResearchIT
βββ CHANGELOG.md β Version history
βββ README.md β This file
π§ How It Works
The Problem
ResearchIT recommends arXiv papers using a multi-stage pipeline:
Qdrant ANN retrieval β Quota fusion β Reranking β MMR diversity β Feed
The reranking step uses a hand-tuned heuristic scorer with 5 features and fixed weights:
score = 0.40 Γ cos(paper, long_term_profile)
+ 0.25 Γ cos(paper, short_term_profile)
+ 0.15 Γ recency_decay
+ 0.10 Γ retrieval_rank_confidence
- 0.15 Γ cos(paper, negative_profile)
This heuristic can't use citation count, co-citation networks, category match, or any feature interactions. The weights are guesses.
The Solution
Train a LightGBM lambdarank model on citation-graph pseudo-labels:
- Each arXiv paper acts as a "pseudo-user" β its bibliography simulates what that researcher would "save"
- Direct citations β label 2 (strong positive β this paper was important enough to cite)
- Co-citations β label 1 (weak positive β papers sharing community context)
- ANN-retrieved but not cited β label 0 (negative β topically related but not worth citing)
The model learns: given 37 features about a (user, paper) pair, which papers should rank higher?
Where It Fits in Doc 07
This is LightGBM-1 in the multi-stage architecture:
Qdrant ANN (Phase 2) β LightGBM-1 (THIS MODEL) β [TinyBERT β LightGBM-2] (Phase 8b, future)
π Production Results
Training Data (Real Citation Graph)
| Metric | Value |
|---|---|
| Corpus size | 1,597,097 arXiv papers |
| Papers sampled for S2 API | 50,000 |
| In-corpus citation edges | 242,179 |
| Training rows | 90,993 (1,857 queries, pre-2023) |
| Eval rows | 7,007 (143 queries, 2023+) |
| Label distribution | 4.6% direct citation, 0.2% co-citation, 95.1% negative |
| Time split | Train: pre-2023, Eval: 2023+ (verified: no temporal leakage) |
Model Training
| Parameter | Value |
|---|---|
| Objective | lambdarank |
| Num boost rounds | 500 (early stopped at 141) |
| Learning rate | 0.05 |
| Num leaves | 63 |
| Min data in leaf | 50 |
| Feature fraction | 0.8 |
| Bagging fraction | 0.8 |
| Training time | ~7 minutes |
Evaluation: LightGBM vs Heuristic Baseline
The heuristic baseline uses qdrant_cosine_score as proxy for ewma_longterm_similarity (since EWMA profiles don't exist for pseudo-users). This is a fair comparison β both models see the same zero-filled user features.
| Metric | Heuristic | LightGBM | Delta | % Improvement |
|---|---|---|---|---|
| nDCG@5 | 0.1819 | 0.8250 | +0.6432 | +353.6% |
| nDCG@10 | 0.2641 | 0.8791 | +0.6150 | +232.8% |
| nDCG@20 | 0.3296 | 0.8857 | +0.5561 | +168.7% |
| Recall@10 | 0.4384 | 0.9825 | +0.5442 | +124.1% |
| Recall@50 | 1.0000 | 1.0000 | 0.0000 | 0.0% |
| HR@10 | 0.6638 | 1.0000 | +0.3362 | +50.6% |
| MRR | 0.2906 | 0.8795 | +0.5889 | +202.7% |
Production Readiness
| Check | Result | Target | Status |
|---|---|---|---|
| Latency (100 candidates) | 0.371ms | <1ms | β 2.7Γ under budget |
| Model size | 948 KB | <2 MB | β |
| Model reload | Identical predictions | β | β |
| Handles NaN input | Graceful | β | β |
| Handles extreme values | No crash | β | β |
| Best iteration | 141/500 | β | β Early stopping healthy |
π Feature Importance (Top 15)
| Rank | Feature | Importance | Description |
|---|---|---|---|
| 1 | candidate_num_cited_by |
75,203 | How many corpus papers cite this candidate |
| 2 | age_ratio |
7,597 | candidate_age / (query_age + 1) |
| 3 | candidate_position |
6,765 | Rank position in ANN results |
| 4 | cosine_x_citations |
2,383 | cosine Γ log(citations) interaction |
| 5 | qdrant_cosine_score |
2,353 | BGE-M3 cosine similarity |
| 6 | candidate_citation_count |
2,042 | Raw citation count |
| 7 | citation_count_ratio |
2,001 | candidate/query citation ratio |
| 8 | query_age_days |
1,749 | Age of the query paper |
| 9 | query_num_references |
1,726 | How many papers the query cites |
| 10 | candidate_citations_per_year |
1,633 | Citation velocity |
| 11 | candidate_influential_citations |
1,564 | S2 influential citation count |
| 12 | query_citation_count |
1,290 | Query paper's citation count |
| 13 | category_x_recency |
1,188 | category_match Γ recency interaction |
| 14 | citations_x_recency |
1,143 | log_citations Γ recency interaction |
| 15 | position_inverse |
1,108 | 1 / (position + 1) |
Key insight: candidate_num_cited_by (how many corpus papers cite this candidate) is the dominant signal β 10Γ more important than any other feature. This is a "corpus-wide popularity" signal that the heuristic cannot access.
User behavior features (20-30): All 11 have zero importance (correctly β they're zero-filled for pseudo-labels). When real user data arrives (500+ interactions), retrain and these features will activate.
π¬ The 37-Feature Schema
Content/Retrieval Features (0-19) β Active in pseudo-label training
| # | Feature | Description | Source |
|---|---|---|---|
| 0 | qdrant_cosine_score |
BGE-M3 cosine similarity from ANN search | Qdrant |
| 1 | candidate_position |
Rank position in ANN results (0-indexed) | Qdrant |
| 2 | candidate_citation_count |
Total citation count | Turso |
| 3 | candidate_log_citations |
log(citation_count + 1) | Computed |
| 4 | candidate_influential_citations |
Influential citation count (S2) | Turso |
| 5 | candidate_age_days |
Days since publication | Turso |
| 6 | candidate_recency_score |
exp(-0.002 Γ age_days) β matches heuristic | Computed |
| 7 | query_citation_count |
Citation count of the query/user paper | Turso |
| 8 | query_age_days |
Days since query paper published | Turso |
| 9 | year_diff |
query_year - candidate_year | |
| 10 | same_primary_category |
1 if same primary arXiv category | Turso |
| 11 | co_citation_count |
Papers citing BOTH query and candidate | Citation graph |
| 12 | shared_author_count |
Shared authors (case-insensitive) | Turso |
| 13 | candidate_is_newer |
1 if candidate published after query | Computed |
| 14 | query_log_citations |
log(query_citation_count + 1) | Computed |
| 15 | citation_count_ratio |
candidate_citations / (query_citations + 1) | Computed |
| 16 | age_ratio |
candidate_age / (query_age + 1) | Computed |
| 17 | candidate_citations_per_year |
citation_count / max(age_years, 0.5) | Computed |
| 18 | query_num_references |
Papers the query cites (in-corpus) | Citation graph |
| 19 | candidate_num_cited_by |
Corpus papers that cite the candidate | Citation graph |
User Behavior Features (20-30) β Zero-filled for pseudo-labels, active for real users
| # | Feature | Description | Source in ResearchIT |
|---|---|---|---|
| 20 | ewma_longterm_similarity |
cos(candidate, long-term EWMA profile) | profiles.py Ξ±=0.03 |
| 21 | ewma_shortterm_similarity |
cos(candidate, short-term EWMA profile) | profiles.py Ξ±=0.40 |
| 22 | ewma_negative_similarity |
cos(candidate, negative EWMA profile) | profiles.py Ξ±=0.15 |
| 23 | cluster_importance |
Importance weight of serving cluster | clustering.py |
| 24 | cluster_distance_to_medoid |
cos(candidate, cluster medoid) | clustering.py |
| 25 | is_suppressed_category |
1 if category suppressed (β₯3 dismissals in 14d) | db.py |
| 26 | onboarding_category_match |
1 if matches onboarding selections | db.py |
| 27 | user_total_saves |
Total papers saved | interactions table |
| 28 | user_total_dismissals |
Total papers dismissed | interactions table |
| 29 | user_days_since_last_save |
Days since last save | interactions table |
| 30 | user_session_save_count |
Saves in current session | In-memory state |
Cross Features (31-36) β Interaction terms
| # | Feature | Formula |
|---|---|---|
| 31 | cosine_x_recency |
qdrant_cosine_score Γ candidate_recency_score |
| 32 | cosine_x_citations |
qdrant_cosine_score Γ candidate_log_citations |
| 33 | category_x_recency |
same_primary_category Γ candidate_recency_score |
| 34 | cosine_x_cocitation |
qdrant_cosine_score Γ log(co_citation_count + 1) |
| 35 | position_inverse |
1 / (candidate_position + 1) |
| 36 | citations_x_recency |
candidate_log_citations Γ candidate_recency_score |
π Reproducing the Pipeline
Prerequisites
pip install httpx pyarrow tqdm numpy qdrant-client lightgbm
Step 1: Export Corpus IDs
Export arXiv IDs from Turso:
SELECT arxiv_id FROM papers;
Save as arxiv_ids.txt (one ID per line). Our corpus: 1,597,097 IDs.
Step 2: Fetch Citation Edges (~30 min for 50K papers)
python scripts/01_fetch_citation_edges.py \
--corpus-file arxiv_ids.txt \
--output citations.parquet \
--max-papers 50000 # sample for rate limits; remove for full corpus
- Supports checkpoint/resume (safe to interrupt)
- S2 API key optional but recommended (faster rate limit)
- Filters to in-corpus edges (both papers must be in Qdrant)
- Our run: 50K papers β 242,179 in-corpus edges
Step 3: Generate Training Triples (~80 min)
python scripts/02_generate_training_triples.py \
--citations citations.parquet \
--corpus-file arxiv_ids.txt \
--qdrant-url "$QDRANT_URL" \
--qdrant-api-key "$QDRANT_API_KEY" \
--qdrant-collection arxiv_bgem3_dense \
--turso-url "$TURSO_URL" \
--turso-token "$TURSO_DB_TOKEN" \
--output-dir ./ltr_dataset \
--num-queries 2000 \
--candidates-per-query 50
- Enforces time-split: train on pre-2023, eval on 2023+
- Asserts no temporal leakage:
max(train.year) < min(eval.year) - Uses
scroll()+query_points()for qdrant-client 1.17+ - Our run: 90,993 train rows + 7,007 eval rows
Step 4: Train Model (~5 min)
python scripts/03_train_lightgbm.py \
--train-file ltr_dataset/train.parquet \
--eval-file ltr_dataset/eval.parquet \
--output-dir ./model_output \
--num-boost-round 500 \
--learning-rate 0.05
- Evaluates nDCG@5/10/20, Recall@10/50, HR@10, MRR
- Compares LightGBM vs exact heuristic baseline
- Reports feature importance, latency benchmark, per-query win rates
- Our run: early stopped at iteration 141, 948 KB model
π Integration Into ResearchIT
See INTEGRATION_GUIDE.md for the complete step-by-step guide.
Quick summary β add to app/recommend/reranker.py:
import lightgbm as lgb
# Load once at startup
_lgb_model = None
try:
_lgb_model = lgb.Booster(model_file="production_model/reranker_v1.txt")
print("[reranker] LightGBM model loaded (948 KB)")
except Exception:
print("[reranker] LightGBM unavailable β using heuristic fallback")
# In rerank_candidates():
if _lgb_model is not None:
features = compute_features_v2(user, candidates) # 37-dim feature vector
scores = _lgb_model.predict(features)
else:
scores = heuristic_score(candidates) # existing fallback
The heuristic scorer remains as a permanent fallback. If the model file is missing or fails to load, the system silently uses the heuristic. No user-facing impact.
β οΈ Known Limitations
Citation β User Interest
Citation pseudo-labels ("cited in bibliography") β real user signals ("saved in feed"). A foundational paper like "Attention Is All You Need" gets label=2 in citation data but might be dismissed by users who've already read it.
Mitigation: The candidate_log_citations and candidate_citations_per_year features help learn a popularity curve. When 500+ real user interactions accumulate, retrain on actual save/dismiss data β the 11 user behavior features (20-30) activate and the model learns real preferences.
Sampled Corpus (50K of 1.6M)
We sampled 50K papers for S2 API calls due to rate limiting, yielding 242K in-corpus edges. The full corpus would produce ~8-10M edges with a valid API key or S2 bulk download. More edges β more training data β better model.
S2 API Key
The provided API key returned 403 Forbidden. We ran unauthenticated at ~1.5s delay per request. A working key or the S2 bulk dataset download would be significantly faster.
Pseudo-Label Heuristic Baseline
The heuristic baseline uses qdrant_cosine_score as proxy for ewma_longterm_similarity (feature 20) since real EWMA profiles don't exist for pseudo-users. This is fair but means the heuristic baseline nDCG (0.264) is lower than what the real heuristic achieves in production with actual user profiles.
πΊοΈ Roadmap
| Step | Description | Status |
|---|---|---|
| S2 API scraping | β Done (242K edges) | |
| Qdrant ANN + Turso β labeled data | β Done (98K rows) | |
| lambdarank + eval | β Done (nDCG@10: 0.879) | |
| Test suite on synthetic data | β Done (6 categories pass) | |
5. compute_features() expansion |
5β37 features in reranker.py | π Next (Opus) |
| 6. Model loading + fallback | Wire lgb.Booster into reranker |
π Next (Opus) |
7. requirements.txt update |
Add lightgbm>=4.0 |
π Next (Opus) |
| 8. Integration testing + deploy | End-to-end verification | π Next (Opus) |
| 9. Real user data retrain | 500+ interactions β retrain with features 20-30 | Future |
| 10. Phase 8b: TinyBERT + LightGBM-2 | Cross-encoder reranker stage | Future |
π References
- ResearchIT Doc 06 Β§3.1 β LightGBM lambdarank architecture decision
- ResearchIT Doc 07 Β§A6 β Time-split evaluation protocol
- ResearchIT Doc 07 Β§B.4 β Multi-stage reranker architecture (LightGBM-1 β TinyBERT β LightGBM-2)
- PinnerSage (Pal et al., KDD 2020) β Ward clustering + importance-weighted retrieval
- Taobao ULIM (Meng et al., RecSys 2025) β Quota allocation, +5.54% CTR
- YouTube DNN (Xia et al., 2023) β 3Γ gain from negative signals in reranking
- RRF Analysis (Bruch et al., SIGIR 2022) β RRF optimizes Recall not nDCG
π οΈ Dependencies
lightgbm>=4.0
httpx>=0.24
pyarrow>=12.0
numpy>=1.24
qdrant-client>=1.17
tqdm>=4.65
π License
This model and pipeline are part of the ResearchIT project by @siddhm11.
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'siddhm11/researchit-reranker-phase6'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.