File size: 19,658 Bytes
63b37fd
 
5856b85
 
63b37fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
# Phase 6: LightGBM Reranker β€” Complete Handoff Document

> **Date**: 2026-04-29 (integration complete) | 2026-05-02 (documentation finalized) | 2026-05-03 (6.1+6.2+6.3 shipped)  
> **Status**: Integration COMPLETE βœ… | 6.1+6.2 Wiring COMPLETE βœ… | 6.3 Health Endpoint COMPLETE βœ… | Tests PASSING βœ…  
> **Contributors**:
> - **ML Intern** (Siddh via Claude Opus 4.6 on HuggingFace): Model training pipeline β€” scripts, data engineering, LightGBM training
> - **Antigravity** (integration agent): Integration into ResearchIT app β€” reranker.py rewrite, tests, documentation

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [Model Provenance β€” Who Built What](#2-model-provenance)
3. [Where to Find the Model](#3-where-to-find-the-model)
4. [The 37-Feature Schema](#4-the-37-feature-schema)
5. [Model Performance](#5-model-performance)
6. [How It Works (End to End)](#6-how-it-works)
7. [File Inventory](#7-file-inventory)
8. [Test Results](#8-test-results)
9. [How to Reproduce Everything](#9-how-to-reproduce)
10. [Deployment Checklist](#10-deployment-checklist)
11. [Credentials & Infrastructure](#11-credentials)
12. [Known Limitations & Future Work](#12-limitations)
13. [Glossary](#13-glossary)

---

## 1. Executive Summary

**Before Phase 6**: Recommendations were scored by a hand-tuned heuristic with 5 features:
```
score = 0.40Γ—lt_sim + 0.25Γ—st_sim + 0.15Γ—recency + 0.10Γ—rrf_conf - 0.15Γ—neg_penalty
```

**After Phase 6**: A LightGBM LambdaRank model with 37 features scores candidates. The heuristic is kept as a permanent fallback.

| Metric | Heuristic | LightGBM | Improvement |
|--------|-----------|----------|-------------|
| nDCG@5 | 0.182 | 0.825 | **+354%** |
| nDCG@10 | 0.264 | 0.879 | **+233%** |
| Recall@10 | 0.438 | 0.983 | **+124%** |
| MRR | 0.291 | 0.880 | **+203%** |
| Latency | β€” | 0.143ms/100 candidates | βœ… <1ms |

> **Important caveat**: These metrics are computed on citation pseudo-labels (cited=relevant), not real user saves. The heuristic baseline is also weakened because EWMA features (20–22) are zero during training. Real-world improvement will be smaller but still significant β€” the model accesses 37 features vs 5.

---

## 2. Model Provenance β€” Who Built What

### ML Intern (Siddh, via Claude Opus 4.6 on HuggingFace)

**Role**: Data pipeline + model training  
**Platform**: HuggingFace Chat (Claude Opus 4.6 sandbox)  
**Conversation logs**: `docs/ML Intern docs/` (5 files preserving the full conversation)

| Deliverable | Description |
|-------------|-------------|
| `scripts/01_fetch_citation_edges.py` | Semantic Scholar Batch API scraper β†’ `citations.parquet` (242K edges) |
| `scripts/02_generate_training_triples.py` | ANN search + Turso metadata β†’ 37-feature training data with pseudo-labels |
| `scripts/03_train_lightgbm.py` | LambdaRank training + evaluation + latency benchmark |
| `reranker_v1.txt` | The trained production model (974 KB, 141 trees) |
| Evaluation artifacts | `eval_metrics.json`, `baseline_comparison.json`, `feature_importance.csv`, `feature_schema.json` |
| HuggingFace repo | [siddhm11/researchit-reranker-phase6](https://huggingface.co/siddhm11/researchit-reranker-phase6) |

**Training data summary**:
- Sampled 50,000 papers from the 1.6M corpus
- 242,179 citation edges (in-corpus only β€” both papers must be in Qdrant)
- 90,993 training triples + 7,007 eval triples (temporal split: train < 2023, eval β‰₯ 2023)
- Label scheme: `2` = directly cited, `1` = co-cited, `0` = ANN-retrieved but not cited

### Antigravity (Integration Agent)

**Role**: Wire model into ResearchIT production code

| Deliverable | Description |
|-------------|-------------|
| `app/recommend/reranker.py` rewrite | 5 features β†’ 37 features, LightGBM loading with heuristic fallback |
| `requirements.txt` update | Added `lightgbm>=4.0,<5.0` |
| `tests/test_reranker_integration.py` | 7-test integration suite |
| `tests/demo_reranker.py` | Interactive demo with 20 realistic papers |
| `tests/test_reranker_diversity.py` fixes | Updated 3 tests from 5-feature β†’ 37-feature schema |
| `scripts/fix_model_crlf.py` | Utility to fix Windows CRLF corruption in model file |
| `scripts/export_arxiv_ids.py` | Exports 1.6M arXiv IDs from Turso for the ML Intern |

---

## 3. Where to Find the Model

### Primary location (HuggingFace)

**URL**: https://huggingface.co/siddhm11/researchit-reranker-phase6  
**Model file**: `production_model/reranker_v1.txt`  
**Direct link**: https://huggingface.co/siddhm11/researchit-reranker-phase6/blob/main/production_model/reranker_v1.txt

### Local clone (in this repo)

**Path**: `models/reranker-phase6/production_model/reranker_v1.txt`

This directory was cloned from the HF repo and contains:
```
models/reranker-phase6/
β”œβ”€β”€ README.md                    # Full model documentation
β”œβ”€β”€ INTEGRATION_GUIDE.md         # Step-by-step integration code
β”œβ”€β”€ CHANGELOG.md                 # Version history
β”œβ”€β”€ load_model.py                # Quick-start loading snippet
β”œβ”€β”€ production_model/
β”‚   β”œβ”€β”€ reranker_v1.txt          ← THE MODEL (974 KB, 141 trees, 37 features)
β”‚   β”œβ”€β”€ eval_metrics.json        # nDCG, recall, MRR, latency benchmarks
β”‚   β”œβ”€β”€ baseline_comparison.json # LightGBM vs heuristic head-to-head
β”‚   β”œβ”€β”€ feature_importance.csv   # All 37 features ranked by split gain
β”‚   └── feature_schema.json      # Exact feature column order (MUST match code)
β”œβ”€β”€ scripts/                     # Training pipeline (3 scripts)
β”‚   β”œβ”€β”€ 01_fetch_citation_edges.py
β”‚   β”œβ”€β”€ 02_generate_training_triples.py
β”‚   └── 03_train_lightgbm.py
β”œβ”€β”€ synthetic_model/             # Old proof-of-concept (ignore)
└── tests/
    └── test_full_pipeline.py
```

### How to load the model

```python
import lightgbm as lgb
model = lgb.Booster(model_file="models/reranker-phase6/production_model/reranker_v1.txt")
scores = model.predict(features)  # (N, 37) numpy array β†’ (N,) relevance scores
```

### Model file properties

| Property | Value |
|----------|-------|
| Format | LightGBM v4 text model (plain text, no pickle) |
| Objective | `lambdarank` (optimizes nDCG directly) |
| Trees | 141 (early stopped from 500) |
| Leaves per tree | 63 |
| Learning rate | 0.05 |
| Features | 37 (must match `feature_schema.json` exactly) |
| File size | 974 KB |
| Best iteration | 141 |

---

## 4. The 37-Feature Schema

The model expects features in **this exact order** (defined in `feature_schema.json` and `FEATURE_NAMES` in `reranker.py`):

### Content/Retrieval Features (0–19)

| # | Name | Source | Notes |
|---|------|--------|-------|
| 0 | `qdrant_cosine_score` | Qdrant ANN search | Raw embedding similarity |
| 1 | `candidate_position` | ANN rank order | 0-indexed |
| 2 | `candidate_citation_count` | Turso `papers` table | Raw count |
| 3 | `candidate_log_citations` | Derived | log(citation_count + 1) |
| 4 | `candidate_influential_citations` | Turso `papers` table | From Semantic Scholar |
| 5 | `candidate_age_days` | Turso `update_date` | Days since publication |
| 6 | `candidate_recency_score` | Derived | exp(-0.002 Γ— age_days) |
| 7 | `query_citation_count` | N/A in prod | 0 (no seed paper) |
| 8 | `query_age_days` | N/A in prod | 0 (no seed paper) |
| 9 | `year_diff` | Derived | \|current_year - paper_year\| |
| 10 | `same_primary_category` | N/A in prod | 0 (no seed paper) |
| 11 | `co_citation_count` | N/A in prod | 0 (no citation graph) |
| 12 | `shared_author_count` | N/A in prod | 0 (no seed paper) |
| 13 | `candidate_is_newer` | Derived | 1 if paper_year >= current_year |
| 14 | `query_log_citations` | N/A in prod | 0 |
| 15 | `citation_count_ratio` | Derived | cand_citations / (query_citations + 1) |
| 16 | `age_ratio` | Derived | cand_age / (query_age + 1) |
| 17 | `candidate_citations_per_year` | Derived | citations / max(age_years, 0.5) |
| 18 | `query_num_references` | N/A in prod | 0 |
| 19 | `candidate_num_cited_by` | N/A in prod | 0 |

### User Behavior Features (20–30)

| # | Name | Source | Status |
|---|------|--------|--------|
| 20 | `ewma_longterm_similarity` | `profiles.load_profile("long_term")` | βœ… Active |
| 21 | `ewma_shortterm_similarity` | `profiles.load_profile("short_term")` | βœ… Active |
| 22 | `ewma_negative_similarity` | `profiles.load_profile("negative")` | βœ… Active |
| 23 | `cluster_importance` | Ward clustering | βœ… Active when passed |
| 24 | `cluster_distance_to_medoid` | Ward clustering | βœ… Active when passed |
| 25 | `is_suppressed_category` | `db.get_suppressed_categories()` | βœ… Active when passed |
| 26 | `onboarding_category_match` | Phase 5 onboarding | Zero until wired |
| 27 | `user_total_saves` | `interactions` table | Zero until wired |
| 28 | `user_total_dismissals` | `interactions` table | Zero until wired |
| 29 | `user_days_since_last_save` | `interactions` table | Zero until wired |
| 30 | `user_session_save_count` | Session state | Zero until wired |

### Cross Features (31–36) β€” Auto-computed

| # | Name | Formula |
|---|------|---------|
| 31 | `cosine_x_recency` | feat[0] Γ— feat[6] |
| 32 | `cosine_x_citations` | feat[0] Γ— feat[3] |
| 33 | `category_x_recency` | feat[10] Γ— feat[6] |
| 34 | `cosine_x_cocitation` | feat[0] Γ— log(feat[11] + 1) |
| 35 | `position_inverse` | 1 / (feat[1] + 1) |
| 36 | `citations_x_recency` | feat[3] Γ— feat[6] |

> **Key insight**: Features 20–30 were ALL zero during training (no real users). The model learned to work without them. When you retrain with real user data, these features will "light up" and the model will learn user-specific ranking signals.

---

## 5. Model Performance

### Feature Importance (Top 10 by split gain)

| Rank | Feature | Importance | % of Total |
|------|---------|------------|-----------|
| 1 | `candidate_num_cited_by` | 75,203 | 65.2% |
| 2 | `age_ratio` | 7,597 | 6.6% |
| 3 | `candidate_position` | 6,765 | 5.9% |
| 4 | `cosine_x_citations` | 2,383 | 2.1% |
| 5 | `qdrant_cosine_score` | 2,353 | 2.0% |
| 6 | `candidate_citation_count` | 2,042 | 1.8% |
| 7 | `citation_count_ratio` | 2,001 | 1.7% |
| 8 | `query_age_days` | 1,749 | 1.5% |
| 9 | `query_num_references` | 1,726 | 1.5% |
| 10 | `candidate_citations_per_year` | 1,633 | 1.4% |

> **Interpretation**: The model promotes highly-cited, recent papers over position-biased ANN ordering. Features 20–30 (user behavior) have zero importance because they were zero-filled during training β€” this is expected and will change after retraining with real data.

---

## 6. How It Works (End to End)

### At Module Import Time

```
reranker.py loads β†’ tries import lightgbm
  β†’ searches for model file in 4 locations:
      1. RERANKER_MODEL_PATH env var
      2. models/reranker-phase6/production_model/reranker_v1.txt (relative)
      3. production_model/reranker_v1.txt (relative)
      4. Absolute path computed from __file__
  β†’ if found: loads lgb.Booster, sets _USE_LGB = True
  β†’ if not found: prints warning, _USE_LGB = False (heuristic fallback)
```

### At Recommendation Time

```
recommendations.py calls rerank_candidates(ids, embeddings, metadata, ...)
  β†’ compute_features() builds (N, 37) feature matrix
    β†’ Batch cosine similarities (vectorized NumPy, fast)
    β†’ Per-candidate metadata features (citations, age, category)
    β†’ User behavior features (EWMA, cluster, interaction counts)
    β†’ Cross features (auto-computed from above)
  β†’ if _USE_LGB: scores = model.predict(features)
    else: scores = heuristic_score(features)
  β†’ Sort by scores descending
  β†’ Return (sorted_ids, sorted_scores, sorted_embeddings)
```

### Backward Compatibility

The existing caller in `recommendations.py` (line 305) does NOT need changes:
```python
rerank_candidates(
    candidate_ids=valid_ids,
    candidate_embeddings=valid_embs,
    candidate_metadata=valid_meta,
    long_term_vec=lt_vec,
    short_term_vec=st_vec,
    negative_vec=neg_vec,
)
```
All Phase 6 parameters are keyword-only with safe defaults. The model zero-fills missing features.

---

## 7. File Inventory

### Files modified by Phase 6

| File | Change |
|------|--------|
| `app/recommend/reranker.py` | Complete rewrite: 181 β†’ 473 lines, 5 β†’ 37 features, LightGBM + heuristic |
| `requirements.txt` | Added `lightgbm>=4.0,<5.0` |
| `tests/test_reranker_diversity.py` | Updated 3 tests from 5-feature β†’ 37-feature expectations |

### Files created by Phase 6

| File | Purpose |
|------|---------|
| `models/reranker-phase6/` | Complete model repo clone from HuggingFace |
| `tests/test_reranker_integration.py` | 7-test integration suite (smoke, features, E2E, latency, compat) |
| `tests/demo_reranker.py` | Interactive demo with 20 realistic papers |
| `scripts/fix_model_crlf.py` | Utility to fix Windows line-ending corruption |
| `scripts/export_arxiv_ids.py` | Exports 1.6M arXiv IDs from Turso |
| `docs/PHASE6-HANDOFF.md` | This document |
| `docs/ML Intern docs/` | ML Intern conversation logs (5 files) |

---

## 8. Test Results

### Integration Test Suite (7/7 PASSED)

```
$ python tests/test_reranker_integration.py

1. Smoke Test          βœ…  141 trees, 37 features loaded
2. Feature Computation βœ…  (N, 37) matrix, values verified
3. Heuristic Fallback  βœ…  Scores [0.39, 0.83]
4. E2E Pipeline        βœ…  50 candidates reranked via LightGBM
5. Latency Benchmark   βœ…  0.143ms / 100 candidates (target: <1ms)
6. Backward Compat     βœ…  Old 6-arg call works
7. LGB vs Heuristic    βœ…  Top-5 overlap 1/5, Kendall Ο„ = -0.07
```

### Full Test Suite (121/121 PASSED)

```
$ python -m pytest tests/ -v
121 passed, 0 failed
```

All existing Phase 1–5 tests continue to pass with zero regressions.

### How to run tests

```bash
cd ResearchIT-Final

# Set encoding for Windows emoji support
$env:PYTHONIOENCODING='utf-8'

# Run Phase 6 integration tests
python tests/test_reranker_integration.py

# Run interactive demo (20 realistic papers)
python tests/demo_reranker.py

# Run full test suite
python -m pytest tests/ -v
```

---

## 9. How to Reproduce Everything

### Step 0: Export arXiv IDs (already done β€” `arxiv_ids.txt` exists)
```bash
python scripts/export_arxiv_ids.py
# Output: arxiv_ids.txt (1.6M lines, 18.5 MB)
```

### Step 1: Fetch citation edges (~2 hours)
```bash
cd models/reranker-phase6/scripts
S2_API_KEY=<your_key> python 01_fetch_citation_edges.py \
    --corpus-file ../../../arxiv_ids.txt \
    --max-papers 50000
# Output: citations.parquet (242K edges)
```

### Step 2: Generate training triples (~30 min)
```bash
python 02_generate_training_triples.py
# Requires: Qdrant + Turso access via env vars
# Output: ltr_dataset/train.parquet + eval.parquet
```

### Step 3: Train model (~7 min)
```bash
python 03_train_lightgbm.py
# Output: production_model/reranker_v1.txt + eval_metrics.json
```

### Step 4: Fix line endings on Windows (if needed)
```bash
python scripts/fix_model_crlf.py
```

> **Note**: The intermediate data files (`citations.parquet`, `train.parquet`, `eval.parquet`) were in the ML Intern's HuggingFace sandbox which has expired. They are fully reproducible by re-running Steps 1–3.

---

## 10. Deployment Checklist

- [x] Rewrite `reranker.py` with 37-feature schema
- [x] Add `lightgbm>=4.0,<5.0` to `requirements.txt`
- [x] Integration tests passing (7/7)
- [x] Full test suite passing (121/121)
- [x] Schema alignment verified (code = JSON = model)
- [x] Latency verified (0.143ms < 1ms target)
- [x] Backward compatibility verified
- [x] Documentation complete
- [ ] Commit Phase 6 changes to Git
- [ ] Push to GitHub
- [ ] Push model file to HF Spaces (or set `RERANKER_MODEL_PATH`)
- [ ] Add `lightgbm>=4.0,<5.0` to Docker image
- [ ] Verify model loads in production: `[reranker] βœ… LightGBM model loaded`

---

## 11. Credentials & Infrastructure

| Credential | Env Var | Status | Used By |
|-----------|---------|--------|---------|
| Qdrant Cloud | `QDRANT_URL`, `QDRANT_API_KEY` | βœ… In `.env` + HF | Embedding search |
| Zilliz Cloud | `ZILLIZ_URI`, `ZILLIZ_TOKEN` | βœ… In `.env` + HF | Sparse search |
| Turso (libSQL) | `TURSO_URL`, `TURSO_DB_TOKEN` | βœ… In `.env` + HF | Paper metadata |
| Groq | `GROQ_API_KEY` | βœ… In `.env` + HF | Query rewriting |
| Semantic Scholar | `S2_API_KEY` | βœ… In `.env` | Script 1 only (not needed in prod) |
| Model path | `RERANKER_MODEL_PATH` | Optional | Override model file location |

---

## 12. Known Limitations & Future Work

### Current limitations

1. **Citation pseudo-labels β‰  real user preferences**: The model was trained on "what would a researcher cite?" not "what would a user save?" These correlate but aren't identical.
2. **Features 20–30 are zero**: User behavior features had no signal during training. The model works without them but will improve significantly when retrained with real data.
3. **`candidate_num_cited_by` dominates** (65% importance): This is because citation data is the strongest signal available. With real user data, expect EWMA and interaction features to gain importance.
4. **Recommendations router still uses old call signature**: The caller at `recommendations.py:305` passes only the old 6 args. Phase 6 params (`qdrant_scores`, `cluster_importance`, `suppressed_categories`) are available but not wired yet.

### Optional enhancement: Wire rich features

Update `recommendations.py` line 305 to pass additional context:
```python
reranked_ids, reranked_scores, reranked_embs = rerank_candidates(
    candidate_ids=valid_ids,
    candidate_embeddings=valid_embs,
    candidate_metadata=valid_meta,
    long_term_vec=lt_vec,
    short_term_vec=st_vec,
    negative_vec=neg_vec,
    cluster_importance=clusters[0].importance if clusters else 0.0,
    cluster_medoid=clusters[0].medoid_embedding if clusters else None,
    suppressed_categories=suppressed,
)
```

### Future: Retraining with real user data

When you have 500+ user interactions:
1. Export: `SELECT user_id, arxiv_id, action, created_at FROM interactions`
2. Relabel: save=2, click=1, dismiss=0
3. Re-run Script 2 with real labels β†’ new training data
4. Re-run Script 3 β†’ new model
5. Features 20–30 will gain significant importance

---

## 13. Glossary

| Term | Definition |
|------|-----------|
| **LambdaRank** | Learning-to-rank objective that optimizes nDCG directly via pairwise ordering |
| **nDCG@K** | Normalized Discounted Cumulative Gain at K. 1.0 = perfect, 0.0 = random |
| **EWMA** | Exponentially Weighted Moving Average. User profile vectors with temporal decay |
| **Pseudo-labels** | Using citation data as proxy for relevance (cited = relevant) |
| **Cold-start** | User behavior features are zero because no real users exist yet |
| **Heuristic fallback** | Hand-tuned scoring formula that runs when LightGBM is unavailable |
| **Feature schema** | The exact 37-feature order. Must match between training and inference |
| **Booster** | LightGBM's model class. Loaded from plain text, no pickle needed |

---

## Phase Timeline

```
Phase 1   βœ…  Zero-ML Recommender (Qdrant + HTMX)
Phase 2a  βœ…  EWMA Profile Embeddings
Phase 2b  βœ…  Ward Clustering + Multi-Interest
Phase 2c  βœ…  Heuristic Re-ranking + MMR
Phase 3   βœ…  Hybrid Semantic Search
Phase 3.5 βœ…  Turso Metadata DB
Phase 4   βœ…  Quota Fusion + Hungarian + Suppression
Phase 4.5 βœ…  Instrumentation Foundation
Phase 5   βœ…  Cold-Start Onboarding + UI Redesign
Phase 6   βœ…  LightGBM Reranker ← COMPLETE
Phase 7   πŸ“‹  Evaluation Framework (NOT STARTED)
Phase 8   πŸ“‹  LLM Summaries + Distilled Reranker
Phase 9   πŸ“‹  Exploration + Collaborative Filtering
```