siddhm11 commited on
Commit
b4d17db
Β·
1 Parent(s): c450206

docs: Add Phase 3.5 Turso ArXiv Metadata DB, update infra status

Browse files
Files changed (1) hide show
  1. docs/TASK-TRACKER.md +57 -17
docs/TASK-TRACKER.md CHANGED
@@ -1,8 +1,8 @@
1
  # ResearchIT β€” Master Task Tracker
2
 
3
  > **Purpose**: Single source of truth for all completed, in-progress, and upcoming work.
4
- > **Last updated**: 2026-04-19
5
- > **Current phase**: Phase 3 (Hybrid Semantic Search) β€” implementation complete, pending deployment
6
 
7
  ---
8
 
@@ -201,6 +201,44 @@
201
 
202
  ---
203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  ## Phase 4: Recommendation Pipeline Fixes πŸ“‹ NOT STARTED
205
 
206
  > *Fix the known architectural debt in the recommendation pipeline.*
@@ -217,11 +255,12 @@
217
  - Deduplicate across clusters (assign to highest-ranked)
218
  - MMR over merged union
219
 
220
- ### 4.2 β€” Pre-populate Metadata Store (Kaggle Bulk Load)
221
- - [ ] Download Kaggle arXiv metadata dataset (~4GB JSON)
222
- - [ ] Write bulk-insert script β†’ SQLite `paper_metadata` table (1.6M rows)
 
223
  - [ ] arXiv API becomes fallback for genuinely new papers only
224
- - [ ] **Impact**: Metadata fetch drops from ~7,600ms to <5ms
225
 
226
  ### 4.3 β€” Hungarian Matching for Cluster Stability
227
  - [ ] Implement Hungarian matching in `clustering.py`
@@ -312,24 +351,25 @@
312
  | Component | Status | Details |
313
  |---|---|---|
314
  | **Qdrant Cloud** | βœ… Live | 1.6M papers, BGE-M3 1024-dim, BQ enabled, HNSW m=32 |
315
- | **Zilliz Cloud** | βœ… Live (DB exists, not wired to code) | 1.6M papers, BGE-M3 sparse vectors, collection `arxiv_bgem3_sparse` |
316
- | **SQLite** | βœ… Live | interactions, paper_metadata, user_profiles, user_clusters |
317
- | **HF Spaces** | βœ… Deployment target | Docker SDK, free tier: 16GB RAM, 2 vCPUs, port 7860 |
 
318
  | **Render** | ⚠️ Previous target (512MB RAM too small for BGE-M3) | May still be used for non-ML services |
319
- | **arXiv API** | βœ… Live | Keyword search (placeholder) + metadata fetch |
320
- | **BGE-M3 Model** | βœ… Code written, loads at startup | `app/embed_svc.py` β€” singleton, LRU cache, CPU float32 |
321
  | **Groq API** | βœ… Code written, fallback-enabled | `app/groq_svc.py` β€” 2s timeout, academic heuristic skip |
322
- | **Kaggle Dataset** | ❌ Not downloaded | Phase 4 bulk-loads metadata |
323
- | **Notebooks** | βœ… Organized | `notebooks/` β€” 01-upload, 02-test, 03-search-benchmark (see `notebooks/README.md`) |
324
 
325
  ### Credentials Status
326
 
327
  | Credential | Status | Env Var | Notes |
328
  |---|---|---|---|
329
- | **Qdrant Cloud** | βœ… In `config.py` | `QDRANT_URL`, `QDRANT_API_KEY` | Already wired |
330
- | **Zilliz Cloud** | βœ… Confirmed (not yet in config.py) | `ZILLIZ_URI`, `ZILLIZ_TOKEN` | Phase 3 adds to config |
331
- | **Groq** | βœ… Confirmed (not yet in config.py) | `GROQ_API_KEY` | Phase 3 adds to config |
332
- | **HF Spaces** | πŸ“‹ Not yet created | N/A | Create Space with Docker SDK when ready to deploy |
 
333
 
334
  ---
335
 
 
1
  # ResearchIT β€” Master Task Tracker
2
 
3
  > **Purpose**: Single source of truth for all completed, in-progress, and upcoming work.
4
+ > **Last updated**: 2026-04-20
5
+ > **Current phase**: Phase 3.5 (Turso Metadata DB) β€” complete, integration pending
6
 
7
  ---
8
 
 
201
 
202
  ---
203
 
204
+ ## Phase 3.5: Turso ArXiv Metadata DB βœ… COMPLETE
205
+
206
+ > *Bulk-loaded 1.23 GB of arXiv paper metadata + citation data to Turso (libSQL) cloud DB.*
207
+ > *Eliminates the unstable arXiv API dependency for metadata fetching (Phase 4.2 solved early).*
208
+ > *Created from Kaggle notebook β€” no code changes to ResearchIT codebase yet.*
209
+
210
+ ### Infrastructure
211
+ - [x] Turso cloud DB created: `arxiv-data` on `aws-ap-south-1`
212
+ - URL: `https://arxiv-data-siddhm11.aws-ap-south-1.turso.io`
213
+ - Auth: Platform token + DB auth token (minted via CLI)
214
+ - [x] Table: `papers` with columns:
215
+ - `arxiv_id` (TEXT, UNIQUE INDEX `idx_papers_arxiv_id`)
216
+ - `title` (TEXT)
217
+ - `authors` (TEXT)
218
+ - `categories` (TEXT)
219
+ - `primary_topic` (TEXT)
220
+ - `update_date` (TEXT)
221
+ - `abstract_preview` (TEXT, truncated to 500 chars)
222
+ - `citation_count` (INTEGER, default 0)
223
+ - `influential_citations` (INTEGER, default 0)
224
+ - [x] Data sources:
225
+ - `arxiv_comprehensive_papers.csv` (Kaggle: siddhm11/arxivdata)
226
+ - `arxiv_citations_summary.csv` (Kaggle: siddhm11/citation-data-letsgoo)
227
+ - Joined on `id` = `arxiv_id_clean`, deduplicated
228
+ - [x] Row count verified: local ↔ remote match
229
+ - [x] Unique index on `arxiv_id` for fast lookups
230
+
231
+ ### Integration plan (not yet wired into code)
232
+ - [ ] Add `TURSO_URL` and `TURSO_DB_TOKEN` to `config.py` / `.env`
233
+ - [ ] Create `app/turso_svc.py` β€” metadata lookup service
234
+ - `fetch_metadata_batch(arxiv_ids)` β†’ `{arxiv_id: paper_dict}`
235
+ - Uses `libsql-experimental` or `libsql-client` (HTTP)
236
+ - [ ] Replace `arxiv_svc.fetch_metadata_batch()` with `turso_svc.fetch_metadata_batch()` in `search.py`
237
+ - [ ] arXiv API becomes fallback for papers not in Turso DB
238
+ - [ ] **Impact**: Metadata fetch drops from ~7,600ms to <50ms
239
+
240
+ ---
241
+
242
  ## Phase 4: Recommendation Pipeline Fixes πŸ“‹ NOT STARTED
243
 
244
  > *Fix the known architectural debt in the recommendation pipeline.*
 
255
  - Deduplicate across clusters (assign to highest-ranked)
256
  - MMR over merged union
257
 
258
+ ### 4.2 β€” Pre-populate Metadata Store βœ… DONE (via Turso)
259
+ - [x] Bulk-loaded arXiv metadata from Kaggle to Turso cloud DB (Phase 3.5)
260
+ - [x] 1.23 GB, includes citation counts from Semantic Scholar
261
+ - [ ] Wire Turso service into `search.py` to replace arXiv API calls
262
  - [ ] arXiv API becomes fallback for genuinely new papers only
263
+ - [ ] **Impact**: Metadata fetch drops from ~7,600ms to <50ms
264
 
265
  ### 4.3 β€” Hungarian Matching for Cluster Stability
266
  - [ ] Implement Hungarian matching in `clustering.py`
 
351
  | Component | Status | Details |
352
  |---|---|---|
353
  | **Qdrant Cloud** | βœ… Live | 1.6M papers, BGE-M3 1024-dim, BQ enabled, HNSW m=32 |
354
+ | **Zilliz Cloud** | βœ… Live | 1.6M papers, BGE-M3 sparse vectors, collection `arxiv_bgem3_sparse` |
355
+ | **Turso (libSQL)** | βœ… Live | 1.23 GB arXiv metadata + citations, `arxiv-data` DB, `papers` table, unique index on `arxiv_id` |
356
+ | **SQLite** | βœ… Live | interactions, paper_metadata (local cache), user_profiles, user_clusters |
357
+ | **HF Spaces** | βœ… Deployed | Docker SDK, free tier, port 7860 β€” https://siddhm11-researchit.hf.space |
358
  | **Render** | ⚠️ Previous target (512MB RAM too small for BGE-M3) | May still be used for non-ML services |
359
+ | **arXiv API** | βœ… Live | Keyword search fallback + metadata fetch (to be replaced by Turso) |
360
+ | **BGE-M3 Model** | βœ… Live | Pre-baked in Docker image, warm-up at startup |
361
  | **Groq API** | βœ… Code written, fallback-enabled | `app/groq_svc.py` β€” 2s timeout, academic heuristic skip |
362
+ | **Notebooks** | βœ… Organized | `notebooks/` β€” 01-upload, 02-test, 03-search-benchmark |
 
363
 
364
  ### Credentials Status
365
 
366
  | Credential | Status | Env Var | Notes |
367
  |---|---|---|---|
368
+ | **Qdrant Cloud** | βœ… In `.env` | `QDRANT_URL`, `QDRANT_API_KEY` | Already wired |
369
+ | **Zilliz Cloud** | βœ… In `.env` | `ZILLIZ_URI`, `ZILLIZ_TOKEN` | Phase 3, wired |
370
+ | **Turso (libSQL)** | βœ… Token minted | `TURSO_URL`, `TURSO_DB_TOKEN` | Phase 3.5, not yet in config.py |
371
+ | **Groq** | βœ… In `.env` | `GROQ_API_KEY` | Phase 3, wired |
372
+ | **HF Spaces** | βœ… Deployed | Secrets panel | Need to add all env vars |
373
 
374
  ---
375