# Phase 1 — ArXiv Recommender System ## What Was Built A fully working, zero-ML-inference personalized arXiv paper recommender web app. Users search arXiv, save papers they like, and get increasingly personalized recommendations driven by Qdrant's native Recommend API — without loading any embedding model at runtime. --- ## Architecture Overview ``` Browser │ HTMX requests (partial HTML swaps) ▼ FastAPI (Uvicorn ASGI) ├── GET / → home page (search bar + lazy-load recs) ├── GET /search?q= → arXiv search results ├── GET /saved → saved papers page ├── POST /api/papers/{id}/save → log save, update hot cache ├── POST /api/papers/{id}/not-interested → log dismiss, remove card └── GET /api/recommendations → Qdrant Recommend → arXiv metadata │ ├── arXiv API (export.arxiv.org) — search + metadata fetch ├── SQLite WAL (aiosqlite) — events, ID map, metadata cache └── Qdrant Cloud (BGE-M3 dense) — Recommend API (1.6M papers) ``` No ML model is loaded or executed at runtime in Phase 1. The Qdrant collection (`arxiv_bgem3_dense`) was pre-indexed with BGE-M3 embeddings. Recommendations are generated purely from the vector space: Qdrant's `BEST_SCORE` strategy finds papers near the user's saved papers and away from dismissed ones. --- ## File Structure ``` ResearchIT-Final/ ├── app/ │ ├── __init__.py │ ├── config.py # all settings + credentials │ ├── db.py # SQLite layer (3 tables) │ ├── arxiv_svc.py # arXiv API client + metadata cache │ ├── user_state.py # in-memory hot cache per user │ ├── qdrant_svc.py # Qdrant ID lookup + Recommend API │ ├── templates_env.py # shared Jinja2 env (custom filter) │ ├── main.py # FastAPI app + lifespan │ └── routers/ │ ├── search.py # GET /search │ ├── events.py # POST /api/papers/{id}/save|not-interested │ ├── recommendations.py # GET /api/recommendations │ └── saved.py # GET /saved ← added in Phase 1 completion ├── app/templates/ │ ├── base.html # DaisyUI + TailwindCSS CDN + HTMX CDN │ ├── index.html # home (search bar + recommendation section) │ ├── search.html # full search results page │ ├── saved.html # saved papers page ← added in Phase 1 completion │ └── partials/ │ ├── paper_card.html # single paper card │ ├── action_buttons.html # save / not-interested buttons │ ├── search_results.html # HTMX partial for search │ ├── recommendations.html # HTMX partial for recommendations (+ refresh btn) │ └── empty_recs.html # shown when not enough saves yet (+ check btn) ├── tests/ │ ├── test_user_state.py # 10 unit tests │ ├── test_db.py # 7 async integration tests │ ├── test_arxiv_svc.py # 11 tests (normalise, parse, live API, cache) │ ├── test_qdrant_svc.py # 5 tests (cache warm, live lookup, live recommend) │ ├── test_integration.py # 12 full HTTP tests via FastAPI TestClient │ └── test_saved.py # 10 tests for /saved page ← added in Phase 1 completion ├── run.py # uvicorn entry point ├── requirements.txt └── pytest.ini ``` --- ## How to Run ### Prerequisites ```bash pip install -r requirements.txt ``` ### Start the server ```bash python run.py ``` Open `http://localhost:8000`. ### Run the tests ```bash python -m pytest ``` Full suite: **55 tests** across 6 files. Some tests hit live services (arXiv API, Qdrant Cloud) and run by default. To skip them: ```bash python -m pytest -m "not live" ``` --- ## Core Modules ### `app/config.py` Single source of truth for all settings. Every credential and tunable is here; they can all be overridden via environment variables. | Setting | Default | Purpose | |---|---|---| | `QDRANT_URL` | Qdrant Cloud EU | BGE-M3 dense collection endpoint | | `QDRANT_COLLECTION` | `arxiv_bgem3_dense` | 1,596,587 integer-ID points | | `DB_PATH` | `interactions.db` | SQLite file path | | `ARXIV_API_URL` | `https://export.arxiv.org/api/query` | arXiv Atom feed | | `REC_LIMIT` | 10 | Papers shown per recommendation batch | | `REC_POSITIVE_LIMIT` | 20 | Max positive examples kept in memory per user | | `REC_MIN_POSITIVES` | 1 | Saves needed before showing recs | | `COOKIE_NAME` | `arxiv_user_id` | UUID4, 1-year cookie | --- ### `app/db.py` SQLite with WAL mode + `PRAGMA synchronous=NORMAL` for safe concurrent reads under asyncio. Three tables: **`interactions`** — append-only event log. Every save, dismiss, click and view lands here. Two indexes: `(user_id, timestamp DESC)` for history fetch, `(user_id, paper_id)` for deduplication. The `source` field tracks where the action came from (`"search"`, `"recommendation"`, or `"saved"`). **`paper_qdrant_map`** — maps `arxiv_id TEXT → qdrant_point_id INTEGER`. Populated lazily on first save. Once an ID is mapped it is reused forever — the Qdrant collection is static. **`paper_metadata`** — SQLite cache of arXiv API responses. Stores title, abstract, authors (JSON), category, published date. Prevents redundant API calls. There is no TTL enforcement in Phase 1 (metadata rarely changes). --- ### `app/arxiv_svc.py` Thin async client around `https://export.arxiv.org/api/query` (Atom XML feed). **ID normalisation** — the arXiv API returns IDs as full URLs with version suffixes, e.g. `http://arxiv.org/abs/1706.03762v5`. `_normalise_id()` strips the URL prefix and `v5` suffix so we always work with bare IDs like `1706.03762`. Old-format IDs (`math/0702129`) are also handled. **`search(query)`** — fetches up to 10 results, writes them all into the metadata cache, returns list of paper dicts. **`fetch_metadata_batch(ids)`** — checks SQLite first, then fetches missing IDs from arXiv in batches of 20 with a 0.35s gap between requests (respects the arXiv 3 req/s rate limit). --- ### `app/user_state.py` Pure in-memory dictionary of `UserState` dataclasses, one per `user_id`. Each state holds two `deque`s: - `positives` — maxlen `config.REC_POSITIVE_LIMIT` (20), most-recent first - `negatives` — maxlen 50, most-recent first **Mutual exclusion**: saving a paper removes it from negatives and vice versa. **Lazy hydration**: `ensure_loaded()` is called once per user per server process. It reads the last 70 interactions from SQLite and replays them into the deque. After that, all reads are O(1) dict lookups in memory. **`MAX_POSITIVES` is sourced from `config.REC_POSITIVE_LIMIT`** so the deque cap and the config are always in sync. Changing `REC_POSITIVE_LIMIT` in config automatically changes how many positives are kept in memory. --- ### `app/qdrant_svc.py` Two responsibilities: **`lookup_qdrant_ids(arxiv_ids)`** — translates arxiv string IDs to Qdrant integer point IDs. Checks `paper_qdrant_map` SQLite table first. For cache misses, calls `client.scroll()` with a `MatchAny` payload filter on the `arxiv_id` field (requires the keyword index created during setup). Persists new mappings back to SQLite. **`recommend(positive_ids, negative_ids, seen_ids)`** — translates both lists to integer IDs, then calls `client.query_points()` with: ```python RecommendQuery( recommend=RecommendInput( positive=pos_ids, negative=neg_ids, strategy=RecommendStrategy.BEST_SCORE, ) ) ``` Fetches `limit * 2` results so that already-seen papers can be filtered out in Python before returning the final `limit` results. **Why sync Qdrant client inside `run_in_executor`?** The official `qdrant-client` async client has known issues with some environments. Using the sync client in a thread pool is the recommended production pattern — it keeps the asyncio event loop unblocked. --- ### `app/routers/recommendations.py` Fetches `REC_LIMIT` candidates from Qdrant (already filtered for seen papers inside `qdrant_svc.recommend()`), then fetches their metadata and renders the cards. No year filtering — classic foundational papers (2015, 2017, etc.) are valid and valuable recommendations. --- ### `app/routers/saved.py` `GET /saved` loads the user's current `positive_list` from `user_state`, fetches metadata for all of them via `arxiv_svc.fetch_metadata_batch()`, and renders them using the same `paper_card.html` partial with `saved=True`. The Remove button on each card works identically to everywhere else — it POSTs to `not-interested` and HTMX removes the card. --- ### `app/templates_env.py` Shared Jinja2 `Environment` instance imported by all routers. Registers one custom filter: **`tojson_parse`** — converts a JSON string stored in SQLite (e.g. authors array) back to a Python list. Returns `[]` on any parse error. This prevents the template from crashing when the DB column contains malformed JSON. --- ## Frontend Design Zero build step. CSS is loaded from the TailwindCSS CDN and styled with DaisyUI components. JavaScript is provided entirely by HTMX — no custom JS written. **HTMX patterns used:** | Pattern | Where | Effect | |---|---|---| | `hx-get="/search" hx-trigger="input changed delay:300ms"` | Search bar | Live search as you type | | `hx-get="/api/recommendations" hx-trigger="load"` | Recs section | Lazy-load recs after page paint | | `hx-post=".../save" hx-target="#actions-{id}" hx-swap="innerHTML"` | Save button | Replace button group with "Saved" state in-place | | `hx-post=".../not-interested" hx-target="#paper-{id}" hx-swap="outerHTML swap:200ms"` | Dismiss button | Animate-remove the whole card | | `hx-get="/api/recommendations" hx-target="#rec-section"` | Refresh button | Reload recommendations after saving more papers | **Source tracking**: every action button carries a `source` field in `hx-vals` that is logged to the DB. Values: `"search"` (from search results), `"recommendation"` (from the recs section), `"saved"` (from the saved papers page). The `source` is forwarded back to the rendered partial after a save so subsequent actions from that partial carry the correct source. --- ## Tests ### `tests/test_user_state.py` — 10 unit tests Pure unit tests, no I/O, no fixtures needed. - `test_add_positive` — paper appears in `positive_list` - `test_add_negative` — paper appears in `negative_list` - `test_mutual_exclusion_pos_to_neg` — saving then dismissing the same paper moves it - `test_mutual_exclusion_neg_to_pos` — dismissing then saving moves it back - `test_no_duplicate_positives` — saving same paper twice only stores it once - `test_ordering_positives` — most recently saved paper is first - `test_maxlen_eviction_positives` — 21st save evicts the oldest - `test_has_enough_for_recs` — False at 0 saves, True at REC_MIN_POSITIVES - `test_all_seen` — union of positives and negatives ### `tests/test_db.py` — 7 async tests Each test uses a fresh `tmp_path` SQLite file via `monkeypatch.setattr(config, "DB_PATH", ...)`. DB state never bleeds between tests. - `test_init_creates_tables` — all 3 tables present after init - `test_log_and_retrieve_interaction` — round-trip save + fetch - `test_filter_by_event_type` — only `save` rows returned when filtered - `test_qdrant_id_roundtrip` — save and retrieve a point ID - `test_qdrant_ids_batch` — batch fetch returns correct dict - `test_metadata_cache_roundtrip` — single paper insert + fetch - `test_metadata_cache_batch` — multiple papers, batch fetch ### `tests/test_arxiv_svc.py` — 11 tests - 7 parametrised `_normalise_id` tests covering URL form, bare ID, `v` suffix, old-style slash IDs - `test_parse_entry` — parses XML entry string directly - `test_fetch_metadata_live` — real arXiv API call for `0704.0002` - `test_search_live` — real arXiv API search for "attention is all you need" - `test_fetch_metadata_cache_hit` — mocked HTTP to verify SQLite cache is used on second call ### `tests/test_qdrant_svc.py` — 5 tests - `test_lookup_cache_warm` — if SQLite already has the ID, Qdrant is never called - `test_lookup_cache_miss_fetches_and_persists` — missing ID triggers Qdrant scroll, result saved to SQLite - `test_recommend_empty_no_positives` — returns `[]` immediately without hitting Qdrant - `test_lookup_real_qdrant` — live lookup: `0704.0002` → point ID 0 - `test_recommend_real_qdrant` — live recommend: saves `0704.0002`, gets real recommendations back ### `tests/test_integration.py` — 12 full HTTP tests Uses FastAPI `TestClient` (Starlette's synchronous test client). Isolated SQLite per test via monkeypatching. - `test_home_returns_200` — GET / works - `test_home_sets_cookie` — user ID cookie is set - `test_search_empty_query` — no query = no results shown - `test_search_with_query_htmx` — HTMX header returns partial (no `` tag) - `test_search_real_query` — live arXiv search via TestClient - `test_save_paper_logs_interaction` — POST save → DB row created - `test_save_paper_returns_saved_state` — response HTML contains "Saved" - `test_not_interested_returns_empty` — POST dismiss → 200 empty body - `test_not_interested_updates_state` — state reflects dismiss - `test_recommendations_empty_for_new_user` — no saves = empty recs partial - `test_recommendations_after_save` — mocked Qdrant + arXiv returns recommendation cards (year ≥ 2022) - `test_full_pipeline_smoke` — search → save → dismiss → recs, all in sequence ### `tests/test_saved.py` — 10 tests - `test_saved_page_returns_200` — GET /saved works - `test_saved_page_sets_cookie` — cookie is set on fresh visit - `test_saved_page_empty_for_new_user` — shows empty-state message - `test_saved_page_shows_paper_after_save` — paper appears after saving - `test_saved_page_shows_correct_count` — badge shows correct count for 2 saves - `test_remove_paper_updates_state` — dismiss moves paper to negatives - `test_remove_returns_empty_response` — empty response body (HTMX removes card) - `test_save_source_is_logged` — source field persisted to DB - `test_dismiss_source_saved_is_logged` — dismiss from saved page logs correctly - `test_old_paper_filtered_from_recommendations` — 2017 paper excluded, 2023 paper included --- ## Design Decisions ### No model loading at runtime Phase 1 is deliberately zero-ML-inference. The BGE-M3 embeddings were pre-indexed into Qdrant by a notebook (`bme-arxiv-test.ipynb`). At request time we only need integer point IDs — no vectors, no tokeniser, no GPU. This makes: - Cold start instant (< 1 second) - Memory footprint tiny (< 100 MB) - The recommendation quality surprisingly good — Qdrant's `BEST_SCORE` strategy in a well-indexed 1024-dim space works well even without query encoding ### arXiv API + SQLite as the metadata layer Qdrant payloads contain only `arxiv_id`. Title, abstract, authors, and category all come from the arXiv API and are cached in SQLite. This was the only viable option given the payload structure, and it has a nice property: the cache warms up naturally as users search, so recommendation metadata is usually already cached by the time it is needed. ### Lazy arxiv_id → Qdrant point ID mapping We don't pre-populate the SQLite map for all 1.6M papers. Instead, when a user saves a paper a background asyncio task (`asyncio.create_task`) fires a Qdrant scroll filter to find that paper's point ID. This is a one-time cost per unique paper. Subsequent recommendations are instant since the ID is cached. ### Cookie-based user identity No login, no accounts. A UUID4 is generated on first visit and stored in a 1-year cookie. This is intentional for Phase 1 — simple to implement, good enough for a research tool, easy to replace with real auth in Phase 2. ### Separation of DB writes and in-memory reads Every save/dismiss writes to SQLite synchronously in the event handler. The `user_state` module maintains an in-memory deque as a read cache. Because asyncio is single-threaded, there are no race conditions. The cache is loaded lazily on first access and then kept live by direct calls to `record_positive` / `record_negative`. ### Source tracking on every action Every save and dismiss carries a `source` field (`"search"`, `"recommendation"`, `"saved"`) that is logged to the `interactions` table. This enables future analytics about which surface drives the most engagement. After a save, the `source` is forwarded to the rendered `action_buttons.html` partial so that any subsequent Remove action from the same card also carries the correct source. --- ## Bugs Found and Fixed During Implementation ### 1. arXiv API 301 redirect **Symptom**: `httpx` raised an HTTP error on all arXiv requests. **Cause**: `http://export.arxiv.org` returns 301 → HTTPS. `httpx` doesn't follow redirects by default. **Fix**: Changed `ARXIV_API_URL` to `https://export.arxiv.org/api/query` and added `follow_redirects=True` to all `httpx.AsyncClient` calls. ### 2. Jinja2 UndefinedError in `action_buttons.html` **Symptom**: `POST /api/papers/{id}/save` returned 500 when rendering the button partial. **Cause**: The template used `paper_id | default(paper.arxiv_id)`. Jinja2's `default()` filter eagerly evaluates *both sides* before choosing, so `paper.arxiv_id` was evaluated even when `paper` was not in the template context (the events router only passes `paper_id`). **Fix**: Changed to `{% set pid = paper_id if paper_id is defined else paper.arxiv_id %}` which short-circuits correctly. ### 3. `action_buttons.html` hardcoded `source: "search"` everywhere **Symptom**: Every action from the recommendations section or the saved page was logged with `source="search"` in the DB. **Cause**: `hx-vals='{"source": "search"}'` was hardcoded — the `source` context variable passed by the parent template (`"recommendation"`, `"saved"`) was never read. **Fix**: Added `{% set _source = source if source is defined else "search" %}` and used `{{ _source }}` in all `hx-vals`. Also fixed `events.py` to forward the received `source` form field back to the `action_buttons.html` template context after a save. ### 4. Qdrant `recommend()` deprecated **Symptom**: `DeprecationWarning` and incorrect results. **Cause**: `client.recommend()` is the old API. `PointIdsList` has a `points` field, not `positive`/`negative`. **Fix**: Switched to `client.query_points()` with `RecommendQuery(recommend=RecommendInput(...))` — the current recommended pattern. ### 5. Qdrant payload filter fails without index **Symptom**: Qdrant returned an error: *"Index required but not found for field arxiv_id"*. **Cause**: Filtering on a payload field requires a payload index. The collection was created without one. **Fix**: Created a keyword index on the `arxiv_id` field: ```python client.create_payload_index( collection_name=collection, field_name="arxiv_id", field_schema=PayloadSchemaType.KEYWORD, wait=False, ) ``` This runs in background on Qdrant and persists permanently. ### 6. Stella Qdrant clusters dead **Symptom**: All requests to `49b5f0e9-...` and `65c05851-...` clusters returned 404. **Cause**: Those clusters (used for `stella-400M-v5` embeddings in the notebooks) were deleted or expired. **Fix**: Pivoted entirely to the BGE-M3 dense collection at `2fe1965b-...` which is alive and has 1,596,587 points. ### 7. TemplateResponse deprecation warning **Symptom**: Deprecation warning on every request. **Cause**: Old Starlette API: `TemplateResponse("name.html", {"request": request, ...})`. **Fix**: Updated all calls to the new positional form: `TemplateResponse(request, "name.html", context_without_request)`. ### 8. Test assertion too strict for old-style arXiv IDs **Symptom**: `test_normalise_id` parametrized case for `math/0702129` failed with `AssertionError`. **Cause**: The assertion `assert "." in r or r.isdigit()` fails for slash-style IDs which contain neither. **Fix**: Changed assertion to `assert isinstance(r, str) and len(r) > 0`. ### 9. `MAX_POSITIVES` in `user_state.py` was hardcoded **Symptom**: Changing `REC_POSITIVE_LIMIT` in config had no effect on the actual deque size. **Cause**: `MAX_POSITIVES = 20` was a bare integer literal, not referencing config. **Fix**: Changed to `MAX_POSITIVES = config.REC_POSITIVE_LIMIT` so the two values are always in sync. --- ## What Phase 2 Adds See [PHASE2_PLAN.md](PHASE2_PLAN.md) for the full plan. The short version: 1. **Semantic search** — replace arXiv keyword API with BGE-M3 + Qdrant dense search + Zilliz sparse search (hybrid) 2. **LLM query rewriting** — Groq `llama-3.3-70b` converts casual queries into academic keyword strings before encoding 3. **RRF + reranker** — fuses dense and sparse results, applies citation + recency signals 4. **New service files** — `embed_svc.py`, `zilliz_svc.py`, `groq_svc.py`, `hybrid_search_svc.py` 5. Everything else (recommendations, user state, templates, saved page, event logging) stays unchanged --- ## Test Results (Final) ``` tests/test_user_state.py 10 passed tests/test_db.py 7 passed tests/test_arxiv_svc.py 11 passed tests/test_qdrant_svc.py 5 passed tests/test_integration.py 12 passed tests/test_saved.py 9 passed ───────────────────────────────────────── 54 passed in ~42s ``` All routes registered: ``` GET / GET /search GET /saved POST /api/papers/{paper_id}/save POST /api/papers/{paper_id}/not-interested GET /api/recommendations ```