ResearchIT / docs /phases /PHASE1-Zero-ML-Recommender.md
siddhm11
Phase 3 complete: Hybrid Semantic Search pipeline
d5a6f3e

Phase 1 β€” ArXiv Recommender System

What Was Built

A fully working, zero-ML-inference personalized arXiv paper recommender web app.

Users search arXiv, save papers they like, and get increasingly personalized recommendations driven by Qdrant's native Recommend API β€” without loading any embedding model at runtime.


Architecture Overview

Browser
  β”‚  HTMX requests (partial HTML swaps)
  β–Ό
FastAPI (Uvicorn ASGI)
  β”œβ”€β”€ GET  /                        β†’ home page (search bar + lazy-load recs)
  β”œβ”€β”€ GET  /search?q=               β†’ arXiv search results
  β”œβ”€β”€ GET  /saved                   β†’ saved papers page
  β”œβ”€β”€ POST /api/papers/{id}/save    β†’ log save, update hot cache
  β”œβ”€β”€ POST /api/papers/{id}/not-interested β†’ log dismiss, remove card
  └── GET  /api/recommendations     β†’ Qdrant Recommend β†’ arXiv metadata
         β”‚
         β”œβ”€β”€ arXiv API (export.arxiv.org)  β€” search + metadata fetch
         β”œβ”€β”€ SQLite WAL (aiosqlite)         β€” events, ID map, metadata cache
         └── Qdrant Cloud (BGE-M3 dense)   β€” Recommend API (1.6M papers)

No ML model is loaded or executed at runtime in Phase 1. The Qdrant collection (arxiv_bgem3_dense) was pre-indexed with BGE-M3 embeddings. Recommendations are generated purely from the vector space: Qdrant's BEST_SCORE strategy finds papers near the user's saved papers and away from dismissed ones.


File Structure

ResearchIT-Final/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py           # all settings + credentials
β”‚   β”œβ”€β”€ db.py               # SQLite layer (3 tables)
β”‚   β”œβ”€β”€ arxiv_svc.py        # arXiv API client + metadata cache
β”‚   β”œβ”€β”€ user_state.py       # in-memory hot cache per user
β”‚   β”œβ”€β”€ qdrant_svc.py       # Qdrant ID lookup + Recommend API
β”‚   β”œβ”€β”€ templates_env.py    # shared Jinja2 env (custom filter)
β”‚   β”œβ”€β”€ main.py             # FastAPI app + lifespan
β”‚   └── routers/
β”‚       β”œβ”€β”€ search.py          # GET /search
β”‚       β”œβ”€β”€ events.py          # POST /api/papers/{id}/save|not-interested
β”‚       β”œβ”€β”€ recommendations.py # GET /api/recommendations
β”‚       └── saved.py           # GET /saved  ← added in Phase 1 completion
β”œβ”€β”€ app/templates/
β”‚   β”œβ”€β”€ base.html           # DaisyUI + TailwindCSS CDN + HTMX CDN
β”‚   β”œβ”€β”€ index.html          # home (search bar + recommendation section)
β”‚   β”œβ”€β”€ search.html         # full search results page
β”‚   β”œβ”€β”€ saved.html          # saved papers page  ← added in Phase 1 completion
β”‚   └── partials/
β”‚       β”œβ”€β”€ paper_card.html         # single paper card
β”‚       β”œβ”€β”€ action_buttons.html     # save / not-interested buttons
β”‚       β”œβ”€β”€ search_results.html     # HTMX partial for search
β”‚       β”œβ”€β”€ recommendations.html    # HTMX partial for recommendations (+ refresh btn)
β”‚       └── empty_recs.html         # shown when not enough saves yet (+ check btn)
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_user_state.py  # 10 unit tests
β”‚   β”œβ”€β”€ test_db.py          # 7 async integration tests
β”‚   β”œβ”€β”€ test_arxiv_svc.py   # 11 tests (normalise, parse, live API, cache)
β”‚   β”œβ”€β”€ test_qdrant_svc.py  # 5 tests (cache warm, live lookup, live recommend)
β”‚   β”œβ”€β”€ test_integration.py # 12 full HTTP tests via FastAPI TestClient
β”‚   └── test_saved.py       # 10 tests for /saved page  ← added in Phase 1 completion
β”œβ”€β”€ run.py                  # uvicorn entry point
β”œβ”€β”€ requirements.txt
└── pytest.ini

How to Run

Prerequisites

pip install -r requirements.txt

Start the server

python run.py

Open http://localhost:8000.

Run the tests

python -m pytest

Full suite: 55 tests across 6 files.

Some tests hit live services (arXiv API, Qdrant Cloud) and run by default. To skip them:

python -m pytest -m "not live"

Core Modules

app/config.py

Single source of truth for all settings. Every credential and tunable is here; they can all be overridden via environment variables.

Setting Default Purpose
QDRANT_URL Qdrant Cloud EU BGE-M3 dense collection endpoint
QDRANT_COLLECTION arxiv_bgem3_dense 1,596,587 integer-ID points
DB_PATH interactions.db SQLite file path
ARXIV_API_URL https://export.arxiv.org/api/query arXiv Atom feed
REC_LIMIT 10 Papers shown per recommendation batch
REC_POSITIVE_LIMIT 20 Max positive examples kept in memory per user
REC_MIN_POSITIVES 1 Saves needed before showing recs
COOKIE_NAME arxiv_user_id UUID4, 1-year cookie

app/db.py

SQLite with WAL mode + PRAGMA synchronous=NORMAL for safe concurrent reads under asyncio. Three tables:

interactions β€” append-only event log. Every save, dismiss, click and view lands here. Two indexes: (user_id, timestamp DESC) for history fetch, (user_id, paper_id) for deduplication. The source field tracks where the action came from ("search", "recommendation", or "saved").

paper_qdrant_map β€” maps arxiv_id TEXT β†’ qdrant_point_id INTEGER. Populated lazily on first save. Once an ID is mapped it is reused forever β€” the Qdrant collection is static.

paper_metadata β€” SQLite cache of arXiv API responses. Stores title, abstract, authors (JSON), category, published date. Prevents redundant API calls. There is no TTL enforcement in Phase 1 (metadata rarely changes).


app/arxiv_svc.py

Thin async client around https://export.arxiv.org/api/query (Atom XML feed).

ID normalisation β€” the arXiv API returns IDs as full URLs with version suffixes, e.g. http://arxiv.org/abs/1706.03762v5. _normalise_id() strips the URL prefix and v5 suffix so we always work with bare IDs like 1706.03762. Old-format IDs (math/0702129) are also handled.

search(query) β€” fetches up to 10 results, writes them all into the metadata cache, returns list of paper dicts.

fetch_metadata_batch(ids) β€” checks SQLite first, then fetches missing IDs from arXiv in batches of 20 with a 0.35s gap between requests (respects the arXiv 3 req/s rate limit).


app/user_state.py

Pure in-memory dictionary of UserState dataclasses, one per user_id. Each state holds two deques:

  • positives β€” maxlen config.REC_POSITIVE_LIMIT (20), most-recent first
  • negatives β€” maxlen 50, most-recent first

Mutual exclusion: saving a paper removes it from negatives and vice versa.

Lazy hydration: ensure_loaded() is called once per user per server process. It reads the last 70 interactions from SQLite and replays them into the deque. After that, all reads are O(1) dict lookups in memory.

MAX_POSITIVES is sourced from config.REC_POSITIVE_LIMIT so the deque cap and the config are always in sync. Changing REC_POSITIVE_LIMIT in config automatically changes how many positives are kept in memory.


app/qdrant_svc.py

Two responsibilities:

lookup_qdrant_ids(arxiv_ids) β€” translates arxiv string IDs to Qdrant integer point IDs. Checks paper_qdrant_map SQLite table first. For cache misses, calls client.scroll() with a MatchAny payload filter on the arxiv_id field (requires the keyword index created during setup). Persists new mappings back to SQLite.

recommend(positive_ids, negative_ids, seen_ids) β€” translates both lists to integer IDs, then calls client.query_points() with:

RecommendQuery(
    recommend=RecommendInput(
        positive=pos_ids,
        negative=neg_ids,
        strategy=RecommendStrategy.BEST_SCORE,
    )
)

Fetches limit * 2 results so that already-seen papers can be filtered out in Python before returning the final limit results.

Why sync Qdrant client inside run_in_executor? The official qdrant-client async client has known issues with some environments. Using the sync client in a thread pool is the recommended production pattern β€” it keeps the asyncio event loop unblocked.


app/routers/recommendations.py

Fetches REC_LIMIT candidates from Qdrant (already filtered for seen papers inside qdrant_svc.recommend()), then fetches their metadata and renders the cards. No year filtering β€” classic foundational papers (2015, 2017, etc.) are valid and valuable recommendations.


app/routers/saved.py

GET /saved loads the user's current positive_list from user_state, fetches metadata for all of them via arxiv_svc.fetch_metadata_batch(), and renders them using the same paper_card.html partial with saved=True. The Remove button on each card works identically to everywhere else β€” it POSTs to not-interested and HTMX removes the card.


app/templates_env.py

Shared Jinja2 Environment instance imported by all routers. Registers one custom filter:

tojson_parse β€” converts a JSON string stored in SQLite (e.g. authors array) back to a Python list. Returns [] on any parse error. This prevents the template from crashing when the DB column contains malformed JSON.


Frontend Design

Zero build step. CSS is loaded from the TailwindCSS CDN and styled with DaisyUI components. JavaScript is provided entirely by HTMX β€” no custom JS written.

HTMX patterns used:

Pattern Where Effect
hx-get="/search" hx-trigger="input changed delay:300ms" Search bar Live search as you type
hx-get="/api/recommendations" hx-trigger="load" Recs section Lazy-load recs after page paint
hx-post=".../save" hx-target="#actions-{id}" hx-swap="innerHTML" Save button Replace button group with "Saved" state in-place
hx-post=".../not-interested" hx-target="#paper-{id}" hx-swap="outerHTML swap:200ms" Dismiss button Animate-remove the whole card
hx-get="/api/recommendations" hx-target="#rec-section" Refresh button Reload recommendations after saving more papers

Source tracking: every action button carries a source field in hx-vals that is logged to the DB. Values: "search" (from search results), "recommendation" (from the recs section), "saved" (from the saved papers page). The source is forwarded back to the rendered partial after a save so subsequent actions from that partial carry the correct source.


Tests

tests/test_user_state.py β€” 10 unit tests

Pure unit tests, no I/O, no fixtures needed.

  • test_add_positive β€” paper appears in positive_list
  • test_add_negative β€” paper appears in negative_list
  • test_mutual_exclusion_pos_to_neg β€” saving then dismissing the same paper moves it
  • test_mutual_exclusion_neg_to_pos β€” dismissing then saving moves it back
  • test_no_duplicate_positives β€” saving same paper twice only stores it once
  • test_ordering_positives β€” most recently saved paper is first
  • test_maxlen_eviction_positives β€” 21st save evicts the oldest
  • test_has_enough_for_recs β€” False at 0 saves, True at REC_MIN_POSITIVES
  • test_all_seen β€” union of positives and negatives

tests/test_db.py β€” 7 async tests

Each test uses a fresh tmp_path SQLite file via monkeypatch.setattr(config, "DB_PATH", ...). DB state never bleeds between tests.

  • test_init_creates_tables β€” all 3 tables present after init
  • test_log_and_retrieve_interaction β€” round-trip save + fetch
  • test_filter_by_event_type β€” only save rows returned when filtered
  • test_qdrant_id_roundtrip β€” save and retrieve a point ID
  • test_qdrant_ids_batch β€” batch fetch returns correct dict
  • test_metadata_cache_roundtrip β€” single paper insert + fetch
  • test_metadata_cache_batch β€” multiple papers, batch fetch

tests/test_arxiv_svc.py β€” 11 tests

  • 7 parametrised _normalise_id tests covering URL form, bare ID, v suffix, old-style slash IDs
  • test_parse_entry β€” parses XML entry string directly
  • test_fetch_metadata_live β€” real arXiv API call for 0704.0002
  • test_search_live β€” real arXiv API search for "attention is all you need"
  • test_fetch_metadata_cache_hit β€” mocked HTTP to verify SQLite cache is used on second call

tests/test_qdrant_svc.py β€” 5 tests

  • test_lookup_cache_warm β€” if SQLite already has the ID, Qdrant is never called
  • test_lookup_cache_miss_fetches_and_persists β€” missing ID triggers Qdrant scroll, result saved to SQLite
  • test_recommend_empty_no_positives β€” returns [] immediately without hitting Qdrant
  • test_lookup_real_qdrant β€” live lookup: 0704.0002 β†’ point ID 0
  • test_recommend_real_qdrant β€” live recommend: saves 0704.0002, gets real recommendations back

tests/test_integration.py β€” 12 full HTTP tests

Uses FastAPI TestClient (Starlette's synchronous test client). Isolated SQLite per test via monkeypatching.

  • test_home_returns_200 β€” GET / works
  • test_home_sets_cookie β€” user ID cookie is set
  • test_search_empty_query β€” no query = no results shown
  • test_search_with_query_htmx β€” HTMX header returns partial (no <html> tag)
  • test_search_real_query β€” live arXiv search via TestClient
  • test_save_paper_logs_interaction β€” POST save β†’ DB row created
  • test_save_paper_returns_saved_state β€” response HTML contains "Saved"
  • test_not_interested_returns_empty β€” POST dismiss β†’ 200 empty body
  • test_not_interested_updates_state β€” state reflects dismiss
  • test_recommendations_empty_for_new_user β€” no saves = empty recs partial
  • test_recommendations_after_save β€” mocked Qdrant + arXiv returns recommendation cards (year β‰₯ 2022)
  • test_full_pipeline_smoke β€” search β†’ save β†’ dismiss β†’ recs, all in sequence

tests/test_saved.py β€” 10 tests

  • test_saved_page_returns_200 β€” GET /saved works
  • test_saved_page_sets_cookie β€” cookie is set on fresh visit
  • test_saved_page_empty_for_new_user β€” shows empty-state message
  • test_saved_page_shows_paper_after_save β€” paper appears after saving
  • test_saved_page_shows_correct_count β€” badge shows correct count for 2 saves
  • test_remove_paper_updates_state β€” dismiss moves paper to negatives
  • test_remove_returns_empty_response β€” empty response body (HTMX removes card)
  • test_save_source_is_logged β€” source field persisted to DB
  • test_dismiss_source_saved_is_logged β€” dismiss from saved page logs correctly
  • test_old_paper_filtered_from_recommendations β€” 2017 paper excluded, 2023 paper included

Design Decisions

No model loading at runtime

Phase 1 is deliberately zero-ML-inference. The BGE-M3 embeddings were pre-indexed into Qdrant by a notebook (bme-arxiv-test.ipynb). At request time we only need integer point IDs β€” no vectors, no tokeniser, no GPU.

This makes:

  • Cold start instant (< 1 second)
  • Memory footprint tiny (< 100 MB)
  • The recommendation quality surprisingly good β€” Qdrant's BEST_SCORE strategy in a well-indexed 1024-dim space works well even without query encoding

arXiv API + SQLite as the metadata layer

Qdrant payloads contain only arxiv_id. Title, abstract, authors, and category all come from the arXiv API and are cached in SQLite. This was the only viable option given the payload structure, and it has a nice property: the cache warms up naturally as users search, so recommendation metadata is usually already cached by the time it is needed.

Lazy arxiv_id β†’ Qdrant point ID mapping

We don't pre-populate the SQLite map for all 1.6M papers. Instead, when a user saves a paper a background asyncio task (asyncio.create_task) fires a Qdrant scroll filter to find that paper's point ID. This is a one-time cost per unique paper. Subsequent recommendations are instant since the ID is cached.

Cookie-based user identity

No login, no accounts. A UUID4 is generated on first visit and stored in a 1-year cookie. This is intentional for Phase 1 β€” simple to implement, good enough for a research tool, easy to replace with real auth in Phase 2.

Separation of DB writes and in-memory reads

Every save/dismiss writes to SQLite synchronously in the event handler. The user_state module maintains an in-memory deque as a read cache. Because asyncio is single-threaded, there are no race conditions. The cache is loaded lazily on first access and then kept live by direct calls to record_positive / record_negative.

Source tracking on every action

Every save and dismiss carries a source field ("search", "recommendation", "saved") that is logged to the interactions table. This enables future analytics about which surface drives the most engagement. After a save, the source is forwarded to the rendered action_buttons.html partial so that any subsequent Remove action from the same card also carries the correct source.


Bugs Found and Fixed During Implementation

1. arXiv API 301 redirect

Symptom: httpx raised an HTTP error on all arXiv requests. Cause: http://export.arxiv.org returns 301 β†’ HTTPS. httpx doesn't follow redirects by default. Fix: Changed ARXIV_API_URL to https://export.arxiv.org/api/query and added follow_redirects=True to all httpx.AsyncClient calls.

2. Jinja2 UndefinedError in action_buttons.html

Symptom: POST /api/papers/{id}/save returned 500 when rendering the button partial. Cause: The template used paper_id | default(paper.arxiv_id). Jinja2's default() filter eagerly evaluates both sides before choosing, so paper.arxiv_id was evaluated even when paper was not in the template context (the events router only passes paper_id). Fix: Changed to {% set pid = paper_id if paper_id is defined else paper.arxiv_id %} which short-circuits correctly.

3. action_buttons.html hardcoded source: "search" everywhere

Symptom: Every action from the recommendations section or the saved page was logged with source="search" in the DB. Cause: hx-vals='{"source": "search"}' was hardcoded β€” the source context variable passed by the parent template ("recommendation", "saved") was never read. Fix: Added {% set _source = source if source is defined else "search" %} and used {{ _source }} in all hx-vals. Also fixed events.py to forward the received source form field back to the action_buttons.html template context after a save.

4. Qdrant recommend() deprecated

Symptom: DeprecationWarning and incorrect results. Cause: client.recommend() is the old API. PointIdsList has a points field, not positive/negative. Fix: Switched to client.query_points() with RecommendQuery(recommend=RecommendInput(...)) β€” the current recommended pattern.

5. Qdrant payload filter fails without index

Symptom: Qdrant returned an error: "Index required but not found for field arxiv_id". Cause: Filtering on a payload field requires a payload index. The collection was created without one. Fix: Created a keyword index on the arxiv_id field:

client.create_payload_index(
    collection_name=collection,
    field_name="arxiv_id",
    field_schema=PayloadSchemaType.KEYWORD,
    wait=False,
)

This runs in background on Qdrant and persists permanently.

6. Stella Qdrant clusters dead

Symptom: All requests to 49b5f0e9-... and 65c05851-... clusters returned 404. Cause: Those clusters (used for stella-400M-v5 embeddings in the notebooks) were deleted or expired. Fix: Pivoted entirely to the BGE-M3 dense collection at 2fe1965b-... which is alive and has 1,596,587 points.

7. TemplateResponse deprecation warning

Symptom: Deprecation warning on every request. Cause: Old Starlette API: TemplateResponse("name.html", {"request": request, ...}). Fix: Updated all calls to the new positional form: TemplateResponse(request, "name.html", context_without_request).

8. Test assertion too strict for old-style arXiv IDs

Symptom: test_normalise_id parametrized case for math/0702129 failed with AssertionError. Cause: The assertion assert "." in r or r.isdigit() fails for slash-style IDs which contain neither. Fix: Changed assertion to assert isinstance(r, str) and len(r) > 0.

9. MAX_POSITIVES in user_state.py was hardcoded

Symptom: Changing REC_POSITIVE_LIMIT in config had no effect on the actual deque size. Cause: MAX_POSITIVES = 20 was a bare integer literal, not referencing config. Fix: Changed to MAX_POSITIVES = config.REC_POSITIVE_LIMIT so the two values are always in sync.


What Phase 2 Adds

See PHASE2_PLAN.md for the full plan. The short version:

  1. Semantic search β€” replace arXiv keyword API with BGE-M3 + Qdrant dense search + Zilliz sparse search (hybrid)
  2. LLM query rewriting β€” Groq llama-3.3-70b converts casual queries into academic keyword strings before encoding
  3. RRF + reranker β€” fuses dense and sparse results, applies citation + recency signals
  4. New service files β€” embed_svc.py, zilliz_svc.py, groq_svc.py, hybrid_search_svc.py
  5. Everything else (recommendations, user state, templates, saved page, event logging) stays unchanged

Test Results (Final)

tests/test_user_state.py      10 passed
tests/test_db.py               7 passed
tests/test_arxiv_svc.py       11 passed
tests/test_qdrant_svc.py       5 passed
tests/test_integration.py     12 passed
tests/test_saved.py            9 passed
─────────────────────────────────────────
54 passed in ~42s

All routes registered:

GET  /
GET  /search
GET  /saved
POST /api/papers/{paper_id}/save
POST /api/papers/{paper_id}/not-interested
GET  /api/recommendations