Spaces:

siddhm11
/

ResearchIT

Running

App Files Files Community

ResearchIT / docs /phases /PHASE1-Zero-ML-Recommender.md

siddhm11

Phase 3 complete: Hybrid Semantic Search pipeline

d5a6f3e about 1 month ago

preview code

raw

history blame contribute delete

21.9 kB

Phase 1 — ArXiv Recommender System

What Was Built

A fully working, zero-ML-inference personalized arXiv paper recommender web app.

Users search arXiv, save papers they like, and get increasingly personalized recommendations driven by Qdrant's native Recommend API — without loading any embedding model at runtime.

Architecture Overview

Browser
  │  HTMX requests (partial HTML swaps)
  ▼
FastAPI (Uvicorn ASGI)
  ├── GET  /                        → home page (search bar + lazy-load recs)
  ├── GET  /search?q=               → arXiv search results
  ├── GET  /saved                   → saved papers page
  ├── POST /api/papers/{id}/save    → log save, update hot cache
  ├── POST /api/papers/{id}/not-interested → log dismiss, remove card
  └── GET  /api/recommendations     → Qdrant Recommend → arXiv metadata
         │
         ├── arXiv API (export.arxiv.org)  — search + metadata fetch
         ├── SQLite WAL (aiosqlite)         — events, ID map, metadata cache
         └── Qdrant Cloud (BGE-M3 dense)   — Recommend API (1.6M papers)

No ML model is loaded or executed at runtime in Phase 1. The Qdrant collection (arxiv_bgem3_dense) was pre-indexed with BGE-M3 embeddings. Recommendations are generated purely from the vector space: Qdrant's BEST_SCORE strategy finds papers near the user's saved papers and away from dismissed ones.

File Structure

ResearchIT-Final/
├── app/
│   ├── __init__.py
│   ├── config.py           # all settings + credentials
│   ├── db.py               # SQLite layer (3 tables)
│   ├── arxiv_svc.py        # arXiv API client + metadata cache
│   ├── user_state.py       # in-memory hot cache per user
│   ├── qdrant_svc.py       # Qdrant ID lookup + Recommend API
│   ├── templates_env.py    # shared Jinja2 env (custom filter)
│   ├── main.py             # FastAPI app + lifespan
│   └── routers/
│       ├── search.py          # GET /search
│       ├── events.py          # POST /api/papers/{id}/save|not-interested
│       ├── recommendations.py # GET /api/recommendations
│       └── saved.py           # GET /saved  ← added in Phase 1 completion
├── app/templates/
│   ├── base.html           # DaisyUI + TailwindCSS CDN + HTMX CDN
│   ├── index.html          # home (search bar + recommendation section)
│   ├── search.html         # full search results page
│   ├── saved.html          # saved papers page  ← added in Phase 1 completion
│   └── partials/
│       ├── paper_card.html         # single paper card
│       ├── action_buttons.html     # save / not-interested buttons
│       ├── search_results.html     # HTMX partial for search
│       ├── recommendations.html    # HTMX partial for recommendations (+ refresh btn)
│       └── empty_recs.html         # shown when not enough saves yet (+ check btn)
├── tests/
│   ├── test_user_state.py  # 10 unit tests
│   ├── test_db.py          # 7 async integration tests
│   ├── test_arxiv_svc.py   # 11 tests (normalise, parse, live API, cache)
│   ├── test_qdrant_svc.py  # 5 tests (cache warm, live lookup, live recommend)
│   ├── test_integration.py # 12 full HTTP tests via FastAPI TestClient
│   └── test_saved.py       # 10 tests for /saved page  ← added in Phase 1 completion
├── run.py                  # uvicorn entry point
├── requirements.txt
└── pytest.ini

How to Run

Prerequisites

pip install -r requirements.txt

Start the server

python run.py

Open http://localhost:8000.

Run the tests

python -m pytest

Full suite: 55 tests across 6 files.

Some tests hit live services (arXiv API, Qdrant Cloud) and run by default. To skip them:

python -m pytest -m "not live"

Core Modules

`app/config.py`

Single source of truth for all settings. Every credential and tunable is here; they can all be overridden via environment variables.

Setting	Default	Purpose
`QDRANT_URL`	Qdrant Cloud EU	BGE-M3 dense collection endpoint
`QDRANT_COLLECTION`	`arxiv_bgem3_dense`	1,596,587 integer-ID points
`DB_PATH`	`interactions.db`	SQLite file path
`ARXIV_API_URL`	`https://export.arxiv.org/api/query`	arXiv Atom feed
`REC_LIMIT`	10	Papers shown per recommendation batch
`REC_POSITIVE_LIMIT`	20	Max positive examples kept in memory per user
`REC_MIN_POSITIVES`	1	Saves needed before showing recs
`COOKIE_NAME`	`arxiv_user_id`	UUID4, 1-year cookie

`app/db.py`

SQLite with WAL mode + PRAGMA synchronous=NORMAL for safe concurrent reads under asyncio. Three tables:

interactions — append-only event log. Every save, dismiss, click and view lands here. Two indexes: (user_id, timestamp DESC) for history fetch, (user_id, paper_id) for deduplication. The source field tracks where the action came from ("search", "recommendation", or "saved").

paper_qdrant_map — maps arxiv_id TEXT → qdrant_point_id INTEGER. Populated lazily on first save. Once an ID is mapped it is reused forever — the Qdrant collection is static.

paper_metadata — SQLite cache of arXiv API responses. Stores title, abstract, authors (JSON), category, published date. Prevents redundant API calls. There is no TTL enforcement in Phase 1 (metadata rarely changes).

`app/arxiv_svc.py`

Thin async client around https://export.arxiv.org/api/query (Atom XML feed).

ID normalisation — the arXiv API returns IDs as full URLs with version suffixes, e.g. http://arxiv.org/abs/1706.03762v5. _normalise_id() strips the URL prefix and v5 suffix so we always work with bare IDs like 1706.03762. Old-format IDs (math/0702129) are also handled.

search(query) — fetches up to 10 results, writes them all into the metadata cache, returns list of paper dicts.

fetch_metadata_batch(ids) — checks SQLite first, then fetches missing IDs from arXiv in batches of 20 with a 0.35s gap between requests (respects the arXiv 3 req/s rate limit).

`app/user_state.py`

Pure in-memory dictionary of UserState dataclasses, one per user_id. Each state holds two deques:

positives — maxlen config.REC_POSITIVE_LIMIT (20), most-recent first
negatives — maxlen 50, most-recent first

Mutual exclusion: saving a paper removes it from negatives and vice versa.

Lazy hydration: ensure_loaded() is called once per user per server process. It reads the last 70 interactions from SQLite and replays them into the deque. After that, all reads are O(1) dict lookups in memory.

MAX_POSITIVES is sourced from config.REC_POSITIVE_LIMIT so the deque cap and the config are always in sync. Changing REC_POSITIVE_LIMIT in config automatically changes how many positives are kept in memory.

`app/qdrant_svc.py`

Two responsibilities:

lookup_qdrant_ids(arxiv_ids) — translates arxiv string IDs to Qdrant integer point IDs. Checks paper_qdrant_map SQLite table first. For cache misses, calls client.scroll() with a MatchAny payload filter on the arxiv_id field (requires the keyword index created during setup). Persists new mappings back to SQLite.

recommend(positive_ids, negative_ids, seen_ids) — translates both lists to integer IDs, then calls client.query_points() with:

RecommendQuery(
    recommend=RecommendInput(
        positive=pos_ids,
        negative=neg_ids,
        strategy=RecommendStrategy.BEST_SCORE,
    )
)

Fetches limit * 2 results so that already-seen papers can be filtered out in Python before returning the final limit results.

Why sync Qdrant client inside run_in_executor? The official qdrant-client async client has known issues with some environments. Using the sync client in a thread pool is the recommended production pattern — it keeps the asyncio event loop unblocked.

`app/routers/recommendations.py`

Fetches REC_LIMIT candidates from Qdrant (already filtered for seen papers inside qdrant_svc.recommend()), then fetches their metadata and renders the cards. No year filtering — classic foundational papers (2015, 2017, etc.) are valid and valuable recommendations.

`app/routers/saved.py`

GET /saved loads the user's current positive_list from user_state, fetches metadata for all of them via arxiv_svc.fetch_metadata_batch(), and renders them using the same paper_card.html partial with saved=True. The Remove button on each card works identically to everywhere else — it POSTs to not-interested and HTMX removes the card.

`app/templates_env.py`

Shared Jinja2 Environment instance imported by all routers. Registers one custom filter:

tojson_parse — converts a JSON string stored in SQLite (e.g. authors array) back to a Python list. Returns [] on any parse error. This prevents the template from crashing when the DB column contains malformed JSON.

Frontend Design

Zero build step. CSS is loaded from the TailwindCSS CDN and styled with DaisyUI components. JavaScript is provided entirely by HTMX — no custom JS written.

HTMX patterns used:

Pattern	Where	Effect
`hx-get="/search" hx-trigger="input changed delay:300ms"`	Search bar	Live search as you type
`hx-get="/api/recommendations" hx-trigger="load"`	Recs section	Lazy-load recs after page paint
`hx-post=".../save" hx-target="#actions-{id}" hx-swap="innerHTML"`	Save button	Replace button group with "Saved" state in-place
`hx-post=".../not-interested" hx-target="#paper-{id}" hx-swap="outerHTML swap:200ms"`	Dismiss button	Animate-remove the whole card
`hx-get="/api/recommendations" hx-target="#rec-section"`	Refresh button	Reload recommendations after saving more papers

Source tracking: every action button carries a source field in hx-vals that is logged to the DB. Values: "search" (from search results), "recommendation" (from the recs section), "saved" (from the saved papers page). The source is forwarded back to the rendered partial after a save so subsequent actions from that partial carry the correct source.

Tests

`tests/test_user_state.py` — 10 unit tests

Pure unit tests, no I/O, no fixtures needed.

test_add_positive — paper appears in positive_list
test_add_negative — paper appears in negative_list
test_mutual_exclusion_pos_to_neg — saving then dismissing the same paper moves it
test_mutual_exclusion_neg_to_pos — dismissing then saving moves it back
test_no_duplicate_positives — saving same paper twice only stores it once
test_ordering_positives — most recently saved paper is first
test_maxlen_eviction_positives — 21st save evicts the oldest
test_has_enough_for_recs — False at 0 saves, True at REC_MIN_POSITIVES
test_all_seen — union of positives and negatives

`tests/test_db.py` — 7 async tests

Each test uses a fresh tmp_path SQLite file via monkeypatch.setattr(config, "DB_PATH", ...). DB state never bleeds between tests.

test_init_creates_tables — all 3 tables present after init
test_log_and_retrieve_interaction — round-trip save + fetch
test_filter_by_event_type — only save rows returned when filtered
test_qdrant_id_roundtrip — save and retrieve a point ID
test_qdrant_ids_batch — batch fetch returns correct dict
test_metadata_cache_roundtrip — single paper insert + fetch
test_metadata_cache_batch — multiple papers, batch fetch

`tests/test_arxiv_svc.py` — 11 tests

7 parametrised _normalise_id tests covering URL form, bare ID, v suffix, old-style slash IDs
test_parse_entry — parses XML entry string directly
test_fetch_metadata_live — real arXiv API call for 0704.0002
test_search_live — real arXiv API search for "attention is all you need"
test_fetch_metadata_cache_hit — mocked HTTP to verify SQLite cache is used on second call

`tests/test_qdrant_svc.py` — 5 tests

test_lookup_cache_warm — if SQLite already has the ID, Qdrant is never called
test_lookup_cache_miss_fetches_and_persists — missing ID triggers Qdrant scroll, result saved to SQLite
test_recommend_empty_no_positives — returns [] immediately without hitting Qdrant
test_lookup_real_qdrant — live lookup: 0704.0002 → point ID 0
test_recommend_real_qdrant — live recommend: saves 0704.0002, gets real recommendations back

`tests/test_integration.py` — 12 full HTTP tests

Uses FastAPI TestClient (Starlette's synchronous test client). Isolated SQLite per test via monkeypatching.

test_home_returns_200 — GET / works
test_home_sets_cookie — user ID cookie is set
test_search_empty_query — no query = no results shown
test_search_with_query_htmx — HTMX header returns partial (no <html> tag)
test_search_real_query — live arXiv search via TestClient
test_save_paper_logs_interaction — POST save → DB row created
test_save_paper_returns_saved_state — response HTML contains "Saved"
test_not_interested_returns_empty — POST dismiss → 200 empty body
test_not_interested_updates_state — state reflects dismiss
test_recommendations_empty_for_new_user — no saves = empty recs partial
test_recommendations_after_save — mocked Qdrant + arXiv returns recommendation cards (year ≥ 2022)
test_full_pipeline_smoke — search → save → dismiss → recs, all in sequence

`tests/test_saved.py` — 10 tests

test_saved_page_returns_200 — GET /saved works
test_saved_page_sets_cookie — cookie is set on fresh visit
test_saved_page_empty_for_new_user — shows empty-state message
test_saved_page_shows_paper_after_save — paper appears after saving
test_saved_page_shows_correct_count — badge shows correct count for 2 saves
test_remove_paper_updates_state — dismiss moves paper to negatives
test_remove_returns_empty_response — empty response body (HTMX removes card)
test_save_source_is_logged — source field persisted to DB
test_dismiss_source_saved_is_logged — dismiss from saved page logs correctly
test_old_paper_filtered_from_recommendations — 2017 paper excluded, 2023 paper included

Design Decisions

No model loading at runtime

Phase 1 is deliberately zero-ML-inference. The BGE-M3 embeddings were pre-indexed into Qdrant by a notebook (bme-arxiv-test.ipynb). At request time we only need integer point IDs — no vectors, no tokeniser, no GPU.

This makes:

Cold start instant (< 1 second)
Memory footprint tiny (< 100 MB)
The recommendation quality surprisingly good — Qdrant's BEST_SCORE strategy in a well-indexed 1024-dim space works well even without query encoding

arXiv API + SQLite as the metadata layer

Qdrant payloads contain only arxiv_id. Title, abstract, authors, and category all come from the arXiv API and are cached in SQLite. This was the only viable option given the payload structure, and it has a nice property: the cache warms up naturally as users search, so recommendation metadata is usually already cached by the time it is needed.

Lazy arxiv_id → Qdrant point ID mapping

We don't pre-populate the SQLite map for all 1.6M papers. Instead, when a user saves a paper a background asyncio task (asyncio.create_task) fires a Qdrant scroll filter to find that paper's point ID. This is a one-time cost per unique paper. Subsequent recommendations are instant since the ID is cached.

Cookie-based user identity

No login, no accounts. A UUID4 is generated on first visit and stored in a 1-year cookie. This is intentional for Phase 1 — simple to implement, good enough for a research tool, easy to replace with real auth in Phase 2.

Separation of DB writes and in-memory reads

Every save/dismiss writes to SQLite synchronously in the event handler. The user_state module maintains an in-memory deque as a read cache. Because asyncio is single-threaded, there are no race conditions. The cache is loaded lazily on first access and then kept live by direct calls to record_positive / record_negative.

Source tracking on every action

Every save and dismiss carries a source field ("search", "recommendation", "saved") that is logged to the interactions table. This enables future analytics about which surface drives the most engagement. After a save, the source is forwarded to the rendered action_buttons.html partial so that any subsequent Remove action from the same card also carries the correct source.

Bugs Found and Fixed During Implementation

1. arXiv API 301 redirect

Symptom: httpx raised an HTTP error on all arXiv requests. Cause: http://export.arxiv.org returns 301 → HTTPS. httpx doesn't follow redirects by default. Fix: Changed ARXIV_API_URL to https://export.arxiv.org/api/query and added follow_redirects=True to all httpx.AsyncClient calls.

2. Jinja2 UndefinedError in `action_buttons.html`

Symptom: POST /api/papers/{id}/save returned 500 when rendering the button partial. Cause: The template used paper_id | default(paper.arxiv_id). Jinja2's default() filter eagerly evaluates both sides before choosing, so paper.arxiv_id was evaluated even when paper was not in the template context (the events router only passes paper_id). Fix: Changed to {% set pid = paper_id if paper_id is defined else paper.arxiv_id %} which short-circuits correctly.

3. `action_buttons.html` hardcoded `source: "search"` everywhere

Symptom: Every action from the recommendations section or the saved page was logged with source="search" in the DB. Cause: hx-vals='{"source": "search"}' was hardcoded — the source context variable passed by the parent template ("recommendation", "saved") was never read. Fix: Added {% set _source = source if source is defined else "search" %} and used {{ _source }} in all hx-vals. Also fixed events.py to forward the received source form field back to the action_buttons.html template context after a save.

4. Qdrant `recommend()` deprecated

Symptom: DeprecationWarning and incorrect results. Cause: client.recommend() is the old API. PointIdsList has a points field, not positive/negative. Fix: Switched to client.query_points() with RecommendQuery(recommend=RecommendInput(...)) — the current recommended pattern.

5. Qdrant payload filter fails without index

Symptom: Qdrant returned an error: "Index required but not found for field arxiv_id". Cause: Filtering on a payload field requires a payload index. The collection was created without one. Fix: Created a keyword index on the arxiv_id field:

client.create_payload_index(
    collection_name=collection,
    field_name="arxiv_id",
    field_schema=PayloadSchemaType.KEYWORD,
    wait=False,
)

This runs in background on Qdrant and persists permanently.

6. Stella Qdrant clusters dead

Symptom: All requests to 49b5f0e9-... and 65c05851-... clusters returned 404. Cause: Those clusters (used for stella-400M-v5 embeddings in the notebooks) were deleted or expired. Fix: Pivoted entirely to the BGE-M3 dense collection at 2fe1965b-... which is alive and has 1,596,587 points.

7. TemplateResponse deprecation warning

Symptom: Deprecation warning on every request. Cause: Old Starlette API: TemplateResponse("name.html", {"request": request, ...}). Fix: Updated all calls to the new positional form: TemplateResponse(request, "name.html", context_without_request).

8. Test assertion too strict for old-style arXiv IDs

Symptom: test_normalise_id parametrized case for math/0702129 failed with AssertionError. Cause: The assertion assert "." in r or r.isdigit() fails for slash-style IDs which contain neither. Fix: Changed assertion to assert isinstance(r, str) and len(r) > 0.

9. `MAX_POSITIVES` in `user_state.py` was hardcoded

Symptom: Changing REC_POSITIVE_LIMIT in config had no effect on the actual deque size. Cause: MAX_POSITIVES = 20 was a bare integer literal, not referencing config. Fix: Changed to MAX_POSITIVES = config.REC_POSITIVE_LIMIT so the two values are always in sync.

What Phase 2 Adds

See PHASE2_PLAN.md for the full plan. The short version:

Semantic search — replace arXiv keyword API with BGE-M3 + Qdrant dense search + Zilliz sparse search (hybrid)
LLM query rewriting — Groq llama-3.3-70b converts casual queries into academic keyword strings before encoding
RRF + reranker — fuses dense and sparse results, applies citation + recency signals
New service files — embed_svc.py, zilliz_svc.py, groq_svc.py, hybrid_search_svc.py
Everything else (recommendations, user state, templates, saved page, event logging) stays unchanged

Test Results (Final)

tests/test_user_state.py      10 passed
tests/test_db.py               7 passed
tests/test_arxiv_svc.py       11 passed
tests/test_qdrant_svc.py       5 passed
tests/test_integration.py     12 passed
tests/test_saved.py            9 passed
─────────────────────────────────────────
54 passed in ~42s

All routes registered:

GET  /
GET  /search
GET  /saved
POST /api/papers/{paper_id}/save
POST /api/papers/{paper_id}/not-interested
GET  /api/recommendations