Spaces:
Running
Phase 1 β ArXiv Recommender System
What Was Built
A fully working, zero-ML-inference personalized arXiv paper recommender web app.
Users search arXiv, save papers they like, and get increasingly personalized recommendations driven by Qdrant's native Recommend API β without loading any embedding model at runtime.
Architecture Overview
Browser
β HTMX requests (partial HTML swaps)
βΌ
FastAPI (Uvicorn ASGI)
βββ GET / β home page (search bar + lazy-load recs)
βββ GET /search?q= β arXiv search results
βββ GET /saved β saved papers page
βββ POST /api/papers/{id}/save β log save, update hot cache
βββ POST /api/papers/{id}/not-interested β log dismiss, remove card
βββ GET /api/recommendations β Qdrant Recommend β arXiv metadata
β
βββ arXiv API (export.arxiv.org) β search + metadata fetch
βββ SQLite WAL (aiosqlite) β events, ID map, metadata cache
βββ Qdrant Cloud (BGE-M3 dense) β Recommend API (1.6M papers)
No ML model is loaded or executed at runtime in Phase 1. The Qdrant collection (arxiv_bgem3_dense) was pre-indexed with BGE-M3 embeddings. Recommendations are generated purely from the vector space: Qdrant's BEST_SCORE strategy finds papers near the user's saved papers and away from dismissed ones.
File Structure
ResearchIT-Final/
βββ app/
β βββ __init__.py
β βββ config.py # all settings + credentials
β βββ db.py # SQLite layer (3 tables)
β βββ arxiv_svc.py # arXiv API client + metadata cache
β βββ user_state.py # in-memory hot cache per user
β βββ qdrant_svc.py # Qdrant ID lookup + Recommend API
β βββ templates_env.py # shared Jinja2 env (custom filter)
β βββ main.py # FastAPI app + lifespan
β βββ routers/
β βββ search.py # GET /search
β βββ events.py # POST /api/papers/{id}/save|not-interested
β βββ recommendations.py # GET /api/recommendations
β βββ saved.py # GET /saved β added in Phase 1 completion
βββ app/templates/
β βββ base.html # DaisyUI + TailwindCSS CDN + HTMX CDN
β βββ index.html # home (search bar + recommendation section)
β βββ search.html # full search results page
β βββ saved.html # saved papers page β added in Phase 1 completion
β βββ partials/
β βββ paper_card.html # single paper card
β βββ action_buttons.html # save / not-interested buttons
β βββ search_results.html # HTMX partial for search
β βββ recommendations.html # HTMX partial for recommendations (+ refresh btn)
β βββ empty_recs.html # shown when not enough saves yet (+ check btn)
βββ tests/
β βββ test_user_state.py # 10 unit tests
β βββ test_db.py # 7 async integration tests
β βββ test_arxiv_svc.py # 11 tests (normalise, parse, live API, cache)
β βββ test_qdrant_svc.py # 5 tests (cache warm, live lookup, live recommend)
β βββ test_integration.py # 12 full HTTP tests via FastAPI TestClient
β βββ test_saved.py # 10 tests for /saved page β added in Phase 1 completion
βββ run.py # uvicorn entry point
βββ requirements.txt
βββ pytest.ini
How to Run
Prerequisites
pip install -r requirements.txt
Start the server
python run.py
Open http://localhost:8000.
Run the tests
python -m pytest
Full suite: 55 tests across 6 files.
Some tests hit live services (arXiv API, Qdrant Cloud) and run by default. To skip them:
python -m pytest -m "not live"
Core Modules
app/config.py
Single source of truth for all settings. Every credential and tunable is here; they can all be overridden via environment variables.
| Setting | Default | Purpose |
|---|---|---|
QDRANT_URL |
Qdrant Cloud EU | BGE-M3 dense collection endpoint |
QDRANT_COLLECTION |
arxiv_bgem3_dense |
1,596,587 integer-ID points |
DB_PATH |
interactions.db |
SQLite file path |
ARXIV_API_URL |
https://export.arxiv.org/api/query |
arXiv Atom feed |
REC_LIMIT |
10 | Papers shown per recommendation batch |
REC_POSITIVE_LIMIT |
20 | Max positive examples kept in memory per user |
REC_MIN_POSITIVES |
1 | Saves needed before showing recs |
COOKIE_NAME |
arxiv_user_id |
UUID4, 1-year cookie |
app/db.py
SQLite with WAL mode + PRAGMA synchronous=NORMAL for safe concurrent reads under asyncio. Three tables:
interactions β append-only event log. Every save, dismiss, click and view lands here. Two indexes: (user_id, timestamp DESC) for history fetch, (user_id, paper_id) for deduplication. The source field tracks where the action came from ("search", "recommendation", or "saved").
paper_qdrant_map β maps arxiv_id TEXT β qdrant_point_id INTEGER. Populated lazily on first save. Once an ID is mapped it is reused forever β the Qdrant collection is static.
paper_metadata β SQLite cache of arXiv API responses. Stores title, abstract, authors (JSON), category, published date. Prevents redundant API calls. There is no TTL enforcement in Phase 1 (metadata rarely changes).
app/arxiv_svc.py
Thin async client around https://export.arxiv.org/api/query (Atom XML feed).
ID normalisation β the arXiv API returns IDs as full URLs with version suffixes, e.g. http://arxiv.org/abs/1706.03762v5. _normalise_id() strips the URL prefix and v5 suffix so we always work with bare IDs like 1706.03762. Old-format IDs (math/0702129) are also handled.
search(query) β fetches up to 10 results, writes them all into the metadata cache, returns list of paper dicts.
fetch_metadata_batch(ids) β checks SQLite first, then fetches missing IDs from arXiv in batches of 20 with a 0.35s gap between requests (respects the arXiv 3 req/s rate limit).
app/user_state.py
Pure in-memory dictionary of UserState dataclasses, one per user_id. Each state holds two deques:
positivesβ maxlenconfig.REC_POSITIVE_LIMIT(20), most-recent firstnegativesβ maxlen 50, most-recent first
Mutual exclusion: saving a paper removes it from negatives and vice versa.
Lazy hydration: ensure_loaded() is called once per user per server process. It reads the last 70 interactions from SQLite and replays them into the deque. After that, all reads are O(1) dict lookups in memory.
MAX_POSITIVES is sourced from config.REC_POSITIVE_LIMIT so the deque cap and the config are always in sync. Changing REC_POSITIVE_LIMIT in config automatically changes how many positives are kept in memory.
app/qdrant_svc.py
Two responsibilities:
lookup_qdrant_ids(arxiv_ids) β translates arxiv string IDs to Qdrant integer point IDs. Checks paper_qdrant_map SQLite table first. For cache misses, calls client.scroll() with a MatchAny payload filter on the arxiv_id field (requires the keyword index created during setup). Persists new mappings back to SQLite.
recommend(positive_ids, negative_ids, seen_ids) β translates both lists to integer IDs, then calls client.query_points() with:
RecommendQuery(
recommend=RecommendInput(
positive=pos_ids,
negative=neg_ids,
strategy=RecommendStrategy.BEST_SCORE,
)
)
Fetches limit * 2 results so that already-seen papers can be filtered out in Python before returning the final limit results.
Why sync Qdrant client inside run_in_executor? The official qdrant-client async client has known issues with some environments. Using the sync client in a thread pool is the recommended production pattern β it keeps the asyncio event loop unblocked.
app/routers/recommendations.py
Fetches REC_LIMIT candidates from Qdrant (already filtered for seen papers inside qdrant_svc.recommend()), then fetches their metadata and renders the cards. No year filtering β classic foundational papers (2015, 2017, etc.) are valid and valuable recommendations.
app/routers/saved.py
GET /saved loads the user's current positive_list from user_state, fetches metadata for all of them via arxiv_svc.fetch_metadata_batch(), and renders them using the same paper_card.html partial with saved=True. The Remove button on each card works identically to everywhere else β it POSTs to not-interested and HTMX removes the card.
app/templates_env.py
Shared Jinja2 Environment instance imported by all routers. Registers one custom filter:
tojson_parse β converts a JSON string stored in SQLite (e.g. authors array) back to a Python list. Returns [] on any parse error. This prevents the template from crashing when the DB column contains malformed JSON.
Frontend Design
Zero build step. CSS is loaded from the TailwindCSS CDN and styled with DaisyUI components. JavaScript is provided entirely by HTMX β no custom JS written.
HTMX patterns used:
| Pattern | Where | Effect |
|---|---|---|
hx-get="/search" hx-trigger="input changed delay:300ms" |
Search bar | Live search as you type |
hx-get="/api/recommendations" hx-trigger="load" |
Recs section | Lazy-load recs after page paint |
hx-post=".../save" hx-target="#actions-{id}" hx-swap="innerHTML" |
Save button | Replace button group with "Saved" state in-place |
hx-post=".../not-interested" hx-target="#paper-{id}" hx-swap="outerHTML swap:200ms" |
Dismiss button | Animate-remove the whole card |
hx-get="/api/recommendations" hx-target="#rec-section" |
Refresh button | Reload recommendations after saving more papers |
Source tracking: every action button carries a source field in hx-vals that is logged to the DB. Values: "search" (from search results), "recommendation" (from the recs section), "saved" (from the saved papers page). The source is forwarded back to the rendered partial after a save so subsequent actions from that partial carry the correct source.
Tests
tests/test_user_state.py β 10 unit tests
Pure unit tests, no I/O, no fixtures needed.
test_add_positiveβ paper appears inpositive_listtest_add_negativeβ paper appears innegative_listtest_mutual_exclusion_pos_to_negβ saving then dismissing the same paper moves ittest_mutual_exclusion_neg_to_posβ dismissing then saving moves it backtest_no_duplicate_positivesβ saving same paper twice only stores it oncetest_ordering_positivesβ most recently saved paper is firsttest_maxlen_eviction_positivesβ 21st save evicts the oldesttest_has_enough_for_recsβ False at 0 saves, True at REC_MIN_POSITIVEStest_all_seenβ union of positives and negatives
tests/test_db.py β 7 async tests
Each test uses a fresh tmp_path SQLite file via monkeypatch.setattr(config, "DB_PATH", ...). DB state never bleeds between tests.
test_init_creates_tablesβ all 3 tables present after inittest_log_and_retrieve_interactionβ round-trip save + fetchtest_filter_by_event_typeβ onlysaverows returned when filteredtest_qdrant_id_roundtripβ save and retrieve a point IDtest_qdrant_ids_batchβ batch fetch returns correct dicttest_metadata_cache_roundtripβ single paper insert + fetchtest_metadata_cache_batchβ multiple papers, batch fetch
tests/test_arxiv_svc.py β 11 tests
- 7 parametrised
_normalise_idtests covering URL form, bare ID,vsuffix, old-style slash IDs test_parse_entryβ parses XML entry string directlytest_fetch_metadata_liveβ real arXiv API call for0704.0002test_search_liveβ real arXiv API search for "attention is all you need"test_fetch_metadata_cache_hitβ mocked HTTP to verify SQLite cache is used on second call
tests/test_qdrant_svc.py β 5 tests
test_lookup_cache_warmβ if SQLite already has the ID, Qdrant is never calledtest_lookup_cache_miss_fetches_and_persistsβ missing ID triggers Qdrant scroll, result saved to SQLitetest_recommend_empty_no_positivesβ returns[]immediately without hitting Qdranttest_lookup_real_qdrantβ live lookup:0704.0002β point ID 0test_recommend_real_qdrantβ live recommend: saves0704.0002, gets real recommendations back
tests/test_integration.py β 12 full HTTP tests
Uses FastAPI TestClient (Starlette's synchronous test client). Isolated SQLite per test via monkeypatching.
test_home_returns_200β GET / workstest_home_sets_cookieβ user ID cookie is settest_search_empty_queryβ no query = no results showntest_search_with_query_htmxβ HTMX header returns partial (no<html>tag)test_search_real_queryβ live arXiv search via TestClienttest_save_paper_logs_interactionβ POST save β DB row createdtest_save_paper_returns_saved_stateβ response HTML contains "Saved"test_not_interested_returns_emptyβ POST dismiss β 200 empty bodytest_not_interested_updates_stateβ state reflects dismisstest_recommendations_empty_for_new_userβ no saves = empty recs partialtest_recommendations_after_saveβ mocked Qdrant + arXiv returns recommendation cards (year β₯ 2022)test_full_pipeline_smokeβ search β save β dismiss β recs, all in sequence
tests/test_saved.py β 10 tests
test_saved_page_returns_200β GET /saved workstest_saved_page_sets_cookieβ cookie is set on fresh visittest_saved_page_empty_for_new_userβ shows empty-state messagetest_saved_page_shows_paper_after_saveβ paper appears after savingtest_saved_page_shows_correct_countβ badge shows correct count for 2 savestest_remove_paper_updates_stateβ dismiss moves paper to negativestest_remove_returns_empty_responseβ empty response body (HTMX removes card)test_save_source_is_loggedβ source field persisted to DBtest_dismiss_source_saved_is_loggedβ dismiss from saved page logs correctlytest_old_paper_filtered_from_recommendationsβ 2017 paper excluded, 2023 paper included
Design Decisions
No model loading at runtime
Phase 1 is deliberately zero-ML-inference. The BGE-M3 embeddings were pre-indexed into Qdrant by a notebook (bme-arxiv-test.ipynb). At request time we only need integer point IDs β no vectors, no tokeniser, no GPU.
This makes:
- Cold start instant (< 1 second)
- Memory footprint tiny (< 100 MB)
- The recommendation quality surprisingly good β Qdrant's
BEST_SCOREstrategy in a well-indexed 1024-dim space works well even without query encoding
arXiv API + SQLite as the metadata layer
Qdrant payloads contain only arxiv_id. Title, abstract, authors, and category all come from the arXiv API and are cached in SQLite. This was the only viable option given the payload structure, and it has a nice property: the cache warms up naturally as users search, so recommendation metadata is usually already cached by the time it is needed.
Lazy arxiv_id β Qdrant point ID mapping
We don't pre-populate the SQLite map for all 1.6M papers. Instead, when a user saves a paper a background asyncio task (asyncio.create_task) fires a Qdrant scroll filter to find that paper's point ID. This is a one-time cost per unique paper. Subsequent recommendations are instant since the ID is cached.
Cookie-based user identity
No login, no accounts. A UUID4 is generated on first visit and stored in a 1-year cookie. This is intentional for Phase 1 β simple to implement, good enough for a research tool, easy to replace with real auth in Phase 2.
Separation of DB writes and in-memory reads
Every save/dismiss writes to SQLite synchronously in the event handler. The user_state module maintains an in-memory deque as a read cache. Because asyncio is single-threaded, there are no race conditions. The cache is loaded lazily on first access and then kept live by direct calls to record_positive / record_negative.
Source tracking on every action
Every save and dismiss carries a source field ("search", "recommendation", "saved") that is logged to the interactions table. This enables future analytics about which surface drives the most engagement. After a save, the source is forwarded to the rendered action_buttons.html partial so that any subsequent Remove action from the same card also carries the correct source.
Bugs Found and Fixed During Implementation
1. arXiv API 301 redirect
Symptom: httpx raised an HTTP error on all arXiv requests.
Cause: http://export.arxiv.org returns 301 β HTTPS. httpx doesn't follow redirects by default.
Fix: Changed ARXIV_API_URL to https://export.arxiv.org/api/query and added follow_redirects=True to all httpx.AsyncClient calls.
2. Jinja2 UndefinedError in action_buttons.html
Symptom: POST /api/papers/{id}/save returned 500 when rendering the button partial.
Cause: The template used paper_id | default(paper.arxiv_id). Jinja2's default() filter eagerly evaluates both sides before choosing, so paper.arxiv_id was evaluated even when paper was not in the template context (the events router only passes paper_id).
Fix: Changed to {% set pid = paper_id if paper_id is defined else paper.arxiv_id %} which short-circuits correctly.
3. action_buttons.html hardcoded source: "search" everywhere
Symptom: Every action from the recommendations section or the saved page was logged with source="search" in the DB.
Cause: hx-vals='{"source": "search"}' was hardcoded β the source context variable passed by the parent template ("recommendation", "saved") was never read.
Fix: Added {% set _source = source if source is defined else "search" %} and used {{ _source }} in all hx-vals. Also fixed events.py to forward the received source form field back to the action_buttons.html template context after a save.
4. Qdrant recommend() deprecated
Symptom: DeprecationWarning and incorrect results.
Cause: client.recommend() is the old API. PointIdsList has a points field, not positive/negative.
Fix: Switched to client.query_points() with RecommendQuery(recommend=RecommendInput(...)) β the current recommended pattern.
5. Qdrant payload filter fails without index
Symptom: Qdrant returned an error: "Index required but not found for field arxiv_id".
Cause: Filtering on a payload field requires a payload index. The collection was created without one.
Fix: Created a keyword index on the arxiv_id field:
client.create_payload_index(
collection_name=collection,
field_name="arxiv_id",
field_schema=PayloadSchemaType.KEYWORD,
wait=False,
)
This runs in background on Qdrant and persists permanently.
6. Stella Qdrant clusters dead
Symptom: All requests to 49b5f0e9-... and 65c05851-... clusters returned 404.
Cause: Those clusters (used for stella-400M-v5 embeddings in the notebooks) were deleted or expired.
Fix: Pivoted entirely to the BGE-M3 dense collection at 2fe1965b-... which is alive and has 1,596,587 points.
7. TemplateResponse deprecation warning
Symptom: Deprecation warning on every request.
Cause: Old Starlette API: TemplateResponse("name.html", {"request": request, ...}).
Fix: Updated all calls to the new positional form: TemplateResponse(request, "name.html", context_without_request).
8. Test assertion too strict for old-style arXiv IDs
Symptom: test_normalise_id parametrized case for math/0702129 failed with AssertionError.
Cause: The assertion assert "." in r or r.isdigit() fails for slash-style IDs which contain neither.
Fix: Changed assertion to assert isinstance(r, str) and len(r) > 0.
9. MAX_POSITIVES in user_state.py was hardcoded
Symptom: Changing REC_POSITIVE_LIMIT in config had no effect on the actual deque size.
Cause: MAX_POSITIVES = 20 was a bare integer literal, not referencing config.
Fix: Changed to MAX_POSITIVES = config.REC_POSITIVE_LIMIT so the two values are always in sync.
What Phase 2 Adds
See PHASE2_PLAN.md for the full plan. The short version:
- Semantic search β replace arXiv keyword API with BGE-M3 + Qdrant dense search + Zilliz sparse search (hybrid)
- LLM query rewriting β Groq
llama-3.3-70bconverts casual queries into academic keyword strings before encoding - RRF + reranker β fuses dense and sparse results, applies citation + recency signals
- New service files β
embed_svc.py,zilliz_svc.py,groq_svc.py,hybrid_search_svc.py - Everything else (recommendations, user state, templates, saved page, event logging) stays unchanged
Test Results (Final)
tests/test_user_state.py 10 passed
tests/test_db.py 7 passed
tests/test_arxiv_svc.py 11 passed
tests/test_qdrant_svc.py 5 passed
tests/test_integration.py 12 passed
tests/test_saved.py 9 passed
βββββββββββββββββββββββββββββββββββββββββ
54 passed in ~42s
All routes registered:
GET /
GET /search
GET /saved
POST /api/papers/{paper_id}/save
POST /api/papers/{paper_id}/not-interested
GET /api/recommendations