Spaces:
Running
Running
File size: 33,590 Bytes
d5a6f3e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 | # Code Tour β ArXiv Recommender (Phase 1)
A file-by-file walkthrough of every piece of the codebase: what it does, how it works, and why it was written the way it was.
---
## Entry Points
### `run.py`
```python
import uvicorn
if __name__ == "__main__":
uvicorn.run(
"app.main:app",
host="127.0.0.1",
port=8000,
reload=True,
reload_dirs=["app"],
)
```
Nothing special here. Starts Uvicorn pointing at the FastAPI `app` object. `reload=True` watches the `app/` directory and hot-reloads on file changes. Run with `python run.py`.
---
### `app/main.py`
```python
from app.routers import search, events, recommendations, saved
@asynccontextmanager
async def lifespan(app: FastAPI):
await db.init_db()
yield
app = FastAPI(title=APP_TITLE, lifespan=lifespan)
app.include_router(search.router)
app.include_router(events.router)
app.include_router(recommendations.router)
app.include_router(saved.router)
@app.get("/", response_class=HTMLResponse)
async def home(request, user_id=Cookie(...)):
user_id = user_id or str(uuid.uuid4())
state = await us.ensure_loaded(user_id)
resp = templates.TemplateResponse(request, "index.html", {
"has_recs": state.has_enough_for_recs(),
"save_count": len(state.positives),
})
resp.set_cookie(COOKIE_NAME, user_id, max_age=365*24*3600, httponly=True)
return resp
```
**`lifespan`** is a FastAPI context manager that runs `init_db()` once when the server starts β creates the three SQLite tables if they don't exist, then yields control to the app.
**The home route** is the only one that lives in `main.py`. Everything else is in routers. It reads the user's cookie, loads their state from memory/DB, and renders `index.html` with two flags: `has_recs` (enough saves to show recommendations?) and `save_count` (how many papers saved so far).
**Cookie pattern** β every route that might be a user's first visit creates a UUID4 if no cookie exists, and refreshes the cookie's max_age on every response. This way the cookie always stays 1 year from last visit.
---
## Configuration
### `app/config.py`
```python
import os
QDRANT_URL = os.getenv("QDRANT_URL", "https://2fe1965b-...eu-west-2-0.aws...")
QDRANT_API_KEY = os.getenv("QDRANT_API_KEY", "eyJhbGci...")
QDRANT_COLLECTION = os.getenv("QDRANT_COLLECTION", "arxiv_bgem3_dense")
DB_PATH = os.getenv("DB_PATH", "interactions.db")
ARXIV_API_URL = "https://export.arxiv.org/api/query"
ARXIV_MAX_RESULTS = 10
METADATA_CACHE_TTL_DAYS = 30
REC_LIMIT = 10
REC_POSITIVE_LIMIT = 20
REC_MIN_POSITIVES = 1
APP_TITLE = "ArXiv Recommender"
COOKIE_NAME = "arxiv_user_id"
COOKIE_MAX_AGE = 60 * 60 * 24 * 365
```
Every credential and tunable lives here. All of them can be overridden with environment variables β `os.getenv("X", default)`. In production you'd set `QDRANT_API_KEY` as an env var and never commit it to git.
**`REC_POSITIVE_LIMIT = 20`** β controls how many saved papers are kept in the in-memory deque *and* how many are sent to Qdrant as positive examples. This is the only place you change it; `user_state.py` reads it directly.
---
## Database Layer
### `app/db.py`
Three tables. The schema runs once at startup via `init_db()`.
```python
_SCHEMA = """
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;
CREATE TABLE IF NOT EXISTS interactions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id TEXT NOT NULL,
paper_id TEXT NOT NULL,
event_type TEXT NOT NULL, -- save | not_interested
source TEXT, -- search | recommendation | saved
position INTEGER,
query_id TEXT,
timestamp TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE INDEX IF NOT EXISTS idx_ui_user_ts ON interactions(user_id, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_ui_user_paper ON interactions(user_id, paper_id);
CREATE TABLE IF NOT EXISTS paper_qdrant_map (
arxiv_id TEXT PRIMARY KEY,
qdrant_point_id INTEGER NOT NULL,
mapped_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE TABLE IF NOT EXISTS paper_metadata (
arxiv_id TEXT PRIMARY KEY,
title TEXT,
abstract TEXT,
authors TEXT, -- JSON array string e.g. '["Vaswani", "Shazeer"]'
category TEXT,
published TEXT,
cached_at TEXT NOT NULL DEFAULT (datetime('now'))
);
"""
```
**WAL mode** (`journal_mode=WAL`) allows one writer and multiple concurrent readers without blocking. Important because FastAPI handles requests concurrently and SQLite's default mode would serialize everything.
**`synchronous=NORMAL`** β safe against OS crashes but doesn't fsync on every write. Faster than `FULL` with acceptable durability for a research tool.
**Three tables, three jobs:**
| Table | Job |
|---|---|
| `interactions` | Append-only event log. Never updated, only inserted. Source of truth. |
| `paper_qdrant_map` | Cache translating arxiv_id strings β Qdrant integer point IDs |
| `paper_metadata` | Cache of arXiv API responses so we don't re-fetch titles/abstracts |
**Key functions:**
```python
# Write an event
await db.log_interaction(user_id, paper_id, "save", source="search", position=2)
# Read recent events for a user (used to hydrate the in-memory cache)
rows = await db.get_user_interactions(user_id, event_types=["save", "not_interested"], limit=70)
# Qdrant ID cache
await db.save_qdrant_id("1706.03762", 523419)
cached = await db.get_qdrant_ids_batch(["1706.03762", "0704.0002"])
# β {"1706.03762": 523419} (only IDs that were in the cache)
# Metadata cache
await db.cache_metadata({"arxiv_id": "1706.03762", "title": "Attention...", ...})
batch = await db.get_cached_metadata_batch(["1706.03762", "0704.0002"])
# β {"1706.03762": {...}}
```
All functions use `async with aiosqlite.connect(DB_PATH)` β each call opens and closes its own connection. This is safe with WAL mode and avoids connection pool complexity.
---
## arXiv Service
### `app/arxiv_svc.py`
Handles all communication with the arXiv Atom XML API and the SQLite metadata cache.
#### ID Normalisation
arXiv IDs come in several formats from the API:
```python
_ID_RE = re.compile(r"(?:arxiv:|https?://arxiv\.org/abs/)?([^\s/v]+(?:v\d+)?)")
def _normalise_id(raw: str) -> str:
m = _ID_RE.search(raw.strip())
bare = m.group(1)
return re.sub(r"v\d+$", "", bare)
```
| Input | Output |
|---|---|
| `http://arxiv.org/abs/1706.03762v5` | `1706.03762` |
| `https://arxiv.org/abs/1706.03762` | `1706.03762` |
| `arxiv:1706.03762v2` | `1706.03762` |
| `1706.03762v3` | `1706.03762` |
| `0704.0002` | `0704.0002` |
The bare ID is what we store everywhere β in SQLite, in the user state cache, and in the Qdrant `arxiv_id` payload field.
#### XML Parsing
The arXiv API returns Atom XML. One `<entry>` element per paper:
```python
_NS = {
"atom": "http://www.w3.org/2005/Atom",
"arxiv": "http://arxiv.org/schemas/atom",
}
def _parse_entry(entry: ET.Element) -> dict:
raw_id = text("atom:id")
arxiv_id = _normalise_id(raw_id)
authors = [a.findtext("atom:name", ...) for a in entry.findall("atom:author", _NS)]
cat_el = entry.find("arxiv:primary_category", _NS)
category = cat_el.attrib.get("term", "")
return {
"arxiv_id": arxiv_id,
"title": text("atom:title").replace("\n", " "),
"abstract": text("atom:summary").replace("\n", " "),
"authors": json.dumps(authors[:5]), # stored as JSON string in SQLite
"category": category,
"published": text("atom:published")[:10], # YYYY-MM-DD only
"year": int(published[:4]),
}
```
Authors are stored as a JSON string (`'["Vaswani", "Shazeer"]'`) because SQLite has no array type. The `tojson_parse` filter in the template converts it back to a Python list for display.
#### Search and Fetch
```python
async def search(query: str, max_results=10) -> list[dict]:
params = {"search_query": f"all:{query}", "start": 0,
"max_results": max_results, "sortBy": "relevance"}
async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
resp = await client.get(ARXIV_API_URL, params=params)
papers = [_parse_entry(e) for e in ET.fromstring(resp.text).findall("atom:entry", _NS)]
for paper in papers:
await db.cache_metadata(paper) # cache all results immediately
return papers
async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
result = await db.get_cached_metadata_batch(arxiv_ids) # check SQLite first
missing = [aid for aid in arxiv_ids if aid not in result]
if missing:
# Batch up to 20 IDs per request, 0.35s gap = ~3 req/s rate limit
for i in range(0, len(missing), 20):
chunk = missing[i:i+20]
params = {"id_list": ",".join(chunk), "max_results": len(chunk)}
# ... fetch, parse, cache ...
await asyncio.sleep(0.35)
return result
```
`follow_redirects=True` is required β the arXiv API's HTTP URL redirects to HTTPS.
---
## User State
### `app/user_state.py`
The in-memory hot cache. Zero DB reads on the hot path.
```python
from app import db, config
MAX_POSITIVES = config.REC_POSITIVE_LIMIT # = 20, kept in sync with config
MAX_NEGATIVES = 50
@dataclass
class UserState:
positives: deque[str] = field(default_factory=lambda: deque(maxlen=MAX_POSITIVES))
negatives: deque[str] = field(default_factory=lambda: deque(maxlen=MAX_NEGATIVES))
loaded: bool = False
def add_positive(self, paper_id: str) -> None:
try: self.negatives.remove(paper_id) # mutual exclusion
except ValueError: pass
if paper_id not in self.positives:
self.positives.appendleft(paper_id) # most recent first
def add_negative(self, paper_id: str) -> None:
try: self.positives.remove(paper_id)
except ValueError: pass
if paper_id not in self.negatives:
self.negatives.appendleft(paper_id)
def has_enough_for_recs(self) -> bool:
return len(self.positives) >= config.REC_MIN_POSITIVES
```
**Mutual exclusion**: `add_positive` removes the paper from negatives before adding to positives, and vice versa. So a paper can never be in both lists simultaneously.
**`appendleft`**: deques are double-ended. `appendleft` inserts at index 0 (front). When `maxlen` is reached, the rightmost (oldest) element is silently dropped. So `positive_list[0]` is always the most recently saved paper.
```python
_cache: dict[str, UserState] = {} # global in-process dict
async def ensure_loaded(user_id: str) -> UserState:
state = get_user_state(user_id)
if state.loaded:
return state # O(1) β hot path
# Cold path: first request from this user in this server process
rows = await db.get_user_interactions(user_id,
event_types=["save", "not_interested"], limit=70)
for row in reversed(rows): # oldest first so appendleft puts newest at front
if row["event_type"] == "save":
state.add_positive(row["paper_id"])
else:
state.add_negative(row["paper_id"])
state.loaded = True
return state
```
**Why `reversed(rows)`**: `get_user_interactions` returns rows newest-first (ORDER BY timestamp DESC). We want to replay them in chronological order so that `appendleft` in `add_positive` correctly ends up with the newest paper at `index 0`. If we replayed newest-first, the oldest save would end up at the front.
```python
def record_positive(user_id: str, paper_id: str) -> None:
get_user_state(user_id).add_positive(paper_id) # sync, no DB
def all_seen(user_id: str) -> set[str]:
state = get_user_state(user_id)
return set(state.positive_list) | set(state.negative_list)
```
`all_seen` feeds the recommendation engine β any paper the user has ever saved or dismissed is excluded from the results.
---
## Qdrant Service
### `app/qdrant_svc.py`
Two jobs: translate arxiv_ids β integer point IDs, and call the Recommend API.
#### Client Setup
```python
@lru_cache(maxsize=1)
def _client() -> QdrantClient:
return QdrantClient(
url=config.QDRANT_URL,
api_key=config.QDRANT_API_KEY,
timeout=30,
check_compatibility=False,
)
```
`@lru_cache(maxsize=1)` makes this a singleton. The client is created once, reused on every request. The sync `QdrantClient` is used (not the async one) because it runs inside `asyncio.run_in_executor` β this keeps the event loop free while the network call is in flight.
#### ID Lookup
```python
async def lookup_qdrant_ids(arxiv_ids: list[str]) -> dict[str, int]:
cached = await db.get_qdrant_ids_batch(arxiv_ids)
missing = [aid for aid in arxiv_ids if aid not in cached]
if missing:
loop = asyncio.get_event_loop()
results = await loop.run_in_executor(None, _scroll_by_arxiv_ids, missing)
for arxiv_id, point_id in results.items():
await db.save_qdrant_id(arxiv_id, point_id)
cached[arxiv_id] = point_id
return cached
def _scroll_by_arxiv_ids(arxiv_ids: list[str]) -> dict[str, int]:
pts, _ = _client().scroll(
collection_name=QDRANT_COLLECTION,
scroll_filter=Filter(must=[
FieldCondition(key="arxiv_id", match=MatchAny(any=arxiv_ids))
]),
limit=len(arxiv_ids),
with_payload=True,
with_vectors=False,
)
return {p.payload["arxiv_id"]: p.id for p in pts}
```
`MatchAny` is Qdrant's `IN (...)` β it filters points whose `arxiv_id` payload field matches any value in the list. Requires the keyword payload index created on the collection (created once, persists permanently).
The result is `{arxiv_id: integer_point_id}`. Any ID not found in the collection is simply absent from the dict β that paper hasn't been indexed yet.
#### Recommend
```python
async def recommend(positive_arxiv_ids, negative_arxiv_ids, seen_arxiv_ids, limit):
all_ids = list(dict.fromkeys(positive_arxiv_ids + negative_arxiv_ids))
id_map = await lookup_qdrant_ids(all_ids)
pos_ids = [id_map[aid] for aid in positive_arxiv_ids if aid in id_map]
neg_ids = [id_map[aid] for aid in negative_arxiv_ids if aid in id_map]
if not pos_ids:
return []
results = await loop.run_in_executor(None, _run_recommend, pos_ids, neg_ids, limit*2)
filtered = [
r.payload["arxiv_id"]
for r in results
if r.payload.get("arxiv_id") and r.payload["arxiv_id"] not in seen_arxiv_ids
]
return filtered[:limit]
def _run_recommend(pos_ids, neg_ids, limit):
result = _client().query_points(
collection_name=QDRANT_COLLECTION,
query=RecommendQuery(
recommend=RecommendInput(
positive=pos_ids,
negative=neg_ids if neg_ids else [],
strategy=RecommendStrategy.BEST_SCORE,
)
),
limit=limit,
with_payload=True,
with_vectors=False,
)
return result.points
```
**`BEST_SCORE` strategy**: for each candidate paper, Qdrant computes its similarity to each positive example, takes the maximum score, then subtracts a penalty for similarity to negatives. Papers near your saves and far from your dismissals bubble to the top.
**`limit * 2` over-fetch**: we fetch double the target count so that after filtering out `seen_arxiv_ids` in Python, we still have enough results to return `limit` papers.
**`dict.fromkeys(...)` deduplication**: if a paper appears in both positive and negative lists (shouldn't happen due to mutual exclusion in `user_state`, but defensive), it's deduplicated before the lookup.
---
## Routers
### `app/routers/search.py`
```python
@router.get("/search", response_class=HTMLResponse)
async def search(request: Request, q: str = "", user_id=Cookie(...)):
papers = []
if q.strip():
papers = await arxiv_svc.search(q.strip())
state = await us.ensure_loaded(user_id)
saved_ids = set(state.positive_list)
dismissed = set(state.negative_list)
for p in papers:
p["saved"] = p["arxiv_id"] in saved_ids
p["dismissed"] = p["arxiv_id"] in dismissed_ids
if request.headers.get("HX-Request"):
return templates.TemplateResponse(request, "partials/search_results.html",
{"papers": papers, "query": q})
else:
return templates.TemplateResponse(request, "search.html",
{"papers": papers, "query": q,
"has_recs": state.has_enough_for_recs()})
```
**HTMX detection**: if the request has an `HX-Request` header (set automatically by HTMX), return only the `search_results.html` partial β just the list of cards, no `<html>` wrapper. This is what gets swapped into `#search-results` on the page without a full reload.
**Annotating papers**: after fetching from arXiv, each paper dict gets `saved` and `dismissed` booleans added. The template uses these to show the correct button state (e.g. already-saved papers show "β Saved" instead of "β Save").
---
### `app/routers/events.py`
```python
@router.post("/{paper_id}/save", response_class=HTMLResponse)
async def save_paper(paper_id, request, source=Form("search"),
position=Form(0), query_id=Form(""), user_id=Cookie(...)):
await db.log_interaction(user_id, paper_id, "save",
source=source, position=position or None)
us.record_positive(user_id, paper_id)
asyncio.create_task(qdrant_svc.lookup_qdrant_ids([paper_id])) # background
return templates.TemplateResponse(request, "partials/action_buttons.html",
{"paper_id": paper_id, "saved": True, "dismissed": False, "source": source})
@router.post("/{paper_id}/not-interested", response_class=HTMLResponse)
async def not_interested(paper_id, request, source=Form("search"), ...):
await db.log_interaction(user_id, paper_id, "not_interested", source=source)
us.record_negative(user_id, paper_id)
resp = HTMLResponse(content="") # empty β HTMX removes the card
resp.set_cookie(...)
return resp
```
**Three things happen on save, in order:**
1. `db.log_interaction()` β durable write to SQLite (awaited, synchronous from caller's perspective)
2. `us.record_positive()` β in-memory update (synchronous, no I/O)
3. `asyncio.create_task(...)` β background task to look up the Qdrant point ID. Returns immediately; the lookup happens in the background. The response is sent before this finishes.
**Why background for Qdrant lookup?** The user doesn't need the Qdrant point ID for the save response. They only need it when recommendations are requested. The background task means the save response is fast (~5ms), and by the time the user navigates to the home page to see recommendations, the ID is likely already cached.
**Empty response for dismiss**: HTMX has a target set to `#paper-{id}` with `hx-swap="outerHTML swap:200ms"`. An empty response body tells HTMX to replace the entire card element with nothing β the card fades out and disappears.
**`source` is forwarded to the response template**: after a save, the rendered `action_buttons.html` partial receives the same `source` value that came in. So the "Remove" button on the now-saved card will log `source="recommendation"` if the save happened from the recs section, not `"search"`.
---
### `app/routers/recommendations.py`
```python
@router.get("/recommendations", response_class=HTMLResponse)
async def get_recommendations(request, user_id=Cookie(...)):
state = await us.ensure_loaded(user_id)
if not state.has_enough_for_recs():
return _empty_resp() # shows "Save 1 paper to unlock recs"
rec_arxiv_ids = await qdrant_svc.recommend(
positive_arxiv_ids=state.positive_list,
negative_arxiv_ids=state.negative_list,
seen_arxiv_ids=us.all_seen(user_id),
limit=REC_LIMIT,
)
if not rec_arxiv_ids:
return _empty_resp()
meta = await arxiv_svc.fetch_metadata_batch(rec_arxiv_ids)
papers = [{**meta[aid], "saved": False, "dismissed": False}
for aid in rec_arxiv_ids if aid in meta]
return templates.TemplateResponse(request, "partials/recommendations.html",
{"papers": papers})
```
Linear pipeline: load state β check threshold β Qdrant recommend β fetch metadata β render. If anything returns empty at any step, show the empty state partial.
---
### `app/routers/saved.py`
```python
@router.get("/saved", response_class=HTMLResponse)
async def saved_papers(request, user_id=Cookie(...)):
state = await us.ensure_loaded(user_id)
saved_ids = state.positive_list # most-recent first
papers = []
if saved_ids:
meta = await arxiv_svc.fetch_metadata_batch(saved_ids)
papers = [{**meta[aid], "saved": True, "dismissed": False}
for aid in saved_ids if aid in meta]
return templates.TemplateResponse(request, "saved.html",
{"papers": papers, "count": len(papers)})
```
The simplest router. `positive_list` is already the source of truth for what's saved. Fetch metadata for all of them, render. `saved=True` is hardcoded because every paper on this page is by definition saved β the action button will show "β Saved" + "Remove".
---
## Templates
### `app/templates_env.py`
```python
from jinja2 import Environment
from fastapi.templating import Jinja2Templates
def _tojson_parse(value: str) -> list:
try:
result = json.loads(value)
return result if isinstance(result, list) else []
except Exception:
return []
templates = Jinja2Templates(directory="app/templates")
templates.env.filters["tojson_parse"] = _tojson_parse
```
One custom filter: `tojson_parse`. SQLite stores authors as a JSON string (`'["Vaswani", "Shazeer"]'`). In the template: `{{ paper.authors | tojson_parse | join(", ") }}`. The filter parses it back to a Python list. Returns `[]` on any error β never crashes the template.
All routers import `templates` from here. There is only one instance, shared everywhere.
---
### `app/templates/base.html`
```html
<head>
<link href="https://cdn.jsdelivr.net/npm/daisyui@4.12.10/dist/full.min.css" rel="stylesheet"/>
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://unpkg.com/htmx.org@1.9.12"></script>
<style>
.htmx-swapping { opacity: 0; transition: opacity 200ms ease-out; }
.htmx-request .htmx-indicator { display: inline-block !important; }
.htmx-indicator { display: none; }
</style>
</head>
<body>
<div class="navbar bg-base-100 shadow-sm px-4">
<a href="/" class="text-xl font-bold text-primary">π ArXiv Rec</a>
<a href="/search" class="btn btn-ghost btn-sm">Search</a>
<a href="/saved" class="btn btn-ghost btn-sm">Saved</a>
</div>
<main class="container mx-auto px-4 py-6 max-w-4xl">
{% block content %}{% endblock %}
</main>
</body>
```
Zero build step. TailwindCSS + DaisyUI from CDN, HTMX from CDN.
**HTMX CSS hooks:**
- `.htmx-swapping` β HTMX adds this class to an element just before it's replaced. The `opacity: 0` + transition creates the fade-out animation on dismissed cards.
- `.htmx-indicator` β hidden by default. `.htmx-request .htmx-indicator` makes it visible while any HTMX request is in flight. Used for the loading spinners next to buttons.
---
### `app/templates/index.html`
```html
<!-- Search bar with live-search -->
<form hx-get="/search"
hx-target="#search-results"
hx-push-url="true"
hx-indicator="#search-spinner">
<input type="text" name="q" placeholder="e.g. transformer attention" />
<button>Search <span id="search-spinner" class="htmx-indicator loading ..."></span></button>
</form>
<!-- Recommendations: loaded after page paint -->
<div id="rec-section"
hx-get="/api/recommendations"
hx-trigger="load"
hx-swap="innerHTML">
Loading...
</div>
<!-- Search results: swapped here by HTMX -->
<div id="search-results"></div>
```
**`hx-trigger="load"`**: the `#rec-section` div fires the HTMX request as soon as it loads. The page renders immediately with "Loading..." and the recs appear ~500ms later. This way the page never feels slow β you see content instantly, then recs fill in.
**`hx-push-url="true"`**: when a search fires, HTMX pushes `/search?q=...` to the browser history. So the back button works and the URL is shareable.
---
### `app/templates/partials/paper_card.html`
```html
<div class="card bg-base-100 shadow-sm border border-base-300 p-4 space-y-2"
id="paper-{{ paper.arxiv_id }}">
<div class="flex items-start justify-between gap-2">
<a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
target="_blank" class="font-semibold text-primary hover:underline">
{{ paper.title }}
</a>
<span class="badge badge-outline badge-sm">{{ paper.category }}</span>
</div>
<div class="text-xs text-base-content/50">
[{{ paper.arxiv_id }}]
{% if paper.published %} Β· {{ paper.published[:4] }}{% endif %}
{% if authors_list %} Β· {{ authors_list | join(", ") }}{% endif %}
</div>
<p class="text-sm line-clamp-3">{{ paper.abstract }}</p>
<div id="actions-{{ paper.arxiv_id }}">
{% include "partials/action_buttons.html" %}
</div>
</div>
```
Two IDs per card: `#paper-{id}` on the outer div (target for dismiss β the whole card is removed) and `#actions-{id}` on the buttons div (target for save β only the buttons are swapped to "Saved" state).
`line-clamp-3` is a Tailwind utility that truncates the abstract to 3 lines with an ellipsis.
---
### `app/templates/partials/action_buttons.html`
```html
{% set pid = paper_id if paper_id is defined else paper.arxiv_id %}
{% set is_saved = saved if saved is defined else (paper.saved | default(false)) %}
{% set _source = source if source is defined else "search" %}
{% if is_saved %}
<button class="btn btn-success btn-xs" disabled>β Saved</button>
<button hx-post="/api/papers/{{ pid }}/not-interested"
hx-target="#paper-{{ pid }}"
hx-swap="outerHTML swap:200ms"
hx-vals='{"source": "{{ _source }}"}'>Remove</button>
{% else %}
<button hx-post="/api/papers/{{ pid }}/save"
hx-target="#actions-{{ pid }}"
hx-swap="innerHTML"
hx-vals='{"source": "{{ _source }}", "position": "{{ position | default(0) }}"}'>
β Save
</button>
<button hx-post="/api/papers/{{ pid }}/not-interested"
hx-target="#paper-{{ pid }}"
hx-swap="outerHTML swap:200ms"
hx-vals='{"source": "{{ _source }}"}'>
β Not interested
</button>
{% endif %}
```
This partial is used in two contexts:
1. **Inside `paper_card.html`** β `paper` is defined, `paper_id` is not
2. **As a direct response from `events.py/save_paper`** β `paper_id` is defined, `paper` is not
The `{% set pid = ... if ... is defined else ... %}` pattern handles both safely. `Jinja2`'s `default()` filter would crash here because it eagerly evaluates both branches regardless of which one is chosen.
**`hx-vals`** sends additional form fields with the HTMX request. The `source` and `position` values ride along with every button click to be logged in the DB.
---
### `app/templates/partials/recommendations.html`
```html
{% if papers %}
<div class="space-y-3">
{% for paper in papers %}
{% set position = loop.index0 %}
{% set source = "recommendation" %}
{% include "partials/paper_card.html" %}
{% endfor %}
</div>
<div class="text-center pt-3">
<button hx-get="/api/recommendations"
hx-target="#rec-section"
hx-swap="innerHTML"
hx-indicator="#rec-refresh-spinner">
β» Show different recommendations
<span id="rec-refresh-spinner" class="htmx-indicator loading ..."></span>
</button>
</div>
{% else %}
{% include "partials/empty_recs.html" %}
{% endif %}
```
`{% set source = "recommendation" %}` before the include ensures that every action button rendered from this partial carries `source="recommendation"` in its `hx-vals`. The actions router will log that source to the DB.
The refresh button re-triggers the same `/api/recommendations` endpoint. Because the Qdrant Recommend API doesn't return deterministic results (it's an ANN search), re-requesting can surface different papers from the same vector neighborhood.
---
## Tests
### Test Isolation Pattern
Every test that touches the DB or in-memory cache uses this fixture:
```python
@pytest.fixture
def client(tmp_path, monkeypatch):
import app.config as cfg
import app.db as db_mod
db_path = str(tmp_path / "test.db")
# Point both the config and the db module at a fresh temp DB
monkeypatch.setattr(cfg, "DB_PATH", db_path)
monkeypatch.setattr(db_mod, "DB_PATH", db_path)
# Clear in-memory caches so tests don't bleed into each other
import app.user_state as us
us._cache.clear()
from app.qdrant_svc import _client
_client.cache_clear() # lru_cache singleton β need to clear between tests
from app.main import app
asyncio.get_event_loop().run_until_complete(db_mod.init_db())
with TestClient(app, raise_server_exceptions=True) as c:
yield c
```
`tmp_path` is a pytest built-in that gives each test its own temporary directory. Monkeypatching `DB_PATH` means every test gets a fresh, empty SQLite file. Clearing `us._cache` and `_client.cache_clear()` ensures no in-process state bleeds between tests.
### Mocking Pattern for Live Services
Tests that need recommendations mock both the Qdrant service and the arXiv metadata fetcher:
```python
def test_recommendations_after_save(client, monkeypatch):
import app.qdrant_svc as qs
import app.arxiv_svc as arxiv
async def fake_recommend(positive_arxiv_ids, negative_arxiv_ids, seen_arxiv_ids, limit):
return ["1706.03762"]
monkeypatch.setattr(qs, "recommend", fake_recommend)
async def fake_batch(ids):
return {"1706.03762": {"arxiv_id": "1706.03762",
"title": "Attention Is All You Need", ...}}
monkeypatch.setattr(arxiv, "fetch_metadata_batch", fake_batch)
client.get("/")
client.post("/api/papers/0704.0002/save", data={"source": "search"})
resp = client.get("/api/recommendations")
assert "Attention Is All You Need" in resp.text
```
`monkeypatch.setattr` replaces the real function for the duration of the test, then automatically restores it. This lets integration tests run without network access.
---
## Data Flow Summary
```
User types "transformer attention" in search bar
β
β HTMX: GET /search?q=transformer+attention (HX-Request: true)
βΌ
search.py: arxiv_svc.search("transformer attention")
β β GET https://export.arxiv.org/api/query?search_query=all:transformer+attention
β β Atom XML, 10 entries
β β parse β cache in paper_metadata table
β β annotate with saved/dismissed from user_state
βΌ
returns partials/search_results.html β HTMX swaps into #search-results
β
User clicks β Save on paper 1706.03762
β
β HTMX: POST /api/papers/1706.03762/save {source: "search", position: 3}
βΌ
events.py:
1. db.log_interaction(user_id, "1706.03762", "save", source="search", position=3)
2. us.record_positive(user_id, "1706.03762")
3. asyncio.create_task(qdrant_svc.lookup_qdrant_ids(["1706.03762"])) β background
βΌ
returns partials/action_buttons.html (saved=True) β HTMX swaps buttons in-place
[Background task]
qdrant_svc.lookup_qdrant_ids(["1706.03762"])
β db.get_qdrant_ids_batch: miss
β Qdrant scroll filter: arxiv_id = "1706.03762"
β point_id = 523419
β db.save_qdrant_id("1706.03762", 523419)
User navigates to home page /
β
β HTMX: GET /api/recommendations (hx-trigger="load")
βΌ
recommendations.py:
1. us.ensure_loaded(user_id) β positives = ["1706.03762"]
2. qdrant_svc.recommend(positive=["1706.03762"], negative=[], seen={"1706.03762"})
β db.get_qdrant_ids_batch(["1706.03762"]) β {523419} (already cached)
β Qdrant query_points with RecommendQuery(positive=[523419])
β [point_612003, point_88341, ...]
β filter out seen papers in Python
β ["2302.13971", "2307.09288", ...]
3. arxiv_svc.fetch_metadata_batch(["2302.13971", "2307.09288", ...])
β check paper_metadata cache: some hits, some misses
β arXiv API batch fetch for misses β cache results
βΌ
returns partials/recommendations.html β HTMX swaps into #rec-section
```
---
## File Count Summary
| File | Lines | Job |
|---|---|---|
| `app/config.py` | 36 | All settings |
| `app/db.py` | 185 | SQLite: 3 tables, 8 functions |
| `app/arxiv_svc.py` | 159 | arXiv API + metadata cache |
| `app/user_state.py` | 112 | In-memory deque cache per user |
| `app/qdrant_svc.py` | 166 | Qdrant ID lookup + Recommend |
| `app/templates_env.py` | ~20 | Shared Jinja2 env + tojson_parse |
| `app/main.py` | 54 | FastAPI app + home route |
| `app/routers/search.py` | 56 | GET /search |
| `app/routers/events.py` | 75 | POST save + not-interested |
| `app/routers/recommendations.py` | 62 | GET /api/recommendations |
| `app/routers/saved.py` | 47 | GET /saved |
| Templates | ~200 | All HTML |
| Tests | ~600 | 54 tests across 6 files |
|