File size: 33,590 Bytes
d5a6f3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
# Code Tour β€” ArXiv Recommender (Phase 1)

A file-by-file walkthrough of every piece of the codebase: what it does, how it works, and why it was written the way it was.

---

## Entry Points

### `run.py`

```python
import uvicorn

if __name__ == "__main__":
    uvicorn.run(
        "app.main:app",
        host="127.0.0.1",
        port=8000,
        reload=True,
        reload_dirs=["app"],
    )
```

Nothing special here. Starts Uvicorn pointing at the FastAPI `app` object. `reload=True` watches the `app/` directory and hot-reloads on file changes. Run with `python run.py`.

---

### `app/main.py`

```python
from app.routers import search, events, recommendations, saved

@asynccontextmanager
async def lifespan(app: FastAPI):
    await db.init_db()
    yield

app = FastAPI(title=APP_TITLE, lifespan=lifespan)

app.include_router(search.router)
app.include_router(events.router)
app.include_router(recommendations.router)
app.include_router(saved.router)

@app.get("/", response_class=HTMLResponse)
async def home(request, user_id=Cookie(...)):
    user_id = user_id or str(uuid.uuid4())
    state = await us.ensure_loaded(user_id)
    resp = templates.TemplateResponse(request, "index.html", {
        "has_recs": state.has_enough_for_recs(),
        "save_count": len(state.positives),
    })
    resp.set_cookie(COOKIE_NAME, user_id, max_age=365*24*3600, httponly=True)
    return resp
```

**`lifespan`** is a FastAPI context manager that runs `init_db()` once when the server starts β€” creates the three SQLite tables if they don't exist, then yields control to the app.

**The home route** is the only one that lives in `main.py`. Everything else is in routers. It reads the user's cookie, loads their state from memory/DB, and renders `index.html` with two flags: `has_recs` (enough saves to show recommendations?) and `save_count` (how many papers saved so far).

**Cookie pattern** β€” every route that might be a user's first visit creates a UUID4 if no cookie exists, and refreshes the cookie's max_age on every response. This way the cookie always stays 1 year from last visit.

---

## Configuration

### `app/config.py`

```python
import os

QDRANT_URL        = os.getenv("QDRANT_URL", "https://2fe1965b-...eu-west-2-0.aws...")
QDRANT_API_KEY    = os.getenv("QDRANT_API_KEY", "eyJhbGci...")
QDRANT_COLLECTION = os.getenv("QDRANT_COLLECTION", "arxiv_bgem3_dense")

DB_PATH           = os.getenv("DB_PATH", "interactions.db")
ARXIV_API_URL     = "https://export.arxiv.org/api/query"
ARXIV_MAX_RESULTS = 10
METADATA_CACHE_TTL_DAYS = 30

REC_LIMIT          = 10
REC_POSITIVE_LIMIT = 20
REC_MIN_POSITIVES  = 1

APP_TITLE   = "ArXiv Recommender"
COOKIE_NAME = "arxiv_user_id"
COOKIE_MAX_AGE = 60 * 60 * 24 * 365
```

Every credential and tunable lives here. All of them can be overridden with environment variables β€” `os.getenv("X", default)`. In production you'd set `QDRANT_API_KEY` as an env var and never commit it to git.

**`REC_POSITIVE_LIMIT = 20`** β€” controls how many saved papers are kept in the in-memory deque *and* how many are sent to Qdrant as positive examples. This is the only place you change it; `user_state.py` reads it directly.

---

## Database Layer

### `app/db.py`

Three tables. The schema runs once at startup via `init_db()`.

```python
_SCHEMA = """
PRAGMA journal_mode=WAL;
PRAGMA synchronous=NORMAL;

CREATE TABLE IF NOT EXISTS interactions (
    id         INTEGER PRIMARY KEY AUTOINCREMENT,
    user_id    TEXT    NOT NULL,
    paper_id   TEXT    NOT NULL,
    event_type TEXT    NOT NULL,   -- save | not_interested
    source     TEXT,               -- search | recommendation | saved
    position   INTEGER,
    query_id   TEXT,
    timestamp  TEXT    NOT NULL DEFAULT (datetime('now'))
);
CREATE INDEX IF NOT EXISTS idx_ui_user_ts    ON interactions(user_id, timestamp DESC);
CREATE INDEX IF NOT EXISTS idx_ui_user_paper ON interactions(user_id, paper_id);

CREATE TABLE IF NOT EXISTS paper_qdrant_map (
    arxiv_id        TEXT PRIMARY KEY,
    qdrant_point_id INTEGER NOT NULL,
    mapped_at       TEXT    NOT NULL DEFAULT (datetime('now'))
);

CREATE TABLE IF NOT EXISTS paper_metadata (
    arxiv_id  TEXT PRIMARY KEY,
    title     TEXT,
    abstract  TEXT,
    authors   TEXT,   -- JSON array string e.g. '["Vaswani", "Shazeer"]'
    category  TEXT,
    published TEXT,
    cached_at TEXT    NOT NULL DEFAULT (datetime('now'))
);
"""
```

**WAL mode** (`journal_mode=WAL`) allows one writer and multiple concurrent readers without blocking. Important because FastAPI handles requests concurrently and SQLite's default mode would serialize everything.

**`synchronous=NORMAL`** β€” safe against OS crashes but doesn't fsync on every write. Faster than `FULL` with acceptable durability for a research tool.

**Three tables, three jobs:**

| Table | Job |
|---|---|
| `interactions` | Append-only event log. Never updated, only inserted. Source of truth. |
| `paper_qdrant_map` | Cache translating arxiv_id strings β†’ Qdrant integer point IDs |
| `paper_metadata` | Cache of arXiv API responses so we don't re-fetch titles/abstracts |

**Key functions:**

```python
# Write an event
await db.log_interaction(user_id, paper_id, "save", source="search", position=2)

# Read recent events for a user (used to hydrate the in-memory cache)
rows = await db.get_user_interactions(user_id, event_types=["save", "not_interested"], limit=70)

# Qdrant ID cache
await db.save_qdrant_id("1706.03762", 523419)
cached = await db.get_qdrant_ids_batch(["1706.03762", "0704.0002"])
# β†’ {"1706.03762": 523419}  (only IDs that were in the cache)

# Metadata cache
await db.cache_metadata({"arxiv_id": "1706.03762", "title": "Attention...", ...})
batch = await db.get_cached_metadata_batch(["1706.03762", "0704.0002"])
# β†’ {"1706.03762": {...}}
```

All functions use `async with aiosqlite.connect(DB_PATH)` β€” each call opens and closes its own connection. This is safe with WAL mode and avoids connection pool complexity.

---

## arXiv Service

### `app/arxiv_svc.py`

Handles all communication with the arXiv Atom XML API and the SQLite metadata cache.

#### ID Normalisation

arXiv IDs come in several formats from the API:

```python
_ID_RE = re.compile(r"(?:arxiv:|https?://arxiv\.org/abs/)?([^\s/v]+(?:v\d+)?)")

def _normalise_id(raw: str) -> str:
    m = _ID_RE.search(raw.strip())
    bare = m.group(1)
    return re.sub(r"v\d+$", "", bare)
```

| Input | Output |
|---|---|
| `http://arxiv.org/abs/1706.03762v5` | `1706.03762` |
| `https://arxiv.org/abs/1706.03762` | `1706.03762` |
| `arxiv:1706.03762v2` | `1706.03762` |
| `1706.03762v3` | `1706.03762` |
| `0704.0002` | `0704.0002` |

The bare ID is what we store everywhere β€” in SQLite, in the user state cache, and in the Qdrant `arxiv_id` payload field.

#### XML Parsing

The arXiv API returns Atom XML. One `<entry>` element per paper:

```python
_NS = {
    "atom":  "http://www.w3.org/2005/Atom",
    "arxiv": "http://arxiv.org/schemas/atom",
}

def _parse_entry(entry: ET.Element) -> dict:
    raw_id   = text("atom:id")
    arxiv_id = _normalise_id(raw_id)
    authors  = [a.findtext("atom:name", ...) for a in entry.findall("atom:author", _NS)]
    cat_el   = entry.find("arxiv:primary_category", _NS)
    category = cat_el.attrib.get("term", "")

    return {
        "arxiv_id": arxiv_id,
        "title":    text("atom:title").replace("\n", " "),
        "abstract": text("atom:summary").replace("\n", " "),
        "authors":  json.dumps(authors[:5]),   # stored as JSON string in SQLite
        "category": category,
        "published": text("atom:published")[:10],  # YYYY-MM-DD only
        "year":     int(published[:4]),
    }
```

Authors are stored as a JSON string (`'["Vaswani", "Shazeer"]'`) because SQLite has no array type. The `tojson_parse` filter in the template converts it back to a Python list for display.

#### Search and Fetch

```python
async def search(query: str, max_results=10) -> list[dict]:
    params = {"search_query": f"all:{query}", "start": 0,
              "max_results": max_results, "sortBy": "relevance"}
    async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
        resp = await client.get(ARXIV_API_URL, params=params)
    papers = [_parse_entry(e) for e in ET.fromstring(resp.text).findall("atom:entry", _NS)]
    for paper in papers:
        await db.cache_metadata(paper)   # cache all results immediately
    return papers

async def fetch_metadata_batch(arxiv_ids: list[str]) -> dict[str, dict]:
    result  = await db.get_cached_metadata_batch(arxiv_ids)  # check SQLite first
    missing = [aid for aid in arxiv_ids if aid not in result]
    if missing:
        # Batch up to 20 IDs per request, 0.35s gap = ~3 req/s rate limit
        for i in range(0, len(missing), 20):
            chunk  = missing[i:i+20]
            params = {"id_list": ",".join(chunk), "max_results": len(chunk)}
            # ... fetch, parse, cache ...
            await asyncio.sleep(0.35)
    return result
```

`follow_redirects=True` is required β€” the arXiv API's HTTP URL redirects to HTTPS.

---

## User State

### `app/user_state.py`

The in-memory hot cache. Zero DB reads on the hot path.

```python
from app import db, config

MAX_POSITIVES = config.REC_POSITIVE_LIMIT   # = 20, kept in sync with config
MAX_NEGATIVES = 50

@dataclass
class UserState:
    positives: deque[str] = field(default_factory=lambda: deque(maxlen=MAX_POSITIVES))
    negatives: deque[str] = field(default_factory=lambda: deque(maxlen=MAX_NEGATIVES))
    loaded: bool = False

    def add_positive(self, paper_id: str) -> None:
        try:    self.negatives.remove(paper_id)   # mutual exclusion
        except ValueError: pass
        if paper_id not in self.positives:
            self.positives.appendleft(paper_id)   # most recent first

    def add_negative(self, paper_id: str) -> None:
        try:    self.positives.remove(paper_id)
        except ValueError: pass
        if paper_id not in self.negatives:
            self.negatives.appendleft(paper_id)

    def has_enough_for_recs(self) -> bool:
        return len(self.positives) >= config.REC_MIN_POSITIVES
```

**Mutual exclusion**: `add_positive` removes the paper from negatives before adding to positives, and vice versa. So a paper can never be in both lists simultaneously.

**`appendleft`**: deques are double-ended. `appendleft` inserts at index 0 (front). When `maxlen` is reached, the rightmost (oldest) element is silently dropped. So `positive_list[0]` is always the most recently saved paper.

```python
_cache: dict[str, UserState] = {}   # global in-process dict

async def ensure_loaded(user_id: str) -> UserState:
    state = get_user_state(user_id)
    if state.loaded:
        return state                 # O(1) β€” hot path

    # Cold path: first request from this user in this server process
    rows = await db.get_user_interactions(user_id,
               event_types=["save", "not_interested"], limit=70)
    for row in reversed(rows):      # oldest first so appendleft puts newest at front
        if row["event_type"] == "save":
            state.add_positive(row["paper_id"])
        else:
            state.add_negative(row["paper_id"])
    state.loaded = True
    return state
```

**Why `reversed(rows)`**: `get_user_interactions` returns rows newest-first (ORDER BY timestamp DESC). We want to replay them in chronological order so that `appendleft` in `add_positive` correctly ends up with the newest paper at `index 0`. If we replayed newest-first, the oldest save would end up at the front.

```python
def record_positive(user_id: str, paper_id: str) -> None:
    get_user_state(user_id).add_positive(paper_id)   # sync, no DB

def all_seen(user_id: str) -> set[str]:
    state = get_user_state(user_id)
    return set(state.positive_list) | set(state.negative_list)
```

`all_seen` feeds the recommendation engine β€” any paper the user has ever saved or dismissed is excluded from the results.

---

## Qdrant Service

### `app/qdrant_svc.py`

Two jobs: translate arxiv_ids β†’ integer point IDs, and call the Recommend API.

#### Client Setup

```python
@lru_cache(maxsize=1)
def _client() -> QdrantClient:
    return QdrantClient(
        url=config.QDRANT_URL,
        api_key=config.QDRANT_API_KEY,
        timeout=30,
        check_compatibility=False,
    )
```

`@lru_cache(maxsize=1)` makes this a singleton. The client is created once, reused on every request. The sync `QdrantClient` is used (not the async one) because it runs inside `asyncio.run_in_executor` β€” this keeps the event loop free while the network call is in flight.

#### ID Lookup

```python
async def lookup_qdrant_ids(arxiv_ids: list[str]) -> dict[str, int]:
    cached  = await db.get_qdrant_ids_batch(arxiv_ids)
    missing = [aid for aid in arxiv_ids if aid not in cached]

    if missing:
        loop    = asyncio.get_event_loop()
        results = await loop.run_in_executor(None, _scroll_by_arxiv_ids, missing)
        for arxiv_id, point_id in results.items():
            await db.save_qdrant_id(arxiv_id, point_id)
            cached[arxiv_id] = point_id

    return cached

def _scroll_by_arxiv_ids(arxiv_ids: list[str]) -> dict[str, int]:
    pts, _ = _client().scroll(
        collection_name=QDRANT_COLLECTION,
        scroll_filter=Filter(must=[
            FieldCondition(key="arxiv_id", match=MatchAny(any=arxiv_ids))
        ]),
        limit=len(arxiv_ids),
        with_payload=True,
        with_vectors=False,
    )
    return {p.payload["arxiv_id"]: p.id for p in pts}
```

`MatchAny` is Qdrant's `IN (...)` β€” it filters points whose `arxiv_id` payload field matches any value in the list. Requires the keyword payload index created on the collection (created once, persists permanently).

The result is `{arxiv_id: integer_point_id}`. Any ID not found in the collection is simply absent from the dict β€” that paper hasn't been indexed yet.

#### Recommend

```python
async def recommend(positive_arxiv_ids, negative_arxiv_ids, seen_arxiv_ids, limit):
    all_ids = list(dict.fromkeys(positive_arxiv_ids + negative_arxiv_ids))
    id_map  = await lookup_qdrant_ids(all_ids)

    pos_ids = [id_map[aid] for aid in positive_arxiv_ids if aid in id_map]
    neg_ids = [id_map[aid] for aid in negative_arxiv_ids if aid in id_map]

    if not pos_ids:
        return []

    results = await loop.run_in_executor(None, _run_recommend, pos_ids, neg_ids, limit*2)

    filtered = [
        r.payload["arxiv_id"]
        for r in results
        if r.payload.get("arxiv_id") and r.payload["arxiv_id"] not in seen_arxiv_ids
    ]
    return filtered[:limit]

def _run_recommend(pos_ids, neg_ids, limit):
    result = _client().query_points(
        collection_name=QDRANT_COLLECTION,
        query=RecommendQuery(
            recommend=RecommendInput(
                positive=pos_ids,
                negative=neg_ids if neg_ids else [],
                strategy=RecommendStrategy.BEST_SCORE,
            )
        ),
        limit=limit,
        with_payload=True,
        with_vectors=False,
    )
    return result.points
```

**`BEST_SCORE` strategy**: for each candidate paper, Qdrant computes its similarity to each positive example, takes the maximum score, then subtracts a penalty for similarity to negatives. Papers near your saves and far from your dismissals bubble to the top.

**`limit * 2` over-fetch**: we fetch double the target count so that after filtering out `seen_arxiv_ids` in Python, we still have enough results to return `limit` papers.

**`dict.fromkeys(...)` deduplication**: if a paper appears in both positive and negative lists (shouldn't happen due to mutual exclusion in `user_state`, but defensive), it's deduplicated before the lookup.

---

## Routers

### `app/routers/search.py`

```python
@router.get("/search", response_class=HTMLResponse)
async def search(request: Request, q: str = "", user_id=Cookie(...)):
    papers = []
    if q.strip():
        papers = await arxiv_svc.search(q.strip())

    state      = await us.ensure_loaded(user_id)
    saved_ids  = set(state.positive_list)
    dismissed  = set(state.negative_list)

    for p in papers:
        p["saved"]     = p["arxiv_id"] in saved_ids
        p["dismissed"] = p["arxiv_id"] in dismissed_ids

    if request.headers.get("HX-Request"):
        return templates.TemplateResponse(request, "partials/search_results.html",
                                          {"papers": papers, "query": q})
    else:
        return templates.TemplateResponse(request, "search.html",
                                          {"papers": papers, "query": q,
                                           "has_recs": state.has_enough_for_recs()})
```

**HTMX detection**: if the request has an `HX-Request` header (set automatically by HTMX), return only the `search_results.html` partial β€” just the list of cards, no `<html>` wrapper. This is what gets swapped into `#search-results` on the page without a full reload.

**Annotating papers**: after fetching from arXiv, each paper dict gets `saved` and `dismissed` booleans added. The template uses these to show the correct button state (e.g. already-saved papers show "βœ“ Saved" instead of "⭐ Save").

---

### `app/routers/events.py`

```python
@router.post("/{paper_id}/save", response_class=HTMLResponse)
async def save_paper(paper_id, request, source=Form("search"),
                     position=Form(0), query_id=Form(""), user_id=Cookie(...)):
    await db.log_interaction(user_id, paper_id, "save",
                             source=source, position=position or None)
    us.record_positive(user_id, paper_id)
    asyncio.create_task(qdrant_svc.lookup_qdrant_ids([paper_id]))  # background

    return templates.TemplateResponse(request, "partials/action_buttons.html",
        {"paper_id": paper_id, "saved": True, "dismissed": False, "source": source})


@router.post("/{paper_id}/not-interested", response_class=HTMLResponse)
async def not_interested(paper_id, request, source=Form("search"), ...):
    await db.log_interaction(user_id, paper_id, "not_interested", source=source)
    us.record_negative(user_id, paper_id)

    resp = HTMLResponse(content="")   # empty β†’ HTMX removes the card
    resp.set_cookie(...)
    return resp
```

**Three things happen on save, in order:**
1. `db.log_interaction()` β€” durable write to SQLite (awaited, synchronous from caller's perspective)
2. `us.record_positive()` β€” in-memory update (synchronous, no I/O)
3. `asyncio.create_task(...)` β€” background task to look up the Qdrant point ID. Returns immediately; the lookup happens in the background. The response is sent before this finishes.

**Why background for Qdrant lookup?** The user doesn't need the Qdrant point ID for the save response. They only need it when recommendations are requested. The background task means the save response is fast (~5ms), and by the time the user navigates to the home page to see recommendations, the ID is likely already cached.

**Empty response for dismiss**: HTMX has a target set to `#paper-{id}` with `hx-swap="outerHTML swap:200ms"`. An empty response body tells HTMX to replace the entire card element with nothing β€” the card fades out and disappears.

**`source` is forwarded to the response template**: after a save, the rendered `action_buttons.html` partial receives the same `source` value that came in. So the "Remove" button on the now-saved card will log `source="recommendation"` if the save happened from the recs section, not `"search"`.

---

### `app/routers/recommendations.py`

```python
@router.get("/recommendations", response_class=HTMLResponse)
async def get_recommendations(request, user_id=Cookie(...)):
    state = await us.ensure_loaded(user_id)

    if not state.has_enough_for_recs():
        return _empty_resp()           # shows "Save 1 paper to unlock recs"

    rec_arxiv_ids = await qdrant_svc.recommend(
        positive_arxiv_ids=state.positive_list,
        negative_arxiv_ids=state.negative_list,
        seen_arxiv_ids=us.all_seen(user_id),
        limit=REC_LIMIT,
    )

    if not rec_arxiv_ids:
        return _empty_resp()

    meta   = await arxiv_svc.fetch_metadata_batch(rec_arxiv_ids)
    papers = [{**meta[aid], "saved": False, "dismissed": False}
              for aid in rec_arxiv_ids if aid in meta]

    return templates.TemplateResponse(request, "partials/recommendations.html",
                                      {"papers": papers})
```

Linear pipeline: load state β†’ check threshold β†’ Qdrant recommend β†’ fetch metadata β†’ render. If anything returns empty at any step, show the empty state partial.

---

### `app/routers/saved.py`

```python
@router.get("/saved", response_class=HTMLResponse)
async def saved_papers(request, user_id=Cookie(...)):
    state    = await us.ensure_loaded(user_id)
    saved_ids = state.positive_list   # most-recent first

    papers = []
    if saved_ids:
        meta   = await arxiv_svc.fetch_metadata_batch(saved_ids)
        papers = [{**meta[aid], "saved": True, "dismissed": False}
                  for aid in saved_ids if aid in meta]

    return templates.TemplateResponse(request, "saved.html",
                                      {"papers": papers, "count": len(papers)})
```

The simplest router. `positive_list` is already the source of truth for what's saved. Fetch metadata for all of them, render. `saved=True` is hardcoded because every paper on this page is by definition saved β€” the action button will show "βœ“ Saved" + "Remove".

---

## Templates

### `app/templates_env.py`

```python
from jinja2 import Environment
from fastapi.templating import Jinja2Templates

def _tojson_parse(value: str) -> list:
    try:
        result = json.loads(value)
        return result if isinstance(result, list) else []
    except Exception:
        return []

templates = Jinja2Templates(directory="app/templates")
templates.env.filters["tojson_parse"] = _tojson_parse
```

One custom filter: `tojson_parse`. SQLite stores authors as a JSON string (`'["Vaswani", "Shazeer"]'`). In the template: `{{ paper.authors | tojson_parse | join(", ") }}`. The filter parses it back to a Python list. Returns `[]` on any error β€” never crashes the template.

All routers import `templates` from here. There is only one instance, shared everywhere.

---

### `app/templates/base.html`

```html
<head>
  <link href="https://cdn.jsdelivr.net/npm/daisyui@4.12.10/dist/full.min.css" rel="stylesheet"/>
  <script src="https://cdn.tailwindcss.com"></script>
  <script src="https://unpkg.com/htmx.org@1.9.12"></script>
  <style>
    .htmx-swapping { opacity: 0; transition: opacity 200ms ease-out; }
    .htmx-request .htmx-indicator { display: inline-block !important; }
    .htmx-indicator { display: none; }
  </style>
</head>
<body>
  <div class="navbar bg-base-100 shadow-sm px-4">
    <a href="/" class="text-xl font-bold text-primary">πŸ“„ ArXiv Rec</a>
    <a href="/search" class="btn btn-ghost btn-sm">Search</a>
    <a href="/saved"  class="btn btn-ghost btn-sm">Saved</a>
  </div>
  <main class="container mx-auto px-4 py-6 max-w-4xl">
    {% block content %}{% endblock %}
  </main>
</body>
```

Zero build step. TailwindCSS + DaisyUI from CDN, HTMX from CDN.

**HTMX CSS hooks:**
- `.htmx-swapping` β€” HTMX adds this class to an element just before it's replaced. The `opacity: 0` + transition creates the fade-out animation on dismissed cards.
- `.htmx-indicator` β€” hidden by default. `.htmx-request .htmx-indicator` makes it visible while any HTMX request is in flight. Used for the loading spinners next to buttons.

---

### `app/templates/index.html`

```html
<!-- Search bar with live-search -->
<form hx-get="/search"
      hx-target="#search-results"
      hx-push-url="true"
      hx-indicator="#search-spinner">
  <input type="text" name="q" placeholder="e.g. transformer attention" />
  <button>Search <span id="search-spinner" class="htmx-indicator loading ..."></span></button>
</form>

<!-- Recommendations: loaded after page paint -->
<div id="rec-section"
     hx-get="/api/recommendations"
     hx-trigger="load"
     hx-swap="innerHTML">
  Loading...
</div>

<!-- Search results: swapped here by HTMX -->
<div id="search-results"></div>
```

**`hx-trigger="load"`**: the `#rec-section` div fires the HTMX request as soon as it loads. The page renders immediately with "Loading..." and the recs appear ~500ms later. This way the page never feels slow β€” you see content instantly, then recs fill in.

**`hx-push-url="true"`**: when a search fires, HTMX pushes `/search?q=...` to the browser history. So the back button works and the URL is shareable.

---

### `app/templates/partials/paper_card.html`

```html
<div class="card bg-base-100 shadow-sm border border-base-300 p-4 space-y-2"
     id="paper-{{ paper.arxiv_id }}">

  <div class="flex items-start justify-between gap-2">
    <a href="https://arxiv.org/abs/{{ paper.arxiv_id }}"
       target="_blank" class="font-semibold text-primary hover:underline">
      {{ paper.title }}
    </a>
    <span class="badge badge-outline badge-sm">{{ paper.category }}</span>
  </div>

  <div class="text-xs text-base-content/50">
    [{{ paper.arxiv_id }}]
    {% if paper.published %} Β· {{ paper.published[:4] }}{% endif %}
    {% if authors_list %} Β· {{ authors_list | join(", ") }}{% endif %}
  </div>

  <p class="text-sm line-clamp-3">{{ paper.abstract }}</p>

  <div id="actions-{{ paper.arxiv_id }}">
    {% include "partials/action_buttons.html" %}
  </div>
</div>
```

Two IDs per card: `#paper-{id}` on the outer div (target for dismiss β€” the whole card is removed) and `#actions-{id}` on the buttons div (target for save β€” only the buttons are swapped to "Saved" state).

`line-clamp-3` is a Tailwind utility that truncates the abstract to 3 lines with an ellipsis.

---

### `app/templates/partials/action_buttons.html`

```html
{% set pid     = paper_id if paper_id is defined else paper.arxiv_id %}
{% set is_saved = saved if saved is defined else (paper.saved | default(false)) %}
{% set _source  = source if source is defined else "search" %}

{% if is_saved %}
  <button class="btn btn-success btn-xs" disabled>βœ“ Saved</button>
  <button hx-post="/api/papers/{{ pid }}/not-interested"
          hx-target="#paper-{{ pid }}"
          hx-swap="outerHTML swap:200ms"
          hx-vals='{"source": "{{ _source }}"}'>Remove</button>
{% else %}
  <button hx-post="/api/papers/{{ pid }}/save"
          hx-target="#actions-{{ pid }}"
          hx-swap="innerHTML"
          hx-vals='{"source": "{{ _source }}", "position": "{{ position | default(0) }}"}'>
    ⭐ Save
  </button>
  <button hx-post="/api/papers/{{ pid }}/not-interested"
          hx-target="#paper-{{ pid }}"
          hx-swap="outerHTML swap:200ms"
          hx-vals='{"source": "{{ _source }}"}'>
    βœ• Not interested
  </button>
{% endif %}
```

This partial is used in two contexts:
1. **Inside `paper_card.html`** β€” `paper` is defined, `paper_id` is not
2. **As a direct response from `events.py/save_paper`** β€” `paper_id` is defined, `paper` is not

The `{% set pid = ... if ... is defined else ... %}` pattern handles both safely. `Jinja2`'s `default()` filter would crash here because it eagerly evaluates both branches regardless of which one is chosen.

**`hx-vals`** sends additional form fields with the HTMX request. The `source` and `position` values ride along with every button click to be logged in the DB.

---

### `app/templates/partials/recommendations.html`

```html
{% if papers %}
  <div class="space-y-3">
    {% for paper in papers %}
      {% set position = loop.index0 %}
      {% set source = "recommendation" %}
      {% include "partials/paper_card.html" %}
    {% endfor %}
  </div>
  <div class="text-center pt-3">
    <button hx-get="/api/recommendations"
            hx-target="#rec-section"
            hx-swap="innerHTML"
            hx-indicator="#rec-refresh-spinner">
      ↻ Show different recommendations
      <span id="rec-refresh-spinner" class="htmx-indicator loading ..."></span>
    </button>
  </div>
{% else %}
  {% include "partials/empty_recs.html" %}
{% endif %}
```

`{% set source = "recommendation" %}` before the include ensures that every action button rendered from this partial carries `source="recommendation"` in its `hx-vals`. The actions router will log that source to the DB.

The refresh button re-triggers the same `/api/recommendations` endpoint. Because the Qdrant Recommend API doesn't return deterministic results (it's an ANN search), re-requesting can surface different papers from the same vector neighborhood.

---

## Tests

### Test Isolation Pattern

Every test that touches the DB or in-memory cache uses this fixture:

```python
@pytest.fixture
def client(tmp_path, monkeypatch):
    import app.config as cfg
    import app.db as db_mod
    db_path = str(tmp_path / "test.db")

    # Point both the config and the db module at a fresh temp DB
    monkeypatch.setattr(cfg, "DB_PATH", db_path)
    monkeypatch.setattr(db_mod, "DB_PATH", db_path)

    # Clear in-memory caches so tests don't bleed into each other
    import app.user_state as us
    us._cache.clear()

    from app.qdrant_svc import _client
    _client.cache_clear()    # lru_cache singleton β€” need to clear between tests

    from app.main import app
    asyncio.get_event_loop().run_until_complete(db_mod.init_db())

    with TestClient(app, raise_server_exceptions=True) as c:
        yield c
```

`tmp_path` is a pytest built-in that gives each test its own temporary directory. Monkeypatching `DB_PATH` means every test gets a fresh, empty SQLite file. Clearing `us._cache` and `_client.cache_clear()` ensures no in-process state bleeds between tests.

### Mocking Pattern for Live Services

Tests that need recommendations mock both the Qdrant service and the arXiv metadata fetcher:

```python
def test_recommendations_after_save(client, monkeypatch):
    import app.qdrant_svc as qs
    import app.arxiv_svc as arxiv

    async def fake_recommend(positive_arxiv_ids, negative_arxiv_ids, seen_arxiv_ids, limit):
        return ["1706.03762"]
    monkeypatch.setattr(qs, "recommend", fake_recommend)

    async def fake_batch(ids):
        return {"1706.03762": {"arxiv_id": "1706.03762",
                               "title": "Attention Is All You Need", ...}}
    monkeypatch.setattr(arxiv, "fetch_metadata_batch", fake_batch)

    client.get("/")
    client.post("/api/papers/0704.0002/save", data={"source": "search"})
    resp = client.get("/api/recommendations")
    assert "Attention Is All You Need" in resp.text
```

`monkeypatch.setattr` replaces the real function for the duration of the test, then automatically restores it. This lets integration tests run without network access.

---

## Data Flow Summary

```
User types "transformer attention" in search bar
  β”‚
  β”‚  HTMX: GET /search?q=transformer+attention  (HX-Request: true)
  β–Ό
search.py: arxiv_svc.search("transformer attention")
  β”‚  β†’ GET https://export.arxiv.org/api/query?search_query=all:transformer+attention
  β”‚  ← Atom XML, 10 entries
  β”‚  β†’ parse β†’ cache in paper_metadata table
  β”‚  β†’ annotate with saved/dismissed from user_state
  β–Ό
returns partials/search_results.html β†’ HTMX swaps into #search-results
  β”‚
User clicks ⭐ Save on paper 1706.03762
  β”‚
  β”‚  HTMX: POST /api/papers/1706.03762/save  {source: "search", position: 3}
  β–Ό
events.py:
  1. db.log_interaction(user_id, "1706.03762", "save", source="search", position=3)
  2. us.record_positive(user_id, "1706.03762")
  3. asyncio.create_task(qdrant_svc.lookup_qdrant_ids(["1706.03762"]))  ← background
  β–Ό
returns partials/action_buttons.html (saved=True) β†’ HTMX swaps buttons in-place

  [Background task]
  qdrant_svc.lookup_qdrant_ids(["1706.03762"])
    β†’ db.get_qdrant_ids_batch: miss
    β†’ Qdrant scroll filter: arxiv_id = "1706.03762"
    ← point_id = 523419
    β†’ db.save_qdrant_id("1706.03762", 523419)

User navigates to home page /
  β”‚
  β”‚  HTMX: GET /api/recommendations  (hx-trigger="load")
  β–Ό
recommendations.py:
  1. us.ensure_loaded(user_id) β†’ positives = ["1706.03762"]
  2. qdrant_svc.recommend(positive=["1706.03762"], negative=[], seen={"1706.03762"})
       β†’ db.get_qdrant_ids_batch(["1706.03762"]) β†’ {523419}  (already cached)
       β†’ Qdrant query_points with RecommendQuery(positive=[523419])
       ← [point_612003, point_88341, ...]
       β†’ filter out seen papers in Python
       ← ["2302.13971", "2307.09288", ...]
  3. arxiv_svc.fetch_metadata_batch(["2302.13971", "2307.09288", ...])
       β†’ check paper_metadata cache: some hits, some misses
       β†’ arXiv API batch fetch for misses β†’ cache results
  β–Ό
returns partials/recommendations.html β†’ HTMX swaps into #rec-section
```

---

## File Count Summary

| File | Lines | Job |
|---|---|---|
| `app/config.py` | 36 | All settings |
| `app/db.py` | 185 | SQLite: 3 tables, 8 functions |
| `app/arxiv_svc.py` | 159 | arXiv API + metadata cache |
| `app/user_state.py` | 112 | In-memory deque cache per user |
| `app/qdrant_svc.py` | 166 | Qdrant ID lookup + Recommend |
| `app/templates_env.py` | ~20 | Shared Jinja2 env + tojson_parse |
| `app/main.py` | 54 | FastAPI app + home route |
| `app/routers/search.py` | 56 | GET /search |
| `app/routers/events.py` | 75 | POST save + not-interested |
| `app/routers/recommendations.py` | 62 | GET /api/recommendations |
| `app/routers/saved.py` | 47 | GET /saved |
| Templates | ~200 | All HTML |
| Tests | ~600 | 54 tests across 6 files |