jofaichow commited on
Commit
83adb51
·
1 Parent(s): 62263e7

v0.0.9 — Full cache sweep + adaptive radius fix

Browse files

- All 126 city×category combos fully cached (18 cities × 7 categories)
- Marrakech, San Francisco, Kyoto gaps filled
- Adaptive coordinate radius: dynamic bounding-box-based instead of 15km hardcoded
(Bali: 141km, Dubai: 108km, European cities: ~15km)
- .gitignore updated for warmup logs, patches/, .streamlit_out.log
- Cache sizes updated in README
- New scripts: fix_images.py, clear_poor_entries.py, warmup_fast.py, warmup_direct.py
- 1,869 image entries, 827 geocode entries

.geocode_cache.json CHANGED
The diff for this file is too large to render. See raw diff
 
.gitignore CHANGED
@@ -27,6 +27,15 @@ hermes-plan.md
27
 
28
  # Warmup logs
29
  warmup_*.log
 
 
 
 
 
 
 
 
 
30
 
31
  # OS junk
32
  .DS_Store
 
27
 
28
  # Warmup logs
29
  warmup_*.log
30
+ .warmup_*.log
31
+ .warmup_run_check.txt
32
+ .warmup_run_cycle*.log
33
+ .fix_images.log
34
+ prewarm_translations.log
35
+ .streamlit_out.log
36
+
37
+ # Patches (not part of the app)
38
+ patches/
39
 
40
  # OS junk
41
  .DS_Store
.image_cache.json CHANGED
The diff for this file is too large to render. See raw diff
 
.llm_cache.json CHANGED
The diff for this file is too large to render. See raw diff
 
README.md CHANGED
@@ -82,7 +82,7 @@ A provider is skipped if its API key is empty. Just set `OPENROUTER_API_KEY` and
82
 
83
  - **7 travel categories**: Landmark, Culture, Nature, Gems, Photo, Food, Shopping
84
  - **AI-generated recommendations** with descriptions, tips, and coordinates
85
- - **6-tier image fallback**: Wikipedia → Wikidata → Commons → Local name → Unsplash → City photo
86
  - **Real coordinates** from Nominatim geocoding with LLM-coord fast-path
87
  - **Leaflet map** with spider markers, card↔map hover sync
88
  - **Multi-language translation**: Traditional Chinese, Japanese, Korean, French, Spanish, German
@@ -103,19 +103,27 @@ Three JSON cache files are committed and ship with the app:
103
  Caches are populated on first search and persisted to disk. On HF Spaces, they
104
  survive restarts and provide instant results for cached cities.
105
 
106
- ### Pre-warming
107
 
108
  ```bash
109
- python scripts/prewarm_cache.py
 
 
 
 
 
 
 
110
  ```
111
 
112
- Populates all 3 caches for a set of popular cities × all 7 categories.
113
- Caches are num-agnostic one entry per (city, category) is enough.
114
 
115
  ### Cache Health Check
116
 
117
  ```bash
118
- python scripts/check_cache.py # scan + fix
 
119
  python scripts/check_cache.py --report-only # scan only
120
  ```
121
 
@@ -137,13 +145,17 @@ roamify/
137
  │ └── utils/
138
  │ └── prompts.py # Category-specific AI prompt templates
139
  ├── scripts/
140
- │ ├── prewarm_cache.py # Multi-category pre-warming
141
- ── check_cache.py # Cache health check & repair
 
 
 
142
  ├── .streamlit/
143
  │ └── config.toml # Streamlit server and theme config
144
- ├── .llm_cache.json # Disk-persisted recommendation cache
145
- ├── .image_cache.json # Disk-persisted image URL cache
146
- ├── .geocode_cache.json # Disk-persisted geocoding cache
 
147
  ├── Dockerfile # HF Spaces deployment
148
  ├── requirements.txt
149
  └── README.md
 
82
 
83
  - **7 travel categories**: Landmark, Culture, Nature, Gems, Photo, Food, Shopping
84
  - **AI-generated recommendations** with descriptions, tips, and coordinates
85
+ - **5-tier image fallback + emoji**: Wikipedia → Wikidata → Commons → Local name → Unsplash → emoji (🏛️)
86
  - **Real coordinates** from Nominatim geocoding with LLM-coord fast-path
87
  - **Leaflet map** with spider markers, card↔map hover sync
88
  - **Multi-language translation**: Traditional Chinese, Japanese, Korean, French, Spanish, German
 
103
  Caches are populated on first search and persisted to disk. On HF Spaces, they
104
  survive restarts and provide instant results for cached cities.
105
 
106
+ ### Warmup
107
 
108
  ```bash
109
+ # Full 18-city × 7-category warmup (LLM + image enrichment)
110
+ python scripts/warmup.py
111
+
112
+ # Fast warmup (LLM data only, skip sequential image fix)
113
+ python scripts/warmup_fast.py
114
+
115
+ # Re-warmup specific cities (e.g. after coordinate fixes)
116
+ python scripts/warmup.py -c Bali -c Dubai
117
  ```
118
 
119
+ Generates all 126 city × category combos (2,300+ items across 3 caches).
120
+ Resumableinterrupted runs pick up where they left off.
121
 
122
  ### Cache Health Check
123
 
124
  ```bash
125
+ python scripts/warmup.py --fix # re-check images on cached entries
126
+ python scripts/check_cache.py # scan + fix
127
  python scripts/check_cache.py --report-only # scan only
128
  ```
129
 
 
145
  │ └── utils/
146
  │ └── prompts.py # Category-specific AI prompt templates
147
  ├── scripts/
148
+ │ ├── warmup.py # Full 18-city unified warmup
149
+ ── warmup_fast.py # Fast LLM-only warmup (skips image fix)
150
+ │ ├── check_cache.py # Cache health check & repair
151
+ │ ├── fix_images.py # Parallel image enrichment pass
152
+ │ └── clear_poor_entries.py # Clear cache for re-warmup
153
  ├── .streamlit/
154
  │ └── config.toml # Streamlit server and theme config
155
+ ├── .llm_cache.json # Disk-persisted recommendation cache (~850KB)
156
+ ├── .image_cache.json # Disk-persisted image URL cache (~300KB)
157
+ ├── .geocode_cache.json # Disk-persisted geocoding cache (~290KB)
158
+ ├── .translation_cache.json # Disk-persisted translation cache (~220KB)
159
  ├── Dockerfile # HF Spaces deployment
160
  ├── requirements.txt
161
  └── README.md
scripts/clear_poor_entries.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Clear specific cache entries so they get regenerated with the new adaptive radius."""
3
+ import json, os, sys
4
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
5
+ from dotenv import load_dotenv
6
+ load_dotenv(dotenv_path=os.path.join(os.path.dirname(__file__), "..", ".env"), override=True)
7
+
8
+ CACHE_FILE = os.path.join(os.path.dirname(__file__), "..", ".llm_cache.json")
9
+ WARMUP_PROGRESS = os.path.join(os.path.dirname(__file__), "..", ".warmup_progress.json")
10
+
11
+ CATS = ["Landmark", "Culture", "Nature", "Gems", "Photo", "Food", "Shopping"]
12
+
13
+ def cat_hash(name):
14
+ d = {c: (c == name) for c in CATS}
15
+ return json.dumps(d, sort_keys=True)
16
+
17
+ # Cities to fully clear (all categories)
18
+ FULL_CLEAR = ["Bali", "Dubai"]
19
+ # Specific combos to clear
20
+ COMBO_CLEAR = [
21
+ ("Marrakech", "Landmark"),
22
+ ("Kyoto", "Shopping"),
23
+ ]
24
+
25
+ with open(CACHE_FILE) as f:
26
+ cache = json.load(f)
27
+
28
+ removed = 0
29
+ # Full clear
30
+ for city in FULL_CLEAR:
31
+ for cat in CATS:
32
+ key = json.dumps([city, cat_hash(cat)])
33
+ if key in cache:
34
+ del cache[key]
35
+ removed += 1
36
+
37
+ # Specific combos
38
+ for city, cat in COMBO_CLEAR:
39
+ key = json.dumps([city, cat_hash(cat)])
40
+ if key in cache:
41
+ del cache[key]
42
+ removed += 1
43
+
44
+ with open(CACHE_FILE, "w") as f:
45
+ json.dump(cache, f)
46
+
47
+ # Also clear warmup progress for these so the warmup retries them
48
+ with open(WARMUP_PROGRESS) as f:
49
+ progress = json.load(f)
50
+
51
+ for cid in list(progress["combos"].keys()):
52
+ city, cat = cid.split("::")
53
+ if city in FULL_CLEAR:
54
+ del progress["combos"][cid]
55
+ elif (city, cat) in COMBO_CLEAR:
56
+ del progress["combos"][cid]
57
+
58
+ with open(WARMUP_PROGRESS, "w") as f:
59
+ json.dump(progress, f, indent=2)
60
+
61
+ print(f"Cleared {removed} cache entries + progress for re-warmup")
scripts/fix_images.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Fix missing images across all cached cities using parallel enrichment."""
3
+ import sys, os, json, time
4
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
5
+ from dotenv import load_dotenv
6
+ load_dotenv(dotenv_path=os.path.join(os.path.dirname(__file__), "..", ".env"), override=True)
7
+
8
+ from services.recommender import (
9
+ _LLM_CACHE, _IMAGE_CACHE, _save_image_cache,
10
+ _enrich_with_images,
11
+ )
12
+
13
+ # Collect all items that have no image_url
14
+ CITIES = ['Paris','London','Rome','Barcelona','New York','Tokyo','Bangkok','Sydney',
15
+ 'Cape Town','Rio de Janeiro','Istanbul','Dubai','Seoul','Bali','Prague',
16
+ 'San Francisco','Marrakech','Kyoto']
17
+ CATS = ['Landmark','Culture','Nature','Gems','Photo','Food','Shopping']
18
+
19
+ def cat_hash(name):
20
+ d = {c: (c==name) for c in CATS}
21
+ return json.dumps(d, sort_keys=True)
22
+
23
+ # Group missing-image items by city for parallel enrichment
24
+ by_city = {}
25
+ total_missing = 0
26
+ for city in CITIES:
27
+ city_items = []
28
+ for cat in CATS:
29
+ key = (city, cat_hash(cat))
30
+ items = _LLM_CACHE.get(key, [])
31
+ if items:
32
+ for item in items:
33
+ if not item.get("image_url"):
34
+ city_items.append(item)
35
+ if city_items:
36
+ by_city[city] = city_items
37
+ total_missing += len(city_items)
38
+ print(f'{city}: {len(city_items)} items missing images')
39
+
40
+ print(f'\nTotal items missing images: {total_missing}')
41
+
42
+ # Enrich each city's items in parallel (6 workers per batch)
43
+ import concurrent.futures
44
+ with concurrent.futures.ThreadPoolExecutor(max_workers=4) as pool:
45
+ futures = {}
46
+ for city, items in by_city.items():
47
+ f = pool.submit(_enrich_with_images, items, city=city)
48
+ futures[f] = city
49
+
50
+ for f in concurrent.futures.as_completed(futures):
51
+ city = futures[f]
52
+ try:
53
+ result = f.result()
54
+ fixed = sum(1 for it in result if it.get("image_url"))
55
+ print(f' {city}: fixed {fixed}/{len(by_city[city])} remaining')
56
+ except Exception as e:
57
+ print(f' {city}: error - {e}')
58
+
59
+ _save_image_cache()
60
+
61
+ # Final tally
62
+ still_missing = sum(1 for v in _LLM_CACHE.values() if v for it in v if not it.get("image_url"))
63
+ print(f'\nStill missing after fix: {still_missing} (from {total_missing})')
64
+ print(f'Image cache entries: {len(_IMAGE_CACHE)}')
scripts/run_cities.sh ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -e
3
+
4
+ ROAMIFY=/home/joe/repo_dev/roamify
5
+ PYTHON=/home/joe/repo_dev/roamify/.venv/bin/python3
6
+ LOG=$ROAMIFY/warmup_batch.log
7
+
8
+ # Cities to process, in order
9
+ CITIES=(
10
+ "Cape Town"
11
+ "Rio de Janeiro"
12
+ "Istanbul"
13
+ "Dubai"
14
+ "Seoul"
15
+ "Bali"
16
+ "Prague"
17
+ "San Francisco"
18
+ "Marrakech"
19
+ "Kyoto"
20
+ )
21
+
22
+ echo "=== Batch warmup started $(date) ===" > "$LOG"
23
+
24
+ for city in "${CITIES[@]}"; do
25
+ echo ""
26
+ echo "═══ Processing: $city ═══" | tee -a "$LOG"
27
+ echo "Started: $(date)" >> "$LOG"
28
+
29
+ cd "$ROAMIFY"
30
+ if $PYTHON scripts/warmup.py --city "$city" >> "$LOG" 2>&1; then
31
+ echo "✅ $city — DONE at $(date)" | tee -a "$LOG"
32
+ else
33
+ echo "❌ $city — FAILED at $(date)" | tee -a "$LOG"
34
+ echo "See $LOG for details"
35
+ exit 1
36
+ fi
37
+ done
38
+
39
+ echo ""
40
+ echo "🎉 All cities complete at $(date)" | tee -a "$LOG"
scripts/run_warmup.sh ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ #!/bin/bash
2
+ cd /home/joe/repo_dev/roamify
3
+ .venv/bin/python -u scripts/warmup.py 2>&1 | while IFS= read -r line; do echo "$line"; done
scripts/warmup_direct.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Wrapper that ensures flushing for background warmup."""
3
+ import sys
4
+ import os
5
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
6
+
7
+ # Monkey-patch print to flush after every call
8
+ _original_print = print
9
+ def _flushing_print(*args, **kwargs):
10
+ kwargs.setdefault("flush", True)
11
+ _original_print(*args, **kwargs)
12
+ import builtins
13
+ builtins.print = _flushing_print
14
+
15
+ from warmup import warmup
16
+ warmup()
scripts/warmup_fast.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Fast warmup — generates LLM data for missing combos only.
4
+ Skips the slow sequential image fix; get_recommendations already does parallel enrichment.
5
+ """
6
+ import os, sys, time, json
7
+ from datetime import datetime
8
+
9
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
10
+ from dotenv import load_dotenv
11
+ load_dotenv(dotenv_path=os.path.join(os.path.dirname(__file__), "..", ".env"), override=True)
12
+
13
+ from services.recommender import (
14
+ get_recommendations_cached,
15
+ _LLM_CACHE,
16
+ _IMAGE_CACHE,
17
+ _GEOCODE_CACHE,
18
+ )
19
+
20
+ CITIES = [
21
+ "Paris", "London", "Rome", "Barcelona", "New York", "Tokyo",
22
+ "Bangkok", "Sydney", "Cape Town", "Rio de Janeiro", "Istanbul",
23
+ "Dubai", "Seoul", "Bali", "Prague", "San Francisco", "Marrakech", "Kyoto",
24
+ ]
25
+ CATEGORIES = ["Landmark", "Culture", "Nature", "Gems", "Photo", "Food", "Shopping"]
26
+
27
+ PROGRESS_FILE = os.path.join(os.path.dirname(__file__), "..", ".warmup_progress.json")
28
+
29
+ def cat_dict(cat_name: str) -> dict:
30
+ return {name: (name == cat_name) for name in CATEGORIES}
31
+
32
+ def cat_hash(cat_name: str) -> str:
33
+ return json.dumps(cat_dict(cat_name), sort_keys=True)
34
+
35
+ def load_progress() -> dict:
36
+ if not os.path.exists(PROGRESS_FILE):
37
+ return {"version": 1, "combos": {}}
38
+ try:
39
+ with open(PROGRESS_FILE) as f:
40
+ return json.load(f)
41
+ except (json.JSONDecodeError, OSError):
42
+ return {"version": 1, "combos": {}}
43
+
44
+ def save_progress(progress: dict):
45
+ with open(PROGRESS_FILE, "w") as f:
46
+ json.dump(progress, f, indent=2)
47
+
48
+ def combo_id(city: str, cat: str) -> str:
49
+ return f"{city}::{cat}"
50
+
51
+ def is_done(progress: dict, cid: str) -> bool:
52
+ entry = progress["combos"].get(cid)
53
+ return entry and entry.get("status") == "success"
54
+
55
+ progress = load_progress()
56
+ llm_before = len(_LLM_CACHE)
57
+
58
+ # Only process combos that actually need LLM generation
59
+ todo = []
60
+ for city in CITIES:
61
+ for cat in CATEGORIES:
62
+ cid = combo_id(city, cat)
63
+ if is_done(progress, cid):
64
+ continue
65
+ key = (city, cat_hash(cat))
66
+ if key in _LLM_CACHE:
67
+ # In cache but not in progress — mark done
68
+ continue
69
+ todo.append((city, cat))
70
+
71
+ total = len(todo)
72
+ print(f"Missing combos needing API calls: {total}")
73
+ print()
74
+
75
+ for i, (city, cat) in enumerate(todo, 1):
76
+ cid = combo_id(city, cat)
77
+ print(f"[{i}/{total}] 🔍 {city} / {cat}...", end=" ", flush=True)
78
+ start = time.time()
79
+ provider_log = []
80
+ try:
81
+ result = get_recommendations_cached(
82
+ city=city, num_attractions=19,
83
+ categories=cat_dict(cat),
84
+ temperature=0,
85
+ provider_log=provider_log,
86
+ )
87
+ elapsed = time.time() - start
88
+
89
+ for entry in provider_log:
90
+ label = entry.get("provider", "?")
91
+ status = "✅" if entry.get("status") == "success" else "❌"
92
+ items = entry.get("items", 0)
93
+ dur = entry.get("elapsed", "?")
94
+ print(f"\n {label} {status} {dur}s ({items}it)", end="", flush=True)
95
+
96
+ if result:
97
+ items = len(result)
98
+ print(f"\n✅ {items} items, {elapsed:.0f}s total")
99
+ progress["combos"][cid] = {
100
+ "status": "success", "items": items,
101
+ "elapsed": round(elapsed, 1),
102
+ "provider_chain": provider_log,
103
+ "timestamp": datetime.now().isoformat(),
104
+ }
105
+ else:
106
+ print(f"\n❌ returned None, {elapsed:.0f}s total")
107
+ progress["combos"][cid] = {
108
+ "status": "failed", "elapsed": round(elapsed, 1),
109
+ "provider_chain": provider_log,
110
+ "error": "all providers returned None",
111
+ "timestamp": datetime.now().isoformat(),
112
+ }
113
+ except Exception as e:
114
+ elapsed = time.time() - start
115
+ print(f"\n❌ {elapsed:.0f}s — {e}")
116
+ progress["combos"][cid] = {
117
+ "status": "failed", "elapsed": round(elapsed, 1),
118
+ "error": str(e), "timestamp": datetime.now().isoformat(),
119
+ }
120
+
121
+ save_progress(progress)
122
+ if i < total:
123
+ time.sleep(1.5) # Nominatim-friendly pause
124
+
125
+ # Summary
126
+ success = sum(1 for v in progress["combos"].values() if v.get("status") == "success")
127
+ failed = sum(1 for v in progress["combos"].values() if v.get("status") == "failed")
128
+ new_llm = len(_LLM_CACHE) - llm_before
129
+ print("\n" + "=" * 50)
130
+ print(f"Done! {success} success, {failed} failed, {new_llm} new cache entries")
131
+
132
+ failed_combos = [k for k,v in progress["combos"].items() if v.get("status") == "failed"]
133
+ if failed_combos:
134
+ print("Failed combos:")
135
+ for c in failed_combos:
136
+ print(f" ❌ {c.replace('::', ' / ')}")
src/services/recommender.py CHANGED
@@ -672,22 +672,34 @@ def _geocode_city(city: str) -> tuple[float, float, list[float]] | None:
672
 
673
  def _verify_coordinates(items: list[dict], city: str) -> list[dict]:
674
  """Verify attraction coordinates.
675
-
676
  Strategy:
677
- 1. Geocode city center (1 cached Nominatim query)
678
- 2. For each item: if LLM-provided coords are non-zero and within 15km of
679
- city center, trust them skip Nominatim entirely.
680
- 3. Only geocode items whose LLM coords fail the radius check.
 
 
 
681
  This eliminates ~80% of Nominatim calls on a good LLM response.
682
  """
683
  # Geocode city center (cached — sleep handled internally)
684
  city_result = _geocode_city(city)
685
  if city_result:
686
  city_center = (city_result[0], city_result[1])
 
 
 
 
 
 
 
 
 
 
687
  else:
688
  city_center = None
689
-
690
- MAX_CITY_DIST_KM = 15
691
  verified = []
692
 
693
  for item in items:
 
672
 
673
  def _verify_coordinates(items: list[dict], city: str) -> list[dict]:
674
  """Verify attraction coordinates.
675
+
676
  Strategy:
677
+ 1. Geocode city center (1 cached Nominatim query), get bounding box
678
+ 2. Adaptive radius: max(15km, bounding_box_diagonal x 0.6)
679
+ Compact European cities stay ~15km, spread-out cities (Bali, Dubai)
680
+ get a larger radius proportional to their bounding box.
681
+ 3. For each item: if LLM-provided coords are non-zero and within
682
+ adaptive radius of city center, trust them — skip Nominatim entirely.
683
+ 4. Only geocode items whose LLM coords fail the radius check.
684
  This eliminates ~80% of Nominatim calls on a good LLM response.
685
  """
686
  # Geocode city center (cached — sleep handled internally)
687
  city_result = _geocode_city(city)
688
  if city_result:
689
  city_center = (city_result[0], city_result[1])
690
+ # Adaptive radius: use bounding box diagonal × 0.6, min 15km
691
+ # This handles spread-out cities (Bali, Dubai, Rio, etc.) while keeping
692
+ # compact European cities tight.
693
+ bb = city_result[2]
694
+ if len(bb) == 4:
695
+ km_lat = (bb[1] - bb[0]) * 111.0
696
+ km_lon = (bb[3] - bb[2]) * 111.0 * math.cos(math.radians(city_center[0]))
697
+ MAX_CITY_DIST_KM = max(15, math.sqrt(km_lat**2 + km_lon**2) * 0.6)
698
+ else:
699
+ MAX_CITY_DIST_KM = 15
700
  else:
701
  city_center = None
702
+ MAX_CITY_DIST_KM = 15
 
703
  verified = []
704
 
705
  for item in items: