Spaces:

jofaichow
/

roamify

Running

jofaichow commited on 13 days ago

Commit

83adb51

1 Parent(s): 62263e7

v0.0.9 — Full cache sweep + adaptive radius fix

- All 126 city×category combos fully cached (18 cities × 7 categories)
- Marrakech, San Francisco, Kyoto gaps filled
- Adaptive coordinate radius: dynamic bounding-box-based instead of 15km hardcoded
(Bali: 141km, Dubai: 108km, European cities: ~15km)
- .gitignore updated for warmup logs, patches/, .streamlit_out.log
- Cache sizes updated in README
- New scripts: fix_images.py, clear_poor_entries.py, warmup_fast.py, warmup_direct.py
- 1,869 image entries, 827 geocode entries

Files changed (12) hide show

.geocode_cache.json +0 -0
.gitignore +9 -0
.image_cache.json +0 -0
.llm_cache.json +0 -0
README.md +23 -11
scripts/clear_poor_entries.py +61 -0
scripts/fix_images.py +64 -0
scripts/run_cities.sh +40 -0
scripts/run_warmup.sh +3 -0
scripts/warmup_direct.py +16 -0
scripts/warmup_fast.py +136 -0
src/services/recommender.py +19 -7

.geocode_cache.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

.gitignore CHANGED Viewed

@@ -27,6 +27,15 @@ hermes-plan.md
 # Warmup logs
 warmup_*.log
 # OS junk
 .DS_Store

 # Warmup logs
 warmup_*.log
+.warmup_*.log
+.warmup_run_check.txt
+.warmup_run_cycle*.log
+.fix_images.log
+prewarm_translations.log
+.streamlit_out.log
+# Patches (not part of the app)
+patches/
 # OS junk
 .DS_Store

.image_cache.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

.llm_cache.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

README.md CHANGED Viewed

@@ -82,7 +82,7 @@ A provider is skipped if its API key is empty. Just set `OPENROUTER_API_KEY` and
 - **7 travel categories**: Landmark, Culture, Nature, Gems, Photo, Food, Shopping
 - **AI-generated recommendations** with descriptions, tips, and coordinates
-- **6-tier image fallback**: Wikipedia → Wikidata → Commons → Local name → Unsplash → City photo
 - **Real coordinates** from Nominatim geocoding with LLM-coord fast-path
 - **Leaflet map** with spider markers, card↔map hover sync
 - **Multi-language translation**: Traditional Chinese, Japanese, Korean, French, Spanish, German
@@ -103,19 +103,27 @@ Three JSON cache files are committed and ship with the app:
 Caches are populated on first search and persisted to disk. On HF Spaces, they
 survive restarts and provide instant results for cached cities.
-### Pre-warming
 ```bash
-python scripts/prewarm_cache.py
 ```
-Populates all 3 caches for a set of popular cities × all 7 categories.
-Caches are num-agnostic — one entry per (city, category) is enough.
 ### Cache Health Check
 ```bash
-python scripts/check_cache.py           # scan + fix
 python scripts/check_cache.py --report-only  # scan only
 ```
@@ -137,13 +145,17 @@ roamify/
 │   └── utils/
 │       └── prompts.py           # Category-specific AI prompt templates
 ├── scripts/
-│   ├── prewarm_cache.py         # Multi-category pre-warming
-│   └── check_cache.py           # Cache health check & repair
 ├── .streamlit/
 │   └── config.toml              # Streamlit server and theme config
-├── .llm_cache.json              # Disk-persisted recommendation cache
-├── .image_cache.json            # Disk-persisted image URL cache
-├── .geocode_cache.json          # Disk-persisted geocoding cache
 ├── Dockerfile                   # HF Spaces deployment
 ├── requirements.txt
 └── README.md

 - **7 travel categories**: Landmark, Culture, Nature, Gems, Photo, Food, Shopping
 - **AI-generated recommendations** with descriptions, tips, and coordinates
+- **5-tier image fallback + emoji**: Wikipedia → Wikidata → Commons → Local name → Unsplash → emoji (🏛️)
 - **Real coordinates** from Nominatim geocoding with LLM-coord fast-path
 - **Leaflet map** with spider markers, card↔map hover sync
 - **Multi-language translation**: Traditional Chinese, Japanese, Korean, French, Spanish, German
 Caches are populated on first search and persisted to disk. On HF Spaces, they
 survive restarts and provide instant results for cached cities.
+### Warmup
 ```bash
+# Full 18-city × 7-category warmup (LLM + image enrichment)
+python scripts/warmup.py
+# Fast warmup (LLM data only, skip sequential image fix)
+python scripts/warmup_fast.py
+# Re-warmup specific cities (e.g. after coordinate fixes)
+python scripts/warmup.py -c Bali -c Dubai
 ```
+Generates all 126 city × category combos (2,300+ items across 3 caches).
+Resumable — interrupted runs pick up where they left off.
 ### Cache Health Check
 ```bash
+python scripts/warmup.py --fix        # re-check images on cached entries
+python scripts/check_cache.py          # scan + fix
 python scripts/check_cache.py --report-only  # scan only
 ```
 │   └── utils/
 │       └── prompts.py           # Category-specific AI prompt templates
 ├── scripts/
+│   ├── warmup.py                  # Full 18-city unified warmup
+│   ├── warmup_fast.py             # Fast LLM-only warmup (skips image fix)
+│   ├── check_cache.py             # Cache health check & repair
+│   ├── fix_images.py              # Parallel image enrichment pass
+│   └── clear_poor_entries.py      # Clear cache for re-warmup
 ├── .streamlit/
 │   └── config.toml              # Streamlit server and theme config
+├── .llm_cache.json              # Disk-persisted recommendation cache (~850KB)
+├── .image_cache.json            # Disk-persisted image URL cache (~300KB)
+├── .geocode_cache.json          # Disk-persisted geocoding cache (~290KB)
+├── .translation_cache.json      # Disk-persisted translation cache (~220KB)
 ├── Dockerfile                   # HF Spaces deployment
 ├── requirements.txt
 └── README.md

scripts/clear_poor_entries.py ADDED Viewed

	@@ -0,0 +1,61 @@

+#!/usr/bin/env python3
+"""Clear specific cache entries so they get regenerated with the new adaptive radius."""
+import json, os, sys
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
+from dotenv import load_dotenv
+load_dotenv(dotenv_path=os.path.join(os.path.dirname(__file__), "..", ".env"), override=True)
+CACHE_FILE = os.path.join(os.path.dirname(__file__), "..", ".llm_cache.json")
+WARMUP_PROGRESS = os.path.join(os.path.dirname(__file__), "..", ".warmup_progress.json")
+CATS = ["Landmark", "Culture", "Nature", "Gems", "Photo", "Food", "Shopping"]
+def cat_hash(name):
+    d = {c: (c == name) for c in CATS}
+    return json.dumps(d, sort_keys=True)
+# Cities to fully clear (all categories)
+FULL_CLEAR = ["Bali", "Dubai"]
+# Specific combos to clear
+COMBO_CLEAR = [
+    ("Marrakech", "Landmark"),
+    ("Kyoto", "Shopping"),
+]
+with open(CACHE_FILE) as f:
+    cache = json.load(f)
+removed = 0
+# Full clear
+for city in FULL_CLEAR:
+    for cat in CATS:
+        key = json.dumps([city, cat_hash(cat)])
+        if key in cache:
+            del cache[key]
+            removed += 1
+# Specific combos
+for city, cat in COMBO_CLEAR:
+    key = json.dumps([city, cat_hash(cat)])
+    if key in cache:
+        del cache[key]
+        removed += 1
+with open(CACHE_FILE, "w") as f:
+    json.dump(cache, f)
+# Also clear warmup progress for these so the warmup retries them
+with open(WARMUP_PROGRESS) as f:
+    progress = json.load(f)
+for cid in list(progress["combos"].keys()):
+    city, cat = cid.split("::")
+    if city in FULL_CLEAR:
+        del progress["combos"][cid]
+    elif (city, cat) in COMBO_CLEAR:
+        del progress["combos"][cid]
+with open(WARMUP_PROGRESS, "w") as f:
+    json.dump(progress, f, indent=2)
+print(f"Cleared {removed} cache entries + progress for re-warmup")

scripts/fix_images.py ADDED Viewed

	@@ -0,0 +1,64 @@

+#!/usr/bin/env python3
+"""Fix missing images across all cached cities using parallel enrichment."""
+import sys, os, json, time
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
+from dotenv import load_dotenv
+load_dotenv(dotenv_path=os.path.join(os.path.dirname(__file__), "..", ".env"), override=True)
+from services.recommender import (
+    _LLM_CACHE, _IMAGE_CACHE, _save_image_cache,
+    _enrich_with_images,
+)
+# Collect all items that have no image_url
+CITIES = ['Paris','London','Rome','Barcelona','New York','Tokyo','Bangkok','Sydney',
+          'Cape Town','Rio de Janeiro','Istanbul','Dubai','Seoul','Bali','Prague',
+          'San Francisco','Marrakech','Kyoto']
+CATS = ['Landmark','Culture','Nature','Gems','Photo','Food','Shopping']
+def cat_hash(name):
+    d = {c: (c==name) for c in CATS}
+    return json.dumps(d, sort_keys=True)
+# Group missing-image items by city for parallel enrichment
+by_city = {}
+total_missing = 0
+for city in CITIES:
+    city_items = []
+    for cat in CATS:
+        key = (city, cat_hash(cat))
+        items = _LLM_CACHE.get(key, [])
+        if items:
+            for item in items:
+                if not item.get("image_url"):
+                    city_items.append(item)
+    if city_items:
+        by_city[city] = city_items
+        total_missing += len(city_items)
+        print(f'{city}: {len(city_items)} items missing images')
+print(f'\nTotal items missing images: {total_missing}')
+# Enrich each city's items in parallel (6 workers per batch)
+import concurrent.futures
+with concurrent.futures.ThreadPoolExecutor(max_workers=4) as pool:
+    futures = {}
+    for city, items in by_city.items():
+        f = pool.submit(_enrich_with_images, items, city=city)
+        futures[f] = city
+    for f in concurrent.futures.as_completed(futures):
+        city = futures[f]
+        try:
+            result = f.result()
+            fixed = sum(1 for it in result if it.get("image_url"))
+            print(f'  {city}: fixed {fixed}/{len(by_city[city])} remaining')
+        except Exception as e:
+            print(f'  {city}: error - {e}')
+_save_image_cache()
+# Final tally
+still_missing = sum(1 for v in _LLM_CACHE.values() if v for it in v if not it.get("image_url"))
+print(f'\nStill missing after fix: {still_missing} (from {total_missing})')
+print(f'Image cache entries: {len(_IMAGE_CACHE)}')

scripts/run_cities.sh ADDED Viewed

	@@ -0,0 +1,40 @@

+#!/usr/bin/env bash
+set -e
+ROAMIFY=/home/joe/repo_dev/roamify
+PYTHON=/home/joe/repo_dev/roamify/.venv/bin/python3
+LOG=$ROAMIFY/warmup_batch.log
+# Cities to process, in order
+CITIES=(
+  "Cape Town"
+  "Rio de Janeiro"
+  "Istanbul"
+  "Dubai"
+  "Seoul"
+  "Bali"
+  "Prague"
+  "San Francisco"
+  "Marrakech"
+  "Kyoto"
+)
+echo "=== Batch warmup started $(date) ===" > "$LOG"
+for city in "${CITIES[@]}"; do
+  echo ""
+  echo "═══ Processing: $city ═══" | tee -a "$LOG"
+  echo "Started: $(date)" >> "$LOG"
+  cd "$ROAMIFY"
+  if $PYTHON scripts/warmup.py --city "$city" >> "$LOG" 2>&1; then
+    echo "✅ $city — DONE at $(date)" | tee -a "$LOG"
+  else
+    echo "❌ $city — FAILED at $(date)" | tee -a "$LOG"
+    echo "See $LOG for details"
+    exit 1
+  fi
+done
+echo ""
+echo "🎉 All cities complete at $(date)" | tee -a "$LOG"

scripts/run_warmup.sh ADDED Viewed

	@@ -0,0 +1,3 @@

+#!/bin/bash
+cd /home/joe/repo_dev/roamify
+.venv/bin/python -u scripts/warmup.py 2>&1 | while IFS= read -r line; do echo "$line"; done

scripts/warmup_direct.py ADDED Viewed

	@@ -0,0 +1,16 @@

+#!/usr/bin/env python3
+"""Wrapper that ensures flushing for background warmup."""
+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
+# Monkey-patch print to flush after every call
+_original_print = print
+def _flushing_print(*args, **kwargs):
+    kwargs.setdefault("flush", True)
+    _original_print(*args, **kwargs)
+import builtins
+builtins.print = _flushing_print
+from warmup import warmup
+warmup()

scripts/warmup_fast.py ADDED Viewed

	@@ -0,0 +1,136 @@

+#!/usr/bin/env python3
+"""
+Fast warmup — generates LLM data for missing combos only.
+Skips the slow sequential image fix; get_recommendations already does parallel enrichment.
+"""
+import os, sys, time, json
+from datetime import datetime
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "src"))
+from dotenv import load_dotenv
+load_dotenv(dotenv_path=os.path.join(os.path.dirname(__file__), "..", ".env"), override=True)
+from services.recommender import (
+    get_recommendations_cached,
+    _LLM_CACHE,
+    _IMAGE_CACHE,
+    _GEOCODE_CACHE,
+)
+CITIES = [
+    "Paris", "London", "Rome", "Barcelona", "New York", "Tokyo",
+    "Bangkok", "Sydney", "Cape Town", "Rio de Janeiro", "Istanbul",
+    "Dubai", "Seoul", "Bali", "Prague", "San Francisco", "Marrakech", "Kyoto",
+]
+CATEGORIES = ["Landmark", "Culture", "Nature", "Gems", "Photo", "Food", "Shopping"]
+PROGRESS_FILE = os.path.join(os.path.dirname(__file__), "..", ".warmup_progress.json")
+def cat_dict(cat_name: str) -> dict:
+    return {name: (name == cat_name) for name in CATEGORIES}
+def cat_hash(cat_name: str) -> str:
+    return json.dumps(cat_dict(cat_name), sort_keys=True)
+def load_progress() -> dict:
+    if not os.path.exists(PROGRESS_FILE):
+        return {"version": 1, "combos": {}}
+    try:
+        with open(PROGRESS_FILE) as f:
+            return json.load(f)
+    except (json.JSONDecodeError, OSError):
+        return {"version": 1, "combos": {}}
+def save_progress(progress: dict):
+    with open(PROGRESS_FILE, "w") as f:
+        json.dump(progress, f, indent=2)
+def combo_id(city: str, cat: str) -> str:
+    return f"{city}::{cat}"
+def is_done(progress: dict, cid: str) -> bool:
+    entry = progress["combos"].get(cid)
+    return entry and entry.get("status") == "success"
+progress = load_progress()
+llm_before = len(_LLM_CACHE)
+# Only process combos that actually need LLM generation
+todo = []
+for city in CITIES:
+    for cat in CATEGORIES:
+        cid = combo_id(city, cat)
+        if is_done(progress, cid):
+            continue
+        key = (city, cat_hash(cat))
+        if key in _LLM_CACHE:
+            # In cache but not in progress — mark done
+            continue
+        todo.append((city, cat))
+total = len(todo)
+print(f"Missing combos needing API calls: {total}")
+print()
+for i, (city, cat) in enumerate(todo, 1):
+    cid = combo_id(city, cat)
+    print(f"[{i}/{total}] 🔍 {city} / {cat}...", end=" ", flush=True)
+    start = time.time()
+    provider_log = []
+    try:
+        result = get_recommendations_cached(
+            city=city, num_attractions=19,
+            categories=cat_dict(cat),
+            temperature=0,
+            provider_log=provider_log,
+        )
+        elapsed = time.time() - start
+        for entry in provider_log:
+            label = entry.get("provider", "?")
+            status = "✅" if entry.get("status") == "success" else "❌"
+            items = entry.get("items", 0)
+            dur = entry.get("elapsed", "?")
+            print(f"\n  {label} {status} {dur}s ({items}it)", end="", flush=True)
+        if result:
+            items = len(result)
+            print(f"\n✅ {items} items, {elapsed:.0f}s total")
+            progress["combos"][cid] = {
+                "status": "success", "items": items,
+                "elapsed": round(elapsed, 1),
+                "provider_chain": provider_log,
+                "timestamp": datetime.now().isoformat(),
+            }
+        else:
+            print(f"\n❌ returned None, {elapsed:.0f}s total")
+            progress["combos"][cid] = {
+                "status": "failed", "elapsed": round(elapsed, 1),
+                "provider_chain": provider_log,
+                "error": "all providers returned None",
+                "timestamp": datetime.now().isoformat(),
+            }
+    except Exception as e:
+        elapsed = time.time() - start
+        print(f"\n❌ {elapsed:.0f}s — {e}")
+        progress["combos"][cid] = {
+            "status": "failed", "elapsed": round(elapsed, 1),
+            "error": str(e), "timestamp": datetime.now().isoformat(),
+        }
+    save_progress(progress)
+    if i < total:
+        time.sleep(1.5)  # Nominatim-friendly pause
+# Summary
+success = sum(1 for v in progress["combos"].values() if v.get("status") == "success")
+failed = sum(1 for v in progress["combos"].values() if v.get("status") == "failed")
+new_llm = len(_LLM_CACHE) - llm_before
+print("\n" + "=" * 50)
+print(f"Done! {success} success, {failed} failed, {new_llm} new cache entries")
+failed_combos = [k for k,v in progress["combos"].items() if v.get("status") == "failed"]
+if failed_combos:
+    print("Failed combos:")
+    for c in failed_combos:
+        print(f"  ❌ {c.replace('::', ' / ')}")

src/services/recommender.py CHANGED Viewed

@@ -672,22 +672,34 @@ def _geocode_city(city: str) -> tuple[float, float, list[float]] | None:
 def _verify_coordinates(items: list[dict], city: str) -> list[dict]:
     """Verify attraction coordinates.
     Strategy:
-    1. Geocode city center (1 cached Nominatim query)
-    2. For each item: if LLM-provided coords are non-zero and within 15km of
-       city center, trust them — skip Nominatim entirely.
-    3. Only geocode items whose LLM coords fail the radius check.
     This eliminates ~80% of Nominatim calls on a good LLM response.
     """
     # Geocode city center (cached — sleep handled internally)
     city_result = _geocode_city(city)
     if city_result:
         city_center = (city_result[0], city_result[1])
     else:
         city_center = None
-    MAX_CITY_DIST_KM = 15
     verified = []
     for item in items:

 def _verify_coordinates(items: list[dict], city: str) -> list[dict]:
     """Verify attraction coordinates.
     Strategy:
+    1. Geocode city center (1 cached Nominatim query), get bounding box
+    2. Adaptive radius: max(15km, bounding_box_diagonal x 0.6)
+       Compact European cities stay ~15km, spread-out cities (Bali, Dubai)
+       get a larger radius proportional to their bounding box.
+    3. For each item: if LLM-provided coords are non-zero and within
+       adaptive radius of city center, trust them — skip Nominatim entirely.
+    4. Only geocode items whose LLM coords fail the radius check.
     This eliminates ~80% of Nominatim calls on a good LLM response.
     """
     # Geocode city center (cached — sleep handled internally)
     city_result = _geocode_city(city)
     if city_result:
         city_center = (city_result[0], city_result[1])
+        # Adaptive radius: use bounding box diagonal × 0.6, min 15km
+        # This handles spread-out cities (Bali, Dubai, Rio, etc.) while keeping
+        # compact European cities tight.
+        bb = city_result[2]
+        if len(bb) == 4:
+            km_lat = (bb[1] - bb[0]) * 111.0
+            km_lon = (bb[3] - bb[2]) * 111.0 * math.cos(math.radians(city_center[0]))
+            MAX_CITY_DIST_KM = max(15, math.sqrt(km_lat**2 + km_lon**2) * 0.6)
+        else:
+            MAX_CITY_DIST_KM = 15
     else:
         city_center = None
+        MAX_CITY_DIST_KM = 15
     verified = []
     for item in items: