Spaces:

PHhTTPS
/

LarsHoliday

Sleeping

App Files Files Community

LarsHoliday / DOCUMENTATION.md

phhttps

feat: enhance scraper reliability, observability and scheduling

5dc68a0 3 months ago

preview code

raw

history blame contribute delete

4.78 kB

	# Lars Urlaubs-Deals: Technical Documentation

	## Overview
	This project is an AI-powered vacation deal finder specializing in dog-friendly accommodations across multiple platforms (Airbnb, Booking.com).

	It is designed for:
	- resilient scraping under rate limiting,
	- unified scoring and filtering,
	- transparent run diagnostics via observability and health data,
	- practical dashboard usability for day-to-day deal checks.

	## Recent Improvements (2026-02)
	The latest enhancement cycle added:
	- Run-level observability with run IDs, KPI counters, and structured events.
	- Central deal validation before ranking.
	- Robust price alerts (dedupe, cooldown, per-deal threshold overrides).
	- CLI scheduler mode for periodic background searches.
	- Dashboard UX upgrades (client-side filters, better empty/error states, price trend chips).
	- Currency normalization to EUR for fair cross-market ranking.

	## Core Features

	### 1. Smart Scrapers (Multi-Strategy)
	Both Airbnb and Booking.com scrapers follow a tiered strategy:
	- Strategy 1: Local Curl/HTTP (fastest, cheapest)
	- Strategy 2: Firecrawl Cloud (reliable rendered fallback)
	- Strategy 3: Static fallback data (keeps UI functional if everything else fails)

	Strategy attempts are instrumented with source/strategy duration and success metrics.

	### 2. Rate Limit Bypass
	- User-Agent rotation
	- Adaptive delays that increase under pressure
	- Exponential backoff for repeated throttling
	- Optional session warming for more realistic request patterns

	### 3. Central Validation Pipeline
	All raw deals are validated before ranking:
	- required fields (name, location, source, url)
	- numeric sanity checks (price/rating/reviews)
	- budget boundaries
	- pet-friendly enforcement when pets are requested

	Validation output is returned in API/agent results (`valid_count`, `rejected_count`, reasons).

	### 4. Observability & KPI Tracking
	A lightweight observability layer tracks each search run:
	- unique run ID
	- lifecycle events (`run_started`, source cache hits/misses, errors, run_finished)
	- per-run counters (cache hits, misses, valid deals, triggered alerts, etc.)
	- run summaries retained for health diagnostics

	`/health` includes an observability snapshot with active/recent runs.

	### 5. Price Alert System
	Price alerts are persisted and now include robustness controls:
	- configurable drop threshold (global + per-deal override)
	- dedupe window for repeated identical updates
	- cooldown window to suppress duplicate alerts at same price
	- capped history size per property

	### 6. Intelligent Caching
	- Local JSON cache (`.search_cache.json`) with TTL
	- repeated searches with same parameters return quickly
	- cache metrics included in observability

	### 7. Deal Ranking and Currency Normalization
	Deals are scored by price/rating/reviews with pet/weather multipliers.

	All ranking prices are normalized to EUR using built-in FX rates (or optional custom per-deal `fx_rate_to_eur`), while preserving original currency/price in output metadata.

	### 8. Dashboard UX
	The web dashboard includes:
	- source tabs and sorting modes,
	- client-side filters (minimum rating, max EUR/night, pet-only),
	- explicit empty-state messaging ("no source results" vs "filtered out"),
	- improved fetch error handling,
	- price trend badges when previous price context is available.

	## Scheduler Mode (CLI)
	You can run periodic searches from CLI:

	```bash
	python main.py \
	--cities "Amsterdam,Rotterdam" \
	--checkin 2026-03-01 \
	--checkout 2026-03-05 \
	--schedule-minutes 30 \
	--max-runs 6
	```

	- `--schedule-minutes 0` keeps one-shot behavior (default).
	- `--max-runs 0` means unlimited scheduled cycles.

	## Tech Stack
	- Backend: Python, FastAPI, Uvicorn
	- Scraping: httpx, BeautifulSoup4, Firecrawl API
	- Frontend: Responsive HTML/JS dashboard (Tailwind-style utility classes)
	- Persistence: Local JSON files for cache and alerts

	## Testing
	Primary regression coverage for the new features includes:
	- `tests/test_price_alerts.py` (dedupe/cooldown/override + agent integration)
	- `tests/test_agent_validation.py` (pet filter + validation counters)
	- `tests/test_currency_normalization.py` (EUR normalization + custom FX override)
	- `tests/test_scheduler_cli.py` (scheduler CLI argument parsing)
	- `tests/test_caching.py` (cache behavior still valid)

	Example run:

	```bash
	PYTHONPATH=. pytest -q \
	tests/test_price_alerts.py \
	tests/test_agent_validation.py \
	tests/test_currency_normalization.py \
	tests/test_scheduler_cli.py
	```

	## Deployment Notes
	- Local web mode: `uvicorn api:app --reload`
	- Health check endpoint: `/health`
	- Search endpoint: `/search`

	## Repository / Distribution
	- Main source repository and deployment references remain unchanged.