Spaces:

ar0s
/

robotic_seminars

Sleeping

App Files Files Community

ar0s commited on Jan 12

Commit

2b8ca65

1 Parent(s): 588344b

first commit

Browse files

Files changed (14) hide show

.gitignore +7 -0
ARCHITECTURE.md +264 -0
README.md +54 -6
app.py +153 -0
requirements.txt +13 -0
sources.json +61 -0
src/__init__.py +0 -0
src/fetcher.py +115 -0
src/interested.py +67 -0
src/models.py +245 -0
src/org_colors.py +26 -0
src/ui_log.py +44 -0
tests/__init__.py +0 -0
tests/test_golden.py +46 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,7 @@

+.venv/
+__pycache__/
+*.pyc
+.data/
+/data/
+*.log
+.env

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,264 @@

+# Architecture / Code Flow (with ASCII maps)
+This repo is intentionally small:
+- `app.py` = UI + HTTP API (Gradio mounted into FastAPI)
+- `src/fetcher.py` = crawling + LLM extraction + validation + caching
+- `sources.json` = list of org sources to crawl
+- `.data/events.json` (or `/data/events.json`) = cache / golden output format
+---
+## 1) High-level module map
+```
++--------------------+        imports / calls        +---------------------+
+|       app.py       | --------------------------->  |   src/fetcher.py     |
+|                    |                               |                     |
+|  SeminarsWebApp    |                               |  SeminarFetcher      |
+|  - Gradio UI       |                               |  + refresh_all*       |
+|  - /refresh API    |                               |  + models + cache     |
++--------------------+                               +---------------------+
+          |
+          | reads
+          v
++--------------------+
+|    sources.json    |
+|  list[OrgSource]   |
++--------------------+
+Cache file (written/read by fetcher):
++--------------------+
+| .data/events.json  |  (or /data/events.json)
+| meta + results     |
++--------------------+
+```
+---
+## 2) Runtime flow (what happens when you open the UI)
+### 2.1 Gradio initial load
+```
+Browser
+  |
+  v
+Gradio page load
+  |
+  v
+app.py: demo.load(load_initial)
+  |
+  v
+SeminarsWebApp._stream_refresh(force=False)  [generator]
+  |
+  v
+src/fetcher.py: refresh_all_stream(... force=False)
+  |
+  +--> if cache usable -> yield logs + cached results -> done
+  |
+  \--> else -> crawl + LLM -> write cache -> done
+```
+### 2.2 Manual refresh button
+```
+User clicks "Refresh now"
+  |
+  v
+app.py: refresh_btn.click(refresh_click)
+  |
+  v
+SeminarsWebApp._stream_refresh(force=True)
+  |
+  v
+src/fetcher.py: refresh_all_stream(... force=True)
+  |
+  v
+Always crawls + LLM, then writes cache
+```
+---
+## 3) Cache flow (explicit in the app)
+The app explicitly checks cache before crawling.
+```
+SeminarsWebApp.stream_refresh(force=False)
+  |
+  v
+cache = CacheStore(config=..., ttl_hours=...)
+  |
+  +--> if cache.is_usable(): cache.load() -> UI updates
+  |
+  \--> else: crawl + LLM -> cache.write(results)
+```
+---
+## 4) Detailed fetcher flow (one org)
+The fetcher is designed around a *stream* of events:
+- log event: `("log", level, message)`
+- result event: `("result", EventResult)`
+### 4.1 One-org pipeline
+```
+SeminarFetcher.fetch_next_event_for_org_stream(org)
+  |
+  v
+for hop in 1..max_hops:
+  |
+  +--> fetch_html(url)
+  |       - httpx GET
+  |       - if 403: optional curl fallback
+  |
+  +--> llm_extract(...)
+  |       - text_and_links(html)
+  |       - LiteLLM completion(...)
+  |       - safe_json() + normalize_llm_payload()
+  |       - Pydantic validation -> LlmHopResult
+  |
+  +--> validate_events(hop, now)
+  |       - parse_dt(start_time)
+  |       - filter to future events only
+  |       - ensure evidence + http(s) URL
+  |
+  +--> yield ("result", EventResult)
+  |
+  \--> (optional) follow hop.next_url_to_check if provided
+```
+### 4.2 Key idea: strictness retry
+If the LLM returns something that is not valid JSON or doesn’t validate, the code retries once in “strict” mode.
+```
+llm_extract(strict=False)
+  |
+  +--> (fails JSON / schema) => retry
+  v
+llm_extract(strict=True)
+```
+---
+## 5) Data model / JSON shapes
+### 5.1 Source input (`sources.json`)
+```
+[
+  {
+    "id": "utoronto",
+    "name": "UofT Robotics",
+    "url": "https://...",
+    "tags": ["canada", "university"]
+  },
+  ...
+]
+```
+Validated into `OrgSource`.
+### 5.2 Per-org output (`EventResult`)
+```
+EventResult:
+  org_id, org_name, source_url
+  status: "ok" | "no_upcoming" | "error"
+  events: [LlmEvent, ...]
+  checked_at
+  hops
+  visited_urls
+  error (optional)
+```
+### 5.3 Cache file (`.data/events.json`)
+```
+{
+  "meta": {
+    "model": "...",
+    "schema_version": 3,
+    "cached_at": "...",
+    "ttl_hours": 12
+  },
+  "results": [ EventResult, ... ]
+}
+```
+Cache is considered usable when:
+- file exists
+- file age < ttl
+- `meta.schema_version == 3`
+- `meta.model == current LLM model`
+---
+## 6) Where logs come from
+Logs are generated in two layers:
+1) Fetcher (per hop / per org)
+```
+"{org}: hop i/j — HTTP GET start: ..."
+"{org}: hop i/j — HTTP GET done (...)"
+"{org}: hop i/j — LLM call start (model=...)"
+"{org}: hop i/j — LLM call done (...)"
+"{org}: hop i/j — validating extracted event(s)"
+"{org}: success (...)" OR "no upcoming events" OR "error (...)"
+```
+2) App wrapper (per org result summary)
+```
+"{org}: ok (k event(s))"
+"{org}: no upcoming events found"
+"{org}: <error message>"
+```
+---
+## 7) Environment variables (practical cheat-sheet)
+### App / paths
+- `SOURCES_PATH` (default `sources.json`)
+- `DATA_DIR` (default `.data`, or `/data` if that directory exists)
+- `CACHE_TTL_HOURS` (default `12`)
+- `PORT` (default `7860`)
+### `/refresh` auth
+- `REFRESH_TOKEN` (required to use `/refresh`)
+### LLM (LiteLLM)
+- `LITELLM_MODEL` (or `GEMINI_MODEL` fallback)
+- `LITELLM_API_KEY` (or `GEMINI_API_KEY` fallback)
+- `LITELLM_API_BASE` (optional)
+Optional knobs:
+- `LLM_TEMPERATURE` (default `0`)
+- `LLM_SEED` (optional)
+- `LLM_MIN_INTERVAL_SECONDS` (optional throttling)
+- `NOW_ISO` (optional override of “current time” for deterministic runs)
+---
+## 8) Quick “read order” (if you’re new)
+1) `app.py`:
+   - `SeminarsWebApp._stream_refresh()` to see end-to-end UI flow
+   - `build_fastapi()` for `/refresh`
+2) `src/fetcher.py`:
+   - `refresh_all_stream()` to see caching vs crawling
+   - `SeminarFetcher.fetch_next_event_for_org_stream()` for the main pipeline
+   - `llm_extract()` + `validate_events()` for correctness guarantees

README.md CHANGED Viewed

@@ -1,12 +1,60 @@
 ---
-title: Robotic Seminars
-emoji: 📊
-colorFrom: indigo
-colorTo: pink
 sdk: gradio
-sdk_version: 6.3.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Robotic seminars
+emoji: "🤖"
+colorFrom: blue
+colorTo: purple
 sdk: gradio
+sdk_version: "5.12.0"
 app_file: app.py
 pinned: false
 ---
+# Robotic seminars (HF Space)
+This Hugging Face Space aggregates the **next upcoming event** per organization using an LLM via **LiteLLM**. It fetches each page as HTML, extracts a compact text context + candidate links, and asks the model to return structured JSON (up to 3 hops).
+## Files
+- [sources.json](sources.json): list of orgs + starting URLs
+- [app.py](app.py): Gradio UI
+- [src/fetcher.py](src/fetcher.py): LiteLLM hop loop, validation, caching
+## Environment variables (HF Space “Secrets”)
+- `LITELLM_MODEL` (recommended): LiteLLM model string, e.g. `gemini/gemini-2.0-flash`, `openai/gpt-4o-mini`, `anthropic/claude-3-5-sonnet-20241022`
+- `LITELLM_FALLBACK_MODELS` (optional): comma-separated fallback models to try if the primary hits a rate limit
+- `LLM_MIN_INTERVAL_SECONDS` (optional): minimum delay between LLM calls (useful for very low RPM limits)
+- `LITELLM_API_KEY` (optional): explicit API key to pass to LiteLLM
+- `LITELLM_API_BASE` (optional): custom base URL (useful for proxies/self-hosted endpoints)
+Backwards-compatible (still accepted):
+- `GEMINI_API_KEY` (optional): used as a fallback for `LITELLM_API_KEY`
+- `GEMINI_MODEL` (optional): used as a fallback for `LITELLM_MODEL`
+Provider-specific env vars also work (recommended):
+- `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc.
+- `CACHE_TTL_HOURS` (optional): defaults to `12`
+## Cache
+The app writes its cache to `/data/robotic_seminars/events.json`.
+- On Hugging Face Spaces: enable **Persistent Storage** so `/data` exists and is writable.
+- Locally: create `/data/robotic_seminars` and ensure it’s writable by your user.
+## Local run
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+# Create a .env from the template and set your keys/model:
+cp .env.example .env
+# Or export env vars manually if you prefer.
+python app.py
+```
+Open http://localhost:7860

app.py ADDED Viewed

	@@ -0,0 +1,153 @@

+from __future__ import annotations
+import json
+import os
+from datetime import datetime, timezone
+from pathlib import Path
+import gradio as gr
+from dotenv import load_dotenv
+from fastapi import FastAPI
+from src.fetcher import CacheStore, LlmConfig, OrgSource, SeminarFetcher
+from src.interested import COLS, inc_interested, results_table
+from src.models import parse_dt_utc
+from src.org_colors import org_colors
+from src.ui_log import bind as bind_log, error, info
+# import debugpy
+# print("Waiting for debugger attach...")
+# debugpy.listen(5678)
+# debugpy.wait_for_client()
+# print("Debugger attached.")
+load_dotenv(Path(__file__).with_name(".env"))
+def ts() -> str:
+    return datetime.now(timezone.utc).isoformat(timespec="seconds")
+SOURCES_PATH = os.environ.get("SOURCES_PATH", "sources.json")
+TTL_HOURS = float(os.environ.get("CACHE_TTL_HOURS", "12"))
+LLM = LlmConfig(model=os.environ.get("LITELLM_MODEL", "gemini/gemini-2.0-flash"), api_key=os.environ.get("LITELLM_API_KEY"))
+SOURCES_RAW = json.loads(Path(SOURCES_PATH).read_text(encoding="utf-8"))
+COLORS = org_colors(SOURCES_RAW)
+SOURCES_MD = "\n".join(
+    f"- [{s['name']}]({s['url']})"
+    for s in SOURCES_RAW
+)
+def stream_refresh(force: bool):
+    logs: list[str] = []
+    results: list[dict] = []
+    colors = COLORS
+    def emit(status: str):
+        df, row_map = results_table(results, colors)
+        return status, df, "<br>\n".join(logs), results, row_map
+    with bind_log(logs):
+        info(f"refresh(force={force}, ttl_hours={TTL_HOURS})")
+        info(f"model={LLM.model}")
+        try:
+            done = 0
+            sources = [OrgSource.model_validate(s) for s in SOURCES_RAW]
+            started = ts()
+            cache = CacheStore(config=LLM, ttl_hours=TTL_HOURS)
+            if not force and cache.is_usable():
+                results.extend(json.loads(cache.path.read_text(encoding="utf-8"))["results"])
+                info(f"used cache: {cache.path}")
+                yield emit(f"Refreshed: {ts()} (started {started})")
+                return
+            info("cache miss, crawling")
+            old_results = json.loads(cache.path.read_text(encoding="utf-8"))["results"] if cache.path.exists() else []
+            prev_by_org = {r["org_id"]: r["events"] for r in old_results}
+            old_by_exact: dict[tuple[str, str, str], int] = {}
+            old_by_dt: dict[tuple[str, str], int] = {}
+            old_by_dt_n: dict[tuple[str, str], int] = {}
+            for r in old_results:
+                for ev in r["events"]:
+                    dt = parse_dt_utc(ev["start_time"]).isoformat(timespec="minutes")
+                    url = ev["event_url"] or ""
+                    old_by_exact[(r["org_id"], dt, url)] = ev["interested_count"]
+                    k2 = (r["org_id"], dt)
+                    try:
+                        old_by_dt_n[k2] += 1
+                    except KeyError:
+                        old_by_dt_n[k2] = 1
+                    if k2 not in old_by_dt:
+                        old_by_dt[k2] = ev["interested_count"]
+            fetcher = SeminarFetcher(config=LLM, now=None, max_hops=3, max_events=3, previous_events_by_org=prev_by_org)
+            event_results = []
+            for org in sources:
+                yield emit(f"Refreshing… {done}/{len(sources)} (started {started})")
+                done += 1
+                for r in fetcher.fetch_next_event_for_org_stream(org):
+                    for ev in r.events:
+                        dt = parse_dt_utc(ev.start_time).isoformat(timespec="minutes")
+                        url = ev.event_url or ""
+                        k1 = (r.org_id, dt, url)
+                        if k1 in old_by_exact:
+                            ev.interested_count = old_by_exact[k1]
+                            continue
+                        k2 = (r.org_id, dt)
+                        if k2 in old_by_dt and old_by_dt_n[k2] == 1:
+                            ev.interested_count = old_by_dt[k2]
+                    event_results.append(r)
+                    results.append(r.model_dump())
+            cache.write(results=event_results)
+            info(f"wrote cache: {cache.path}")
+            yield emit(f"Refreshed: {ts()} (started {started})")
+        except Exception:
+            import traceback
+            error(f"Unhandled exception:\n{traceback.format_exc()}")
+            yield emit(f"Error: {ts()}")
+            return
+with gr.Blocks() as demo:
+    gr.Markdown("# Robotic seminars\nNext upcoming event per org.")
+    status = gr.Markdown("")
+    table = gr.Dataframe(
+        headers=COLS,
+        datatype=["str", "markdown", "str", "markdown", "str"],
+        interactive=True,
+        wrap=True,
+    )
+    results_state = gr.State([])
+    row_map_state = gr.State([])
+    refresh_btn = gr.Button("Refresh now")
+    with gr.Accordion("Sources", open=False):
+        gr.Markdown(SOURCES_MD)
+    with gr.Accordion("Logs", open=False):
+        logs_box = gr.Markdown()
+    def on_select(results: list[dict], row_map: list[tuple[int, int]], evt: gr.SelectData):
+        return inc_interested(evt, results, row_map, colors=COLORS, llm=LLM, ttl_hours=TTL_HOURS)
+    demo.load(stream_refresh, inputs=[gr.State(False)], outputs=[status, table, logs_box, results_state, row_map_state])
+    refresh_btn.click(stream_refresh, inputs=[gr.State(True)], outputs=[status, table, logs_box, results_state, row_map_state])
+    table.select(
+        on_select,
+        inputs=[results_state, row_map_state],
+        outputs=[table, results_state, row_map_state],
+    )
+# HF Spaces runs Gradio apps itself; avoid mounting + running Uvicorn here.
+# app = gr.mount_gradio_app(FastAPI(), demo, path="/")
+if __name__ == "__main__":
+    # import uvicorn
+    # uvicorn.run(app, host="0.0.0.0", port=int(os.environ.get("PORT", "7860")))
+    demo.launch(server_name="0.0.0.0", server_port=int(os.environ.get("PORT", "7860")))

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+gradio==5.12.0
+fastapi==0.115.6
+uvicorn==0.34.0
+pydantic==2.10.5
+python-dateutil==2.9.0.post0
+pandas==2.2.3
+httpx==0.28.1
+h2>=4.1.0
+litellm>=1.0.0
+python-dotenv>=1.0.0
+beautifulsoup4>=4.12.2
+pytest>=8.0.0
+curl_cffi>=0.14.0

sources.json ADDED Viewed

	@@ -0,0 +1,61 @@

+[
+    {
+    "id": "cmu-ri-seminar",
+    "name": "Carnegie Mellon University Robotics Institute Seminar Series",
+    "url": "https://www.ri.cmu.edu/events/",
+    "tags": ["robotics"]
+  },
+  {
+    "id": "rig",
+    "name": "Robotics Institute Germany (RIG) Lecture Series",
+    "url": "https://robotics-institute-germany.de/rig-lecture-series-weekly-online-lectures-on-robotics/",
+    "tags": ["robotics"]
+  },
+  {
+    "id": "stanford-engr319",
+    "name": "Stanford Robotics & Autonomous Systems Seminar (ENGR319)",
+    "url": "https://stanfordasl.github.io/robotics_seminar/",
+    "tags": ["robotics"]
+  },
+  {
+    "id": "utoronto-ri",
+    "name": "University of Toronto Robotics Institute Seminar Series",
+    "url": "https://robotics.utoronto.ca/seminar-series/",
+    "tags": ["robotics"]
+  },
+  {
+    "id": "eth-rvc-talks",
+    "name": "ETH Zürich Robotics, Vision, and Controls Talks",
+    "url": "https://robotics-talks.com/",
+    "tags": ["robotics"]
+  },
+  {
+    "id": "umd-mrc-seminars",
+    "name": "Maryland Robotics Center (UMD) Robotics Seminar Series",
+    "url": "https://robotics.umd.edu/events/mrc-seminars",
+    "tags": ["robotics"]
+  },
+  {
+    "id": "imperial-rl-seminar",
+    "name": "Imperial College London Robot Learning Seminar Series",
+    "url": "https://www.robot-learning.uk/seminar-series",
+    "tags": ["robotics"]
+  },
+  {
+    "id": "gatech-irim-seminar",
+    "name": "Georgia Tech IRIM Seminar Series",
+    "url": "https://research.gatech.edu/robotics/irim-seminar-series",
+    "tags": ["robotics"]
+  },
+    { "id": "robot-talk",
+    "name": "Robot Talk",
+    "url": "https://www.robottalk.org/latest-episodes/",
+    "tags": ["robotics"]
+  },
+  {
+    "id": "montreal-robotics",
+    "name": "Montréal Robotics / Mila Robot Learning Seminar",
+    "url": "https://montrealrobotics.ca/robotlearningseries/",
+    "tags": ["robotics"]
+  }
+]

src/__init__.py ADDED Viewed

File without changes

src/fetcher.py ADDED Viewed

	@@ -0,0 +1,115 @@

+from __future__ import annotations
+import os
+from datetime import datetime, timezone
+from typing import Iterator
+from urllib.parse import urlparse
+from curl_cffi import requests as r  # type: ignore
+from dateutil import parser as dtparser
+from .models import (
+	CacheStore,
+	EventResult,
+	LlmConfig,
+	LlmEvent,
+	OrgSource,
+	USER_AGENT,
+	llm_extract,
+	parse_dt_utc,
+)
+from .ui_log import error, info, warn
+class SeminarFetcher:
+    def __init__(
+        self,
+        config: LlmConfig,
+        now: datetime | None = None,
+        max_hops: int = 3,
+        max_events: int = 3,
+        previous_events_by_org: dict[str, list[dict]] | None = None,
+    ):
+        self.config = config
+        self.max_hops = max_hops
+        self.max_events = max_events
+        self.previous_events_by_org = previous_events_by_org or {}
+        if now is None:
+            raw = (os.environ.get("NOW_ISO") or "").strip()
+            now = dtparser.isoparse(raw) if raw else datetime.now(timezone.utc)
+        if now.tzinfo is None:
+            now = now.replace(tzinfo=timezone.utc)
+        self.now = now.astimezone(timezone.utc)
+    def fetch_html(self, url: str) -> str:
+        p = urlparse(url)
+        resp = r.get(url, timeout=20, allow_redirects=True, impersonate="chrome120",
+            headers={"User-Agent": USER_AGENT, "Accept": "text/html", "Referer": f"{p.scheme}://{p.netloc}/"},
+        )
+        resp.raise_for_status()
+        return resp.text
+    def fetch_next_event_for_org_stream(self, org: OrgSource) -> Iterator[EventResult]:
+        checked_at = datetime.now(timezone.utc).isoformat()
+        now_iso = self.now.isoformat()
+        url = str(org.url)
+        visited: list[str] = []
+        info(f"Inspecting {org.name}…")
+        result = EventResult(org_id=org.id, org_name=org.name, source_url=str(org.url), status="no_upcoming",
+            events=[], checked_at=checked_at, hops=self.max_hops, visited_urls=visited, error="Max hops reached.",
+        )
+        for hop_i in range(1, self.max_hops + 1):
+            visited.append(url)
+            info(f"Inspecting {org.name}: fetching {url}")
+            previous_events = self.previous_events_by_org[org.id] if org.id in self.previous_events_by_org else []
+            hop = llm_extract(
+                config=self.config,
+                org=org,
+                url=url,
+                page_html=self.fetch_html(url),
+                now_iso=now_iso,
+                previous_events=previous_events,
+            )
+            if hop.status != "ok":
+                next_url = hop.next_url_to_check
+                if next_url and next_url not in visited:
+                    info(f"Inspecting {org.name}: following {next_url}")
+                    url = next_url
+                    continue
+                (error if hop.status == "error" else warn)(f"Inspecting {org.name}: {hop.status}: {hop.error}")
+                result.status = hop.status
+                result.hops = hop_i
+                result.error = hop.error
+                break
+            events: list[LlmEvent] = []
+            parsed = [(parse_dt_utc(e.start_time), e) for e in hop.events]
+            parsed.sort(key=lambda x: x[0])
+            for dt, e in parsed:
+                if dt <= self.now:
+                    continue
+                if e.event_url is None:
+                    e = e.model_copy(update={"event_url": url})
+                events.append(e)
+                if len(events) >= self.max_events:
+                    break
+            result.status = "ok" if events else "no_upcoming"
+            result.events = events
+            result.hops = hop_i
+            result.error = hop.error
+            if events:
+                info(f"Found {len(events)} upcoming event(s) for {org.name}")
+            else:
+                warn(f"No upcoming events found for {org.name}")
+            break
+        yield result
+    def fetch_next_event_for_org(self, org: OrgSource) -> EventResult:
+        return next(self.fetch_next_event_for_org_stream(org))

src/interested.py ADDED Viewed

	@@ -0,0 +1,67 @@

+from __future__ import annotations
+import gradio as gr
+import pandas as pd
+from .fetcher import CacheStore, EventResult, LlmConfig
+from .org_colors import org_tag
+COLS = ["Date/Time (UTC)", "Title", "Speaker", "Organization", "Interested"]
+def results_table(results: list[dict], colors: dict[str, str]) -> tuple[pd.DataFrame, list[tuple[int, int]]]:
+    rows: list[dict[str, object]] = []
+    for r_i, r in enumerate(results):
+        if r["status"] != "ok":
+            continue
+        for ev_i, ev in enumerate(r["events"]):
+            sp = ev["speaker"] or ""
+            aff = ev["affiliation"] or ""
+            speaker = f"{sp} ({aff})" if sp and aff else sp or aff
+            rows.append(
+                {
+                    "Date/Time (UTC)": ev["start_time"],
+                    "Title": f"[{ev['title']}]({ev['event_url']})" if ev["event_url"] else ev["title"],
+                    "Speaker": speaker,
+                    "Organization": org_tag(r["org_name"], colors),
+                    "Interested": f"{ev['interested_count']} (+)",
+                    "_org": r["org_name"],
+                    "_r": r_i,
+                    "_e": ev_i,
+                }
+            )
+    if not rows:
+        return pd.DataFrame(columns=COLS), []
+    df = pd.DataFrame(rows)
+    df["_sort"] = pd.to_datetime(df["Date/Time (UTC)"], utc=True, errors="coerce")
+    df = df.sort_values(by=["_sort", "_org"], na_position="last").reset_index(drop=True)
+    row_map = list(zip(df["_r"].astype(int).tolist(), df["_e"].astype(int).tolist()))
+    return df.drop(columns=["_sort", "_org", "_r", "_e"])[COLS], row_map
+def inc_interested(
+    evt: gr.SelectData | None,
+    results: list[dict],
+    row_map: list[tuple[int, int]],
+    *,
+    colors: dict[str, str],
+    llm: LlmConfig,
+    ttl_hours: float,
+):
+    if evt is None or evt.index is None:
+        df, row_map = results_table(results, colors)
+        return df, results, row_map
+    r = int(evt.index[0])
+    c = int(evt.index[1])
+    if c != COLS.index("Interested"):
+        df, row_map = results_table(results, colors)
+        return df, results, row_map
+    r_i, ev_i = row_map[r]
+    results[r_i]["events"][ev_i]["interested_count"] += 1
+    cache = CacheStore(config=llm, ttl_hours=ttl_hours)
+    cache.write(results=[EventResult.model_validate(x) for x in results])
+    df, row_map = results_table(results, colors)
+    return df, results, row_map

src/models.py ADDED Viewed

	@@ -0,0 +1,245 @@

+from __future__ import annotations
+import json
+from dataclasses import dataclass
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Literal
+import html
+from urllib.parse import urljoin, urlparse
+from pydantic import BaseModel, Field, HttpUrl
+Status = Literal["ok", "no_upcoming", "error"]
+CACHE_SCHEMA_VERSION = 4
+USER_AGENT = (
+    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
+    "(KHTML, like Gecko) Chrome/120 Safari/537.36"
+)
+# SYSTEM_PROMPT = (
+#             "You are given the page text content and a list of links from the page. "
+#             "Find up to the NEXT 3 upcoming talks/events AFTER now, sorted by time ascending. "
+#             "If the page lists only a date (no time), output start_time as an ISO-8601 DATE like '2026-01-15'. "
+#             "If events are listed as lines with a date (e.g. 'Jan 15, 2026 – ...'), treat each such line as an event. "
+#             "If there is no dedicated per-event URL, set event_url to the Source URL. "
+#             "If the schedule is not on this page, set next_url_to_check to the single best link to follow (one of LINKS). "
+#             "Do NOT invent placeholder titles like 'TBA' unless the page text explicitly contains 'TBA'. "
+#             "IMPORTANT: Keep fields clean and separated: "
+#             "- events[].title MUST be the talk/event title ONLY (no speaker name, no affiliation, no date/time). "
+#             "- Put the speaker/person name in events[].speaker when present in the text (e.g. 'Speaker: …', 'by …', 'Presenter: …'). "
+#             "- If the speaker affiliation/institution is present (e.g. 'UC Davis', 'MIT', 'Google DeepMind'), put it in events[].affiliation (do not mix it into title). "
+#             "- If a line contains both speaker and title (e.g. 'Jane Doe — Learning Robots' or 'Learning Robots — Jane Doe'), split them correctly. "
+#             "Choose the talk title that is best-supported by the page text. Give priority to explicit cues like 'Title:', 'Talk title:', 'Topic:', and text near 'Abstract:'/'Summary:'. "
+#             "If a header includes a series label plus a person name/affiliation (e.g. 'Seminar: Jane Doe (MIT)'), treat that as speaker/affiliation (not title) and keep searching the body for the real title. "
+#             "e.g. this Robotics Institute Seminar: Mahdi Tavakoli (University of Alberta) is not a title"
+#             "Keep person name and affiliation seprate. Each should be put in its own field. "
+#             "Never guess; quote evidence from the provided text."
+#             "Never reply with any text other than the JSON object. If you don't find any events, still reply with a JSON object containing the no_upcoming status. "
+#             "REMEMBER: START TIME IS ALWAYS REQUIRED FOR EACH EVENT. IF THERE IS NO START TIME, DO NOT INCLUDE THE EVENT. "
+#             "Respond in JSON with the following schema: "
+#             "status: ok | no_upcoming | error"
+#             "events: array of up to 3 objects (required if ok)"
+#             "events[].title: string"
+#             "events[].start_time: ISO-8601 string; if only date is known, use YYYY-MM-DD (always required)"
+#             "events[].event_url: string URL"
+#             "events[].speaker: string "
+#             "events[].affiliation: string (if prersent)"
+#             "events[].evidence: short snippet from provided text (always required)"
+#             "error: short error string (required if error)"
+#             "next_url_to_check: string URL (optional - must be one of LINKS if provided)"
+# )
+SYSTEM_PROMPT = """
+You are a JSON extraction engine. You do NOT write code.
+CRITICAL OUTPUT CONSTRAINTS (HARD):
+- Your entire reply MUST be valid JSON (RFC 8259).
+- Reply with exactly ONE JSON object.
+- The first non-whitespace character MUST be "{" and the last MUST be "}".
+- Use double quotes for all JSON strings. Never use single quotes.
+- Do NOT include markdown fences (```), explanations, pseudocode, or Python.
+If you cannot follow these constraints, reply exactly:
+{"status":"error","error":"non_json_or_invalid_schema"}
+Event extraction rules (HARD):
+- Return up to 3 upcoming events after "now", sorted by start_time ascending.
+- Every event MUST include start_time. If you cannot find a date/time in PAGE_TEXT for an event, DO NOT include that event.
+- If you find zero events with a date/time, return {"status":"no_upcoming"}.
+Title rule (HARD):
+- title MUST be copied verbatim from PAGE_TEXT (no paraphrasing).
+- title MUST come from the same local event block as the date/time:
+  - it must appear within 300 characters of the date/time text you used for start_time.
+- Do NOT use site/series/page headings or navigation as title.
+  Examples of INVALID titles: "Seminar Series", "Events", "Robotics", "University of Toronto", page header text.
+- If you cannot find a specific talk/topic title near the date/time, DO NOT include the event.
+If PREVIOUS_EVENTS is provided:
+- Use PREVIOUS_EVENTS as a strict copy source.
+- If PAGE_TEXT contains an event that matches one in PREVIOUS_EVENTS, you MUST include that event in your output and you MUST copy the entire event object exactly from PREVIOUS_EVENTS.
+- Do NOT omit a matched event. Do NOT say it is "already known". Do NOT reduce the number of returned events because PREVIOUS_EVENTS were provided.
+- If PREVIOUS_EVENTS contains events that are NOT present in PAGE_TEXT, ignore them.
+Final self-check (HARD, perform before replying):
+- Your reply must be valid JSON only.
+- For each event: verify start_time exists and is a non-empty string.
+- If an event has missing/empty start_time, REMOVE that event.
+- If no events remain, output {"status":"no_upcoming"}.
+Schema:
+{ "status": "ok"|"no_upcoming"|"error",
+    "events": [{"title": "...", "start_time": "...", "event_url": "...", "speaker": "", "affiliation": null, "interested_count": 0}],
+  "error": "...",
+  "next_url_to_check": "..." }
+"""
+class OrgSource(BaseModel):
+    id: str
+    name: str
+    url: HttpUrl
+    tags: list[str] = Field(default_factory=list)
+class LlmEvent(BaseModel):
+    title: str
+    start_time: str
+    event_url: str | None = None
+    speaker: str
+    affiliation: str | None = None
+    interested_count: int = 0
+class LlmHopResult(BaseModel):
+    status: Status
+    events: list[LlmEvent] = Field(default_factory=list)
+    error: str | None = None
+    next_url_to_check: str | None = None
+class EventResult(BaseModel):
+    org_id: str
+    org_name: str
+    source_url: str
+    status: Status
+    events: list[LlmEvent] = Field(default_factory=list)
+    checked_at: str
+    hops: int = 0
+    visited_urls: list[str] = Field(default_factory=list)
+    error: str | None = None
+@dataclass
+class LlmConfig:
+    model: str
+    api_key: str | None = None
+    api_base: str | None = None
+class CacheStore:
+    def __init__(self, config: LlmConfig, ttl_hours: float, path: Path = Path("/data/robotic_seminars/events.json")):
+        self.config = config
+        self.ttl_hours = ttl_hours
+        self.path = path
+    def is_fresh(self) -> bool:
+        if not self.path.exists():
+            return False
+        mtime = datetime.fromtimestamp(self.path.stat().st_mtime, tz=timezone.utc)
+        return (datetime.now(timezone.utc) - mtime).total_seconds() < self.ttl_hours * 3600
+    def is_usable(self) -> bool:
+        if not self.is_fresh():
+            return False
+        meta = json.loads(self.path.read_text(encoding="utf-8"))["meta"]
+        return meta["schema_version"] == CACHE_SCHEMA_VERSION and meta["model"] == self.config.model
+    def write(self, *, results: list[EventResult]) -> None:
+        self.path.parent.mkdir(parents=True, exist_ok=True)
+        payload = {
+            "meta": {
+                "model": self.config.model,
+                "schema_version": CACHE_SCHEMA_VERSION,
+                "cached_at": datetime.now(timezone.utc).isoformat(),
+                "ttl_hours": self.ttl_hours,
+            },
+            "results": [r.model_dump() for r in results],
+        }
+        self.path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
+def parse_dt_utc(value: str) -> datetime:
+    from dateutil import parser as dtparser
+    dt = dtparser.parse(value)
+    if dt.tzinfo is None:
+        dt = dt.replace(tzinfo=timezone.utc)
+    return dt.astimezone(timezone.utc)
+def text_and_links(page_html: str, *, base_url: str, limit: int = 40) -> tuple[str, list[str]]:
+    from bs4 import BeautifulSoup
+    soup = BeautifulSoup(html.unescape(page_html), "html.parser")
+    for tag in soup(["script", "style", "noscript"]):
+        tag.decompose()
+    text = "\n".join(ln.strip() for ln in soup.get_text("\n").splitlines() if ln.strip())[:24000]
+    base_dom = urlparse(base_url).netloc.lower()
+    links: list[str] = []
+    for a in soup.find_all("a", href=True):
+        u = urljoin(base_url, str(a["href"]).strip())
+        p = urlparse(u)
+        if p.scheme in {"http", "https"} and p.netloc.lower() == base_dom:
+            links.append(u)
+        if len(links) >= limit:
+            break
+    return text, links
+def llm_extract(*, config: LlmConfig, org: OrgSource, url: str, page_html: str, now_iso: str, previous_events: list[dict]) -> LlmHopResult:
+    from litellm import completion  # type: ignore
+    page_text, links = text_and_links(page_html, base_url=url)
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": json.dumps({"org": org.name, "now": now_iso, "source_url": url, "PREVIOUS_EVENTS": previous_events})},
+        {"role": "user", "content": "LINKS:\n" + "\n".join(links)},
+        {"role": "user", "content": "PAGE_TEXT_BEGIN\n" + page_text + "\nPAGE_TEXT_END"},
+        {"role": "user", "content": "Return ONLY one JSON object (no markdown, no code). "
+                        "Must start with '{' and end with '}'. Use double quotes only. "
+                        "Before returning, delete any event missing/empty start_time. "
+                        "Title must be copied verbatim from PAGE_TEXT near the date/time. "
+                        "IMPORTANT: PREVIOUS_EVENTS are NOT a reason to omit events. If PAGE_TEXT contains an event that matches PREVIOUS_EVENTS, you MUST re-output it by copying the entire event object exactly from PREVIOUS_EVENTS. "
+                        "If none remain, return {\"status\":\"no_upcoming\"}."}
+    ]
+    kwargs: dict[str, object] = {"model": config.model, "temperature": 0}
+    if config.api_key:
+        kwargs["api_key"] = config.api_key
+    if config.api_base:
+        kwargs["api_base"] = config.api_base
+    content = completion(messages=messages, **kwargs)["choices"][0]["message"]["content"]
+    content = content.replace('`', '').replace('json', '')
+    content = content[content.find("{") : content.rfind("}") + 1]
+    try:
+        content = json.loads(content)
+    except json.JSONDecodeError:
+        raise ValueError(f"LLM did not return valid JSON:\n{content}")
+    try:
+        LlmHopResult.model_validate(content)
+    except Exception as e:
+        raise ValueError(f"LLM returned JSON that does not match schema: {e}\n{content}")
+    return LlmHopResult.model_validate(content)

src/org_colors.py ADDED Viewed

	@@ -0,0 +1,26 @@

+from __future__ import annotations
+from html import escape
+PALETTE = [
+    "#e6194b",
+    "#3cb44b",
+    "#4363d8",
+    "#f58231",
+    "#911eb4",
+    "#46f0f0",
+    "#f032e6",
+    "#bcf60c",
+    "#fabebe",
+    "#008080",
+    "#e6beff",
+    "#9a6324",
+]
+def org_colors(sources: list[dict]) -> dict[str, str]:
+    return {s["name"]: PALETTE[i % len(PALETTE)] for i, s in enumerate(sources)}
+def org_tag(org: str, colors: dict[str, str]) -> str:
+    return f'<span style="color:{colors[org]}">{escape(org)}</span>'

src/ui_log.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from __future__ import annotations
+from contextlib import contextmanager
+from datetime import datetime, timezone
+_ACTIVE_LINES: list[str] = []
+def ts() -> str:
+    return datetime.now(timezone.utc).isoformat(timespec="seconds")
+def log(level: str, msg: str) -> None:
+    html = {
+        "INFO": "dodgerblue",
+        "WARN": "goldenrod",
+        "ERR": "crimson",
+    }
+    c = html.get(level)
+    tag = f"<span style=\"color:{c}\">{level}</span>" if c else level
+    _ACTIVE_LINES.append(f"[{ts()}] {tag}: {msg}")
+def info(msg: str) -> None:
+    log("INFO", msg)
+def warn(msg: str) -> None:
+    log("WARN", msg)
+def error(msg: str) -> None:
+    log("ERR", msg)
+@contextmanager
+def bind(lines: list[str]):
+    global _ACTIVE_LINES
+    prev = _ACTIVE_LINES
+    _ACTIVE_LINES = lines
+    try:
+        yield
+    finally:
+        _ACTIVE_LINES = prev

tests/__init__.py ADDED Viewed

File without changes

tests/test_golden.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import json, os, shutil, tempfile
+from pathlib import Path
+ROOT = Path(__file__).resolve().parents[1]
+GOLDEN = ROOT / ".data" / "events.json"
+def _load(p: Path) -> dict: return json.loads(p.read_text(encoding="utf-8"))
+def _norm(d: dict) -> dict:
+    d = json.loads(json.dumps(d)); (d.get("meta") or {}).pop("cached_at", None)
+    for r in d.get("results") or []:
+        if isinstance(r, dict): r.pop("checked_at", None)
+    return d
+def test_cached_results_match_golden():
+    golden = _load(GOLDEN); model = (golden.get("meta") or {}).get("model"); assert isinstance(model, str)
+    with tempfile.TemporaryDirectory() as td:
+        data_dir = Path(td); shutil.copy2(GOLDEN, data_dir / "events.json")
+        from src.fetcher import CacheStore, LlmConfig
+        cache = CacheStore(config=LlmConfig(model=model), ttl_hours=9999.0, path=data_dir / "events.json")
+        assert cache.is_usable()
+        got = _load(cache.path)
+        assert got["results"] == golden["results"]
+def test_live_crawl_matches_golden_snapshot():
+    if os.environ.get("RUN_LIVE_TESTS") != "1": return
+    golden = _load(GOLDEN); model = (golden.get("meta") or {}).get("model"); assert isinstance(model, str)
+    with tempfile.TemporaryDirectory() as td:
+        data_dir = Path(td)
+        os.environ.update({"LLM_TEMPERATURE": "0", "LLM_SEED": "1", "LITELLM_FALLBACK_MODELS": ""})
+        cached_at = (golden.get("meta") or {}).get("cached_at")
+        if isinstance(cached_at, str) and cached_at.strip(): os.environ["NOW_ISO"] = cached_at.strip()
+        from src.fetcher import CacheStore, LlmConfig, OrgSource, SeminarFetcher
+        cfg = LlmConfig(model=model)
+        raw_sources = json.loads((ROOT / "sources.json").read_text(encoding="utf-8"))
+        sources = [OrgSource.model_validate(s) for s in raw_sources]
+        fetcher = SeminarFetcher(config=cfg, now=None, max_hops=3, max_events=3)
+        results = [fetcher.fetch_next_event_for_org(o) for o in sources]
+        cache = CacheStore(config=cfg, ttl_hours=float((golden.get("meta") or {}).get("ttl_hours") or 12.0), path=data_dir / "events.json")
+        cache.write(results=results)
+        produced = _load(cache.path)
+    assert _norm(produced) == _norm(golden)