Spaces:

agrim12345
/

deployed-meet

Sleeping

App Files Files Community

agrim12345 commited on Feb 16

Commit

2db58d0

1 Parent(s): b903d41

Deploy deployed-meet Gradio app

Browse files

Files changed (20) hide show

.dockerignore +12 -0
.gitattributes +1 -0
.gitignore +7 -0
Dockerfile +30 -0
HF_SPACE_DEPLOY.md +65 -0
README.md +44 -6
api/index.py +672 -0
app.py +279 -0
pipelines/assign_utterances_to_keyframes.py +249 -0
pipelines/build_final_output.py +758 -0
pipelines/build_final_output_demo_code.py +549 -0
pipelines/condense_final_output.py +145 -0
pipelines/deepgram_extract_utterances.py +208 -0
pipelines/models/yolov8x-doclaynet.pt +3 -0
pipelines/run_pipeline_all.py +238 -0
pipelines/run_pipeline_demo_code.py +239 -0
pipelines/smart_keyframes_and_classify.py +1443 -0
requirements.txt +17 -0
run_manager.py +581 -0
vercel.json +21 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,12 @@

+.git
+.gitignore
+.venv
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+*.log
+out*
+tmp
+runs

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+pipelines/models/*.pt filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,7 @@

+__pycache__/
+*.pyc
+.env
+.venv/
+out*/
+tmp/

Dockerfile ADDED Viewed

	@@ -0,0 +1,30 @@

+FROM python:3.10-slim
+ENV DEBIAN_FRONTEND=noninteractive \
+    PIP_NO_CACHE_DIR=1 \
+    PYTHONUNBUFFERED=1 \
+    PYTHONIOENCODING=utf-8 \
+    PIPELINE_WORKDIR=/data/deployed-meet-runs
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    ffmpeg \
+    curl \
+    libgl1 \
+    libglib2.0-0 \
+    libgomp1 \
+    && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+COPY . /app
+RUN pip install --upgrade pip setuptools wheel && \
+    pip install -r requirements.txt && \
+    pip install --no-build-isolation "git+https://github.com/openai/CLIP.git@dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1"
+RUN mkdir -p /data/deployed-meet-runs
+EXPOSE 7860
+CMD ["python", "-m", "uvicorn", "api.index:app", "--host", "0.0.0.0", "--port", "7860"]

HF_SPACE_DEPLOY.md ADDED Viewed

	@@ -0,0 +1,65 @@

+# Deploy to Hugging Face Spaces (Gradio SDK)
+This package is now Gradio-native (`app.py`) and does not require Docker on Spaces.
+The `demo-code` variant is configured as demo-only Gemini:
+- Gemini + YOLO only for `demo` frames.
+- `slides`/`code`/`none` use OCR + transcript output.
+## 1) Create the Space
+1. Go to `https://huggingface.co/new-space`.
+2. Choose:
+   - SDK: `Gradio`
+   - Space name: your choice (for example `deployed-meet`)
+   - Visibility: your choice
+3. Click **Create Space**.
+## 2) Clone the Space repo
+```powershell
+git clone https://huggingface.co/spaces/<YOUR_USER>/<YOUR_SPACE_NAME> hf-space-deployed-meet
+cd hf-space-deployed-meet
+```
+## 3) Copy this folder into the Space repo
+Copy everything from local `deployed-meet/` into this cloned Space repo root.
+Required root files after copy:
+- `app.py`
+- `run_manager.py`
+- `requirements.txt`
+- `README.md`
+- `pipelines/...`
+## 4) Track model weights with Git LFS
+```powershell
+git lfs install
+git lfs track "pipelines/models/*.pt"
+git add .gitattributes
+```
+## 5) Add secrets in Space Settings
+In **Settings -> Variables and secrets**, add:
+- `GEMINI_API_KEY`
+- `DEEPGRAM_API_KEY`
+Optional:
+- `PIPELINE_WORKDIR=/data/deployed-meet-runs`
+- `YOLO_DEVICE=cpu` (if your Space has no GPU)
+- `OCR_GPU=false` (if your Space has no GPU)
+## 6) Commit and push
+```powershell
+git add .
+git commit -m "Deploy deployed-meet Gradio app"
+git push
+```
+Wait for the build to complete.
+## 7) Open the app and run
+- App URL: `https://<YOUR_SPACE_NAME>.hf.space`
+- Start from **Start Run** tab, then monitor from **Track Run** tab.
+## ZeroGPU note
+- ZeroGPU only works with Gradio Spaces, which this repo now uses.
+- This pipeline is long-running and model-heavy, so ZeroGPU sessions may be unstable for long videos.
+- For reliable long jobs, CPU upgraded hardware or a dedicated GPU Space is recommended.

README.md CHANGED Viewed

@@ -1,12 +1,50 @@
 ---
-title: Deployed Meet
-emoji: 🚀
-colorFrom: red
-colorTo: yellow
 sdk: gradio
-sdk_version: 6.5.1
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: deployed-meet
 sdk: gradio
 app_file: app.py
 pinned: false
 ---
+# deployed-meet
+Gradio-based deployment package for the meeting pipeline.
+## Pipeline variants
+- `full`:
+  - Gemini is called for all keyframe types (`slides`, `code`, `demo`, `none` as applicable).
+- `demo-code`:
+  - Gemini is called only for `demo` keyframes.
+  - `slides`/`code`/`none` are built from OCR + transcript.
+  - `smart_keyframes_and_classify.py` runs with `--no-yolo-for-non-demo` in this variant.
+## Run locally (Gradio)
+```powershell
+cd deployed-meet
+C:/meet-agent/.venv/Scripts/activate
+pip install -r requirements.txt
+python app.py
+```
+Open: `http://127.0.0.1:7860`
+## How to use UI
+1. Go to **Start Run**.
+2. Select variant (`full` or `demo-code`).
+3. Choose input mode (`Upload File` or `Video URL`).
+4. Click **Start Pipeline** and copy the generated `run_id`.
+5. Go to **Track Run**, paste `run_id`, then use:
+   - **Refresh Status + Logs**
+   - **Watch Live**
+   - **Fetch Final Output**
+   - **Fetch Condensed Output**
+## Required environment variables
+Set these before starting:
+- `GEMINI_API_KEY`
+- `DEEPGRAM_API_KEY`
+Optional:
+- `PIPELINE_WORKDIR` (defaults to temp directory)
+## Legacy FastAPI
+Existing FastAPI code is still in `api/index.py`, but Hugging Face Gradio Spaces will run `app.py`.

api/index.py ADDED Viewed

	@@ -0,0 +1,672 @@

+from __future__ import annotations
+import json
+import os
+import re
+import subprocess
+import sys
+import tempfile
+import threading
+import time
+import uuid
+from html import unescape
+from pathlib import Path
+from typing import Any, Dict, Optional
+from urllib.parse import parse_qs, urljoin, urlparse
+import httpx
+from fastapi import FastAPI, HTTPException
+from fastapi.responses import JSONResponse, PlainTextResponse
+from pydantic import BaseModel, Field, HttpUrl
+BASE_DIR = Path(__file__).resolve().parents[1]
+PIPELINES_DIR = BASE_DIR / "pipelines"
+DEFAULT_WORKDIR = Path(os.getenv("PIPELINE_WORKDIR", tempfile.gettempdir())) / "deployed-meet-runs"
+DEFAULT_WORKDIR.mkdir(parents=True, exist_ok=True)
+RUNS_DIR = DEFAULT_WORKDIR / "runs"
+RUNS_DIR.mkdir(parents=True, exist_ok=True)
+class PipelineRequest(BaseModel):
+    video_path: Optional[str] = Field(default=None, description="Absolute or server-local path to input video.")
+    video_url: Optional[HttpUrl] = Field(default=None, description="Optional URL to download input video from.")
+    out_dir: Optional[str] = Field(default=None, description="Optional output directory. Defaults to /tmp run folder.")
+    deepgram_model: str = "nova-3"
+    deepgram_language: Optional[str] = None
+    deepgram_request_timeout_sec: float = 1200.0
+    deepgram_connect_timeout_sec: float = 30.0
+    deepgram_retries: int = 3
+    deepgram_retry_backoff_sec: float = 2.0
+    force_deepgram: bool = False
+    force_keyframes: bool = False
+    pre_roll_sec: float = 3.0
+    gemini_model: str = "gemini-2.5-flash"
+    similarity_threshold: float = 0.82
+    temperature: float = 0.2
+    python_bin: Optional[str] = Field(
+        default=None,
+        description="Optional Python executable path for running pipeline subprocesses.",
+    )
+    log_heartbeat_sec: float = Field(
+        default=10.0,
+        description="Seconds between heartbeat progress lines written to run logs.",
+    )
+app = FastAPI(title="deployed-meet", version="1.0.0")
+def _tail(text: str, max_lines: int = 220) -> str:
+    lines = (text or "").splitlines()
+    if len(lines) <= max_lines:
+        return "\n".join(lines)
+    return "\n".join(lines[-max_lines:])
+def _run_dir(run_id: str) -> Path:
+    return RUNS_DIR / run_id
+def _meta_path(run_id: str) -> Path:
+    return _run_dir(run_id) / "run_meta.json"
+def _logs_path(run_id: str) -> Path:
+    return _run_dir(run_id) / "pipeline.log"
+def _write_json(path: Path, data: Dict[str, Any]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    tmp = path.with_suffix(path.suffix + ".tmp")
+    tmp.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
+    tmp.replace(path)
+def _read_json(path: Path) -> Dict[str, Any]:
+    return json.loads(path.read_text(encoding="utf-8"))
+def _get_meta_or_404(run_id: str) -> Dict[str, Any]:
+    p = _meta_path(run_id)
+    if not p.exists():
+        raise HTTPException(status_code=404, detail=f"Unknown run_id: {run_id}")
+    try:
+        return _read_json(p)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Failed to read run metadata: {type(e).__name__}: {e}") from e
+def _resolve_video_input(req: PipelineRequest, run_id: str, run_dir: Path) -> Path:
+    if req.video_path:
+        p = Path(req.video_path).expanduser().resolve()
+        if not p.exists():
+            raise HTTPException(status_code=400, detail=f"video_path does not exist: {p}")
+        return p
+    if req.video_url:
+        suffix = Path(str(req.video_url)).suffix or ".mp4"
+        local = run_dir / f"input_{run_id}{suffix}"
+        try:
+            url = str(req.video_url)
+            if _extract_gdrive_file_id(url):
+                _download_google_drive(url, local)
+            else:
+                with httpx.stream("GET", url, timeout=120.0, follow_redirects=True) as r:
+                    r.raise_for_status()
+                    with open(local, "wb") as f:
+                        for chunk in r.iter_bytes():
+                            f.write(chunk)
+        except HTTPException:
+            raise
+        except Exception as e:
+            raise HTTPException(status_code=400, detail=f"Failed to download video_url: {type(e).__name__}: {e}") from e
+        return local
+    raise HTTPException(status_code=400, detail="Provide one of: video_path or video_url.")
+def _extract_gdrive_file_id(url: str) -> Optional[str]:
+    parsed = urlparse(url)
+    host = (parsed.netloc or "").lower()
+    if "drive.google.com" not in host:
+        return None
+    m = re.search(r"/file/d/([a-zA-Z0-9_-]+)", parsed.path or "")
+    if m:
+        return m.group(1)
+    qs = parse_qs(parsed.query or "")
+    if "id" in qs and qs["id"]:
+        return qs["id"][0]
+    return None
+def _download_google_drive(url: str, out_path: Path) -> None:
+    file_id = _extract_gdrive_file_id(url)
+    if not file_id:
+        raise HTTPException(status_code=400, detail="Could not parse Google Drive file id from video_url.")
+    direct_url = f"https://drive.google.com/uc?export=download&id={file_id}"
+    def _is_html_response(resp: httpx.Response) -> bool:
+        ctype = (resp.headers.get("content-type") or "").lower()
+        if "html" in ctype or "text/plain" in ctype:
+            return True
+        head = (resp.content[:256] or b"").lower()
+        return b"<html" in head or b"<!doctype html" in head
+    def _write_if_file(resp: httpx.Response) -> bool:
+        if _is_html_response(resp):
+            return False
+        if not resp.content or len(resp.content) < 1024:
+            return False
+        out_path.write_bytes(resp.content)
+        return True
+    try:
+        with httpx.Client(timeout=120.0, follow_redirects=True) as client:
+            # Try a couple of direct download endpoints first.
+            candidates = [
+                direct_url,
+                f"https://drive.usercontent.google.com/download?id={file_id}&export=download&confirm=t",
+            ]
+            for c in candidates:
+                rr = client.get(c)
+                rr.raise_for_status()
+                if _write_if_file(rr):
+                    return
+            # Parse Drive HTML interstitial page and submit download form if present.
+            page = client.get(f"https://drive.google.com/file/d/{file_id}/view")
+            page.raise_for_status()
+            html = page.text or ""
+            # Pattern A: explicit download form.
+            form_action_match = re.search(r'id="download-form"[^>]*action="([^"]+)"', html)
+            if form_action_match:
+                action = unescape(form_action_match.group(1))
+                action_url = urljoin("https://drive.google.com", action)
+                params = {k: v for k, v in re.findall(r'<input[^>]+name="([^"]+)"[^>]+value="([^"]*)"', html)}
+                form_resp = client.get(action_url, params=params)
+                form_resp.raise_for_status()
+                if _write_if_file(form_resp):
+                    return
+            # Pattern B: direct download link in page HTML.
+            link_match = re.search(r'href="(/uc\?export=download[^"]+)"', html)
+            if link_match:
+                href = unescape(link_match.group(1)).replace("&amp;", "&")
+                link_url = urljoin("https://drive.google.com", href)
+                link_resp = client.get(link_url)
+                link_resp.raise_for_status()
+                if _write_if_file(link_resp):
+                    return
+            # Pattern C: download_warning cookie + confirm token flow.
+            cookie_confirm = None
+            for k, v in page.cookies.items():
+                if str(k).startswith("download_warning"):
+                    cookie_confirm = v
+                    break
+            if cookie_confirm:
+                confirm_url = f"https://drive.google.com/uc?export=download&confirm={cookie_confirm}&id={file_id}"
+                confirm_resp = client.get(confirm_url)
+                confirm_resp.raise_for_status()
+                if _write_if_file(confirm_resp):
+                    return
+            msg = "Google Drive link did not provide a downloadable file."
+            low = html.lower()
+            if "you need access" in low or "request access" in low:
+                msg += " File is not publicly accessible."
+            elif "quota exceeded" in low or "too many users have viewed or downloaded" in low:
+                msg += " File appears to be quota-limited by Google Drive."
+            else:
+                msg += " Use a publicly accessible direct file link or local video_path."
+            raise HTTPException(status_code=400, detail=msg)
+    except HTTPException:
+        raise
+    except Exception as e:
+        raise HTTPException(status_code=400, detail=f"Failed to download Google Drive file: {type(e).__name__}: {e}") from e
+def _validate_video_file(path: Path) -> None:
+    if not path.exists() or not path.is_file():
+        raise HTTPException(status_code=400, detail=f"Input video file not found: {path}")
+    size = path.stat().st_size
+    if size < 1024:
+        raise HTTPException(status_code=400, detail=f"Input file is too small to be valid media: {path} ({size} bytes)")
+    # Common case for bad video_url: downloaded HTML/JSON page saved as .mp4.
+    try:
+        head = path.read_bytes()[:4096].lower()
+        if b"<html" in head or b"<!doctype html" in head or b"{\"error\"" in head:
+            raise HTTPException(
+                status_code=400,
+                detail=(
+                    "Downloaded input is not a media file (looks like HTML/JSON response). "
+                    "Use a direct video file URL or provide video_path."
+                ),
+            )
+    except HTTPException:
+        raise
+    except Exception:
+        pass
+    # Lightweight decode check.
+    try:
+        import cv2  # local import to avoid import cost at startup
+        cap = cv2.VideoCapture(str(path))
+        ok = cap.isOpened()
+        frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
+        cap.release()
+        if (not ok) or frame_count <= 0:
+            raise HTTPException(
+                status_code=400,
+                detail=(
+                    "Input file is not a decodable video for this runtime. "
+                    "Provide a valid MP4 (H.264/AAC recommended) or use a direct media URL."
+                ),
+            )
+    except HTTPException:
+        raise
+    except Exception:
+        # If cv2 probing fails unexpectedly, let pipeline attempt process rather than hard-fail.
+        pass
+def _resolve_python_executable(req: PipelineRequest) -> str:
+    if req.python_bin:
+        p = Path(req.python_bin).expanduser()
+        if not p.exists():
+            raise HTTPException(status_code=400, detail=f"python_bin does not exist: {p}")
+        return str(p.resolve())
+    # Prefer project virtualenv if available.
+    candidates = [
+        BASE_DIR.parent / ".venv" / "Scripts" / "python.exe",  # Windows, repo root venv
+        BASE_DIR / ".venv" / "Scripts" / "python.exe",         # Windows, deployed-meet local venv
+        BASE_DIR.parent / ".venv" / "bin" / "python",          # Unix, repo root venv
+        BASE_DIR / ".venv" / "bin" / "python",                 # Unix, deployed-meet local venv
+    ]
+    for c in candidates:
+        if c.exists():
+            return str(c.resolve())
+    # Fallback to currently running interpreter.
+    return sys.executable or os.getenv("PYTHON_BIN") or "python"
+def _resolve_out_dir(req: PipelineRequest, run_id: str) -> Path:
+    if req.out_dir:
+        p = Path(req.out_dir)
+        if not p.is_absolute():
+            p = DEFAULT_WORKDIR / p
+    else:
+        p = DEFAULT_WORKDIR / f"run_{run_id}"
+    p.mkdir(parents=True, exist_ok=True)
+    return p.resolve()
+def _build_common_args(req: PipelineRequest, video_path: Path, out_dir: Path) -> list[str]:
+    args = [
+        "--video",
+        str(video_path),
+        "--out",
+        str(out_dir),
+        "--deepgram-model",
+        req.deepgram_model,
+        "--deepgram-request-timeout-sec",
+        str(req.deepgram_request_timeout_sec),
+        "--deepgram-connect-timeout-sec",
+        str(req.deepgram_connect_timeout_sec),
+        "--deepgram-retries",
+        str(req.deepgram_retries),
+        "--deepgram-retry-backoff-sec",
+        str(req.deepgram_retry_backoff_sec),
+        "--pre-roll-sec",
+        str(req.pre_roll_sec),
+        "--gemini-model",
+        req.gemini_model,
+        "--similarity-threshold",
+        str(req.similarity_threshold),
+        "--temperature",
+        str(req.temperature),
+    ]
+    if req.deepgram_language:
+        args.extend(["--deepgram-language", req.deepgram_language])
+    if req.force_deepgram:
+        args.append("--force-deepgram")
+    if req.force_keyframes:
+        args.append("--force-keyframes")
+    return args
+def _build_output_files(out_dir: Path, variant: str) -> Dict[str, str]:
+    return {
+        "utterances": str(out_dir / "utterances.json"),
+        "keyframes_parsed": str(out_dir / "keyframes_parsed.json"),
+        "keyframes_with_utterances": str(out_dir / "keyframes_with_utterances.json"),
+        "final_output": str(
+            out_dir / ("final_output.json" if variant == "full" else "final_output_demo_code.json")
+        ),
+        "final_output_condensed": str(
+            out_dir / ("final_output_condensed.json" if variant == "full" else "final_output_demo_code_condensed.json")
+        ),
+    }
+def _artifact_state(output_files: Dict[str, str]) -> Dict[str, Dict[str, Any]]:
+    state: Dict[str, Dict[str, Any]] = {}
+    for key, p in output_files.items():
+        path = Path(p)
+        if path.exists():
+            try:
+                st = path.stat()
+                state[key] = {
+                    "size_bytes": int(st.st_size),
+                    "mtime": float(st.st_mtime),
+                }
+            except Exception:
+                state[key] = {"size_bytes": -1, "mtime": -1.0}
+    return state
+def _format_artifact_compact(state: Dict[str, Dict[str, Any]]) -> str:
+    if not state:
+        return "none"
+    parts = []
+    for k in sorted(state.keys()):
+        sz = float(state[k].get("size_bytes", 0))
+        parts.append(f"{k}:{sz/1024.0:.1f}KB")
+    return ", ".join(parts)
+def _watch_run(
+    run_id: str,
+    proc: subprocess.Popen,
+    started_at: float,
+    log_fh,
+    heartbeat_sec: float,
+) -> None:
+    heartbeat_sec = max(2.0, float(heartbeat_sec))
+    last_hb = 0.0
+    last_artifact_change = started_at
+    last_state: Dict[str, Dict[str, Any]] = {}
+    # Emit periodic progress so logs are not "stuck" during long calls.
+    while True:
+        now = time.time()
+        rc = proc.poll()
+        if (now - last_hb) >= heartbeat_sec:
+            try:
+                meta_file = _meta_path(run_id)
+                meta = _read_json(meta_file) if meta_file.exists() else {"run_id": run_id}
+                out_files = meta.get("output_files", {}) or {}
+                cur_state = _artifact_state(out_files)
+                changed = cur_state != last_state
+                if changed:
+                    last_artifact_change = now
+                unchanged_for = now - last_artifact_change
+                elapsed = now - started_at
+                log_fh.write(
+                    "[runner] heartbeat "
+                    f"elapsed={elapsed:.1f}s pid={proc.pid} "
+                    f"artifacts={len(cur_state)}/{len(out_files)} "
+                    f"changed={'yes' if changed else 'no'} "
+                    f"unchanged_for={unchanged_for:.1f}s "
+                    f"[{_format_artifact_compact(cur_state)}]\n"
+                )
+                log_fh.flush()
+                meta["last_heartbeat_epoch"] = now
+                meta["last_heartbeat_elapsed_sec"] = round(elapsed, 3)
+                meta["artifacts_ready_count"] = len(cur_state)
+                meta["artifacts_total_count"] = len(out_files)
+                meta["artifacts_unchanged_for_sec"] = round(unchanged_for, 3)
+                _write_json(meta_file, meta)
+                last_state = cur_state
+            except Exception as e:
+                try:
+                    log_fh.write(f"[runner] heartbeat_error: {type(e).__name__}: {e}\n")
+                    log_fh.flush()
+                except Exception:
+                    pass
+            last_hb = now
+        if rc is not None:
+            return_code = int(rc)
+            break
+        time.sleep(1.0)
+    finished_at = time.time()
+    try:
+        meta_file = _meta_path(run_id)
+        meta = _read_json(meta_file) if meta_file.exists() else {"run_id": run_id}
+        meta["status"] = "succeeded" if return_code == 0 else "failed"
+        meta["exit_code"] = int(return_code)
+        meta["finished_at_epoch"] = finished_at
+        meta["duration_sec"] = round(finished_at - started_at, 3)
+        _write_json(meta_file, meta)
+    except Exception as e:
+        try:
+            log_fh.write(f"\n[runner] failed to update metadata: {type(e).__name__}: {e}\n")
+            log_fh.flush()
+        except Exception:
+            pass
+    try:
+        log_fh.write(f"\n[runner] process finished with exit_code={return_code}\n")
+        log_fh.flush()
+    except Exception:
+        pass
+    finally:
+        try:
+            log_fh.close()
+        except Exception:
+            pass
+def _start_pipeline(pipeline_script: Path, req: PipelineRequest, variant: str) -> Dict[str, Any]:
+    if not pipeline_script.exists():
+        raise HTTPException(status_code=500, detail=f"Missing pipeline script: {pipeline_script}")
+    run_id = uuid.uuid4().hex[:12]
+    run_dir = _run_dir(run_id)
+    run_dir.mkdir(parents=True, exist_ok=True)
+    video_path = _resolve_video_input(req, run_id, run_dir)
+    _validate_video_file(video_path)
+    out_dir = _resolve_out_dir(req, run_id)
+    python_exe = _resolve_python_executable(req)
+    cmd = [
+        python_exe,
+        "-u",
+        str(pipeline_script),
+        "--python",
+        python_exe,
+        *_build_common_args(req, video_path, out_dir),
+    ]
+    started = time.time()
+    logs_path = _logs_path(run_id)
+    log_fh = open(logs_path, "a", encoding="utf-8", buffering=1)
+    log_fh.write(
+        f"[runner] run_id={run_id} variant={variant} started_at_epoch={started}\n"
+        f"[runner] command={' '.join(cmd)}\n"
+        f"[runner] cwd={PIPELINES_DIR}\n\n"
+        f"[runner] heartbeat_interval_sec={req.log_heartbeat_sec}\n"
+        f"[runner] python_unbuffered=1\n\n"
+    )
+    log_fh.flush()
+    child_env = os.environ.copy()
+    child_env["PYTHONUNBUFFERED"] = "1"
+    child_env.setdefault("PYTHONIOENCODING", "utf-8")
+    proc = subprocess.Popen(
+        cmd,
+        cwd=str(PIPELINES_DIR),
+        stdout=log_fh,
+        stderr=subprocess.STDOUT,
+        text=True,
+        env=child_env,
+    )
+    meta = {
+        "variant": variant,
+        "run_id": run_id,
+        "python_executable": python_exe,
+        "command": cmd,
+        "status": "running",
+        "exit_code": None,
+        "pid": proc.pid,
+        "started_at_epoch": started,
+        "finished_at_epoch": None,
+        "duration_sec": None,
+        "out_dir": str(out_dir),
+        "logs_path": str(logs_path),
+        "heartbeat_interval_sec": float(req.log_heartbeat_sec),
+        "output_files": _build_output_files(out_dir, variant),
+    }
+    _write_json(_meta_path(run_id), meta)
+    watcher = threading.Thread(
+        target=_watch_run,
+        args=(run_id, proc, started, log_fh, float(req.log_heartbeat_sec)),
+        daemon=True,
+    )
+    watcher.start()
+    return {
+        "run_id": run_id,
+        "variant": variant,
+        "status": "running",
+        "python_executable": python_exe,
+        "status_path": f"/runs/{run_id}",
+        "logs_path": f"/runs/{run_id}/logs",
+        "final_output_path": f"/runs/{run_id}/final-output",
+        "final_output_condensed_path": f"/runs/{run_id}/final-output/condensed",
+        "out_dir": str(out_dir),
+    }
+@app.get("/health")
+def health() -> Dict[str, str]:
+    return {"status": "ok"}
+@app.get("/")
+def root() -> Dict[str, Any]:
+    return {
+        "service": "deployed-meet",
+        "status": "ok",
+        "docs": "/docs",
+        "routes": [
+            "/pipeline/full",
+            "/pipeline/demo-code",
+            "/runs/{run_id}",
+            "/runs/{run_id}/logs",
+            "/runs/{run_id}/final-output",
+            "/runs/{run_id}/final-output/condensed",
+        ],
+    }
+@app.post("/pipeline/full")
+def pipeline_full(req: PipelineRequest) -> Dict[str, Any]:
+    return _start_pipeline(PIPELINES_DIR / "run_pipeline_all.py", req, variant="full")
+@app.post("/pipeline/demo-code")
+def pipeline_demo_code(req: PipelineRequest) -> Dict[str, Any]:
+    return _start_pipeline(PIPELINES_DIR / "run_pipeline_demo_code.py", req, variant="demo_code")
+@app.get("/runs/{run_id}")
+def run_status(run_id: str) -> Dict[str, Any]:
+    return _get_meta_or_404(run_id)
+@app.get("/runs/{run_id}/logs")
+def run_logs(run_id: str, tail_lines: int = 300) -> PlainTextResponse:
+    meta = _get_meta_or_404(run_id)
+    p = Path(meta.get("logs_path", ""))
+    if not p.exists():
+        return PlainTextResponse("")
+    txt = p.read_text(encoding="utf-8", errors="replace")
+    limit = max(1, min(int(tail_lines), 5000))
+    return PlainTextResponse(_tail(txt, max_lines=limit))
+@app.get("/runs/{run_id}/final-output")
+def run_final_output(run_id: str) -> Any:
+    meta = _get_meta_or_404(run_id)
+    status = meta.get("status")
+    out_file = Path(meta["output_files"]["final_output"])
+    if status == "running":
+        return JSONResponse(
+            status_code=202,
+            content={
+                "run_id": run_id,
+                "status": status,
+                "message": "Pipeline is still running. Check /runs/{run_id}/logs for live progress.",
+                "logs_path": f"/runs/{run_id}/logs",
+            },
+        )
+    if status == "failed":
+        raise HTTPException(
+            status_code=409,
+            detail={
+                "run_id": run_id,
+                "status": status,
+                "message": "Pipeline failed. Check logs for details.",
+                "logs_path": f"/runs/{run_id}/logs",
+            },
+        )
+    if not out_file.exists():
+        raise HTTPException(status_code=404, detail=f"Final output not found: {out_file}")
+    return _read_json(out_file)
+@app.get("/runs/{run_id}/final-output/condensed")
+def run_final_output_condensed(run_id: str) -> Any:
+    meta = _get_meta_or_404(run_id)
+    status = meta.get("status")
+    out_file = Path(meta["output_files"]["final_output_condensed"])
+    if status == "running":
+        return JSONResponse(
+            status_code=202,
+            content={
+                "run_id": run_id,
+                "status": status,
+                "message": "Pipeline is still running. Check /runs/{run_id}/logs for live progress.",
+                "logs_path": f"/runs/{run_id}/logs",
+            },
+        )
+    if status == "failed":
+        raise HTTPException(
+            status_code=409,
+            detail={
+                "run_id": run_id,
+                "status": status,
+                "message": "Pipeline failed. Check logs for details.",
+                "logs_path": f"/runs/{run_id}/logs",
+            },
+        )
+    if not out_file.exists():
+        raise HTTPException(status_code=404, detail=f"Condensed final output not found: {out_file}")
+    return _read_json(out_file)

app.py ADDED Viewed

	@@ -0,0 +1,279 @@

+from __future__ import annotations
+import os
+import time
+from typing import Any, Dict, Optional, Tuple
+import gradio as gr
+from run_manager import get_final_output, get_logs, get_status, start_run
+def _clean_optional(value: Optional[str]) -> Optional[str]:
+    if value is None:
+        return None
+    text = str(value).strip()
+    return text or None
+def _err_payload(message: str) -> Dict[str, Any]:
+    return {"status": "error", "message": message}
+def start_pipeline(
+    variant: str,
+    input_mode: str,
+    video_file_path: Optional[str],
+    video_url: Optional[str],
+    out_dir: Optional[str],
+    python_bin: Optional[str],
+    deepgram_model: str,
+    deepgram_language: Optional[str],
+    deepgram_request_timeout_sec: float,
+    deepgram_connect_timeout_sec: float,
+    deepgram_retries: int,
+    deepgram_retry_backoff_sec: float,
+    force_deepgram: bool,
+    force_keyframes: bool,
+    pre_roll_sec: float,
+    gemini_model: str,
+    similarity_threshold: float,
+    temperature: float,
+    log_heartbeat_sec: float,
+) -> Tuple[str, Dict[str, Any], str, str]:
+    try:
+        chosen_video_file = None
+        chosen_video_url = None
+        mode = (input_mode or "").strip().lower()
+        if mode == "upload file":
+            chosen_video_file = _clean_optional(video_file_path)
+            if not chosen_video_file:
+                raise ValueError("Select a video file for Upload File mode.")
+        elif mode == "video url":
+            chosen_video_url = _clean_optional(video_url)
+            if not chosen_video_url:
+                raise ValueError("Provide video_url for Video URL mode.")
+        else:
+            raise ValueError("Invalid input mode.")
+        result = start_run(
+            variant=variant,
+            video_file_path=chosen_video_file,
+            video_url=chosen_video_url,
+            out_dir=_clean_optional(out_dir),
+            python_bin=_clean_optional(python_bin),
+            deepgram_model=deepgram_model,
+            deepgram_language=_clean_optional(deepgram_language),
+            deepgram_request_timeout_sec=float(deepgram_request_timeout_sec),
+            deepgram_connect_timeout_sec=float(deepgram_connect_timeout_sec),
+            deepgram_retries=int(deepgram_retries),
+            deepgram_retry_backoff_sec=float(deepgram_retry_backoff_sec),
+            force_deepgram=bool(force_deepgram),
+            force_keyframes=bool(force_keyframes),
+            pre_roll_sec=float(pre_roll_sec),
+            gemini_model=gemini_model,
+            similarity_threshold=float(similarity_threshold),
+            temperature=float(temperature),
+            log_heartbeat_sec=float(log_heartbeat_sec),
+        )
+        run_id = str(result["run_id"])
+        logs = get_logs(run_id, tail_lines=120)
+        return run_id, result, logs, run_id
+    except Exception as e:
+        msg = f"{type(e).__name__}: {e}"
+        return "", _err_payload(msg), msg, ""
+def refresh_status_logs(run_id: str, tail_lines: int) -> Tuple[Dict[str, Any], str]:
+    rid = _clean_optional(run_id)
+    if not rid:
+        return _err_payload("Enter a run_id."), ""
+    try:
+        status = get_status(rid)
+        logs = get_logs(rid, tail_lines=int(tail_lines))
+        return status, logs
+    except Exception as e:
+        return _err_payload(f"{type(e).__name__}: {e}"), ""
+def fetch_output(run_id: str, condensed: bool) -> Dict[str, Any]:
+    rid = _clean_optional(run_id)
+    if not rid:
+        return _err_payload("Enter a run_id.")
+    try:
+        return get_final_output(rid, condensed=condensed)
+    except Exception as e:
+        return _err_payload(f"{type(e).__name__}: {e}")
+def watch_run(
+    run_id: str,
+    tail_lines: int,
+    poll_sec: float,
+):
+    rid = _clean_optional(run_id)
+    if not rid:
+        yield _err_payload("Enter a run_id."), "", None, None
+        return
+    sleep_sec = max(1.0, float(poll_sec))
+    max_tail = max(10, min(int(tail_lines), 5000))
+    while True:
+        try:
+            status = get_status(rid)
+            logs = get_logs(rid, tail_lines=max_tail)
+        except Exception as e:
+            yield _err_payload(f"{type(e).__name__}: {e}"), "", None, None
+            return
+        state = str(status.get("status", "unknown")).lower()
+        if state in {"succeeded", "failed"}:
+            full_payload = None
+            condensed_payload = None
+            if state == "succeeded":
+                try:
+                    full_payload = get_final_output(rid, condensed=False)
+                except Exception as e:
+                    full_payload = _err_payload(f"{type(e).__name__}: {e}")
+                try:
+                    condensed_payload = get_final_output(rid, condensed=True)
+                except Exception as e:
+                    condensed_payload = _err_payload(f"{type(e).__name__}: {e}")
+            yield status, logs, full_payload, condensed_payload
+            return
+        yield status, logs, None, None
+        time.sleep(sleep_sec)
+with gr.Blocks(title="deployed-meet") as demo:
+    gr.Markdown(
+        """
+        # deployed-meet (Gradio)
+        Start either pipeline variant, then monitor logs and fetch final outputs by `run_id`.
+        - `full`: Gemini on all keyframe types.
+        - `demo-code`: Gemini only on demo keyframes, slides+code are OCR/transcript based.
+        """
+    )
+    with gr.Tab("Start Run"):
+        variant = gr.Dropdown(
+            choices=[
+                ("Full pipeline (Gemini on slides/code/demo)", "full"),
+                ("Demo-only Gemini pipeline (slides+code OCR)", "demo-code"),
+            ],
+            value="demo-code",
+            label="Pipeline Variant",
+        )
+        input_mode = gr.Radio(
+            choices=["Upload File", "Video URL"],
+            value="Upload File",
+            label="Input Mode",
+        )
+        video_file = gr.File(label="Video File", type="filepath")
+        video_url = gr.Textbox(label="Video URL", placeholder="https://.../meeting.mp4")
+        out_dir = gr.Textbox(
+            label="Output Directory (optional)",
+            placeholder="run_001",
+        )
+        python_bin = gr.Textbox(
+            label="Python Executable (optional)",
+            placeholder="Leave blank to auto-resolve",
+        )
+        with gr.Accordion("Advanced Settings", open=False):
+            deepgram_model = gr.Textbox(label="Deepgram Model", value="nova-3")
+            deepgram_language = gr.Textbox(label="Deepgram Language (optional)", value="")
+            deepgram_request_timeout_sec = gr.Number(label="Deepgram Request Timeout (sec)", value=1200.0)
+            deepgram_connect_timeout_sec = gr.Number(label="Deepgram Connect Timeout (sec)", value=30.0)
+            deepgram_retries = gr.Number(label="Deepgram Retries", value=3, precision=0)
+            deepgram_retry_backoff_sec = gr.Number(label="Deepgram Retry Backoff (sec)", value=2.0)
+            force_deepgram = gr.Checkbox(label="Force Deepgram Re-run", value=False)
+            force_keyframes = gr.Checkbox(label="Force Keyframe Re-run", value=False)
+            pre_roll_sec = gr.Number(label="Pre-roll Seconds", value=3.0)
+            gemini_model = gr.Textbox(label="Gemini Model", value="gemini-2.5-flash")
+            similarity_threshold = gr.Number(label="Similarity Threshold", value=0.82)
+            temperature = gr.Number(label="Temperature", value=0.2)
+            log_heartbeat_sec = gr.Number(label="Heartbeat Log Interval (sec)", value=10.0)
+        start_btn = gr.Button("Start Pipeline", variant="primary")
+        start_run_id = gr.Textbox(label="Run ID", interactive=False)
+        start_status = gr.JSON(label="Start Response / Error")
+        start_logs = gr.Textbox(label="Initial Logs", lines=14)
+    with gr.Tab("Track Run"):
+        track_run_id = gr.Textbox(label="Run ID", placeholder="Paste run_id from Start tab")
+        tail_lines = gr.Slider(label="Log Tail Lines", minimum=50, maximum=3000, value=300, step=50)
+        poll_sec = gr.Slider(label="Live Poll Interval (sec)", minimum=1, maximum=20, value=3, step=1)
+        with gr.Row():
+            refresh_btn = gr.Button("Refresh Status + Logs")
+            watch_btn = gr.Button("Watch Live")
+            full_btn = gr.Button("Fetch Final Output")
+            condensed_btn = gr.Button("Fetch Condensed Output")
+        track_status = gr.JSON(label="Run Status")
+        track_logs = gr.Textbox(label="Run Logs", lines=22)
+        track_full_output = gr.JSON(label="Final Output")
+        track_condensed_output = gr.JSON(label="Condensed Final Output")
+    start_btn.click(
+        fn=start_pipeline,
+        inputs=[
+            variant,
+            input_mode,
+            video_file,
+            video_url,
+            out_dir,
+            python_bin,
+            deepgram_model,
+            deepgram_language,
+            deepgram_request_timeout_sec,
+            deepgram_connect_timeout_sec,
+            deepgram_retries,
+            deepgram_retry_backoff_sec,
+            force_deepgram,
+            force_keyframes,
+            pre_roll_sec,
+            gemini_model,
+            similarity_threshold,
+            temperature,
+            log_heartbeat_sec,
+        ],
+        outputs=[start_run_id, start_status, start_logs, track_run_id],
+    )
+    refresh_btn.click(
+        fn=refresh_status_logs,
+        inputs=[track_run_id, tail_lines],
+        outputs=[track_status, track_logs],
+    )
+    watch_btn.click(
+        fn=watch_run,
+        inputs=[track_run_id, tail_lines, poll_sec],
+        outputs=[track_status, track_logs, track_full_output, track_condensed_output],
+    )
+    full_btn.click(
+        fn=lambda rid: fetch_output(rid, False),
+        inputs=[track_run_id],
+        outputs=[track_full_output],
+    )
+    condensed_btn.click(
+        fn=lambda rid: fetch_output(rid, True),
+        inputs=[track_run_id],
+        outputs=[track_condensed_output],
+    )
+if __name__ == "__main__":
+    demo.queue(default_concurrency_limit=2).launch(
+        server_name="0.0.0.0",
+        server_port=int(os.getenv("PORT", "7860")),
+        show_error=True,
+    )

pipelines/assign_utterances_to_keyframes.py ADDED Viewed

	@@ -0,0 +1,249 @@

+import json
+import argparse
+from typing import Any, Dict, List, Optional, Tuple
+def safe_str(x: Any) -> str:
+    return "" if x is None else str(x)
+def extract_list(data: Any) -> List[Dict[str, Any]]:
+    # Accept either a list of items, or a dict that contains a list under common keys.
+    if isinstance(data, list):
+        return [x for x in data if isinstance(x, dict)]
+    if isinstance(data, dict):
+        for k in ["utterances", "items", "segments", "results", "data"]:
+            if k in data and isinstance(data[k], list):
+                return [x for x in data[k] if isinstance(x, dict)]
+    return []
+def extract_keyframes(data: Any) -> List[Dict[str, Any]]:
+    # Accept either a list of keyframes, or a dict that contains a list under common keys.
+    if isinstance(data, list):
+        return [x for x in data if isinstance(x, dict)]
+    if isinstance(data, dict):
+        for k in ["keyframes", "items", "results", "data"]:
+            if k in data and isinstance(data[k], list):
+                return [x for x in data[k] if isinstance(x, dict)]
+    return []
+def get_time_field(d: Dict[str, Any], keys: List[str]) -> Optional[float]:
+    for k in keys:
+        if k in d:
+            try:
+                v = d[k]
+                if v is None:
+                    continue
+                return float(v)
+            except Exception:
+                continue
+    return None
+def get_utterance_times(u: Dict[str, Any]) -> Tuple[Optional[float], Optional[float]]:
+    # Try common fields for start/end times
+    start = get_time_field(u, ["start_sec", "start_s", "start", "start_time", "t_start", "begin", "from"])
+    end = get_time_field(u, ["end_sec", "end_s", "end", "end_time", "t_end", "finish", "to"])
+    # If only one is present, treat utterance as a point-in-time
+    if start is not None and end is None:
+        end = start
+    if end is not None and start is None:
+        start = end
+    return start, end
+def get_utterance_text(u: Dict[str, Any]) -> str:
+    for k in ["text", "utterance", "content", "transcript", "sentence"]:
+        if k in u and safe_str(u[k]).strip():
+            return safe_str(u[k]).strip()
+    # Some formats store words list
+    if "words" in u and isinstance(u["words"], list):
+        parts = []
+        for w in u["words"]:
+            if isinstance(w, dict):
+                t = w.get("word") or w.get("text")
+                if t:
+                    parts.append(str(t))
+            elif isinstance(w, str):
+                parts.append(w)
+        if parts:
+            return " ".join(parts).strip()
+    return ""
+def overlaps(a0: float, a1: float, b0: float, b1: float) -> bool:
+    # Closed-open overlap check: [a0, a1) overlaps [b0, b1) iff max(starts) < min(ends)
+    return max(a0, b0) < min(a1, b1)
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("keyframes_json", help="Path to keyframes JSON (e.g. keyframes_parsed.json)")
+    ap.add_argument("utterances_json", help="Path to utterances.json")
+    ap.add_argument("-o", "--out", default="keyframes_with_utterances.json", help="Output JSON path")
+    ap.add_argument(
+        "--pre-roll-sec",
+        type=float,
+        default=3.0,
+        help="Seconds before each keyframe start that should also belong to that keyframe.",
+    )
+    args = ap.parse_args()
+    # Load keyframes
+    with open(args.keyframes_json, "r", encoding="utf-8") as f:
+        kf_raw = json.load(f)
+    keyframes_list = extract_keyframes(kf_raw)
+    if not keyframes_list:
+        raise ValueError(
+            "No keyframes found. Expected a list, or an object containing keyframes under one of: "
+            "keyframes/items/results/data."
+        )
+    # Sort keyframes by time
+    keyframes = sorted(
+        keyframes_list,
+        key=lambda k: (
+            float(k.get("t_sec", 0.0) or 0.0),
+            int(k.get("keyframe_idx", 0) or 0),
+        ),
+    )
+    if not keyframes:
+        raise ValueError("No keyframes found in keyframes JSON")
+    pre_roll_sec = max(0.0, float(args.pre_roll_sec))
+    # Precompute keyframe times and windows.
+    # window i:
+    # - first keyframe: [t_0, t_1)
+    # - others: [max(t_i - pre_roll_sec, t_{i-1}), t_{i+1})
+    # This makes [t_i - pre_roll_sec, t_i) belong to BOTH keyframe i and keyframe i-1.
+    t = [float(kf.get("t_sec", 0.0) or 0.0) for kf in keyframes]
+    n = len(t)
+    windows: List[Tuple[float, float]] = []
+    for i in range(n):
+        if i == 0:
+            start = t[i]
+        else:
+            start = max(t[i] - pre_roll_sec, t[i - 1])
+        end = t[i + 1] if i < n - 1 else float("inf")
+        windows.append((start, end))
+    # Prepare output keyframes (copy + add assigned_utterances)
+    out_keyframes: List[Dict[str, Any]] = []
+    for kf in keyframes:
+        kf_out = dict(kf)
+        kf_out["assigned_utterances"] = []
+        out_keyframes.append(kf_out)
+    # Load utterances
+    with open(args.utterances_json, "r", encoding="utf-8") as f:
+        u_raw = json.load(f)
+    utterances = extract_list(u_raw)
+    if not utterances:
+        raise ValueError(
+            "No utterances found. Expected utterances.json to be a list, or a dict containing a list under "
+            "one of: utterances/items/segments/results/data."
+        )
+    unassigned = []
+    multi_assigned = 0
+    assigned_total = 0
+    for u in utterances:
+        text = get_utterance_text(u).strip()
+        u_start, u_end = get_utterance_times(u)
+        if u_start is None or u_end is None or not text:
+            unassigned.append({"reason": "missing_text_or_time", "utterance": u})
+            continue
+        u_start = float(u_start)
+        u_end = float(u_end)
+        if u_end < u_start:
+            u_start, u_end = u_end, u_start
+        # Make point-in-time utterances half-open with tiny duration
+        if u_end == u_start:
+            u_end = u_start + 1e-6
+        matched_indexes = []
+        for i, (w0, w1) in enumerate(windows):
+            if overlaps(u_start, u_end, w0, w1):
+                matched_indexes.append(i)
+        if not matched_indexes:
+            # Fallback for degenerate boundary conditions.
+            for i, (w0, w1) in enumerate(windows):
+                eps = 1e-9
+                if overlaps(u_start - eps, u_end + eps, w0, w1):
+                    matched_indexes.append(i)
+        if not matched_indexes:
+            unassigned.append({"reason": "no_overlapping_keyframe_window", "utterance": u})
+            continue
+        # Keep indexes sorted and unique.
+        matched_indexes = sorted(set(matched_indexes))
+        if len(matched_indexes) > 1:
+            multi_assigned += 1
+        payload = dict(u)
+        payload["_text"] = text
+        payload["_start_sec"] = u_start
+        payload["_end_sec"] = u_end
+        payload["_overlaps_sorted_indexes"] = matched_indexes
+        for idx in matched_indexes:
+            payload2 = dict(payload)
+            payload2["_assigned_sorted_index"] = idx
+            payload2["_assigned_keyframe_idx"] = out_keyframes[idx].get("keyframe_idx")
+            payload2["_assigned_t_sec"] = out_keyframes[idx].get("t_sec")
+            out_keyframes[idx]["assigned_utterances"].append(payload2)
+            assigned_total += 1
+    # Sort utterances inside each keyframe by start time
+    for kf in out_keyframes:
+        kf["assigned_utterances"].sort(key=lambda x: float(x.get("_start_sec", 0.0) or 0.0))
+    out = {
+        "meta": {
+            "keyframes_file": args.keyframes_json,
+            "utterances_file": args.utterances_json,
+            "keyframes_count": len(out_keyframes),
+            "utterances_count": len(utterances),
+            "assigned_total": assigned_total,  # counts duplicates if an utterance overlaps multiple keyframes
+            "multi_assigned_utterances": multi_assigned,
+            "unassigned_count": len(unassigned),
+            "pre_roll_sec": pre_roll_sec,
+            "window_strategy": (
+                "pre-roll overlap windows: "
+                "first [t_0, t_1), others [max(t_i-pre_roll_sec, t_{i-1}), t_{i+1}), "
+                "last ends at +inf"
+            ),
+        },
+        "keyframes": out_keyframes,
+        "unassigned_utterances": unassigned,
+    }
+    with open(args.out, "w", encoding="utf-8") as f:
+        json.dump(out, f, ensure_ascii=False, indent=2)
+    print(f"Done. Wrote: {args.out}")
+    print(f"Keyframes: {len(out_keyframes)}")
+    print(f"Utterances: {len(utterances)}")
+    print(f"Assigned total (including duplicates): {assigned_total}")
+    print(f"Utterances that overlapped multiple keyframes: {multi_assigned}")
+    print(f"Unassigned utterances: {len(unassigned)}")
+if __name__ == "__main__":
+    main()

pipelines/build_final_output.py ADDED Viewed

	@@ -0,0 +1,758 @@

+# build_final_output.py
+# Usage:
+#   pip install google-genai pydantic python-dotenv
+#   set GEMINI_API_KEY=...
+#   python build_final_output.py ^
+#       --keyframes "C:\meet-agent\out_folder\keyframes_with_utterances.json" ^
+#       --out "C:\meet-agent\out_folder\final_output.json" ^
+#       --model "gemini-2.5-flash"
+import argparse
+import json
+import os
+import re
+import time
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple
+from dotenv import load_dotenv
+from pydantic import BaseModel, Field
+from google import genai
+from google.genai import types
+# -----------------------------
+# Helpers
+# -----------------------------
+def log(msg: str) -> None:
+    print(msg, flush=True)
+def load_json(path: str) -> Any:
+    with open(path, "r", encoding="utf-8") as f:
+        return json.load(f)
+def save_json(path: str, obj: Any) -> None:
+    out_dir = os.path.dirname(path)
+    if out_dir:
+        os.makedirs(out_dir, exist_ok=True)
+    with open(path, "w", encoding="utf-8") as f:
+        json.dump(obj, f, ensure_ascii=False, indent=2)
+def sec_to_hhmmss(t: float) -> str:
+    t = max(0.0, float(t))
+    hh = int(t // 3600)
+    mm = int((t % 3600) // 60)
+    ss = int(t % 60)
+    return f"{hh:02d}:{mm:02d}:{ss:02d}"
+def tokenize(s: str) -> List[str]:
+    s = s.lower()
+    s = re.sub(r"[^a-z0-9_]+", " ", s)
+    toks = [t for t in s.split() if t]
+    return toks
+def jaccard_similarity(a: str, b: str) -> float:
+    sa, sb = set(tokenize(a)), set(tokenize(b))
+    if not sa and not sb:
+        return 1.0
+    if not sa or not sb:
+        return 0.0
+    return len(sa & sb) / max(1, len(sa | sb))
+def safe_join_text(lines: List[str], max_chars: int = 8000) -> str:
+    """Join lines but prevent prompt bloat."""
+    out = []
+    total = 0
+    for ln in lines:
+        if total + len(ln) + 1 > max_chars:
+            break
+        out.append(ln)
+        total += len(ln) + 1
+    return "\n".join(out)
+def frame_signature(frame: Optional[Dict[str, Any]]) -> str:
+    """Build a signature string for similarity comparison to previous keyframe."""
+    if not frame:
+        return ""
+    on_screen = frame.get("on_screen_text") or []
+    screen_parse = frame.get("screen_parse") or {}
+    screen_parse_text = summarize_screen_parse(screen_parse, max_regions=3, max_region_lines=6, max_ocr_lines=30, max_chars=2500)
+    on_screen_small = safe_join_text(on_screen[:80], max_chars=2500)
+    return f"{on_screen_small}\n{screen_parse_text}"
+def diff_lists(prev: List[str], cur: List[str], max_items: int = 25) -> Tuple[List[str], List[str]]:
+    prev_set, cur_set = set(prev), set(cur)
+    added = [x for x in cur if x not in prev_set][:max_items]
+    removed = [x for x in prev if x not in cur_set][:max_items]
+    return added, removed
+def summarize_screen_parse(
+    screen_parse: Optional[Dict[str, Any]],
+    max_regions: int = 8,
+    max_region_lines: int = 12,
+    max_ocr_lines: int = 120,
+    max_chars: int = 9000,
+) -> str:
+    if not isinstance(screen_parse, dict) or not screen_parse:
+        return "unknown"
+    parts: List[str] = []
+    frame_w = screen_parse.get("frame_w")
+    frame_h = screen_parse.get("frame_h")
+    if frame_w is not None and frame_h is not None:
+        parts.append(f"frame_size: {frame_w}x{frame_h}")
+    regions = screen_parse.get("layout_regions") or []
+    if regions:
+        region_lines: List[str] = []
+        for i, region in enumerate(regions[:max_regions]):
+            label = region.get("label", "unknown")
+            conf = region.get("conf", "unknown")
+            box = region.get("box", [])
+            text_lines = region.get("text_lines") or []
+            text_lines_clean = [str(x).strip() for x in text_lines if str(x).strip()][:max_region_lines]
+            text_preview = " | ".join(text_lines_clean)
+            region_lines.append(
+                f"region[{i}] label={label}, conf={conf}, box={box}, text_lines={text_preview}"
+            )
+        parts.append("layout_regions:\n" + "\n".join(region_lines))
+    ocr_lines = screen_parse.get("ocr_lines") or []
+    if ocr_lines:
+        ocr_text: List[str] = []
+        for item in ocr_lines[:max_ocr_lines]:
+            txt = str(item.get("text", "")).strip()
+            if txt:
+                ocr_text.append(txt)
+        if ocr_text:
+            parts.append("ocr_lines:\n" + safe_join_text(ocr_text, max_chars=max_chars))
+    merged = "\n\n".join(parts).strip()
+    if not merged:
+        return "unknown"
+    return merged[:max_chars]
+def split_sentences(text: str) -> List[str]:
+    if not text:
+        return []
+    parts = re.split(r"(?<=[.!?])\s+", str(text).strip())
+    out = []
+    for p in parts:
+        p = p.strip()
+        if p:
+            out.append(p)
+    return out
+def build_content_change_summary(
+    prev_content_summary: Optional[str],
+    cur_content_summary: Optional[str],
+    max_items: int = 6,
+) -> str:
+    prev = (prev_content_summary or "").strip()
+    cur = (cur_content_summary or "").strip()
+    if not prev:
+        return "Initial keyframe in sequence; no previous content summary to diff against."
+    if not cur:
+        return "Current content summary is empty or unknown; unable to compute precise content diff."
+    if prev == cur:
+        return "No material content-summary change from the previous keyframe."
+    prev_sentences = split_sentences(prev)
+    cur_sentences = split_sentences(cur)
+    prev_set = set(prev_sentences)
+    cur_set = set(cur_sentences)
+    added = [s for s in cur_sentences if s not in prev_set][:max_items]
+    removed = [s for s in prev_sentences if s not in cur_set][:max_items]
+    # If sentence-level diff fails (e.g., heavy rewrites), use token-level fallback.
+    if not added and not removed:
+        prev_tokens = set(tokenize(prev))
+        cur_tokens = set(tokenize(cur))
+        added_tokens = sorted(list(cur_tokens - prev_tokens))[:12]
+        removed_tokens = sorted(list(prev_tokens - cur_tokens))[:12]
+        if not added_tokens and not removed_tokens:
+            return "Content summary wording changed but underlying content differences are unclear."
+        out = []
+        if added_tokens:
+            out.append("Added/updated terms: " + ", ".join(added_tokens))
+        if removed_tokens:
+            out.append("Removed/de-emphasized terms: " + ", ".join(removed_tokens))
+        return " ".join(out)
+    chunks = []
+    if added:
+        chunks.append(
+            "Added/updated in current content summary: "
+            + " ; ".join(a[:240] for a in added)
+        )
+    if removed:
+        chunks.append(
+            "Removed/de-emphasized vs previous content summary: "
+            + " ; ".join(r[:240] for r in removed)
+        )
+    return " ".join(chunks).strip()
+def extract_speakers_from_utterances(utterances: List[Dict[str, Any]]) -> List[str]:
+    """Unique speakers in order of first appearance."""
+    seen = set()
+    out = []
+    for u in utterances or []:
+        spk = str(u.get("speaker", "")).strip()
+        if not spk:
+            spk = "unknown"
+        if spk not in seen:
+            seen.add(spk)
+            out.append(spk)
+    return out
+# -----------------------------
+# Pydantic schema for Gemini
+# -----------------------------
+class FrameChange(BaseModel):
+    changed_summary: str = Field(
+        ...,
+        description="Only the content-summary diff from previous keyframe to current keyframe.",
+    )
+    possible_reason: str = Field(
+        ...,
+        description="Why it could have happened (grounded in utterances/on-screen info; if unknown say unknown).",
+    )
+    added_elements: List[str] = Field(
+        default_factory=list,
+        description="Notable on-screen text elements that appeared (from diff).",
+    )
+    removed_elements: List[str] = Field(
+        default_factory=list,
+        description="Notable on-screen text elements that disappeared (from diff).",
+    )
+class FrameSummary(BaseModel):
+    keyframe_idx: int
+    frame_type: str
+    t_sec: float
+    timestamp: str
+    image_path: str
+    on_screen_text: List[str] = Field(default_factory=list)
+    # NEW: all speakers present in this keyframe's utterances
+    speakers: List[str] = Field(
+        default_factory=list,
+        description="Unique list of speakers who spoke during this keyframe (from assigned utterances).",
+    )
+    utterance_time_start: Optional[str] = None
+    utterance_time_end: Optional[str] = None
+    # UPDATED requirements: must explicitly mention speakers
+    utterance_summary: str = Field(
+        ...,
+        description="Summary of utterances during this keyframe; must explicitly attribute statements to speakers.",
+    )
+    # More detailed
+    content_summary: str = Field(
+        ...,
+        description="Detailed frame content summary grounded in frame_type, timestamp, on_screen_text, and screen_parse.",
+    )
+    # Combined synthesis
+    combined_summary: str = Field(
+        ...,
+        description="Summary that combines utterance_summary and content_summary.",
+    )
+    # NEW: change summary for every keyframe transition (prev -> current). null for first keyframe.
+    frame_change: Optional[FrameChange] = None
+    similarity_to_prev: float = 0.0
+    reused_prev_content: bool = False
+    notes: List[str] = Field(default_factory=list)
+class FinalOutput(BaseModel):
+    meta: Dict[str, Any]
+    keyframes: List[FrameSummary]
+# -----------------------------
+# History manager (diminishing returns)
+# -----------------------------
+@dataclass
+class HistoryState:
+    recent_frames: List[Dict[str, Any]]
+    long_memory: str
+    long_memory_max_chars: int = 4500
+    def __init__(self):
+        self.recent_frames = []
+        self.long_memory = ""
+    def add_frame(self, frame_summary_obj: Dict[str, Any], keep_recent: int = 4):
+        self.recent_frames.append(frame_summary_obj)
+        if len(self.recent_frames) > keep_recent:
+            to_compress = self.recent_frames[:-keep_recent]
+            self.recent_frames = self.recent_frames[-keep_recent:]
+            return to_compress
+        return []
+    def build_history_context(self) -> str:
+        parts = []
+        if self.long_memory.strip():
+            parts.append("LONG_MEMORY (old history, low weight):\n" + self.long_memory.strip())
+        if self.recent_frames:
+            parts.append("RECENT_HISTORY (high weight, most recent first):")
+            for fr in reversed(self.recent_frames):
+                parts.append(
+                    f"- [{fr.get('timestamp','??')}] {fr.get('frame_type','?').upper()} "
+                    f"combined_summary: {fr.get('combined_summary','')[:900]}"
+                )
+        return "\n".join(parts).strip()
+# -----------------------------
+# Gemini calls
+# -----------------------------
+def gemini_client() -> genai.Client:
+    load_dotenv()
+    api_key = os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY")
+    if not api_key:
+        raise ValueError("Missing GEMINI_API_KEY in environment (.env not loaded or key not set).")
+    return genai.Client(api_key=api_key)
+def call_gemini_structured(
+    client: genai.Client,
+    model: str,
+    system_instruction: str,
+    user_prompt: str,
+    schema_model: Any,
+    temperature: float = 0.2,
+    max_retries: int = 3,
+) -> Any:
+    last_err = None
+    for attempt in range(1, max_retries + 1):
+        try:
+            resp = client.models.generate_content(
+                model=model,
+                contents=user_prompt,
+                config=types.GenerateContentConfig(
+                    system_instruction=system_instruction,
+                    response_mime_type="application/json",
+                    response_schema=schema_model,
+                    temperature=temperature,
+                ),
+            )
+            if getattr(resp, "parsed", None) is not None:
+                return resp.parsed
+            txt = getattr(resp, "text", None)
+            if not txt:
+                raise ValueError("Gemini returned no text/parsed output.")
+            return json.loads(txt)
+        except Exception as e:
+            last_err = e
+            time.sleep(0.7 * attempt)
+    raise RuntimeError(f"Gemini structured call failed after retries: {last_err}")
+def compress_into_long_memory(
+    client: genai.Client,
+    model: str,
+    existing_long_memory: str,
+    frames_to_compress: List[Dict[str, Any]],
+    max_chars: int,
+) -> str:
+    if not frames_to_compress:
+        return existing_long_memory
+    bullets = []
+    for fr in frames_to_compress:
+        bullets.append(
+            f"[{fr.get('timestamp','??')}][{fr.get('frame_type','?')}] "
+            f"{fr.get('combined_summary','')[:500]}"
+        )
+    chunk = "\n".join(bullets)
+    system = (
+        "You compress meeting history. Output must be short, factual, and useful.\n"
+        "Do not invent details. Prefer concrete technical points and transitions.\n"
+        "Keep it under the requested character budget."
+    )
+    prompt = (
+        f"Existing LONG_MEMORY (may be empty):\n{existing_long_memory}\n\n"
+        f"New older frames to merge (older history):\n{chunk}\n\n"
+        f"Task:\n"
+        f"1) Merge them into LONG_MEMORY.\n"
+        f"2) Keep the result <= {max_chars} characters.\n"
+        f"3) Use bullet points.\n"
+        f"Return ONLY plain text."
+    )
+    resp = client.models.generate_content(
+        model=model,
+        contents=prompt,
+        config=types.GenerateContentConfig(
+            system_instruction=system,
+            temperature=0.2,
+            max_output_tokens=800,
+        ),
+    )
+    text = (getattr(resp, "text", "") or "").strip()
+    if not text:
+        merged = (existing_long_memory + "\n" + chunk).strip()
+        return merged[:max_chars]
+    return text[:max_chars]
+# -----------------------------
+# Core processing logic
+# -----------------------------
+def build_prompt_for_frame(
+    frame: Dict[str, Any],
+    history_context: str,
+    prev_frame: Optional[Dict[str, Any]],
+    prev_content_summary: Optional[str],
+    similarity_to_prev: float,
+    is_similar: bool,
+    transition_diff: Optional[Dict[str, Any]],
+) -> Tuple[str, str]:
+    frame_type = (frame.get("frame_type") or "").lower()
+    timestamp = frame.get("timestamp") or sec_to_hhmmss(frame.get("t_sec", 0.0))
+    t_sec = float(frame.get("t_sec", 0.0))
+    on_screen_text = frame.get("on_screen_text") or []
+    screen_parse_summary = summarize_screen_parse(
+        frame.get("screen_parse") or {},
+        max_regions=8,
+        max_region_lines=14,
+        max_ocr_lines=140,
+        max_chars=12000,
+    )
+    assigned_utterances = frame.get("assigned_utterances") or []
+    speakers = extract_speakers_from_utterances(assigned_utterances)
+    u_start_ts = None
+    u_end_ts = None
+    if assigned_utterances:
+        u_start = min(float(u.get("_start_sec", u.get("start", t_sec))) for u in assigned_utterances)
+        u_end = max(float(u.get("_end_sec", u.get("end", t_sec))) for u in assigned_utterances)
+        u_start_ts = sec_to_hhmmss(u_start)
+        u_end_ts = sec_to_hhmmss(u_end)
+    utt_lines = []
+    for u in assigned_utterances[:60]:
+        s = float(u.get("_start_sec", u.get("start", 0.0)))
+        e = float(u.get("_end_sec", u.get("end", 0.0)))
+        spk = str(u.get("speaker", "unknown")).strip() or "unknown"
+        txt = (u.get("text", "") or "").strip()
+        utt_lines.append(f"[{sec_to_hhmmss(s)}-{sec_to_hhmmss(e)}][{spk}] {txt}")
+    utterances_block = safe_join_text(utt_lines, max_chars=12000)
+    reuse_instruction = ""
+    if is_similar:
+        reuse_instruction = (
+            "IMPORTANT: This frame content is very similar to the previous keyframe.\n"
+            "Do NOT repeat the entire explanation.\n"
+            "Reuse prior context and focus on what is new.\n"
+            "frame_change must still be filled if a previous keyframe exists.\n"
+        )
+    prev_block = ""
+    prev_content_summary_block = "PREVIOUS_KEYFRAME_CONTENT_SUMMARY:\nnone\n\n"
+    if prev_frame is not None:
+        prev_idx = prev_frame.get("keyframe_idx", -1)
+        prev_ts = prev_frame.get("timestamp") or sec_to_hhmmss(prev_frame.get("t_sec", 0.0))
+        prev_type = (prev_frame.get("frame_type") or "unknown").lower()
+        prev_block = (
+            "PREVIOUS_KEYFRAME:\n"
+            f"- keyframe_idx: {prev_idx}\n"
+            f"- frame_type: {prev_type}\n"
+            f"- timestamp: {prev_ts}\n\n"
+        )
+        prev_content_summary_block = (
+            "PREVIOUS_KEYFRAME_CONTENT_SUMMARY:\n"
+            f"{(prev_content_summary or 'unknown').strip()}\n\n"
+        )
+    transition_diff_block = ""
+    if transition_diff is not None:
+        transition_diff_block = (
+            "KEYFRAME_TRANSITION_DIFF (computed from on_screen_text):\n"
+            f"added_elements: {transition_diff.get('added_elements', [])}\n"
+            f"removed_elements: {transition_diff.get('removed_elements', [])}\n\n"
+        )
+    system_instruction = (
+        "You are generating time-aware meeting notes per keyframe.\n"
+        "You must follow the provided schema exactly and return JSON only.\n"
+        "Do not invent facts not present in the inputs.\n"
+        "If something is unknown, say unknown.\n"
+        "History has diminishing importance: RECENT_HISTORY is high weight, LONG_MEMORY is low weight.\n"
+        "Speaker attribution is required for utterance summary.\n"
+    )
+    on_screen_capped = on_screen_text[:350]
+    if frame_type == "slides":
+        content_task = (
+            "For slides:\n"
+            "- content_summary must use frame_type + timestamp + on_screen_text + screen_parse.\n"
+            "  Cover headings, bullets, numbers, claims, and relationships visible on screen.\n"
+            "- combined_summary must combine utterance_summary + content_summary.\n"
+        )
+    elif frame_type == "code":
+        content_task = (
+            "For code:\n"
+            "- content_summary must use frame_type + timestamp + on_screen_text + screen_parse.\n"
+            "  Cover files/modules, functions/classes, logic, inputs/outputs, and config if visible.\n"
+            "- combined_summary must combine utterance_summary + content_summary.\n"
+        )
+    else:
+        content_task = (
+            "For demo:\n"
+            "- content_summary must use frame_type + timestamp + on_screen_text + screen_parse.\n"
+            "  Cover screens, controls, state transitions, and resulting behavior.\n"
+            "- combined_summary must combine utterance_summary + content_summary.\n"
+        )
+    output_rules = (
+        "OUTPUT_RULES (must follow exactly):\n"
+        "- Always populate: on_screen_text, speakers, utterance_summary, content_summary, combined_summary.\n"
+        "- utterance_summary must use utterance timestamps + speaker + text provided.\n"
+        "- content_summary must be grounded in frame_type + timestamp + on_screen_text + screen_parse.\n"
+        "- combined_summary must summarize utterance_summary and content_summary.\n"
+        "- If previous keyframe exists, frame_change must be present.\n"
+        "  - changed_summary must be only the difference between previous and current content_summary.\n"
+        "  - possible_reason remains grounded in utterances/on-screen evidence; else unknown.\n"
+        "  - added_elements and removed_elements must use provided diff lists.\n"
+        "- If no previous keyframe exists, frame_change must be null.\n"
+    )
+    user_prompt = (
+        f"{prev_block}"
+        f"CURRENT_KEYFRAME:\n"
+        f"- keyframe_idx: {frame.get('keyframe_idx')}\n"
+        f"- frame_type: {frame_type}\n"
+        f"- t_sec: {t_sec}\n"
+        f"- timestamp: {timestamp}\n"
+        f"- image_path: {frame.get('image_path')}\n"
+        f"- similarity_to_prev: {similarity_to_prev:.3f}\n"
+        f"- detected_speakers: {speakers}\n"
+        f"- utterance_time_range: {u_start_ts}-{u_end_ts}\n\n"
+        f"ON_SCREEN_TEXT (list):\n{on_screen_capped}\n\n"
+        f"SCREEN_PARSE (structured parse of current frame):\n{screen_parse_summary}\n\n"
+        f"ASSIGNED_UTTERANCES (time-stamped, includes speaker):\n{utterances_block}\n\n"
+        f"{transition_diff_block}"
+        f"{prev_content_summary_block}"
+        f"HISTORY_CONTEXT:\n{history_context}\n\n"
+        f"{output_rules}\n\n"
+        f"{reuse_instruction}\n"
+        f"{content_task}\n"
+        f"Now produce the JSON output for this keyframe following the schema."
+    )
+    return system_instruction, user_prompt
+def keyframe_items(keyframes_data: Any) -> List[Dict[str, Any]]:
+    if isinstance(keyframes_data, dict):
+        return keyframes_data.get("keyframes", []) or []
+    if isinstance(keyframes_data, list):
+        return keyframes_data
+    return []
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--keyframes", required=True, help="Path to keyframes_with_utterances.json")
+    ap.add_argument("--out", required=True, help="Output path for final JSON")
+    ap.add_argument("--model", default="gemini-2.5-flash", help="Gemini model id")
+    ap.add_argument("--similarity_threshold", type=float, default=0.82, help="Similarity threshold for 'reuse prev content'")
+    ap.add_argument("--temperature", type=float, default=0.2)
+    args = ap.parse_args()
+    log("Starting build_final_output.py ...")
+    log(f"Keyframes file: {args.keyframes}")
+    log(f"Output file: {args.out}")
+    log(f"Model: {args.model}")
+    keyframes_data = load_json(args.keyframes)
+    keyframes_list = keyframe_items(keyframes_data)
+    if not keyframes_list:
+        raise ValueError("No keyframes found in input keyframes file.")
+    # Process keyframes in chronological order.
+    keyframes_list = sorted(
+        keyframes_list,
+        key=lambda x: (
+            float(x.get("t_sec", 0.0)),
+            int(x.get("keyframe_idx", 0)),
+        ),
+    )
+    log(f"Loaded keyframes: {len(keyframes_list)}")
+    log("Initializing Gemini client (loading .env + API key)...")
+    client = gemini_client()
+    log("Gemini client ready.")
+    output = {
+        "meta": {
+            "keyframes_file": args.keyframes,
+            "model": args.model,
+            "generated_at_epoch": time.time(),
+            "rules": {
+                "process_order": "keyframes in chronological order",
+                "history": "recent detailed + long_memory compressed (diminishing returns)",
+                "similarity_threshold": args.similarity_threshold,
+                "transition_change_each_keyframe": True,
+                "speakers_per_keyframe": True,
+                "utterance_summary_requires_speaker_attribution": True,
+                "content_summary_uses_screen_parse": True,
+                "combined_summary_synthesizes_utterance_and_content": True,
+                "change_summary_is_content_diff": True,
+            },
+        },
+        "keyframes": [],
+    }
+    history_state = HistoryState()
+    prev_frame_obj: Optional[Dict[str, Any]] = None
+    prev_frame_summary: Optional[Dict[str, Any]] = None
+    global_kf_done = 0
+    global_kf_total = len(keyframes_list)
+    log(f"Total keyframes to process: {global_kf_total}")
+    for frame in keyframes_list:
+        global_kf_done += 1
+        kf_idx = frame.get("keyframe_idx")
+        kf_ts = frame.get("timestamp") or sec_to_hhmmss(frame.get("t_sec", 0.0))
+        kf_type = (frame.get("frame_type") or "unknown").lower()
+        utt_count = len(frame.get("assigned_utterances") or [])
+        log(f"[{global_kf_done}/{global_kf_total}] Keyframe {kf_idx} @ {kf_ts} | type={kf_type} | utterances={utt_count}")
+        sig_cur = frame_signature(frame)
+        sig_prev = frame_signature(prev_frame_obj)
+        sim = jaccard_similarity(sig_prev, sig_cur) if prev_frame_obj else 0.0
+        is_similar = (prev_frame_obj is not None) and (sim >= args.similarity_threshold)
+        log(f"  similarity_to_prev={sim:.3f} | reused_prev_content={is_similar}")
+        transition_diff = None
+        if prev_frame_obj is not None:
+            prev_text = (prev_frame_obj.get("on_screen_text") or [])
+            cur_text = (frame.get("on_screen_text") or [])
+            added, removed = diff_lists(prev_text, cur_text, max_items=40)
+            transition_diff = {"added_elements": added, "removed_elements": removed}
+        history_context = history_state.build_history_context()
+        system_instruction, user_prompt = build_prompt_for_frame(
+            frame=frame,
+            history_context=history_context,
+            prev_frame=prev_frame_obj,
+            prev_content_summary=(prev_frame_summary or {}).get("content_summary"),
+            similarity_to_prev=sim,
+            is_similar=is_similar,
+            transition_diff=transition_diff,
+        )
+        log("  -> Calling Gemini ...")
+        t_call = time.time()
+        parsed = call_gemini_structured(
+            client=client,
+            model=args.model,
+            system_instruction=system_instruction,
+            user_prompt=user_prompt,
+            schema_model=FrameSummary,
+            temperature=args.temperature,
+            max_retries=3,
+        )
+        log(f"  <- Gemini done in {time.time() - t_call:.1f}s")
+        if isinstance(parsed, BaseModel):
+            parsed_dict = parsed.model_dump()
+        else:
+            parsed_dict = dict(parsed)
+        parsed_dict["similarity_to_prev"] = float(sim)
+        parsed_dict["reused_prev_content"] = bool(is_similar)
+        if "notes" not in parsed_dict:
+            parsed_dict["notes"] = []
+        if is_similar:
+            parsed_dict["notes"].append("High similarity to previous keyframe; instructed incremental update.")
+        if prev_frame_summary is not None:
+            parsed_dict["notes"].append("Keyframe-to-keyframe transition diff computed and provided (frame_change required).")
+        # Enforce change summary as strict diff of previous vs current content_summary.
+        if prev_frame_summary is None:
+            parsed_dict["frame_change"] = None
+        else:
+            prev_content_summary = (prev_frame_summary or {}).get("content_summary")
+            current_content_summary = parsed_dict.get("content_summary")
+            existing_change = parsed_dict.get("frame_change") or {}
+            if not isinstance(existing_change, dict):
+                existing_change = {}
+            existing_change["changed_summary"] = build_content_change_summary(
+                prev_content_summary=prev_content_summary,
+                cur_content_summary=current_content_summary,
+            )
+            existing_change["possible_reason"] = str(existing_change.get("possible_reason", "")).strip() or "unknown"
+            existing_change["added_elements"] = (transition_diff or {}).get("added_elements", [])
+            existing_change["removed_elements"] = (transition_diff or {}).get("removed_elements", [])
+            parsed_dict["frame_change"] = existing_change
+        output["keyframes"].append(parsed_dict)
+        to_compress = history_state.add_frame(
+            frame_summary_obj={
+                "timestamp": parsed_dict.get("timestamp"),
+                "frame_type": parsed_dict.get("frame_type"),
+                "combined_summary": parsed_dict.get("combined_summary", ""),
+            },
+            keep_recent=4,
+        )
+        if to_compress:
+            log(f"  -> Compressing {len(to_compress)} older frame(s) into LONG_MEMORY ...")
+            history_state.long_memory = compress_into_long_memory(
+                client=client,
+                model=args.model,
+                existing_long_memory=history_state.long_memory,
+                frames_to_compress=to_compress,
+                max_chars=history_state.long_memory_max_chars,
+            )
+            log("  <- LONG_MEMORY updated.")
+        prev_frame_obj = frame
+        prev_frame_summary = parsed_dict
+    log("\nAll keyframes processed. Writing output JSON ...")
+    save_json(args.out, output)
+    log(f"Done. Wrote: {args.out}")
+if __name__ == "__main__":
+    main()

pipelines/build_final_output_demo_code.py ADDED Viewed

	@@ -0,0 +1,549 @@

+#!/usr/bin/env python3
+"""
+Demo-only Gemini build stage (kept in demo-code route for compatibility).
+Behavior:
+- `demo` keyframes: summarized with Gemini.
+- `slides`, `code`, and `none` keyframes: NO Gemini call; output is built from OCR + utterances.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import re
+import time
+from typing import Any, Dict, List, Optional, Tuple
+from dotenv import load_dotenv
+from pydantic import BaseModel, Field
+from google import genai
+from google.genai import types
+def log(msg: str) -> None:
+    print(msg, flush=True)
+def load_json(path: str) -> Any:
+    with open(path, "r", encoding="utf-8") as f:
+        return json.load(f)
+def save_json(path: str, obj: Any) -> None:
+    out_dir = os.path.dirname(path)
+    if out_dir:
+        os.makedirs(out_dir, exist_ok=True)
+    with open(path, "w", encoding="utf-8") as f:
+        json.dump(obj, f, ensure_ascii=False, indent=2)
+def sec_to_hhmmss(t: float) -> str:
+    t = max(0.0, float(t))
+    hh = int(t // 3600)
+    mm = int((t % 3600) // 60)
+    ss = int(t % 60)
+    return f"{hh:02d}:{mm:02d}:{ss:02d}"
+def tokenize(s: str) -> List[str]:
+    s = s.lower()
+    s = re.sub(r"[^a-z0-9_]+", " ", s)
+    return [t for t in s.split() if t]
+def jaccard_similarity(a: str, b: str) -> float:
+    sa, sb = set(tokenize(a)), set(tokenize(b))
+    if not sa and not sb:
+        return 1.0
+    if not sa or not sb:
+        return 0.0
+    return len(sa & sb) / max(1, len(sa | sb))
+def safe_join_text(lines: List[str], max_chars: int = 8000) -> str:
+    out = []
+    total = 0
+    for ln in lines:
+        if total + len(ln) + 1 > max_chars:
+            break
+        out.append(ln)
+        total += len(ln) + 1
+    return "\n".join(out)
+def split_sentences(text: str) -> List[str]:
+    if not text:
+        return []
+    parts = re.split(r"(?<=[.!?])\s+", str(text).strip())
+    return [p.strip() for p in parts if p.strip()]
+def build_content_change_summary(
+    prev_content_summary: Optional[str],
+    cur_content_summary: Optional[str],
+    max_items: int = 6,
+) -> str:
+    prev = (prev_content_summary or "").strip()
+    cur = (cur_content_summary or "").strip()
+    if not prev:
+        return "Initial keyframe in sequence; no previous content summary to diff against."
+    if not cur:
+        return "Current content summary is empty or unknown; unable to compute precise content diff."
+    if prev == cur:
+        return "No material content-summary change from the previous keyframe."
+    prev_sentences = split_sentences(prev)
+    cur_sentences = split_sentences(cur)
+    prev_set = set(prev_sentences)
+    cur_set = set(cur_sentences)
+    added = [s for s in cur_sentences if s not in prev_set][:max_items]
+    removed = [s for s in prev_sentences if s not in cur_set][:max_items]
+    if not added and not removed:
+        prev_tokens = set(tokenize(prev))
+        cur_tokens = set(tokenize(cur))
+        added_tokens = sorted(list(cur_tokens - prev_tokens))[:12]
+        removed_tokens = sorted(list(prev_tokens - cur_tokens))[:12]
+        if not added_tokens and not removed_tokens:
+            return "Content summary wording changed but underlying content differences are unclear."
+        out = []
+        if added_tokens:
+            out.append("Added/updated terms: " + ", ".join(added_tokens))
+        if removed_tokens:
+            out.append("Removed/de-emphasized terms: " + ", ".join(removed_tokens))
+        return " ".join(out)
+    chunks = []
+    if added:
+        chunks.append(
+            "Added/updated in current content summary: "
+            + " ; ".join(a[:240] for a in added)
+        )
+    if removed:
+        chunks.append(
+            "Removed/de-emphasized vs previous content summary: "
+            + " ; ".join(r[:240] for r in removed)
+        )
+    return " ".join(chunks).strip()
+def frame_signature(frame: Optional[Dict[str, Any]]) -> str:
+    if not frame:
+        return ""
+    on_screen = frame.get("on_screen_text") or []
+    return safe_join_text([str(x) for x in on_screen[:120]], max_chars=3000)
+def diff_lists(prev: List[str], cur: List[str], max_items: int = 25) -> Tuple[List[str], List[str]]:
+    prev_set, cur_set = set(prev), set(cur)
+    added = [x for x in cur if x not in prev_set][:max_items]
+    removed = [x for x in prev if x not in cur_set][:max_items]
+    return added, removed
+def summarize_screen_parse(
+    screen_parse: Optional[Dict[str, Any]],
+    max_regions: int = 8,
+    max_region_lines: int = 12,
+    max_ocr_lines: int = 120,
+    max_chars: int = 9000,
+) -> str:
+    if not isinstance(screen_parse, dict) or not screen_parse:
+        return "unknown"
+    parts: List[str] = []
+    frame_w = screen_parse.get("frame_w")
+    frame_h = screen_parse.get("frame_h")
+    if frame_w is not None and frame_h is not None:
+        parts.append(f"frame_size: {frame_w}x{frame_h}")
+    regions = screen_parse.get("layout_regions") or []
+    if regions:
+        region_lines: List[str] = []
+        for i, region in enumerate(regions[:max_regions]):
+            label = region.get("label", "unknown")
+            conf = region.get("conf", "unknown")
+            box = region.get("box", [])
+            text_lines = region.get("text_lines") or []
+            text_lines_clean = [str(x).strip() for x in text_lines if str(x).strip()][:max_region_lines]
+            text_preview = " | ".join(text_lines_clean)
+            region_lines.append(
+                f"region[{i}] label={label}, conf={conf}, box={box}, text_lines={text_preview}"
+            )
+        parts.append("layout_regions:\n" + "\n".join(region_lines))
+    ocr_lines = screen_parse.get("ocr_lines") or []
+    if ocr_lines:
+        ocr_text: List[str] = []
+        for item in ocr_lines[:max_ocr_lines]:
+            txt = str(item.get("text", "")).strip()
+            if txt:
+                ocr_text.append(txt)
+        if ocr_text:
+            parts.append("ocr_lines:\n" + safe_join_text(ocr_text, max_chars=max_chars))
+    merged = "\n\n".join(parts).strip()
+    if not merged:
+        return "unknown"
+    return merged[:max_chars]
+def extract_speakers_from_utterances(utterances: List[Dict[str, Any]]) -> List[str]:
+    seen = set()
+    out = []
+    for u in utterances or []:
+        spk = str(u.get("speaker", "")).strip() or "unknown"
+        if spk not in seen:
+            seen.add(spk)
+            out.append(spk)
+    return out
+def utterance_time_bounds(utterances: List[Dict[str, Any]], default_t: float) -> Tuple[Optional[str], Optional[str]]:
+    if not utterances:
+        return None, None
+    starts = []
+    ends = []
+    for u in utterances:
+        try:
+            starts.append(float(u.get("_start_sec", u.get("start", default_t))))
+            ends.append(float(u.get("_end_sec", u.get("end", default_t))))
+        except Exception:
+            continue
+    if not starts or not ends:
+        return None, None
+    return sec_to_hhmmss(min(starts)), sec_to_hhmmss(max(ends))
+def build_utterance_lines(utterances: List[Dict[str, Any]], max_lines: int = 80) -> List[str]:
+    lines: List[str] = []
+    for u in utterances[:max_lines]:
+        try:
+            s = float(u.get("_start_sec", u.get("start", 0.0)))
+            e = float(u.get("_end_sec", u.get("end", 0.0)))
+        except Exception:
+            s, e = 0.0, 0.0
+        spk = str(u.get("speaker", "unknown")).strip() or "unknown"
+        txt = (u.get("text", "") or "").strip()
+        if not txt:
+            continue
+        lines.append(f"[{sec_to_hhmmss(s)}-{sec_to_hhmmss(e)}][{spk}] {txt}")
+    return lines
+def local_summary_for_non_demo(frame: Dict[str, Any]) -> Dict[str, str]:
+    frame_type = str(frame.get("frame_type", "unknown")).lower()
+    ocr_lines = [str(x).strip() for x in (frame.get("on_screen_text") or []) if str(x).strip()]
+    utter_lines = build_utterance_lines(frame.get("assigned_utterances") or [], max_lines=20)
+    if utter_lines:
+        utterance_summary = " | ".join(utter_lines[:8])
+    else:
+        utterance_summary = "No assigned utterances for this keyframe."
+    if ocr_lines:
+        content_summary = (
+            f"{frame_type.upper()} keyframe. OCR extracted on-screen text (top lines): "
+            + " | ".join(ocr_lines[:25])
+        )
+    else:
+        content_summary = f"{frame_type.upper()} keyframe. OCR text not available."
+    combined_summary = (
+        f"Local (no Gemini) summary for {frame_type} frame. "
+        f"Utterances: {utterance_summary} "
+        f"Content: {content_summary}"
+    )
+    return {
+        "utterance_summary": utterance_summary,
+        "content_summary": content_summary,
+        "combined_summary": combined_summary,
+    }
+class DemoGeminiSummary(BaseModel):
+    utterance_summary: str = Field(
+        ...,
+        description="Summary of utterances for this frame with explicit speaker attribution where available.",
+    )
+    content_summary: str = Field(
+        ...,
+        description="Detailed description of what changed or is shown in this demo frame.",
+    )
+    combined_summary: str = Field(
+        ...,
+        description="Combined summary merging utterances and visual content.",
+    )
+def gemini_client() -> genai.Client:
+    load_dotenv()
+    api_key = os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY")
+    if not api_key:
+        raise ValueError("Missing GEMINI_API_KEY in environment (.env not loaded or key not set).")
+    return genai.Client(api_key=api_key)
+def call_gemini_structured(
+    client: genai.Client,
+    model: str,
+    system_instruction: str,
+    user_prompt: str,
+    schema_model: Any,
+    temperature: float = 0.2,
+    max_retries: int = 3,
+) -> Any:
+    last_err = None
+    for attempt in range(1, max_retries + 1):
+        try:
+            resp = client.models.generate_content(
+                model=model,
+                contents=user_prompt,
+                config=types.GenerateContentConfig(
+                    system_instruction=system_instruction,
+                    response_mime_type="application/json",
+                    response_schema=schema_model,
+                    temperature=temperature,
+                ),
+            )
+            if getattr(resp, "parsed", None) is not None:
+                return resp.parsed
+            txt = getattr(resp, "text", None)
+            if not txt:
+                raise ValueError("Gemini returned no text/parsed output.")
+            return json.loads(txt)
+        except Exception as e:
+            last_err = e
+            time.sleep(0.7 * attempt)
+    raise RuntimeError(f"Gemini structured call failed after retries: {last_err}")
+def build_demo_prompt(
+    frame: Dict[str, Any],
+    prev_content_summary: Optional[str],
+    similarity_to_prev: float,
+    is_similar: bool,
+) -> Tuple[str, str]:
+    frame_type = str(frame.get("frame_type", "unknown")).lower()
+    timestamp = frame.get("timestamp") or sec_to_hhmmss(frame.get("t_sec", 0.0))
+    t_sec = float(frame.get("t_sec", 0.0))
+    on_screen_text = frame.get("on_screen_text") or []
+    screen_parse_summary = summarize_screen_parse(frame.get("screen_parse") or {})
+    utterances_block = safe_join_text(
+        build_utterance_lines(frame.get("assigned_utterances") or [], max_lines=80),
+        max_chars=12000,
+    )
+    reuse_instruction = ""
+    if is_similar:
+        reuse_instruction = (
+            "Frame is highly similar to previous keyframe. Reuse context and focus on what changed.\n"
+        )
+    prev_block = "PREVIOUS_KEYFRAME_CONTENT_SUMMARY:\nnone\n"
+    if prev_content_summary:
+        prev_block = f"PREVIOUS_KEYFRAME_CONTENT_SUMMARY:\n{prev_content_summary}\n"
+    system_instruction = (
+        "You generate keyframe-level meeting notes for demo screens only.\n"
+        "Ground all claims in provided utterances and OCR/screen parse.\n"
+        "Do not invent facts.\n"
+        "Return strict JSON only following schema."
+    )
+    user_prompt = (
+        f"CURRENT_KEYFRAME:\n"
+        f"- frame_type: {frame_type}\n"
+        f"- keyframe_idx: {frame.get('keyframe_idx')}\n"
+        f"- t_sec: {t_sec}\n"
+        f"- timestamp: {timestamp}\n"
+        f"- image_path: {frame.get('image_path')}\n"
+        f"- similarity_to_prev: {similarity_to_prev:.3f}\n\n"
+        f"ON_SCREEN_TEXT:\n{on_screen_text[:350]}\n\n"
+        f"SCREEN_PARSE:\n{screen_parse_summary}\n\n"
+        f"ASSIGNED_UTTERANCES:\n{utterances_block}\n\n"
+        f"{prev_block}\n"
+        f"{reuse_instruction}\n"
+        f"Requirements:\n"
+        f"- utterance_summary: attribute statements to speakers when present.\n"
+        f"- content_summary: describe what is visible/changed in this frame.\n"
+        f"- combined_summary: merge utterance + visual context.\n"
+    )
+    return system_instruction, user_prompt
+def keyframe_items(keyframes_data: Any) -> List[Dict[str, Any]]:
+    if isinstance(keyframes_data, dict):
+        return keyframes_data.get("keyframes", []) or []
+    if isinstance(keyframes_data, list):
+        return keyframes_data
+    return []
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--keyframes", required=True, help="Path to keyframes_with_utterances.json")
+    ap.add_argument("--out", required=True, help="Output path for final JSON")
+    ap.add_argument("--model", default="gemini-2.5-flash", help="Gemini model id")
+    ap.add_argument("--similarity-threshold", type=float, default=0.82)
+    ap.add_argument("--temperature", type=float, default=0.2)
+    args = ap.parse_args()
+    keyframes_data = load_json(args.keyframes)
+    keyframes_list = keyframe_items(keyframes_data)
+    if not keyframes_list:
+        raise ValueError("No keyframes found in input keyframes file.")
+    keyframes_list = sorted(
+        keyframes_list,
+        key=lambda x: (float(x.get("t_sec", 0.0)), int(x.get("keyframe_idx", 0))),
+    )
+    demo_count = sum(1 for kf in keyframes_list if str(kf.get("frame_type", "")).lower() == "demo")
+    code_count = sum(1 for kf in keyframes_list if str(kf.get("frame_type", "")).lower() == "code")
+    gemini_target_count = demo_count
+    local_only_count = len(keyframes_list) - gemini_target_count
+    log(
+        f"Loaded keyframes: total={len(keyframes_list)} demo={demo_count} "
+        f"code={code_count} local_only={local_only_count}"
+    )
+    client: Optional[genai.Client] = None
+    if gemini_target_count > 0:
+        log("Initializing Gemini client (demo frames only)...")
+        client = gemini_client()
+        log("Gemini client ready.")
+    output: Dict[str, Any] = {
+        "meta": {
+            "keyframes_file": args.keyframes,
+            "model": args.model,
+            "generated_at_epoch": time.time(),
+            "rules": {
+                "demo_frames_use_gemini": True,
+                "slides_code_none_use_local_ocr_only": True,
+                "similarity_threshold": args.similarity_threshold,
+                "frame_change_is_deterministic_content_diff": True,
+            },
+            "counts": {
+                "total_keyframes": len(keyframes_list),
+                "demo_keyframes": demo_count,
+                "code_keyframes": code_count,
+                "gemini_keyframes": gemini_target_count,
+                "local_only_keyframes": local_only_count,
+                "gemini_calls": 0,
+            },
+        },
+        "keyframes": [],
+    }
+    prev_frame_obj: Optional[Dict[str, Any]] = None
+    prev_content_summary: Optional[str] = None
+    for idx, frame in enumerate(keyframes_list, start=1):
+        frame_type = str(frame.get("frame_type", "unknown")).lower()
+        t_sec = float(frame.get("t_sec", 0.0))
+        timestamp = frame.get("timestamp") or sec_to_hhmmss(t_sec)
+        on_screen_text = [str(x).strip() for x in (frame.get("on_screen_text") or []) if str(x).strip()]
+        assigned_utterances = frame.get("assigned_utterances") or []
+        speakers = extract_speakers_from_utterances(assigned_utterances)
+        utt_start_ts, utt_end_ts = utterance_time_bounds(assigned_utterances, default_t=t_sec)
+        sim = 0.0
+        is_similar = False
+        if prev_frame_obj is not None:
+            sim = jaccard_similarity(frame_signature(prev_frame_obj), frame_signature(frame))
+            is_similar = sim >= float(args.similarity_threshold)
+        log(
+            f"[{idx}/{len(keyframes_list)}] keyframe={frame.get('keyframe_idx')} "
+            f"type={frame_type} time={timestamp} similarity={sim:.3f}"
+        )
+        if frame_type == "demo":
+            if client is None:
+                raise RuntimeError("Internal error: demo frame encountered but Gemini client is not initialized.")
+            system_instruction, user_prompt = build_demo_prompt(
+                frame=frame,
+                prev_content_summary=prev_content_summary,
+                similarity_to_prev=sim,
+                is_similar=is_similar,
+            )
+            t0 = time.time()
+            parsed = call_gemini_structured(
+                client=client,
+                model=args.model,
+                system_instruction=system_instruction,
+                user_prompt=user_prompt,
+                schema_model=DemoGeminiSummary,
+                temperature=args.temperature,
+                max_retries=3,
+            )
+            log(f"  Gemini done in {time.time() - t0:.1f}s")
+            output["meta"]["counts"]["gemini_calls"] += 1
+            if isinstance(parsed, BaseModel):
+                summary_payload = parsed.model_dump()
+            else:
+                summary_payload = dict(parsed)
+            summary_source = "gemini_demo_only"
+        else:
+            summary_payload = local_summary_for_non_demo(frame)
+            summary_source = "local_ocr_only"
+        transition_diff = {"added_elements": [], "removed_elements": []}
+        if prev_frame_obj is not None:
+            prev_text = [str(x).strip() for x in (prev_frame_obj.get("on_screen_text") or []) if str(x).strip()]
+            cur_text = on_screen_text
+            added, removed = diff_lists(prev_text, cur_text, max_items=40)
+            transition_diff = {"added_elements": added, "removed_elements": removed}
+        frame_change = None
+        if prev_content_summary is not None:
+            frame_change = {
+                "changed_summary": build_content_change_summary(
+                    prev_content_summary=prev_content_summary,
+                    cur_content_summary=summary_payload.get("content_summary"),
+                ),
+                "possible_reason": (
+                    "Computed from keyframe OCR and utterance differences; no transition LLM call used."
+                ),
+                "added_elements": transition_diff["added_elements"],
+                "removed_elements": transition_diff["removed_elements"],
+            }
+        out_frame = {
+            "keyframe_idx": int(frame.get("keyframe_idx", idx - 1)),
+            "frame_type": frame_type,
+            "t_sec": t_sec,
+            "timestamp": timestamp,
+            "image_path": str(frame.get("image_path", "")),
+            "on_screen_text": on_screen_text[:400],
+            "speakers": speakers,
+            "utterance_time_start": utt_start_ts,
+            "utterance_time_end": utt_end_ts,
+            "utterance_summary": str(summary_payload.get("utterance_summary", "")).strip(),
+            "content_summary": str(summary_payload.get("content_summary", "")).strip(),
+            "combined_summary": str(summary_payload.get("combined_summary", "")).strip(),
+            "frame_change": frame_change,
+            "similarity_to_prev": float(sim),
+            "reused_prev_content": bool(is_similar and frame_type == "demo"),
+            "notes": [
+                f"summary_source={summary_source}",
+                "Only demo keyframes are sent to Gemini in this pipeline.",
+            ],
+        }
+        output["keyframes"].append(out_frame)
+        prev_frame_obj = frame
+        prev_content_summary = out_frame.get("content_summary")
+    save_json(args.out, output)
+    log(f"Done. Wrote: {args.out}")
+    log(f"Gemini calls made: {output['meta']['counts']['gemini_calls']}")
+if __name__ == "__main__":
+    main()

pipelines/condense_final_output.py ADDED Viewed

	@@ -0,0 +1,145 @@

+# condense_final_output.py
+# Usage:
+#   python condense_final_output.py --in "C:\meet-agent\out_folder\final_output.json" --out "C:\meet-agent\out_folder\final_output_condensed.json"
+#
+# What it does:
+# - Reads the "final_output.json" produced by your build script
+# - Produces a condensed version with only:
+#     - keyframe (idx, timestamp, type, t_sec, image_path)
+#     - combined_summary
+#     - changed_summary (from transition_change/frame_change/demo_change if present)
+# - Supports both input schemas:
+#     1) new: {"meta": ..., "keyframes": [...]}
+#     2) old: {"meta": ..., "topics": [{"keyframes": [...]}]}
+import argparse
+import json
+import os
+from typing import Any, Dict, Optional
+def load_json(path: str) -> Any:
+    with open(path, "r", encoding="utf-8") as f:
+        return json.load(f)
+def save_json(path: str, obj: Any) -> None:
+    out_dir = os.path.dirname(path)
+    if out_dir:
+        os.makedirs(out_dir, exist_ok=True)
+    with open(path, "w", encoding="utf-8") as f:
+        json.dump(obj, f, ensure_ascii=False, indent=2)
+def pick_changed_summary(kf: Dict[str, Any]) -> Optional[str]:
+    """
+    Tries multiple locations, because your schema may store change summaries under different keys
+    depending on how you implemented transitions.
+    Priority order:
+      1) transition_change.changed_summary
+      2) frame_change.changed_summary
+      3) demo_change.changed_summary
+      4) changed_summary at root (fallback)
+    """
+    for container_key in ("transition_change", "frame_change", "demo_change"):
+        container = kf.get(container_key)
+        if isinstance(container, dict):
+            cs = container.get("changed_summary")
+            if isinstance(cs, str) and cs.strip():
+                return cs.strip()
+    cs_root = kf.get("changed_summary")
+    if isinstance(cs_root, str) and cs_root.strip():
+        return cs_root.strip()
+    return None
+def condense_keyframe(kf: Dict[str, Any]) -> Dict[str, Any]:
+    return {
+        "keyframe": {
+            "keyframe_idx": kf.get("keyframe_idx"),
+            "timestamp": kf.get("timestamp"),
+            "frame_type": kf.get("frame_type"),
+            "t_sec": kf.get("t_sec"),
+            "image_path": kf.get("image_path"),
+        },
+        "combined_summary": kf.get("combined_summary"),
+        "changed_summary": pick_changed_summary(kf),
+    }
+def condense(final_obj: Dict[str, Any]) -> Dict[str, Any]:
+    out_meta: Dict[str, Any] = {
+        "source": final_obj.get("meta", {}),
+        "notes": "Condensed output: keyframe + combined_summary + changed_summary",
+    }
+    # New schema: root keyframes list
+    root_keyframes = final_obj.get("keyframes", [])
+    if isinstance(root_keyframes, list):
+        out: Dict[str, Any] = {
+            "meta": {**out_meta, "input_schema": "root_keyframes"},
+            "keyframes": [],
+        }
+        for kf in root_keyframes:
+            if not isinstance(kf, dict):
+                continue
+            out["keyframes"].append(condense_keyframe(kf))
+        return out
+    # Old schema: topics[] with keyframes[]
+    out = {
+        "meta": {**out_meta, "input_schema": "topics"},
+        "topics": [],
+    }
+    topics = final_obj.get("topics", [])
+    if not isinstance(topics, list):
+        topics = []
+    for t in topics:
+        if not isinstance(t, dict):
+            continue
+        topic_out = {
+            "topic": t.get("topic"),
+            "start": t.get("start"),
+            "end": t.get("end"),
+            "start_ts": t.get("start_ts"),
+            "end_ts": t.get("end_ts"),
+            "keyframes": [],
+        }
+        keyframes = t.get("keyframes", [])
+        if not isinstance(keyframes, list):
+            keyframes = []
+        for kf in keyframes:
+            if not isinstance(kf, dict):
+                continue
+            topic_out["keyframes"].append(condense_keyframe(kf))
+        out["topics"].append(topic_out)
+    return out
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--in", dest="inp", required=True, help="Path to final_output.json")
+    ap.add_argument("--out", dest="out", required=True, help="Path to write condensed JSON")
+    args = ap.parse_args()
+    final_obj = load_json(args.inp)
+    if not isinstance(final_obj, dict):
+        raise ValueError("Input JSON root must be an object/dict (expected FinalOutput-like structure).")
+    condensed = condense(final_obj)
+    save_json(args.out, condensed)
+    print(f"Wrote condensed JSON: {args.out}")
+if __name__ == "__main__":
+    main()

pipelines/deepgram_extract_utterances.py ADDED Viewed

	@@ -0,0 +1,208 @@

+#!/usr/bin/env python3
+"""
+deepgram_extract_utterances.py
+Extract speaker-attributed utterances (start, end, speaker, text)
+from a meeting MP4 using Deepgram.
+"""
+from __future__ import annotations
+import argparse
+import json
+import mimetypes
+import os
+import sys
+import time
+from typing import Any, Dict, List, Optional
+import httpx
+from dotenv import load_dotenv
+from deepgram import DeepgramClient, PrerecordedOptions, FileSource
+# load .env at startup
+load_dotenv()
+def _die(msg: str, code: int = 1) -> None:
+    print(f"Error: {msg}", file=sys.stderr)
+    sys.exit(code)
+def _load_file_source(path: str):
+    if not os.path.isfile(path):
+        _die(f"File not found: {path}")
+    with open(path, "rb") as f:
+        data = f.read()
+    mime, _ = mimetypes.guess_type(path)
+    if not mime:
+        mime = "application/octet-stream"
+    # IMPORTANT: return a dict, NOT FileSource()
+    return {
+        "buffer": data,
+        "mimetype": mime,
+    }
+def _extract_utterances(result: Dict[str, Any]) -> List[Dict[str, Any]]:
+    utterances = result.get("results", {}).get("utterances", [])
+    out: List[Dict[str, Any]] = []
+    for u in utterances:
+        out.append(
+            {
+                "start": float(u.get("start", 0.0)),
+                "end": float(u.get("end", 0.0)),
+                "speaker": u.get("speaker"),
+                "text": (u.get("transcript") or "").strip(),
+            }
+        )
+    return out
+def _is_non_retryable_error(exc: Exception) -> bool:
+    code = getattr(exc, "status_code", None)
+    if isinstance(code, int) and 400 <= code < 500:
+        return True
+    status = getattr(exc, "status", None)
+    if isinstance(status, int) and 400 <= status < 500:
+        return True
+    msg = str(exc).lower()
+    # Deepgram SDK exceptions often encode status in message text.
+    if "status: 4" in msg or "bad request" in msg or "unsupported data" in msg:
+        return True
+    return False
+def transcribe_and_extract(
+    path: str,
+    model: str = "nova-3",
+    language: Optional[str] = None,
+    request_timeout_sec: float = 1200.0,
+    connect_timeout_sec: float = 30.0,
+    retries: int = 3,
+    retry_backoff_sec: float = 2.0,
+) -> tuple[Dict[str, Any], Dict[str, Any]]:
+    api_key = os.getenv("DEEPGRAM_API_KEY")
+    if not api_key:
+        _die("DEEPGRAM_API_KEY not found in environment or .env")
+    client = DeepgramClient(api_key=api_key)
+    source = _load_file_source(path)
+    options_kwargs: Dict[str, Any] = {
+        "model": model,
+        "smart_format": True,
+        "punctuate": True,
+        "utterances": True,
+        "diarize": True,
+    }
+    if language:
+        options_kwargs["language"] = language
+    options = PrerecordedOptions(**options_kwargs)
+    # Deepgram SDK default HTTP timeout is 30s; long recordings often exceed that.
+    timeout = httpx.Timeout(float(request_timeout_sec), connect=float(connect_timeout_sec))
+    retries = max(1, int(retries))
+    last_err: Optional[Exception] = None
+    response = None
+    for attempt in range(1, retries + 1):
+        try:
+            response = client.listen.rest.v("1").transcribe_file(
+                source,
+                options,
+                timeout=timeout,
+            )
+            break
+        except Exception as e:
+            last_err = e
+            if _is_non_retryable_error(e):
+                # Client/input errors won't succeed on retry.
+                raise
+            if attempt >= retries:
+                raise
+            wait_sec = float(retry_backoff_sec) * attempt
+            print(
+                f"Deepgram request failed (attempt {attempt}/{retries}): {type(e).__name__}: {e}. "
+                f"Retrying in {wait_sec:.1f}s..."
+            )
+            time.sleep(wait_sec)
+    if response is None:
+        raise RuntimeError(f"Deepgram transcription failed after {retries} attempts: {last_err}")
+    result_dict = response.to_dict() if hasattr(response, "to_dict") else dict(response)
+    return {
+        "input_file": os.path.abspath(path),
+        "model": model,
+        "utterances": _extract_utterances(result_dict),
+    }, result_dict
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("input", help="Path to meeting file (.mp4, .wav, .mp3)")
+    parser.add_argument("-o", "--output", default="utterances.json")
+    parser.add_argument("--raw", help="Optional raw Deepgram response JSON")
+    parser.add_argument("--model", default="nova-3")
+    parser.add_argument("--language", help="Optional language code (e.g. en, en-US)")
+    parser.add_argument(
+        "--request-timeout-sec",
+        type=float,
+        default=1200.0,
+        help="HTTP request timeout for Deepgram API call (default: 1200s).",
+    )
+    parser.add_argument(
+        "--connect-timeout-sec",
+        type=float,
+        default=30.0,
+        help="HTTP connect timeout for Deepgram API call (default: 30s).",
+    )
+    parser.add_argument(
+        "--retries",
+        type=int,
+        default=3,
+        help="Number of retry attempts for Deepgram call (default: 3).",
+    )
+    parser.add_argument(
+        "--retry-backoff-sec",
+        type=float,
+        default=2.0,
+        help="Base retry backoff seconds; actual sleep is base * attempt (default: 2.0).",
+    )
+    args = parser.parse_args()
+    extracted, raw = transcribe_and_extract(
+        args.input,
+        model=args.model,
+        language=args.language,
+        request_timeout_sec=float(args.request_timeout_sec),
+        connect_timeout_sec=float(args.connect_timeout_sec),
+        retries=int(args.retries),
+        retry_backoff_sec=float(args.retry_backoff_sec),
+    )
+    with open(args.output, "w", encoding="utf-8") as f:
+        json.dump(extracted, f, ensure_ascii=False, indent=2)
+    if args.raw:
+        with open(args.raw, "w", encoding="utf-8") as f:
+            json.dump(raw, f, ensure_ascii=False, indent=2)
+    print(f"Saved utterances to {args.output}")
+    if args.raw:
+        print(f"Saved raw response to {args.raw}")
+if __name__ == "__main__":
+    main()

pipelines/models/yolov8x-doclaynet.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7fd403628e5377fc08105df49489fc4a8997d1376589470865d874f1ee918317
+size 136821929

pipelines/run_pipeline_all.py ADDED Viewed

	@@ -0,0 +1,238 @@

+#!/usr/bin/env python3
+"""
+Pipeline orchestrator.
+Runs:
+1) deepgram_extract_utterances.py         (parallel)
+2) smart_keyframes_and_classify.py        (parallel)
+3) assign_utterances_to_keyframes.py      (after 1+2)
+4) build_final_output.py                  (after 3)
+5) condense_final_output.py               (after 4)
+"""
+from __future__ import annotations
+import argparse
+import subprocess
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+from typing import Dict, List, Sequence, Tuple
+def run_command(name: str, cmd: Sequence[str], cwd: Path) -> None:
+    start = time.perf_counter()
+    print(f"\n[{name}] START")
+    print(f"[{name}] CMD: {' '.join(cmd)}")
+    result = subprocess.run(cmd, cwd=str(cwd))
+    dur = time.perf_counter() - start
+    if result.returncode != 0:
+        raise RuntimeError(f"[{name}] failed with exit code {result.returncode}")
+    print(f"[{name}] DONE in {dur:.2f}s")
+def run_parallel(commands: List[Tuple[str, List[str]]], cwd: Path) -> None:
+    if not commands:
+        return
+    with ThreadPoolExecutor(max_workers=len(commands)) as ex:
+        futures = {
+            ex.submit(run_command, name, cmd, cwd): name
+            for name, cmd in commands
+        }
+        for fut in as_completed(futures):
+            fut.result()
+def require_file(path: Path, step_name: str) -> None:
+    if not path.exists():
+        raise FileNotFoundError(f"[{step_name}] expected output not found: {path}")
+def main() -> None:
+    ap = argparse.ArgumentParser(description="Run full meeting summarization pipeline.")
+    ap.add_argument("--video", required=True, help="Path to meeting video/audio input.")
+    ap.add_argument("--out", required=True, help="Output directory for pipeline artifacts.")
+    ap.add_argument("--python", default=sys.executable, help="Python executable to use.")
+    ap.add_argument("--deepgram-model", default="nova-3", help="Deepgram model.")
+    ap.add_argument("--deepgram-language", default=None, help="Deepgram language (optional).")
+    ap.add_argument(
+        "--deepgram-raw-out",
+        default=None,
+        help="Optional path for raw Deepgram response JSON.",
+    )
+    ap.add_argument(
+        "--deepgram-request-timeout-sec",
+        type=float,
+        default=1200.0,
+        help="HTTP request timeout for Deepgram call.",
+    )
+    ap.add_argument(
+        "--deepgram-connect-timeout-sec",
+        type=float,
+        default=30.0,
+        help="HTTP connect timeout for Deepgram call.",
+    )
+    ap.add_argument(
+        "--deepgram-retries",
+        type=int,
+        default=3,
+        help="Retry attempts for Deepgram call.",
+    )
+    ap.add_argument(
+        "--deepgram-retry-backoff-sec",
+        type=float,
+        default=2.0,
+        help="Base retry backoff seconds for Deepgram call.",
+    )
+    ap.add_argument(
+        "--force-deepgram",
+        action="store_true",
+        help="Re-run Deepgram even if utterances.json already exists.",
+    )
+    ap.add_argument("--force-keyframes", action="store_true", help="Pass --force to smart keyframe script.")
+    ap.add_argument("--pre-roll-sec", type=float, default=3.0, help="Pre-roll seconds for utterance assignment.")
+    ap.add_argument("--gemini-model", default="gemini-2.5-flash", help="Gemini model id.")
+    ap.add_argument("--similarity-threshold", type=float, default=0.82, help="Similarity threshold for build step.")
+    ap.add_argument("--temperature", type=float, default=0.2, help="Gemini temperature for build step.")
+    args = ap.parse_args()
+    repo_dir = Path(__file__).resolve().parent
+    out_dir = Path(args.out).resolve()
+    out_dir.mkdir(parents=True, exist_ok=True)
+    video_path = Path(args.video).resolve()
+    if not video_path.exists():
+        raise FileNotFoundError(f"Input video not found: {video_path}")
+    deepgram_script = repo_dir / "deepgram_extract_utterances.py"
+    smart_kf_script = repo_dir / "smart_keyframes_and_classify.py"
+    assign_script = repo_dir / "assign_utterances_to_keyframes.py"
+    build_script = repo_dir / "build_final_output.py"
+    condense_script = repo_dir / "condense_final_output.py"
+    for s in [deepgram_script, smart_kf_script, assign_script, build_script, condense_script]:
+        if not s.exists():
+            raise FileNotFoundError(f"Script not found: {s}")
+    utterances_json = out_dir / "utterances.json"
+    keyframes_parsed_json = out_dir / "keyframes_parsed.json"
+    keyframes_with_utterances_json = out_dir / "keyframes_with_utterances.json"
+    final_output_json = out_dir / "final_output.json"
+    final_output_condensed_json = out_dir / "final_output_condensed.json"
+    deepgram_raw_json = Path(args.deepgram_raw_out).resolve() if args.deepgram_raw_out else None
+    python_exe = str(Path(args.python))
+    # 1 + 2 in parallel
+    deepgram_cmd = [
+        python_exe,
+        str(deepgram_script),
+        str(video_path),
+        "-o",
+        str(utterances_json),
+        "--model",
+        str(args.deepgram_model),
+        "--request-timeout-sec",
+        str(args.deepgram_request_timeout_sec),
+        "--connect-timeout-sec",
+        str(args.deepgram_connect_timeout_sec),
+        "--retries",
+        str(args.deepgram_retries),
+        "--retry-backoff-sec",
+        str(args.deepgram_retry_backoff_sec),
+    ]
+    if args.deepgram_language:
+        deepgram_cmd.extend(["--language", str(args.deepgram_language)])
+    if deepgram_raw_json is not None:
+        deepgram_cmd.extend(["--raw", str(deepgram_raw_json)])
+    smart_kf_cmd = [
+        python_exe,
+        str(smart_kf_script),
+        "--video",
+        str(video_path),
+        "--out",
+        str(out_dir),
+    ]
+    if args.force_keyframes:
+        smart_kf_cmd.append("--force")
+    parallel_commands: List[Tuple[str, List[str]]] = []
+    if args.force_deepgram or (not utterances_json.exists()):
+        parallel_commands.append(("deepgram_extract_utterances", deepgram_cmd))
+    else:
+        print(f"[deepgram_extract_utterances] SKIP (exists): {utterances_json}")
+    if args.force_keyframes or (not keyframes_parsed_json.exists()):
+        parallel_commands.append(("smart_keyframes_and_classify", smart_kf_cmd))
+    else:
+        print(f"[smart_keyframes_and_classify] SKIP (exists): {keyframes_parsed_json}")
+    if parallel_commands:
+        print("Running Step 1+2 in parallel...")
+        run_parallel(parallel_commands, cwd=repo_dir)
+    else:
+        print("Skipping Step 1+2 (all required artifacts already exist).")
+    require_file(utterances_json, "deepgram_extract_utterances")
+    require_file(keyframes_parsed_json, "smart_keyframes_and_classify")
+    # 3 assign
+    assign_cmd = [
+        python_exe,
+        str(assign_script),
+        str(keyframes_parsed_json),
+        str(utterances_json),
+        "-o",
+        str(keyframes_with_utterances_json),
+        "--pre-roll-sec",
+        str(args.pre_roll_sec),
+    ]
+    run_command("assign_utterances_to_keyframes", assign_cmd, cwd=repo_dir)
+    require_file(keyframes_with_utterances_json, "assign_utterances_to_keyframes")
+    # 4 build
+    build_cmd = [
+        python_exe,
+        str(build_script),
+        "--keyframes",
+        str(keyframes_with_utterances_json),
+        "--out",
+        str(final_output_json),
+        "--model",
+        str(args.gemini_model),
+        "--similarity_threshold",
+        str(args.similarity_threshold),
+        "--temperature",
+        str(args.temperature),
+    ]
+    run_command("build_final_output", build_cmd, cwd=repo_dir)
+    require_file(final_output_json, "build_final_output")
+    # 5 condense
+    condense_cmd = [
+        python_exe,
+        str(condense_script),
+        "--in",
+        str(final_output_json),
+        "--out",
+        str(final_output_condensed_json),
+    ]
+    run_command("condense_final_output", condense_cmd, cwd=repo_dir)
+    require_file(final_output_condensed_json, "condense_final_output")
+    print("\nPipeline completed successfully.")
+    print(f"Utterances: {utterances_json}")
+    print(f"Keyframes parsed: {keyframes_parsed_json}")
+    print(f"Keyframes+utterances: {keyframes_with_utterances_json}")
+    print(f"Final output: {final_output_json}")
+    print(f"Condensed output: {final_output_condensed_json}")
+if __name__ == "__main__":
+    main()

pipelines/run_pipeline_demo_code.py ADDED Viewed

	@@ -0,0 +1,239 @@

+#!/usr/bin/env python3
+"""
+Demo-only Gemini pipeline orchestrator (kept in demo-code route for compatibility).
+Pipeline steps:
+1) deepgram_extract_utterances.py         (parallel)
+2) smart_keyframes_and_classify.py        (parallel)
+3) assign_utterances_to_keyframes.py
+4) build_final_output_demo_code.py        (Gemini for demo only; slides+code local OCR/transcript)
+5) condense_final_output.py
+"""
+from __future__ import annotations
+import argparse
+import subprocess
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+from typing import List, Sequence, Tuple
+def run_command(name: str, cmd: Sequence[str], cwd: Path) -> None:
+    start = time.perf_counter()
+    print(f"\n[{name}] START")
+    print(f"[{name}] CMD: {' '.join(cmd)}")
+    result = subprocess.run(cmd, cwd=str(cwd))
+    dur = time.perf_counter() - start
+    if result.returncode != 0:
+        raise RuntimeError(f"[{name}] failed with exit code {result.returncode}")
+    print(f"[{name}] DONE in {dur:.2f}s")
+def run_parallel(commands: List[Tuple[str, List[str]]], cwd: Path) -> None:
+    if not commands:
+        return
+    with ThreadPoolExecutor(max_workers=len(commands)) as ex:
+        futures = {ex.submit(run_command, name, cmd, cwd): name for name, cmd in commands}
+        for fut in as_completed(futures):
+            fut.result()
+def require_file(path: Path, step_name: str) -> None:
+    if not path.exists():
+        raise FileNotFoundError(f"[{step_name}] expected output not found: {path}")
+def main() -> None:
+    ap = argparse.ArgumentParser(description="Run demo-only Gemini meeting pipeline (demo-code route alias).")
+    ap.add_argument("--video", required=True, help="Path to meeting video/audio input.")
+    ap.add_argument("--out", required=True, help="Output directory for pipeline artifacts.")
+    ap.add_argument("--python", default=sys.executable, help="Python executable to use.")
+    ap.add_argument("--deepgram-model", default="nova-3", help="Deepgram model.")
+    ap.add_argument("--deepgram-language", default=None, help="Deepgram language (optional).")
+    ap.add_argument(
+        "--deepgram-raw-out",
+        default=None,
+        help="Optional path for raw Deepgram response JSON.",
+    )
+    ap.add_argument(
+        "--deepgram-request-timeout-sec",
+        type=float,
+        default=1200.0,
+        help="HTTP request timeout for Deepgram call.",
+    )
+    ap.add_argument(
+        "--deepgram-connect-timeout-sec",
+        type=float,
+        default=30.0,
+        help="HTTP connect timeout for Deepgram call.",
+    )
+    ap.add_argument(
+        "--deepgram-retries",
+        type=int,
+        default=3,
+        help="Retry attempts for Deepgram call.",
+    )
+    ap.add_argument(
+        "--deepgram-retry-backoff-sec",
+        type=float,
+        default=2.0,
+        help="Base retry backoff seconds for Deepgram call.",
+    )
+    ap.add_argument(
+        "--force-deepgram",
+        action="store_true",
+        help="Re-run Deepgram even if utterances.json already exists.",
+    )
+    ap.add_argument("--force-keyframes", action="store_true", help="Pass --force to smart keyframe script.")
+    ap.add_argument("--pre-roll-sec", type=float, default=3.0, help="Pre-roll seconds for utterance assignment.")
+    ap.add_argument("--gemini-model", default="gemini-2.5-flash", help="Gemini model id.")
+    ap.add_argument(
+        "--similarity-threshold",
+        type=float,
+        default=0.82,
+        help="Similarity threshold for demo prompt reuse logic.",
+    )
+    ap.add_argument("--temperature", type=float, default=0.2, help="Gemini temperature for demo keyframes.")
+    args = ap.parse_args()
+    pipeline_dir = Path(__file__).resolve().parent
+    repo_dir = pipeline_dir
+    out_dir = Path(args.out).resolve()
+    out_dir.mkdir(parents=True, exist_ok=True)
+    video_path = Path(args.video).resolve()
+    if not video_path.exists():
+        raise FileNotFoundError(f"Input video not found: {video_path}")
+    deepgram_script = repo_dir / "deepgram_extract_utterances.py"
+    smart_kf_script = repo_dir / "smart_keyframes_and_classify.py"
+    assign_script = repo_dir / "assign_utterances_to_keyframes.py"
+    build_demo_script = pipeline_dir / "build_final_output_demo_code.py"
+    condense_script = repo_dir / "condense_final_output.py"
+    for s in [deepgram_script, smart_kf_script, assign_script, build_demo_script, condense_script]:
+        if not s.exists():
+            raise FileNotFoundError(f"Script not found: {s}")
+    utterances_json = out_dir / "utterances.json"
+    keyframes_parsed_json = out_dir / "keyframes_parsed.json"
+    keyframes_with_utterances_json = out_dir / "keyframes_with_utterances.json"
+    final_output_json = out_dir / "final_output_demo_code.json"
+    final_output_condensed_json = out_dir / "final_output_demo_code_condensed.json"
+    deepgram_raw_json = Path(args.deepgram_raw_out).resolve() if args.deepgram_raw_out else None
+    python_exe = str(Path(args.python))
+    deepgram_cmd = [
+        python_exe,
+        str(deepgram_script),
+        str(video_path),
+        "-o",
+        str(utterances_json),
+        "--model",
+        str(args.deepgram_model),
+        "--request-timeout-sec",
+        str(args.deepgram_request_timeout_sec),
+        "--connect-timeout-sec",
+        str(args.deepgram_connect_timeout_sec),
+        "--retries",
+        str(args.deepgram_retries),
+        "--retry-backoff-sec",
+        str(args.deepgram_retry_backoff_sec),
+    ]
+    if args.deepgram_language:
+        deepgram_cmd.extend(["--language", str(args.deepgram_language)])
+    if deepgram_raw_json is not None:
+        deepgram_cmd.extend(["--raw", str(deepgram_raw_json)])
+    smart_kf_cmd = [
+        python_exe,
+        str(smart_kf_script),
+        "--video",
+        str(video_path),
+        "--out",
+        str(out_dir),
+        "--no-yolo-for-non-demo",
+    ]
+    if args.force_keyframes:
+        smart_kf_cmd.append("--force")
+    parallel_commands: List[Tuple[str, List[str]]] = []
+    if args.force_deepgram or (not utterances_json.exists()):
+        parallel_commands.append(("deepgram_extract_utterances", deepgram_cmd))
+    else:
+        print(f"[deepgram_extract_utterances] SKIP (exists): {utterances_json}")
+    if args.force_keyframes or (not keyframes_parsed_json.exists()):
+        parallel_commands.append(("smart_keyframes_and_classify", smart_kf_cmd))
+    else:
+        print(f"[smart_keyframes_and_classify] SKIP (exists): {keyframes_parsed_json}")
+    if parallel_commands:
+        print("Running Step 1+2 in parallel...")
+        run_parallel(parallel_commands, cwd=repo_dir)
+    else:
+        print("Skipping Step 1+2 (all required artifacts already exist).")
+    require_file(utterances_json, "deepgram_extract_utterances")
+    require_file(keyframes_parsed_json, "smart_keyframes_and_classify")
+    assign_cmd = [
+        python_exe,
+        str(assign_script),
+        str(keyframes_parsed_json),
+        str(utterances_json),
+        "-o",
+        str(keyframes_with_utterances_json),
+        "--pre-roll-sec",
+        str(args.pre_roll_sec),
+    ]
+    run_command("assign_utterances_to_keyframes", assign_cmd, cwd=repo_dir)
+    require_file(keyframes_with_utterances_json, "assign_utterances_to_keyframes")
+    build_cmd = [
+        python_exe,
+        str(build_demo_script),
+        "--keyframes",
+        str(keyframes_with_utterances_json),
+        "--out",
+        str(final_output_json),
+        "--model",
+        str(args.gemini_model),
+        "--similarity-threshold",
+        str(args.similarity_threshold),
+        "--temperature",
+        str(args.temperature),
+    ]
+    run_command("build_final_output_demo_code", build_cmd, cwd=repo_dir)
+    require_file(final_output_json, "build_final_output_demo_code")
+    condense_cmd = [
+        python_exe,
+        str(condense_script),
+        "--in",
+        str(final_output_json),
+        "--out",
+        str(final_output_condensed_json),
+    ]
+    run_command("condense_final_output", condense_cmd, cwd=repo_dir)
+    require_file(final_output_condensed_json, "condense_final_output")
+    print("\nDemo-only Gemini pipeline completed successfully.")
+    print(f"Utterances: {utterances_json}")
+    print(f"Keyframes parsed: {keyframes_parsed_json}")
+    print(f"Keyframes+utterances: {keyframes_with_utterances_json}")
+    print(f"Final output (demo-only Gemini): {final_output_json}")
+    print(f"Condensed output (demo-only Gemini): {final_output_condensed_json}")
+if __name__ == "__main__":
+    main()

pipelines/smart_keyframes_and_classify.py ADDED Viewed

	@@ -0,0 +1,1443 @@

+# smart_keyframes_and_classify.py
+import argparse
+import json
+import os
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+import re
+import concurrent.futures as cf
+import cv2
+import numpy as np
+from dotenv import load_dotenv
+try:
+    import clip
+    import torch
+    from PIL import Image
+except Exception:
+    clip = None
+    torch = None
+    Image = None
+# Local models (layout + OCR)
+# pip install ultralytics paddleocr paddlepaddle opencv-python numpy python-dotenv
+from ultralytics import YOLO
+# Avoid oneDNN fused-conv issues seen in some Paddle/PaddleOCR builds on CPU.
+# Use hard overrides (not setdefault) so shell/.env values cannot re-enable it.
+os.environ["FLAGS_use_mkldnn"] = "0"
+os.environ["FLAGS_enable_mkldnn"] = "0"
+os.environ["FLAGS_use_onednn"] = "0"
+# Compatibility patch for NumPy>=2 with imgaug (transitive dep of PaddleOCR).
+# imgaug expects np.sctypes, removed in NumPy 2.0.
+if not hasattr(np, "sctypes"):
+    def _np_type(name: str, default):
+        return getattr(np, name, default)
+    np.sctypes = {
+        "int": [_np_type("int8", int), _np_type("int16", int), _np_type("int32", int), _np_type("int64", int)],
+        "uint": [_np_type("uint8", int), _np_type("uint16", int), _np_type("uint32", int), _np_type("uint64", int)],
+        "float": [_np_type("float16", float), _np_type("float32", float), _np_type("float64", float)],
+        "complex": [_np_type("complex64", complex), _np_type("complex128", complex)],
+        "others": [_np_type("bool_", bool), _np_type("object_", object), _np_type("str_", str), _np_type("bytes_", bytes)],
+    }
+from paddleocr import PaddleOCR
+# ============================================================
+# EDIT THESE IN CODE (no tuning args needed in the command)
+# ============================================================
+def _env_bool(name: str, default: bool) -> bool:
+    raw = os.getenv(name)
+    if raw is None:
+        return bool(default)
+    return str(raw).strip().lower() in {"1", "true", "yes", "y", "on"}
+def _auto_has_cuda() -> bool:
+    try:
+        return bool(torch is not None and torch.cuda.is_available())
+    except Exception:
+        return False
+# Candidate sampling (local, no API)
+SAMPLE_FPS = 1.0
+RESIZE_W = 360
+CANDIDATE_PERCENTILE = 70.0
+MAX_CANDIDATES = 180
+# Final cap
+MAX_FRAMES = 150
+# Fast/parse resize for local inference (CLIP)
+FAST_FRAME_MAX_W = 720
+# Parallelism removed (no LLM calls)
+BASE_SLEEP_SEC = 0.0
+# Local screen parsing (required)
+ENABLE_LOCAL_SCREEN_PARSE = True
+# Layout detector weights (DocLayNet-style YOLO weights recommended)
+# Example: models/yolov8n-doclaynet.pt
+LAYOUT_YOLO_WEIGHTS = os.getenv("LAYOUT_YOLO_WEIGHTS", "models/yolov8x-doclaynet.pt")
+LAYOUT_CONF = float(os.getenv("LAYOUT_CONF", "0.25"))
+LAYOUT_IOU = float(os.getenv("LAYOUT_IOU", "0.45"))
+# YOLO runtime settings
+# Defaults are deployment-safe (CPU on non-GPU hosts), but can be overridden via env.
+YOLO_DEVICE = os.getenv("YOLO_DEVICE", "0" if _auto_has_cuda() else "cpu")
+YOLO_IMGSZ = int(os.getenv("YOLO_IMGSZ", "640"))  # try 512 for more speed if acceptable
+# OCR
+OCR_LANG = os.getenv("OCR_LANG", "en")
+OCR_MIN_CONF = float(os.getenv("OCR_MIN_CONF", "0.45"))
+# OCR runtime settings (GPU + crop-only OCR)
+USE_GPU = _env_bool("OCR_GPU", _auto_has_cuda())
+OCR_CROP_MAX_REGIONS = int(os.getenv("OCR_CROP_MAX_REGIONS", "10"))
+# Downscale OCR crops by frame type (slides/demo faster; code keeps max)
+OCR_CROP_SCALE_BY_TYPE = {
+    "slides": float(os.getenv("OCR_CROP_SCALE_SLIDES", "0.80")),
+    "demo":   float(os.getenv("OCR_CROP_SCALE_DEMO", "0.75")),
+    "code":   float(os.getenv("OCR_CROP_SCALE_CODE", "1.00")),
+    "none":   float(os.getenv("OCR_CROP_SCALE_NONE", "0.75")),
+}
+# Resize input frame BEFORE YOLO+OCR in step 3 (slides/demo smaller; code max)
+PARSE_MAX_W_BY_TYPE = {
+    "slides": int(os.getenv("PARSE_MAX_W_SLIDES", "1280")),
+    "demo":   int(os.getenv("PARSE_MAX_W_DEMO", "1280")),
+    "none":   int(os.getenv("PARSE_MAX_W_NONE", "1280")),
+    "code":   int(os.getenv("PARSE_MAX_W_CODE", "99999")),  # effectively "no resize"
+}
+# CLIP frame type classifier
+# -----------------------------
+# CLIP setup (more robust, fewer “code” false-positives)
+# Strategy:
+# 1) Use multiple POS prompts per class (ensembling)
+# 2) Add NEG prompts per class (especially for "code") and score = mean(pos) - mean(neg)
+# This makes "slides with code screenshots" stay as slides, and prevents "demo with code words" -> code.
+# -----------------------------
+CLIP_MODEL_NAME = os.getenv("CLIP_MODEL_NAME", "ViT-B/32")
+# class labels (keep as-is)
+CLIP_CLASS_LABELS = ["slides", "code", "demo", "none"]
+# scoring mode used by your classifier code (implement if you haven't):
+# score(class) = mean(sim(image, pos_prompts)) - mean(sim(image, neg_prompts))
+CLIP_SCORE_MODE = os.getenv("CLIP_SCORE_MODE", "pos_minus_neg")
+# If your pipeline supports a minimum margin between top-1 and top-2 to accept the prediction:
+# (helps when frames are ambiguous)
+CLIP_MIN_MARGIN = float(os.getenv("CLIP_MIN_MARGIN", "0.03"))
+# Prompt bank: POS and NEG per class
+CLIP_PROMPT_BANK = {
+    "slides": {
+        "pos": [
+            "a screenshot of a presentation slide (PowerPoint or Google Slides)",
+            "a slide with a large title at the top and bullet points below",
+            "a slide canvas with wide margins and centered content",
+            "a lecture slide with sections, headings, and bullet lists",
+            "a slide that may include a small embedded screenshot (code or UI) but is still a slide",
+            "a shared slide deck page in a video meeting (16:9 slide layout)",
+        ],
+        "neg": [
+            "a full screen web application dashboard with navigation sidebar",
+            "a desktop application interface with many clickable controls",
+            "a full screen code editor filling the screen",
+            "a terminal window filling the screen",
+            "a webcam grid of meeting participants",
+        ],
+    },
+    "code": {
+        "pos": [
+            "a full screen code editor filling most of the screen with many lines of code",
+            "an IDE with syntax highlighting and line numbers, code dominates the screen",
+            "a programming editor with file tree sidebar and editor pane, not inside a slide",
+            "a terminal and code editor side by side with readable code dominating",
+        ],
+        "neg": [
+            "a presentation slide that contains a screenshot of code",
+            "a slide with a code snippet as part of a slide deck",
+            "a slide with a code image and slide title and bullets",
+            "a demo UI screen that contains a small code panel",
+        ],
+    },
+    "demo": {
+        "pos": [
+            "a web application dashboard with a left navigation sidebar and multiple panels",
+            "a product user interface with buttons, menus, input fields, and toolbars",
+            "a browser-based app with tabs, filters, tables, charts, and navigation",
+            "a desktop software UI with controls, forms, and interactive elements",
+            "a product demo screen where the interface fills the screen (not a slide canvas)",
+        ],
+        "neg": [
+            "a PowerPoint or Google Slides presentation slide",
+            "a slide with title at top and bullet points",
+            "a slide deck page with large margins and a single canvas",
+            "a slide with an embedded screenshot of a UI",
+            "a slide with a cursor hovering over a tab",
+            "a slide with a code snippet or code screenshot",
+        ],
+    },
+    "none": {
+        "pos": [
+            "a video call gallery view with participants and no shared screen",
+            "a mostly blank screen or black screen",
+            "a blurred transition frame with no readable content",
+            "a loading screen with minimal content",
+        ],
+        "neg": [
+            "a presentation slide",
+            "a web application dashboard",
+            "a full screen code editor",
+        ],
+    },
+}
+CLIP_CLASS_PROMPTS = [CLIP_PROMPT_BANK[c]["pos"] for c in CLIP_CLASS_LABELS]
+CLIP_CLASS_NEG_PROMPTS = [CLIP_PROMPT_BANK[c]["neg"] for c in CLIP_CLASS_LABELS]
+# Caps for JSON size
+MAX_OCR_LINES = 300
+# ---- NEW: hard global time gap between kept keyframes ----
+MIN_KEYFRAME_GAP_SEC = 3.0
+# Sensitivity rules (VISUAL ONLY)
+SENS = {
+    "slides": {"min_gap_sec": 1.2, "diff_mult": 1.60},
+    "code":   {"min_gap_sec": 0.8, "diff_mult": 0.70},
+    "demo":   {"min_gap_sec": 0.45, "diff_mult": 0.60},
+    "none":   {"min_gap_sec": 0.55, "diff_mult": 0.95},
+}
+# Concurrent parsing workers (YOLO + OCR) for KEPT keyframes
+PARSE_WORKERS = int(os.getenv("PARSE_WORKERS", "2"))
+# ----------------------------
+# Data structures
+# ----------------------------
+@dataclass
+class CandidateFrame:
+    t_sec: float
+    frame_idx: int
+    diff_score: float  # diff vs previous sampled frame (local)
+# ----------------------------
+# Utils
+# ----------------------------
+def fmt_hhmmss(sec: float) -> str:
+    sec = max(0.0, float(sec))
+    h = int(sec // 3600)
+    m = int((sec % 3600) // 60)
+    s = int(sec % 60)
+    return f"{h:02d}:{m:02d}:{s:02d}"
+def safe_read_json(path: Path) -> Any:
+    return json.loads(path.read_text(encoding="utf-8"))
+def safe_write_json(path: Path, obj: Any) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(json.dumps(obj, indent=2, ensure_ascii=False), encoding="utf-8")
+def _probe_video(video_path: Path) -> Tuple[float, float, int]:
+    cap = cv2.VideoCapture(str(video_path))
+    if not cap.isOpened():
+        raise RuntimeError(f"Could not open video: {video_path}")
+    fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
+    frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
+    duration = float(frames / fps) if frames else 0.0
+    cap.release()
+    return float(fps), float(duration), int(frames)
+def _mad_diff(a: np.ndarray, b: np.ndarray) -> float:
+    return float(np.mean(np.abs(a.astype(np.int16) - b.astype(np.int16))))
+def _downscale_gray(frame_bgr: np.ndarray, resize_w: int) -> np.ndarray:
+    h, w = frame_bgr.shape[:2]
+    new_w = int(resize_w)
+    new_h = int(h * (new_w / max(1, w)))
+    small = cv2.resize(frame_bgr, (new_w, new_h), interpolation=cv2.INTER_AREA)
+    return cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
+def _resize_frame_max_w(frame_bgr: np.ndarray, max_w: int) -> np.ndarray:
+    h, w = frame_bgr.shape[:2]
+    if w <= max_w:
+        return frame_bgr
+    new_w = int(max_w)
+    new_h = int(h * (new_w / w))
+    return cv2.resize(frame_bgr, (new_w, new_h), interpolation=cv2.INTER_AREA)
+def _single_line(s: str, max_len: int = 220) -> str:
+    if s is None:
+        return ""
+    s = str(s).replace("\r", " ").replace("\n", " ")
+    s = re.sub(r"\s+", " ", s).strip()
+    if len(s) > max_len:
+        s = s[: max(0, max_len - 1)].rstrip() + "…"
+    return s
+# ----------------------------
+# Video frame reader (single capture)
+# ----------------------------
+class VideoReader:
+    def __init__(self, video_path: Path):
+        self.cap = cv2.VideoCapture(str(video_path))
+        if not self.cap.isOpened():
+            raise RuntimeError(f"Could not open video: {video_path}")
+    def read_at_frame(self, frame_idx: int) -> Optional[np.ndarray]:
+        self.cap.set(cv2.CAP_PROP_POS_FRAMES, int(frame_idx))
+        ret, frame = self.cap.read()
+        if not ret:
+            return None
+        return frame
+    def close(self) -> None:
+        try:
+            self.cap.release()
+        except Exception:
+            pass
+# ----------------------------
+# Local screen parse helpers (YOLO layout + PaddleOCR)
+# ----------------------------
+def _xyxy_to_int(xyxy):
+    x1, y1, x2, y2 = xyxy
+    return [int(round(x1)), int(round(y1)), int(round(x2)), int(round(y2))]
+def _clip_box(box, w, h):
+    x1, y1, x2, y2 = box
+    x1 = max(0, min(x1, w - 1))
+    y1 = max(0, min(y1, h - 1))
+    x2 = max(0, min(x2, w - 1))
+    y2 = max(0, min(y2, h - 1))
+    if x2 < x1:
+        x1, x2 = x2, x1
+    if y2 < y1:
+        y1, y2 = y2, y1
+    return [x1, y1, x2, y2]
+def _box_center(box):
+    x1, y1, x2, y2 = box
+    return ((x1 + x2) / 2.0, (y1 + y2) / 2.0)
+def _zone_for_box(box, W, H):
+    cx, cy = _box_center(box)
+    if cy < 0.18 * H:
+        return "top"
+    if cy > 0.85 * H:
+        return "bottom"
+    if cx < 0.33 * W:
+        return "left"
+    if cx > 0.67 * W:
+        return "right"
+    return "center"
+def _sort_reading_order(items):
+    return sorted(items, key=lambda it: (it["box"][1], it["box"][0]))
+def run_layout_yolo(layout_model: YOLO, frame_bgr: np.ndarray) -> List[dict]:
+    H, W = frame_bgr.shape[:2]
+    res = layout_model.predict(
+        source=frame_bgr,
+        conf=LAYOUT_CONF,
+        iou=LAYOUT_IOU,
+        imgsz=YOLO_IMGSZ,
+        device=YOLO_DEVICE,
+        verbose=False
+    )[0]
+    regions = []
+    names = res.names
+    if res.boxes is None:
+        return regions
+    for b in res.boxes:
+        cls_id = int(b.cls.item())
+        conf = float(b.conf.item())
+        label = str(names.get(cls_id, f"class_{cls_id}"))
+        box = _xyxy_to_int(b.xyxy[0].tolist())
+        box = _clip_box(box, W, H)
+        regions.append({"label": label, "conf": conf, "box": box})
+    return _sort_reading_order(regions)
+def run_paddle_ocr(ocr: PaddleOCR, frame_bgr: np.ndarray) -> List[dict]:
+    # Full-frame OCR fallback (kept for safety), with angle cls OFF (cls=False)
+    H, W = frame_bgr.shape[:2]
+    out = []
+    rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
+    result = ocr.ocr(rgb, cls=False)
+    if not result:
+        return out
+    lines = result[0] if isinstance(result, list) and len(result) > 0 else []
+    if lines is None:
+        return out
+    if not isinstance(lines, list):
+        return out
+    for line in lines:
+        if line is None or not isinstance(line, (list, tuple)) or len(line) < 2:
+            continue
+        quad = line[0]
+        pair = line[1]
+        if quad is None or pair is None:
+            continue
+        if not isinstance(pair, (list, tuple)) or len(pair) < 2:
+            continue
+        text, conf = pair[0], pair[1]
+        conf = float(conf)
+        if conf < OCR_MIN_CONF:
+            continue
+        if not isinstance(quad, (list, tuple)) or len(quad) == 0:
+            continue
+        xs = [p[0] for p in quad]
+        ys = [p[1] for p in quad]
+        x1, y1, x2, y2 = int(min(xs)), int(min(ys)), int(max(xs)), int(max(ys))
+        box = _clip_box([x1, y1, x2, y2], W, H)
+        txt = _single_line(text, max_len=220)
+        if not txt:
+            continue
+        out.append({
+            "text": txt,
+            "conf": conf,
+            "quad": [[float(p[0]), float(p[1])] for p in quad],
+            "box": box,
+        })
+        if len(out) >= int(MAX_OCR_LINES):
+            break
+    return _sort_reading_order(out)
+def _is_text_heavy_label(label: str) -> bool:
+    lab = (label or "").lower()
+    keys = ["title", "text", "list", "table", "header", "heading"]
+    return any(k in lab for k in keys)
+def _crop_and_scale(frame_bgr: np.ndarray, box: List[int], scale: float) -> Optional[np.ndarray]:
+    x1, y1, x2, y2 = box
+    crop = frame_bgr[y1:y2, x1:x2]
+    if crop is None or crop.size == 0:
+        return None
+    if scale is None or float(scale) >= 0.999:
+        return crop
+    return cv2.resize(crop, (0, 0), fx=float(scale), fy=float(scale), interpolation=cv2.INTER_AREA)
+def run_paddle_ocr_on_text_regions(
+    ocr: PaddleOCR,
+    frame_bgr: np.ndarray,
+    regions: List[dict],
+    frame_type: str,
+    max_regions: int = 10,
+) -> List[dict]:
+    """
+    OCR ONLY on YOLO text-heavy regions (title/text/list/table/header).
+    Angle classifier is OFF via cls=False.
+    Crops are optionally downscaled by frame_type (slides/demo faster, code max).
+    """
+    H, W = frame_bgr.shape[:2]
+    out: List[dict] = []
+    scale = float(OCR_CROP_SCALE_BY_TYPE.get(str(frame_type), 0.80))
+    text_regions = [r for r in regions if _is_text_heavy_label(r.get("label", ""))]
+    text_regions = text_regions[: int(max_regions)]
+    # If YOLO didn't detect any text region, fallback to full-frame OCR
+    if not text_regions:
+        return run_paddle_ocr(ocr, frame_bgr)
+    for r in text_regions:
+        box = r["box"]
+        x1, y1, x2, y2 = box
+        crop = _crop_and_scale(frame_bgr, box, scale=scale)
+        if crop is None or crop.size == 0:
+            continue
+        rgb = cv2.cvtColor(crop, cv2.COLOR_BGR2RGB)
+        res = ocr.ocr(rgb, cls=False)  # cls OFF (angle cls OFF)
+        lines = res[0] if res else []
+        if lines is None or not isinstance(lines, list):
+            continue
+        if not lines:
+            continue
+        inv_scale = (1.0 / scale) if scale and scale > 0 else 1.0
+        for line in lines:
+            if line is None or not isinstance(line, (list, tuple)) or len(line) < 2:
+                continue
+            quad = line[0]
+            pair = line[1]
+            if quad is None or pair is None:
+                continue
+            if not isinstance(pair, (list, tuple)) or len(pair) < 2:
+                continue
+            text, conf = pair[0], pair[1]
+            conf = float(conf)
+            if conf < OCR_MIN_CONF:
+                continue
+            if not isinstance(quad, (list, tuple)) or len(quad) == 0:
+                continue
+            quad_global = []
+            for p in quad:
+                gx = float(p[0]) * inv_scale + float(x1)
+                gy = float(p[1]) * inv_scale + float(y1)
+                quad_global.append([gx, gy])
+            xs = [p[0] for p in quad_global]
+            ys = [p[1] for p in quad_global]
+            gx1, gy1, gx2, gy2 = int(min(xs)), int(min(ys)), int(max(xs)), int(max(ys))
+            gbox = _clip_box([gx1, gy1, gx2, gy2], W, H)
+            txt = _single_line(text, max_len=220)
+            if not txt:
+                continue
+            out.append({
+                "text": txt,
+                "conf": conf,
+                "quad": quad_global,
+                "box": gbox,
+                "from_region_label": r.get("label", ""),
+                "from_region_box": box,
+                "crop_scale": float(scale),
+            })
+            if len(out) >= int(MAX_OCR_LINES):
+                break
+        if len(out) >= int(MAX_OCR_LINES):
+            break
+    return _sort_reading_order(out)
+def attach_zones(regions: List[dict], W: int, H: int) -> Dict[str, List[dict]]:
+    zones = {"top": [], "left": [], "center": [], "right": [], "bottom": []}
+    for r in regions:
+        z = _zone_for_box(r["box"], W, H)
+        zones[z].append(r)
+    for z in zones:
+        zones[z] = _sort_reading_order(zones[z])
+    return zones
+def guess_title(regions: List[dict], ocr_lines: List[dict]) -> str:
+    title_boxes = []
+    for r in regions:
+        lab = r.get("label", "").lower()
+        if ("title" in lab) or (lab == "title") or ("header" in lab and "page" not in lab):
+            title_boxes.append(r["box"])
+    def inside(line_box, region_box) -> bool:
+        x1, y1, x2, y2 = line_box
+        rx1, ry1, rx2, ry2 = region_box
+        return (x1 >= rx1 - 3 and y1 >= ry1 - 3 and x2 <= rx2 + 3 and y2 <= ry2 + 3)
+    if title_boxes:
+        lines = []
+        for ob in ocr_lines:
+            for tb in title_boxes:
+                if inside(ob["box"], tb):
+                    lines.append(ob["text"])
+                    break
+        lines = [x for x in lines if x]
+        if lines:
+            return " ".join(lines[:3]).strip()
+    if ocr_lines:
+        return ocr_lines[0]["text"]
+    return ""
+def attach_ocr_to_regions(regions: List[dict], ocr_lines: List[dict], pad: int = 3) -> List[dict]:
+    def inside(line_box, region_box) -> bool:
+        x1, y1, x2, y2 = line_box
+        rx1, ry1, rx2, ry2 = region_box
+        return (x1 >= rx1 - pad and y1 >= ry1 - pad and x2 <= rx2 + pad and y2 <= ry2 + pad)
+    out = []
+    for r in regions:
+        rb = r.get("box")
+        if not rb:
+            out.append(r)
+            continue
+        texts = []
+        lines_in = []
+        for ln in ocr_lines:
+            lb = ln.get("box")
+            if lb and inside(lb, rb):
+                t = ln.get("text", "")
+                if t:
+                    texts.append(t)
+                lines_in.append(ln)
+        rr = dict(r)
+        rr["text_lines"] = texts
+        rr["text"] = " ".join(texts).strip()
+        rr["ocr_line_count"] = len(lines_in)
+        out.append(rr)
+    return out
+# ----------------------------
+# CLIP frame type classifier (no LLM)
+# ----------------------------
+def init_clip_classifier() -> Tuple[Any, Any, Dict[str, Any], str]:
+    """
+    Builds a robust CLIP classifier with:
+      - POS prompt ensembling per class
+      - NEG prompt ensembling per class
+      - score = mean(sim to POS) - mean(sim to NEG)
+    Returns:
+      clip_model, preprocess, pack, device
+    where pack contains text features and metadata.
+    """
+    if clip is None or torch is None or Image is None:
+        raise RuntimeError(
+            "CLIP dependencies missing. Install torch and CLIP "
+            "(e.g. pip install torch and pip install git+https://github.com/openai/CLIP.git)."
+        )
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    try:
+        model, preprocess = clip.load(CLIP_MODEL_NAME, device=device)
+        model.eval()
+    except Exception as e:
+        raise RuntimeError(f"CLIP init failed for model '{CLIP_MODEL_NAME}': {type(e).__name__}: {e}") from e
+    if len(CLIP_CLASS_PROMPTS) != len(CLIP_CLASS_LABELS):
+        raise ValueError("CLIP_CLASS_PROMPTS must align with CLIP_CLASS_LABELS (same length).")
+    if "CLIP_CLASS_NEG_PROMPTS" not in globals():
+        raise ValueError("CLIP_CLASS_NEG_PROMPTS is missing. Define it (aligned with CLIP_CLASS_LABELS).")
+    if len(CLIP_CLASS_NEG_PROMPTS) != len(CLIP_CLASS_LABELS):
+        raise ValueError("CLIP_CLASS_NEG_PROMPTS must align with CLIP_CLASS_LABELS (same length).")
+    flat_pos: List[str] = []
+    pos_slices: List[Tuple[int, int]] = []
+    idx = 0
+    for prompts in CLIP_CLASS_PROMPTS:
+        if not isinstance(prompts, list) or len(prompts) == 0:
+            raise ValueError("Each entry in CLIP_CLASS_PROMPTS must be a non-empty list[str].")
+        s = idx
+        for p in prompts:
+            if not isinstance(p, str):
+                raise ValueError("All POS prompts must be strings.")
+            flat_pos.append(p)
+            idx += 1
+        pos_slices.append((s, idx))
+    flat_neg: List[str] = []
+    neg_slices: List[Tuple[int, int]] = []
+    idx = 0
+    for prompts in CLIP_CLASS_NEG_PROMPTS:
+        if not isinstance(prompts, list) or len(prompts) == 0:
+            raise ValueError("Each entry in CLIP_CLASS_NEG_PROMPTS must be a non-empty list[str].")
+        s = idx
+        for p in prompts:
+            if not isinstance(p, str):
+                raise ValueError("All NEG prompts must be strings.")
+            flat_neg.append(p)
+            idx += 1
+        neg_slices.append((s, idx))
+    with torch.no_grad():
+        pos_tokens = clip.tokenize(flat_pos).to(device)
+        pos_feats_all = model.encode_text(pos_tokens)
+        pos_feats_all = pos_feats_all / pos_feats_all.norm(dim=-1, keepdim=True)
+        neg_tokens = clip.tokenize(flat_neg).to(device)
+        neg_feats_all = model.encode_text(neg_tokens)
+        neg_feats_all = neg_feats_all / neg_feats_all.norm(dim=-1, keepdim=True)
+        pos_class_feats: List[torch.Tensor] = []
+        neg_class_feats: List[torch.Tensor] = []
+        for (s, e) in pos_slices:
+            pos_class_feats.append(pos_feats_all[s:e])
+        for (s, e) in neg_slices:
+            neg_class_feats.append(neg_feats_all[s:e])
+    pack = {
+        "labels": CLIP_CLASS_LABELS,
+        "pos_class_feats": pos_class_feats,
+        "neg_class_feats": neg_class_feats,
+        "score_mode": str(CLIP_SCORE_MODE),
+        "min_margin": float(CLIP_MIN_MARGIN),
+    }
+    return model, preprocess, pack, device
+def classify_frame_clip(
+    *,
+    frame_bgr: np.ndarray,
+    clip_model: Any,
+    clip_preprocess: Any,
+    clip_text_features: Any,
+    clip_device: str,
+) -> Tuple[str, Dict[str, float]]:
+    pack = clip_text_features
+    labels: List[str] = pack["labels"]
+    pos_class_feats: List[Any] = pack["pos_class_feats"]
+    neg_class_feats: List[Any] = pack["neg_class_feats"]
+    none_margin: float = float(pack.get("none_margin", 0.02))
+    weak_thr: float = float(pack.get("weak_thr", 0.00))
+    slide_close: float = float(pack.get("slide_close", 0.03))
+    rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
+    img = Image.fromarray(rgb)
+    image = clip_preprocess(img).unsqueeze(0).to(clip_device)
+    with torch.no_grad():
+        img_feat = clip_model.encode_image(image)
+        img_feat = img_feat / img_feat.norm(dim=-1, keepdim=True)
+        scores: List[float] = []
+        for i in range(len(labels)):
+            pos_feats = pos_class_feats[i].to(clip_device)
+            neg_feats = neg_class_feats[i].to(clip_device)
+            pos_sims = (img_feat @ pos_feats.T).squeeze(0)
+            neg_sims = (img_feat @ neg_feats.T).squeeze(0)
+            score = float(pos_sims.mean().item() - neg_sims.mean().item())
+            scores.append(score)
+    scores_np = np.array(scores, dtype=np.float32)
+    score_map: Dict[str, float] = {labels[i]: float(scores_np[i]) for i in range(len(labels))}
+    if "none" not in labels:
+        best_idx = int(np.argmax(scores_np))
+        pred = labels[best_idx]
+        if ("slides" in labels) and (pred != "slides"):
+            winner_score = float(score_map[pred])
+            slides_score = float(score_map["slides"])
+            if (winner_score - slides_score) < float(slide_close):
+                pred = "slides"
+        score_map["_slide_close"] = float(slide_close)
+        return pred, score_map
+    none_idx = int(labels.index("none"))
+    none_score = float(scores_np[none_idx])
+    non_none_idxs = [i for i, lab in enumerate(labels) if lab != "none"]
+    best_non_none_idx = int(max(non_none_idxs, key=lambda i: float(scores_np[i])))
+    best_non_none_label = labels[best_non_none_idx]
+    best_non_none_score = float(scores_np[best_non_none_idx])
+    if (none_score >= best_non_none_score + none_margin) or (best_non_none_score < weak_thr):
+        pred = "none"
+    else:
+        pred = best_non_none_label
+    if pred != "none" and ("slides" in labels) and (pred != "slides"):
+        slides_score = float(score_map["slides"])
+        winner_score = float(score_map[pred])
+        if (winner_score - slides_score) < float(slide_close):
+            pred = "slides"
+    score_map["_best_non_none_score"] = float(best_non_none_score)
+    score_map["_none_score"] = float(none_score)
+    score_map["_none_margin"] = float(none_margin)
+    score_map["_weak_thr"] = float(weak_thr)
+    score_map["_best_non_none_idx"] = float(best_non_none_idx)
+    score_map["_none_idx"] = float(none_idx)
+    score_map["_slide_close"] = float(slide_close)
+    if "slides" in labels:
+        score_map["_slides_score"] = float(score_map["slides"])
+    return pred, score_map
+# ----------------------------
+# Candidate detection (cheap, local)
+# ----------------------------
+def find_candidates_diff(
+    video_path: Path,
+    sample_fps: float,
+    resize_w: int,
+    candidate_percentile: float,
+    max_candidates: int,
+) -> Tuple[List[CandidateFrame], float]:
+    fps, duration, total_frames = _probe_video(video_path)
+    if duration <= 0 or total_frames <= 0:
+        raise RuntimeError("Could not determine video duration/frames.")
+    cap = cv2.VideoCapture(str(video_path))
+    if not cap.isOpened():
+        raise RuntimeError(f"Could not open video: {video_path}")
+    sample_fps = float(sample_fps)
+    if sample_fps <= 0:
+        raise ValueError("sample_fps must be > 0")
+    step_frames = max(1, int(round(fps / sample_fps)))
+    print(f"    [step1] video_fps={fps:.3f} duration_sec={duration:.2f} total_frames={total_frames}")
+    print(f"    [step1] SAMPLE_FPS={sample_fps} -> step_frames={step_frames} (~{1.0/sample_fps:.2f}s per sample)")
+    print(f"    [step1] RESIZE_W={resize_w} CANDIDATE_PERCENTILE={candidate_percentile} MAX_CANDIDATES={max_candidates}")
+    candidates: List[CandidateFrame] = []
+    diffs: List[float] = []
+    prev_gray = None
+    sampled = 0
+    max_k = int((total_frames - 1) // step_frames) if total_frames > 0 else 0
+    for k in range(max_k + 1):
+        frame_idx = int(k * step_frames)
+        cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+        ret, frame = cap.read()
+        if not ret or frame is None:
+            break
+        sampled += 1
+        t_sec = frame_idx / fps
+        gray = _downscale_gray(frame, resize_w=resize_w)
+        d = 999.0 if prev_gray is None else _mad_diff(gray, prev_gray)
+        candidates.append(CandidateFrame(t_sec=float(t_sec), frame_idx=int(frame_idx), diff_score=float(d)))
+        diffs.append(float(d))
+        prev_gray = gray
+        if sampled % 300 == 0:
+            print(f"    [step1] sampled={sampled} last_t={fmt_hhmmss(t_sec)} last_diff={d:.2f}")
+    cap.release()
+    if not candidates:
+        print("    [step1] no candidates produced (empty video?)")
+        return [], 0.0
+    diffs_np = np.array(diffs, dtype=np.float32)
+    diffs_for_thr = diffs_np[1:] if len(diffs_np) > 1 else diffs_np
+    base_thr = float(np.percentile(diffs_for_thr, float(candidate_percentile)))
+    base_thr = max(4.0, base_thr)
+    order = np.argsort(diffs_np)[::-1]
+    picked = set()
+    out: List[CandidateFrame] = []
+    out.append(candidates[0])
+    picked.add(0)
+    for idx in order:
+        if len(out) >= int(max_candidates):
+            break
+        ii = int(idx)
+        if ii in picked:
+            continue
+        out.append(candidates[ii])
+        picked.add(ii)
+    out.sort(key=lambda x: x.t_sec)
+    print(f"    [step1] sampled_frames={sampled} raw_candidates={len(candidates)} selected_candidates={len(out)} base_thr={base_thr:.2f}")
+    return out, base_thr
+# ----------------------------
+# Keyframe keep rule (visual only)
+# ----------------------------
+def should_keep_visual_only(
+    *,
+    frame_type: str,
+    t_sec: float,
+    diff_to_last_keep: float,
+    base_thr: float,
+    last_kept_t: float,
+) -> Tuple[bool, Dict[str, float]]:
+    cfg = SENS.get(frame_type, {"min_gap_sec": 1.0, "diff_mult": 1.0})
+    diff_mult = float(cfg.get("diff_mult", 1.0))
+    min_gap = float(MIN_KEYFRAME_GAP_SEC)
+    ok_gap = True if last_kept_t <= -1e8 else ((t_sec - last_kept_t) >= min_gap)
+    thr_eff = float(base_thr * diff_mult)
+    ok_visual = diff_to_last_keep >= thr_eff
+    debug = {
+        "diff_to_last_keep": float(diff_to_last_keep),
+        "thr_effective": float(thr_eff),
+        "ok_gap": 1.0 if ok_gap else 0.0,
+        "ok_visual": 1.0 if ok_visual else 0.0,
+        "min_gap_sec_used": float(min_gap),
+    }
+    return (ok_gap and ok_visual), debug
+# ----------------------------
+# Concurrent parsing worker (YOLO + OCR) for kept keyframes
+# ----------------------------
+_WORKER_LAYOUT_MODEL = None
+_WORKER_OCR_MODEL = None
+def _worker_init(layout_weights: str, ocr_lang: str, enable_yolo: bool = True):
+    global _WORKER_LAYOUT_MODEL, _WORKER_OCR_MODEL
+    _WORKER_LAYOUT_MODEL = YOLO(layout_weights) if enable_yolo else None
+    # IMPORTANT:
+    # - use_angle_cls=False: turn off angle classifier
+    # - use_gpu=USE_GPU: attempts GPU (requires paddlepaddle-gpu)
+    _WORKER_OCR_MODEL = PaddleOCR(
+        use_angle_cls=False,
+        lang=ocr_lang,
+        use_gpu=USE_GPU,
+        show_log=False,
+        enable_mkldnn=False,
+        ir_optim=False,
+    )
+def _parse_one_keyframe(job: dict) -> dict:
+    global _WORKER_LAYOUT_MODEL, _WORKER_OCR_MODEL
+    kidx = int(job["keyframe_idx"])
+    img_path = job["image_path"]
+    frame_type = str(job.get("frame_type", "none"))
+    parse_mode = str(job.get("parse_mode", "yolo_ocr"))
+    frame = cv2.imread(str(img_path))
+    if frame is None:
+        return {"keyframe_idx": kidx, "error": f"Could not read image: {img_path}"}
+    # Resize for slides/demo/none to speed up YOLO+OCR; keep code max
+    max_w = int(PARSE_MAX_W_BY_TYPE.get(frame_type, 1280))
+    frame_for_parse = _resize_frame_max_w(frame, max_w=max_w) if max_w < 99999 else frame
+    H, W = frame_for_parse.shape[:2]
+    regions: List[dict] = []
+    t_yolo_ms = 0.0
+    if parse_mode == "yolo_ocr":
+        if _WORKER_LAYOUT_MODEL is None:
+            return {"keyframe_idx": kidx, "error": "YOLO model is not initialized for yolo_ocr parse mode."}
+        t0 = time.perf_counter()
+        regions = run_layout_yolo(_WORKER_LAYOUT_MODEL, frame_for_parse)
+        t_yolo_ms = (time.perf_counter() - t0) * 1000.0
+    t0 = time.perf_counter()
+    if parse_mode == "yolo_ocr":
+        ocr_lines = run_paddle_ocr_on_text_regions(
+            _WORKER_OCR_MODEL,
+            frame_for_parse,
+            regions,
+            frame_type=frame_type,
+            max_regions=OCR_CROP_MAX_REGIONS,
+        )
+    else:
+        # OCR-only mode: no layout detection, run full-frame OCR.
+        ocr_lines = run_paddle_ocr(_WORKER_OCR_MODEL, frame_for_parse)
+    t_ocr_ms = (time.perf_counter() - t0) * 1000.0
+    t0 = time.perf_counter()
+    regions_with_text = attach_ocr_to_regions(regions, ocr_lines) if regions else []
+    zones = attach_zones(regions_with_text, W=W, H=H) if regions_with_text else {"top": [], "left": [], "center": [], "right": [], "bottom": []}
+    title_guess_val = guess_title(regions_with_text, ocr_lines)
+    t_attach_ms = (time.perf_counter() - t0) * 1000.0
+    text_lines = [x["text"] for x in ocr_lines if x.get("text")][:MAX_OCR_LINES]
+    screen_parse = {
+        "frame_w": int(W),
+        "frame_h": int(H),
+        "layout_regions": regions_with_text,
+        "ocr_lines": ocr_lines,
+        "zones": zones,
+        "title_guess": title_guess_val,
+        "layout_model": str(LAYOUT_YOLO_WEIGHTS),
+        "ocr_lang": str(OCR_LANG),
+        "layout_conf": float(LAYOUT_CONF),
+        "layout_iou": float(LAYOUT_IOU),
+        "ocr_min_conf": float(OCR_MIN_CONF),
+        "parse_input_frame_type": str(frame_type),
+        "yolo_device": str(YOLO_DEVICE),
+        "yolo_imgsz": int(YOLO_IMGSZ),
+        "ocr_use_gpu": bool(USE_GPU),
+        "ocr_angle_cls": False,
+        "ocr_crop_max_regions": int(OCR_CROP_MAX_REGIONS),
+        "ocr_crop_scale_used": float(OCR_CROP_SCALE_BY_TYPE.get(frame_type, 0.80)),
+        "parse_max_w_used": int(max_w),
+        "parse_mode": str(parse_mode),
+    }
+    return {
+        "keyframe_idx": kidx,
+        "on_screen_text": text_lines,
+        "screen_parse": screen_parse,
+        "parse_timings_ms": {
+            "full_yolo_ms": float(t_yolo_ms),
+            "full_ocr_ms": float(t_ocr_ms),
+            "attach_text_ms": float(t_attach_ms),
+        }
+    }
+# ----------------------------
+# Main
+# ----------------------------
+def main():
+    load_dotenv()
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--video", required=True, help="Path to meeting.mp4")
+    ap.add_argument("--out", required=True, help="Output folder")
+    ap.add_argument("--force", action="store_true")
+    ap.add_argument(
+        "--no-yolo-for-non-demo",
+        action="store_true",
+        help="Use OCR-only parsing for non-demo frames (slides/code/none).",
+    )
+    args = ap.parse_args()
+    if not ENABLE_LOCAL_SCREEN_PARSE:
+        raise RuntimeError("ENABLE_LOCAL_SCREEN_PARSE must be True. YOLO and PaddleOCR are required.")
+    if not Path(LAYOUT_YOLO_WEIGHTS).exists():
+        raise FileNotFoundError(f"Layout YOLO weights not found at: {LAYOUT_YOLO_WEIGHTS}")
+    try:
+        _ = YOLO(LAYOUT_YOLO_WEIGHTS)
+    except Exception as e:
+        raise RuntimeError(f"YOLO init failed: {type(e).__name__}: {e}") from e
+    # NOTE: this tries GPU; if your Paddle is CPU-only, this may error.
+    # In that case install paddlepaddle-gpu, or set USE_GPU=False.
+    try:
+        _ = PaddleOCR(
+            use_angle_cls=False,
+            lang=OCR_LANG,
+            use_gpu=USE_GPU,
+            show_log=False,
+            enable_mkldnn=False,
+            ir_optim=False,
+        )
+    except Exception as e:
+        raise RuntimeError(f"PaddleOCR init failed: {type(e).__name__}: {e}") from e
+    try:
+        clip_model, clip_preprocess, clip_text_features, clip_device = init_clip_classifier()
+    except Exception as e:
+        raise RuntimeError(f"CLIP classifier init failed: {type(e).__name__}: {e}") from e
+    video_path = Path(args.video).resolve()
+    out_dir = Path(args.out).resolve()
+    out_dir.mkdir(parents=True, exist_ok=True)
+    frames_dir = out_dir / "frames_selected"
+    frames_dir.mkdir(parents=True, exist_ok=True)
+    enriched_json = out_dir / "keyframes_parsed.json"
+    timing_json = out_dir / "timing_summary.json"
+    classified_dir = out_dir / "classified"
+    classified_dir.mkdir(parents=True, exist_ok=True)
+    out_paths = {
+        "slides": classified_dir / "slides_keyframes.json",
+        "code": classified_dir / "code_keyframes.json",
+        "demo": classified_dir / "demo_keyframes.json",
+        "none": classified_dir / "none_keyframes.json",
+    }
+    t_total0 = time.perf_counter()
+    timing_totals = {
+        "candidate_detection_ms": 0.0,
+        "candidate_loop_ms": 0.0,
+        "read_frame_ms": 0.0,
+        "gray_diff_ms": 0.0,
+        "clip_ms": 0.0,
+        "keep_logic_ms": 0.0,
+        "save_frame_ms": 0.0,
+        "parse_concurrent_ms": 0.0,
+        "json_write_ms": 0.0,
+    }
+    all_selected: List[dict] = []
+    processed_times: set = set()
+    last_kept_t = -1e9
+    last_kept_gray: Optional[np.ndarray] = None
+    if (not args.force) and enriched_json.exists():
+        try:
+            all_selected = safe_read_json(enriched_json)
+            if isinstance(all_selected, list) and all_selected:
+                processed_times = {round(float(x.get("t_sec", -1.0)), 2) for x in all_selected if "t_sec" in x}
+                last = all_selected[-1]
+                last_kept_t = float(last.get("t_sec", last_kept_t))
+                last_img = Path(last.get("image_path", ""))
+                if last_img.exists():
+                    img = cv2.imread(str(last_img))
+                    if img is not None:
+                        last_kept_gray = _downscale_gray(img, RESIZE_W)
+                print(f"Resuming: already selected {len(all_selected)} keyframes (last at {fmt_hhmmss(last_kept_t)}).")
+        except Exception:
+            all_selected = []
+            processed_times = set()
+            last_kept_t = -1e9
+            last_kept_gray = None
+    if args.force:
+        all_selected = []
+        processed_times = set()
+        last_kept_t = -1e9
+        last_kept_gray = None
+    print("1) Finding candidate change points locally (no API)...")
+    print(f"    [step1] starting... (this can take time on long videos)")
+    t0 = time.perf_counter()
+    candidates, base_thr = find_candidates_diff(
+        video_path=video_path,
+        sample_fps=SAMPLE_FPS,
+        resize_w=RESIZE_W,
+        candidate_percentile=CANDIDATE_PERCENTILE,
+        max_candidates=MAX_CANDIDATES,
+    )
+    t1_ms = (time.perf_counter() - t0) * 1000.0
+    timing_totals["candidate_detection_ms"] += t1_ms
+    print(f"    [step1] done in {t1_ms/1000.0:.2f}s")
+    print(f"Candidates: {len(candidates)}, base diff threshold ~ {base_thr:.2f}")
+    print("Sensitivity config (edit in code):", SENS)
+    print("Layout model:", LAYOUT_YOLO_WEIGHTS)
+    print("YOLO device:", YOLO_DEVICE, "| imgsz:", YOLO_IMGSZ)
+    print("OCR lang:", OCR_LANG, "| OCR_MIN_CONF:", OCR_MIN_CONF, "| OCR GPU:", USE_GPU, "| angle_cls:", False)
+    print("CLIP model:", CLIP_MODEL_NAME, "| device:", clip_device)
+    print("Parse workers:", PARSE_WORKERS)
+    print(f"Global min gap override (seconds since last keyframe): {MIN_KEYFRAME_GAP_SEC:.2f}s")
+    kept_count = len(all_selected)
+    reader = VideoReader(video_path)
+    try:
+        print("2) Selecting keyframes (VISUAL ONLY: time gap + diff; no OCR in loop)...")
+        t_loop0 = time.perf_counter()
+        for i, cand in enumerate(candidates, start=1):
+            if kept_count >= int(MAX_FRAMES):
+                break
+            t_key = round(float(cand.t_sec), 2)
+            if t_key in processed_times:
+                continue
+            if cand.t_sec <= (last_kept_t + 1e-6) and last_kept_t > -1e8:
+                continue
+            gap = float(cand.t_sec - last_kept_t) if last_kept_t > -1e8 else 9999.0
+            if last_kept_t > -1e8 and gap < float(MIN_KEYFRAME_GAP_SEC):
+                continue
+            t0 = time.perf_counter()
+            frame = reader.read_at_frame(cand.frame_idx)
+            timing_totals["read_frame_ms"] += (time.perf_counter() - t0) * 1000.0
+            if frame is None:
+                continue
+            t0 = time.perf_counter()
+            gray_now = _downscale_gray(frame, RESIZE_W)
+            diff_to_last_keep = 999.0 if last_kept_gray is None else _mad_diff(gray_now, last_kept_gray)
+            timing_totals["gray_diff_ms"] += (time.perf_counter() - t0) * 1000.0
+            print(
+                f"[{i}/{len(candidates)}] t={fmt_hhmmss(cand.t_sec)} "
+                f"gap_since_last_keep={gap:.2f}s cand_diff={cand.diff_score:.2f} keep_diff={diff_to_last_keep:.2f} ..."
+            )
+            frame_fast = _resize_frame_max_w(frame, FAST_FRAME_MAX_W)
+            t0 = time.perf_counter()
+            frame_type, clip_probs = classify_frame_clip(
+                frame_bgr=frame_fast,
+                clip_model=clip_model,
+                clip_preprocess=clip_preprocess,
+                clip_text_features=clip_text_features,
+                clip_device=clip_device,
+            )
+            t_clip_ms = (time.perf_counter() - t0) * 1000.0
+            timing_totals["clip_ms"] += t_clip_ms
+            t0 = time.perf_counter()
+            keep, dbg = should_keep_visual_only(
+                frame_type=frame_type,
+                t_sec=float(cand.t_sec),
+                diff_to_last_keep=float(diff_to_last_keep),
+                base_thr=float(base_thr),
+                last_kept_t=float(last_kept_t),
+            )
+            t_keep_ms = (time.perf_counter() - t0) * 1000.0
+            timing_totals["keep_logic_ms"] += t_keep_ms
+            print(
+                f"    timings: clip={t_clip_ms:.0f}ms keep_logic={t_keep_ms:.0f}ms "
+                f"| type={frame_type} keep={keep} | diff={diff_to_last_keep:.2f} thr_eff={dbg['thr_effective']:.2f} "
+                f"| min_gap_used={dbg.get('min_gap_sec_used', MIN_KEYFRAME_GAP_SEC):.2f}s"
+            )
+            if not keep:
+                if BASE_SLEEP_SEC > 0:
+                    time.sleep(BASE_SLEEP_SEC)
+                continue
+            t0 = time.perf_counter()
+            out_img = frames_dir / f"frame_{kept_count:04d}_{cand.t_sec:.2f}s_{frame_type}.jpg"
+            cv2.imwrite(str(out_img), frame)
+            t_save_ms = (time.perf_counter() - t0) * 1000.0
+            timing_totals["save_frame_ms"] += t_save_ms
+            item = {
+                "keyframe_idx": int(kept_count),
+                "t_sec": float(cand.t_sec),
+                "timestamp": fmt_hhmmss(cand.t_sec),
+                "image_path": str(out_img),
+                "frame_type": frame_type,
+                "on_screen_text": [],
+                "screen_parse": None,
+                "candidate_diff_score": float(cand.diff_score),
+                "diff_to_last_keep": float(diff_to_last_keep),
+                "base_diff_threshold": float(base_thr),
+                "thr_effective": float(dbg.get("thr_effective", 0.0)),
+                "gap_since_last_keep_sec": float(gap),
+                "clip_probs": {k: float(v) for k, v in clip_probs.items()},
+                "clip_prompt_map": dict(zip(CLIP_CLASS_LABELS, CLIP_CLASS_PROMPTS)),
+                "clip_model_name": str(CLIP_MODEL_NAME),
+                "timings_ms": {
+                    "clip_ms": float(t_clip_ms),
+                    "keep_logic_ms": float(t_keep_ms),
+                    "save_frame_ms": float(t_save_ms),
+                },
+            }
+            all_selected.append(item)
+            processed_times.add(t_key)
+            kept_count += 1
+            last_kept_t = float(cand.t_sec)
+            last_kept_gray = gray_now
+            t0 = time.perf_counter()
+            safe_write_json(enriched_json, all_selected)
+            timing_totals["json_write_ms"] += (time.perf_counter() - t0) * 1000.0
+            if BASE_SLEEP_SEC > 0:
+                time.sleep(BASE_SLEEP_SEC)
+        timing_totals["candidate_loop_ms"] += (time.perf_counter() - t_loop0) * 1000.0
+    finally:
+        reader.close()
+    # Phase 3: YOLO + OCR concurrently on keyframes that need parsing
+    to_parse = []
+    for it in all_selected:
+        if (not args.force) and isinstance(it.get("screen_parse"), dict) and it.get("on_screen_text"):
+            continue
+        if it.get("image_path"):
+            frame_type = str(it.get("frame_type", "none"))
+            parse_mode = "yolo_ocr"
+            if args.no_yolo_for_non_demo and frame_type != "demo":
+                parse_mode = "ocr_only"
+            to_parse.append({
+                "keyframe_idx": int(it["keyframe_idx"]),
+                "t_sec": float(it["t_sec"]),
+                "frame_type": frame_type,
+                "image_path": str(it["image_path"]),
+                "parse_mode": parse_mode,
+            })
+    print(f"3) Parsing kept keyframes with YOLO+OCR concurrently... to_parse={len(to_parse)}")
+    if to_parse:
+        yolo_jobs = sum(1 for j in to_parse if j.get("parse_mode") == "yolo_ocr")
+        ocr_only_jobs = len(to_parse) - yolo_jobs
+        enable_yolo = yolo_jobs > 0
+        print("    [step3] starting ProcessPoolExecutor...")
+        print(f"    [step3] PARSE_WORKERS={PARSE_WORKERS} (each worker loads YOLO + PaddleOCR once)")
+        print(f"    [step3] YOLO_DEVICE={YOLO_DEVICE} YOLO_IMGSZ={YOLO_IMGSZ} | OCR_GPU={USE_GPU} angle_cls=False")
+        print(f"    [step3] OCR crops: max_regions={OCR_CROP_MAX_REGIONS} scale_by_type={OCR_CROP_SCALE_BY_TYPE}")
+        print(f"    [step3] Parse resize max_w_by_type={PARSE_MAX_W_BY_TYPE}")
+        print(f"    [step3] parse_mode split: yolo_ocr={yolo_jobs}, ocr_only={ocr_only_jobs}")
+        t0 = time.perf_counter()
+        with cf.ProcessPoolExecutor(
+            max_workers=max(1, PARSE_WORKERS),
+            initializer=_worker_init,
+            initargs=(str(LAYOUT_YOLO_WEIGHTS), str(OCR_LANG), bool(enable_yolo)),
+        ) as ex:
+            fut_to_job = {ex.submit(_parse_one_keyframe, job): job for job in to_parse}
+            done_count = 0
+            err_count = 0
+            t_last_report = time.perf_counter()
+            for fut in cf.as_completed(fut_to_job):
+                job = fut_to_job[fut]
+                job_kidx = int(job.get("keyframe_idx", -1))
+                done_count += 1
+                try:
+                    res = fut.result()
+                except Exception as e:
+                    err_count += 1
+                    if 0 <= job_kidx < len(all_selected):
+                        all_selected[job_kidx]["screen_parse_error"] = f"worker_exception: {type(e).__name__}: {e}"
+                    now = time.perf_counter()
+                    if (now - t_last_report) >= 1.0 or done_count == len(fut_to_job):
+                        print(f"    [step3] progress {done_count}/{len(fut_to_job)} parsed (errors={err_count})")
+                        t_last_report = now
+                    continue
+                kidx = int(res.get("keyframe_idx", job_kidx))
+                if kidx < 0 or kidx >= len(all_selected):
+                    now = time.perf_counter()
+                    if (now - t_last_report) >= 1.0 or done_count == len(fut_to_job):
+                        print(f"    [step3] progress {done_count}/{len(fut_to_job)} parsed (errors={err_count})")
+                        t_last_report = now
+                    continue
+                # Track explicit worker-level error payloads.
+                if "error" in res:
+                    err_count += 1
+                if "error" in res:
+                    all_selected[kidx]["screen_parse_error"] = res["error"]
+                else:
+                    all_selected[kidx]["on_screen_text"] = res.get("on_screen_text", [])[:MAX_OCR_LINES]
+                    all_selected[kidx]["screen_parse"] = res.get("screen_parse")
+                    tm = all_selected[kidx].get("timings_ms", {}) or {}
+                    tm.update(res.get("parse_timings_ms", {}) or {})
+                    all_selected[kidx]["timings_ms"] = tm
+                now = time.perf_counter()
+                if (now - t_last_report) >= 1.0 or done_count == len(fut_to_job):
+                    print(f"    [step3] progress {done_count}/{len(fut_to_job)} parsed (errors={err_count})")
+                    t_last_report = now
+        t3_ms = (time.perf_counter() - t0) * 1000.0
+        timing_totals["parse_concurrent_ms"] += t3_ms
+        print(f"    [step3] done in {t3_ms/1000.0:.2f}s (errors={err_count})")
+    # Rebuild buckets from final frame_type
+    buckets: Dict[str, List[dict]] = {k: [] for k in out_paths.keys()}
+    for it in all_selected:
+        ft = it.get("frame_type", "none")
+        if ft not in buckets:
+            ft = "none"
+            it["frame_type"] = "none"
+        buckets[ft].append(it)
+    # Final writes
+    t0 = time.perf_counter()
+    safe_write_json(enriched_json, all_selected)
+    for ft, p in out_paths.items():
+        safe_write_json(p, buckets[ft])
+    timing_totals["json_write_ms"] += (time.perf_counter() - t0) * 1000.0
+    total_ms = (time.perf_counter() - t_total0) * 1000.0
+    timing_summary = {
+        "timing_totals_ms": {k: float(v) for k, v in timing_totals.items()},
+        "total_ms": float(total_ms),
+        "candidates": int(len(candidates)),
+        "selected_frames": int(len(all_selected)),
+        "parsed_frames": int(sum(1 for x in all_selected if isinstance(x.get("screen_parse"), dict))),
+        "parse_workers": int(PARSE_WORKERS),
+        "min_keyframe_gap_sec": float(MIN_KEYFRAME_GAP_SEC),
+        "yolo_device": str(YOLO_DEVICE),
+        "yolo_imgsz": int(YOLO_IMGSZ),
+        "ocr_use_gpu": bool(USE_GPU),
+        "ocr_angle_cls": False,
+        "ocr_crop_max_regions": int(OCR_CROP_MAX_REGIONS),
+        "ocr_crop_scale_by_type": dict(OCR_CROP_SCALE_BY_TYPE),
+        "parse_max_w_by_type": dict(PARSE_MAX_W_BY_TYPE),
+    }
+    safe_write_json(timing_json, timing_summary)
+    print("\nDone.")
+    print("Selected frames:", len(all_selected))
+    print("Frames folder:", frames_dir)
+    print("Parsed JSON:", enriched_json)
+    print("Timing JSON:", timing_json)
+    for ft, p in out_paths.items():
+        print(ft, "->", p)
+    print("\nTiming summary (ms):")
+    for k, v in timing_totals.items():
+        print(f"  {k}: {v:.0f}")
+    print(f"  total_ms: {total_ms:.0f}")
+if __name__ == "__main__":
+    try:
+        main()
+    except Exception as e:
+        print(f"[ERROR] {type(e).__name__}: {e}")
+        raise

requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+gradio>=5.0.0
+fastapi==0.116.1
+uvicorn==0.34.3
+python-multipart==0.0.20
+setuptools==70.0.0
+wheel==0.45.1
+python-dotenv==1.2.1
+deepgram-sdk==4.8.0
+httpx==0.28.1
+google-genai==1.60.0
+pydantic==2.12.5
+opencv-python-headless==4.11.0.86
+numpy==1.26.4
+ultralytics==8.4.12
+paddleocr==2.7.3
+paddlepaddle==2.6.2
+torch==2.5.1

run_manager.py ADDED Viewed

	@@ -0,0 +1,581 @@

+from __future__ import annotations
+import json
+import os
+import re
+import shutil
+import subprocess
+import sys
+import tempfile
+import threading
+import time
+import uuid
+from html import unescape
+from pathlib import Path
+from typing import Any, Dict, Optional
+from urllib.parse import parse_qs, urljoin, urlparse
+import httpx
+BASE_DIR = Path(__file__).resolve().parent
+PIPELINES_DIR = BASE_DIR / "pipelines"
+DEFAULT_WORKDIR = Path(os.getenv("PIPELINE_WORKDIR", tempfile.gettempdir())) / "deployed-meet-runs"
+DEFAULT_WORKDIR.mkdir(parents=True, exist_ok=True)
+RUNS_DIR = DEFAULT_WORKDIR / "runs"
+RUNS_DIR.mkdir(parents=True, exist_ok=True)
+def _tail(text: str, max_lines: int = 220) -> str:
+    lines = (text or "").splitlines()
+    if len(lines) <= max_lines:
+        return "\n".join(lines)
+    return "\n".join(lines[-max_lines:])
+def _run_dir(run_id: str) -> Path:
+    return RUNS_DIR / run_id
+def _meta_path(run_id: str) -> Path:
+    return _run_dir(run_id) / "run_meta.json"
+def _logs_path(run_id: str) -> Path:
+    return _run_dir(run_id) / "pipeline.log"
+def _write_json(path: Path, data: Dict[str, Any]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    tmp = path.with_suffix(path.suffix + ".tmp")
+    tmp.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
+    tmp.replace(path)
+def _read_json(path: Path) -> Dict[str, Any]:
+    return json.loads(path.read_text(encoding="utf-8"))
+def _extract_gdrive_file_id(url: str) -> Optional[str]:
+    parsed = urlparse(url)
+    host = (parsed.netloc or "").lower()
+    if "drive.google.com" not in host:
+        return None
+    m = re.search(r"/file/d/([a-zA-Z0-9_-]+)", parsed.path or "")
+    if m:
+        return m.group(1)
+    qs = parse_qs(parsed.query or "")
+    if "id" in qs and qs["id"]:
+        return qs["id"][0]
+    return None
+def _download_google_drive(url: str, out_path: Path) -> None:
+    file_id = _extract_gdrive_file_id(url)
+    if not file_id:
+        raise ValueError("Could not parse Google Drive file id from video_url.")
+    direct_url = f"https://drive.google.com/uc?export=download&id={file_id}"
+    def _is_html_response(resp: httpx.Response) -> bool:
+        ctype = (resp.headers.get("content-type") or "").lower()
+        if "html" in ctype or "text/plain" in ctype:
+            return True
+        head = (resp.content[:256] or b"").lower()
+        return b"<html" in head or b"<!doctype html" in head
+    def _write_if_file(resp: httpx.Response) -> bool:
+        if _is_html_response(resp):
+            return False
+        if not resp.content or len(resp.content) < 1024:
+            return False
+        out_path.write_bytes(resp.content)
+        return True
+    with httpx.Client(timeout=120.0, follow_redirects=True) as client:
+        candidates = [
+            direct_url,
+            f"https://drive.usercontent.google.com/download?id={file_id}&export=download&confirm=t",
+        ]
+        for c in candidates:
+            rr = client.get(c)
+            rr.raise_for_status()
+            if _write_if_file(rr):
+                return
+        page = client.get(f"https://drive.google.com/file/d/{file_id}/view")
+        page.raise_for_status()
+        html = page.text or ""
+        form_action_match = re.search(r'id="download-form"[^>]*action="([^"]+)"', html)
+        if form_action_match:
+            action = unescape(form_action_match.group(1))
+            action_url = urljoin("https://drive.google.com", action)
+            params = {k: v for k, v in re.findall(r'<input[^>]+name="([^"]+)"[^>]+value="([^"]*)"', html)}
+            form_resp = client.get(action_url, params=params)
+            form_resp.raise_for_status()
+            if _write_if_file(form_resp):
+                return
+        link_match = re.search(r'href="(/uc\?export=download[^"]+)"', html)
+        if link_match:
+            href = unescape(link_match.group(1)).replace("&amp;", "&")
+            link_url = urljoin("https://drive.google.com", href)
+            link_resp = client.get(link_url)
+            link_resp.raise_for_status()
+            if _write_if_file(link_resp):
+                return
+        cookie_confirm = None
+        for k, v in page.cookies.items():
+            if str(k).startswith("download_warning"):
+                cookie_confirm = v
+                break
+        if cookie_confirm:
+            confirm_url = f"https://drive.google.com/uc?export=download&confirm={cookie_confirm}&id={file_id}"
+            confirm_resp = client.get(confirm_url)
+            confirm_resp.raise_for_status()
+            if _write_if_file(confirm_resp):
+                return
+        msg = "Google Drive link did not provide a downloadable file."
+        low = html.lower()
+        if "you need access" in low or "request access" in low:
+            msg += " File is not publicly accessible."
+        elif "quota exceeded" in low or "too many users have viewed or downloaded" in low:
+            msg += " File appears to be quota-limited by Google Drive."
+        else:
+            msg += " Use a publicly accessible direct file link or local video file upload."
+        raise ValueError(msg)
+def _validate_video_file(path: Path) -> None:
+    if not path.exists() or not path.is_file():
+        raise ValueError(f"Input video file not found: {path}")
+    size = path.stat().st_size
+    if size < 1024:
+        raise ValueError(f"Input file is too small to be valid media: {path} ({size} bytes)")
+    try:
+        head = path.read_bytes()[:4096].lower()
+        if b"<html" in head or b"<!doctype html" in head or b"{\"error\"" in head:
+            raise ValueError(
+                "Downloaded input is not a media file (looks like HTML/JSON response). "
+                "Use a direct video URL or upload a file."
+            )
+    except ValueError:
+        raise
+    except Exception:
+        pass
+    try:
+        import cv2
+        cap = cv2.VideoCapture(str(path))
+        ok = cap.isOpened()
+        frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
+        cap.release()
+        if (not ok) or frame_count <= 0:
+            raise ValueError(
+                "Input file is not a decodable video for this runtime. "
+                "Provide a valid MP4 (H.264/AAC recommended)."
+            )
+    except ValueError:
+        raise
+    except Exception:
+        pass
+def _resolve_python_executable(python_bin: Optional[str]) -> str:
+    if python_bin:
+        p = Path(python_bin).expanduser()
+        if not p.exists():
+            raise ValueError(f"python_bin does not exist: {p}")
+        return str(p.resolve())
+    candidates = [
+        BASE_DIR.parent / ".venv" / "Scripts" / "python.exe",
+        BASE_DIR / ".venv" / "Scripts" / "python.exe",
+        BASE_DIR.parent / ".venv" / "bin" / "python",
+        BASE_DIR / ".venv" / "bin" / "python",
+    ]
+    for c in candidates:
+        if c.exists():
+            return str(c.resolve())
+    return sys.executable or os.getenv("PYTHON_BIN") or "python"
+def _resolve_out_dir(out_dir: Optional[str], run_id: str) -> Path:
+    if out_dir:
+        p = Path(out_dir)
+        if not p.is_absolute():
+            p = DEFAULT_WORKDIR / p
+    else:
+        p = DEFAULT_WORKDIR / f"run_{run_id}"
+    p.mkdir(parents=True, exist_ok=True)
+    return p.resolve()
+def _build_common_args(
+    *,
+    video_path: Path,
+    out_dir: Path,
+    deepgram_model: str,
+    deepgram_language: Optional[str],
+    deepgram_request_timeout_sec: float,
+    deepgram_connect_timeout_sec: float,
+    deepgram_retries: int,
+    deepgram_retry_backoff_sec: float,
+    force_deepgram: bool,
+    force_keyframes: bool,
+    pre_roll_sec: float,
+    gemini_model: str,
+    similarity_threshold: float,
+    temperature: float,
+) -> list[str]:
+    args = [
+        "--video",
+        str(video_path),
+        "--out",
+        str(out_dir),
+        "--deepgram-model",
+        deepgram_model,
+        "--deepgram-request-timeout-sec",
+        str(deepgram_request_timeout_sec),
+        "--deepgram-connect-timeout-sec",
+        str(deepgram_connect_timeout_sec),
+        "--deepgram-retries",
+        str(deepgram_retries),
+        "--deepgram-retry-backoff-sec",
+        str(deepgram_retry_backoff_sec),
+        "--pre-roll-sec",
+        str(pre_roll_sec),
+        "--gemini-model",
+        gemini_model,
+        "--similarity-threshold",
+        str(similarity_threshold),
+        "--temperature",
+        str(temperature),
+    ]
+    if deepgram_language:
+        args.extend(["--deepgram-language", deepgram_language])
+    if force_deepgram:
+        args.append("--force-deepgram")
+    if force_keyframes:
+        args.append("--force-keyframes")
+    return args
+def _build_output_files(out_dir: Path, variant: str) -> Dict[str, str]:
+    return {
+        "utterances": str(out_dir / "utterances.json"),
+        "keyframes_parsed": str(out_dir / "keyframes_parsed.json"),
+        "keyframes_with_utterances": str(out_dir / "keyframes_with_utterances.json"),
+        "final_output": str(
+            out_dir / ("final_output.json" if variant == "full" else "final_output_demo_code.json")
+        ),
+        "final_output_condensed": str(
+            out_dir / ("final_output_condensed.json" if variant == "full" else "final_output_demo_code_condensed.json")
+        ),
+    }
+def _artifact_state(output_files: Dict[str, str]) -> Dict[str, Dict[str, Any]]:
+    state: Dict[str, Dict[str, Any]] = {}
+    for key, p in output_files.items():
+        path = Path(p)
+        if path.exists():
+            try:
+                st = path.stat()
+                state[key] = {
+                    "size_bytes": int(st.st_size),
+                    "mtime": float(st.st_mtime),
+                }
+            except Exception:
+                state[key] = {"size_bytes": -1, "mtime": -1.0}
+    return state
+def _format_artifact_compact(state: Dict[str, Dict[str, Any]]) -> str:
+    if not state:
+        return "none"
+    parts = []
+    for k in sorted(state.keys()):
+        sz = float(state[k].get("size_bytes", 0))
+        parts.append(f"{k}:{sz/1024.0:.1f}KB")
+    return ", ".join(parts)
+def _watch_run(run_id: str, proc: subprocess.Popen, started_at: float, log_fh, heartbeat_sec: float) -> None:
+    heartbeat_sec = max(2.0, float(heartbeat_sec))
+    last_hb = 0.0
+    last_artifact_change = started_at
+    last_state: Dict[str, Dict[str, Any]] = {}
+    while True:
+        now = time.time()
+        rc = proc.poll()
+        if (now - last_hb) >= heartbeat_sec:
+            try:
+                meta_file = _meta_path(run_id)
+                meta = _read_json(meta_file) if meta_file.exists() else {"run_id": run_id}
+                out_files = meta.get("output_files", {}) or {}
+                cur_state = _artifact_state(out_files)
+                changed = cur_state != last_state
+                if changed:
+                    last_artifact_change = now
+                unchanged_for = now - last_artifact_change
+                elapsed = now - started_at
+                log_fh.write(
+                    "[runner] heartbeat "
+                    f"elapsed={elapsed:.1f}s pid={proc.pid} "
+                    f"artifacts={len(cur_state)}/{len(out_files)} "
+                    f"changed={'yes' if changed else 'no'} "
+                    f"unchanged_for={unchanged_for:.1f}s "
+                    f"[{_format_artifact_compact(cur_state)}]\n"
+                )
+                log_fh.flush()
+                meta["last_heartbeat_epoch"] = now
+                meta["last_heartbeat_elapsed_sec"] = round(elapsed, 3)
+                meta["artifacts_ready_count"] = len(cur_state)
+                meta["artifacts_total_count"] = len(out_files)
+                meta["artifacts_unchanged_for_sec"] = round(unchanged_for, 3)
+                _write_json(meta_file, meta)
+                last_state = cur_state
+            except Exception as e:
+                try:
+                    log_fh.write(f"[runner] heartbeat_error: {type(e).__name__}: {e}\n")
+                    log_fh.flush()
+                except Exception:
+                    pass
+            last_hb = now
+        if rc is not None:
+            return_code = int(rc)
+            break
+        time.sleep(1.0)
+    finished_at = time.time()
+    try:
+        meta_file = _meta_path(run_id)
+        meta = _read_json(meta_file) if meta_file.exists() else {"run_id": run_id}
+        meta["status"] = "succeeded" if return_code == 0 else "failed"
+        meta["exit_code"] = int(return_code)
+        meta["finished_at_epoch"] = finished_at
+        meta["duration_sec"] = round(finished_at - started_at, 3)
+        _write_json(meta_file, meta)
+    except Exception as e:
+        try:
+            log_fh.write(f"\n[runner] failed to update metadata: {type(e).__name__}: {e}\n")
+            log_fh.flush()
+        except Exception:
+            pass
+    try:
+        log_fh.write(f"\n[runner] process finished with exit_code={return_code}\n")
+        log_fh.flush()
+    except Exception:
+        pass
+    finally:
+        try:
+            log_fh.close()
+        except Exception:
+            pass
+def start_run(
+    *,
+    variant: str,
+    video_file_path: Optional[str],
+    video_url: Optional[str],
+    out_dir: Optional[str],
+    python_bin: Optional[str],
+    deepgram_model: str,
+    deepgram_language: Optional[str],
+    deepgram_request_timeout_sec: float,
+    deepgram_connect_timeout_sec: float,
+    deepgram_retries: int,
+    deepgram_retry_backoff_sec: float,
+    force_deepgram: bool,
+    force_keyframes: bool,
+    pre_roll_sec: float,
+    gemini_model: str,
+    similarity_threshold: float,
+    temperature: float,
+    log_heartbeat_sec: float = 10.0,
+) -> Dict[str, Any]:
+    script_name = {
+        "full": "run_pipeline_all.py",
+        "demo-code": "run_pipeline_demo_code.py",
+    }.get(variant)
+    if not script_name:
+        raise ValueError("variant must be one of: full, demo-code")
+    pipeline_script = PIPELINES_DIR / script_name
+    if not pipeline_script.exists():
+        raise FileNotFoundError(f"Missing pipeline script: {pipeline_script}")
+    run_id = uuid.uuid4().hex[:12]
+    run_dir = _run_dir(run_id)
+    run_dir.mkdir(parents=True, exist_ok=True)
+    if video_file_path:
+        src = Path(video_file_path).expanduser().resolve()
+        if not src.exists():
+            raise ValueError(f"Uploaded/local video file not found: {src}")
+        dst = run_dir / f"input_{run_id}{src.suffix or '.mp4'}"
+        shutil.copy2(src, dst)
+        video_path = dst
+    elif video_url:
+        suffix = Path(video_url).suffix or ".mp4"
+        video_path = run_dir / f"input_{run_id}{suffix}"
+        if _extract_gdrive_file_id(video_url):
+            _download_google_drive(video_url, video_path)
+        else:
+            with httpx.stream("GET", video_url, timeout=120.0, follow_redirects=True) as r:
+                r.raise_for_status()
+                with open(video_path, "wb") as f:
+                    for chunk in r.iter_bytes():
+                        f.write(chunk)
+    else:
+        raise ValueError("Provide one of: video_file_path or video_url")
+    _validate_video_file(video_path)
+    out_path = _resolve_out_dir(out_dir, run_id)
+    python_exe = _resolve_python_executable(python_bin)
+    cmd = [
+        python_exe,
+        "-u",
+        str(pipeline_script),
+        "--python",
+        python_exe,
+        *_build_common_args(
+            video_path=video_path,
+            out_dir=out_path,
+            deepgram_model=deepgram_model,
+            deepgram_language=deepgram_language,
+            deepgram_request_timeout_sec=deepgram_request_timeout_sec,
+            deepgram_connect_timeout_sec=deepgram_connect_timeout_sec,
+            deepgram_retries=deepgram_retries,
+            deepgram_retry_backoff_sec=deepgram_retry_backoff_sec,
+            force_deepgram=force_deepgram,
+            force_keyframes=force_keyframes,
+            pre_roll_sec=pre_roll_sec,
+            gemini_model=gemini_model,
+            similarity_threshold=similarity_threshold,
+            temperature=temperature,
+        ),
+    ]
+    started = time.time()
+    logs_path = _logs_path(run_id)
+    log_fh = open(logs_path, "a", encoding="utf-8", buffering=1)
+    log_fh.write(
+        f"[runner] run_id={run_id} variant={variant} started_at_epoch={started}\n"
+        f"[runner] command={' '.join(cmd)}\n"
+        f"[runner] cwd={PIPELINES_DIR}\n\n"
+        f"[runner] heartbeat_interval_sec={log_heartbeat_sec}\n"
+        f"[runner] python_unbuffered=1\n\n"
+    )
+    log_fh.flush()
+    child_env = os.environ.copy()
+    child_env["PYTHONUNBUFFERED"] = "1"
+    child_env.setdefault("PYTHONIOENCODING", "utf-8")
+    proc = subprocess.Popen(
+        cmd,
+        cwd=str(PIPELINES_DIR),
+        stdout=log_fh,
+        stderr=subprocess.STDOUT,
+        text=True,
+        env=child_env,
+    )
+    meta = {
+        "variant": variant,
+        "run_id": run_id,
+        "python_executable": python_exe,
+        "command": cmd,
+        "status": "running",
+        "exit_code": None,
+        "pid": proc.pid,
+        "started_at_epoch": started,
+        "finished_at_epoch": None,
+        "duration_sec": None,
+        "out_dir": str(out_path),
+        "logs_path": str(logs_path),
+        "heartbeat_interval_sec": float(log_heartbeat_sec),
+        "output_files": _build_output_files(out_path, variant),
+    }
+    _write_json(_meta_path(run_id), meta)
+    watcher = threading.Thread(
+        target=_watch_run,
+        args=(run_id, proc, started, log_fh, float(log_heartbeat_sec)),
+        daemon=True,
+    )
+    watcher.start()
+    return {
+        "run_id": run_id,
+        "variant": variant,
+        "status": "running",
+        "python_executable": python_exe,
+        "status_path": f"runs/{run_id}",
+        "logs_path": f"runs/{run_id}/logs",
+        "final_output_path": f"runs/{run_id}/final-output",
+        "final_output_condensed_path": f"runs/{run_id}/final-output/condensed",
+        "out_dir": str(out_path),
+    }
+def get_status(run_id: str) -> Dict[str, Any]:
+    p = _meta_path(run_id)
+    if not p.exists():
+        raise FileNotFoundError(f"Unknown run_id: {run_id}")
+    return _read_json(p)
+def get_logs(run_id: str, tail_lines: int = 300) -> str:
+    meta = get_status(run_id)
+    p = Path(meta.get("logs_path", ""))
+    if not p.exists():
+        return ""
+    txt = p.read_text(encoding="utf-8", errors="replace")
+    limit = max(1, min(int(tail_lines), 5000))
+    return _tail(txt, max_lines=limit)
+def get_final_output(run_id: str, condensed: bool = False) -> Dict[str, Any]:
+    meta = get_status(run_id)
+    status = meta.get("status")
+    key = "final_output_condensed" if condensed else "final_output"
+    out_file = Path(meta["output_files"][key])
+    if status == "running":
+        return {
+            "run_id": run_id,
+            "status": status,
+            "message": "Pipeline is still running. Check logs.",
+        }
+    if status == "failed":
+        return {
+            "run_id": run_id,
+            "status": status,
+            "message": "Pipeline failed. Check logs.",
+        }
+    if not out_file.exists():
+        raise FileNotFoundError(f"Output not found: {out_file}")
+    return _read_json(out_file)

vercel.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "version": 2,
+  "builds": [
+    {
+      "src": "api/index.py",
+      "use": "@vercel/python"
+    }
+  ],
+  "routes": [
+    {
+      "src": "/(.*)",
+      "dest": "api/index.py"
+    }
+  ],
+  "functions": {
+    "api/index.py": {
+      "maxDuration": 900
+    }
+  }
+}