Spaces:

nexusbert
/

DSN

Sleeping

nexusbert commited on 11 days ago

Commit

d47b370

1 Parent(s): ff602f5

Refactor agent workflow and update documentation for Gemini-first implementation

- Updated AGENT_WORKFLOW.md to reflect changes in API, architecture, and environment variables.
- Revised README.md to clarify the use of Gemini API for Task A and Task B, and removed references to local LLM inference.
- Refactored recommendation_pipeline.py to streamline ranking logic and remove local LLM dependencies.
- Simplified user_modeling.py by enforcing Gemini usage for generation and removing local LLM code.
- Cleaned up shared_models.py by removing unused local LLM functions and optimizing embedder initialization.

Files changed (5) hide show

AGENT_WORKFLOW.md +67 -62
README.md +5 -5
app/recommendation_pipeline.py +52 -320
app/shared_models.py +1 -62
app/user_modeling.py +21 -97

AGENT_WORKFLOW.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Agent workflow (Task A & Task B)
-This document matches the API `agent_steps`, the code layout, and environment variables. Use it as the backbone for the **solution paper** (architecture, experiments, limitations).
 ## Architecture overview
@@ -10,14 +10,14 @@ flowchart TB
     P[STARTUP_PREWARM=all]
     S[shared_models.warm_shared_weights]
     A[UserModelingService.warm - RAG index]
-    B[RecommendationService.warm - catalog]
     P --> S --> A
     P --> B
   end
   subgraph taskA [POST /user-modeling]
     R1[yelp_rag_retrieve]
-    G1[gemini_generate or local_hf]
     P1[parse_stars_review]
     R1 --> G1 --> P1
   end
@@ -25,8 +25,8 @@ flowchart TB
   subgraph taskB [POST /recommendation]
     E2[embedded_persona_context]
     V2[vector_retrieval_top_K]
-    L2[gemini_rank or local_hf_rank]
-    E2 --> V2 --> L2
   end
   S -.-> E2
@@ -35,11 +35,12 @@ flowchart TB
 | Layer | Role |
 |--------|------|
-| **`app/gemini_client.py`** | **Gemini API** text generation when `GEMINI_API_KEY` is set (`GENERATION_BACKEND=gemini` or `auto`). |
-| **`app/shared_models.py`** | Shared **SentenceTransformer**; optional local causal LM only when `GENERATION_BACKEND=local`. |
-| **`app/main.py`** | FastAPI routes; **`asyncio.to_thread`** for CPU/API work. |
-| **`scripts/docker_build_assets.py`** | Image build: embedder snapshot + JSONL indexes; skips Qwen download when `SKIP_LOCAL_LLM_HUB_DOWNLOAD=1`. |
-| **Runtime** | Embeddings from baked `/models/huggingface`; generation via **Gemini** by default. |
 ---
@@ -47,34 +48,33 @@ flowchart TB
 | Variable | Purpose |
 |----------|---------|
-| `GENERATION_BACKEND` | `gemini`, `local`, or `auto` (default: Gemini if `GEMINI_API_KEY` is set, else local Qwen). |
-| `GEMINI_API_KEY` / `GOOGLE_API_KEY` | Google AI Studio API key for Task A + Task B generation. |
 | `GEMINI_MODEL` | Model id (default `gemini-2.0-flash`). |
-| `SKIP_LOCAL_LLM_HUB_DOWNLOAD` | `1` at Docker build: do not bake Qwen weights (use with Gemini). |
-| `LOCAL_EMBEDDING_MODEL` | Shared embedder (default `all-MiniLM-L6-v2`) for Task A RAG queries and Task B retrieval. |
-| `LOCAL_LLM_MODEL` | Local causal LM when `GENERATION_BACKEND=local` (default `Qwen/Qwen2.5-1.5B-Instruct`). |
-| `TASK_A_REVIEWS_EMBEDDED` | Path to embedded review snippets JSONL for RAG. |
-| `TASK_A_RAG_TOP_K` | Snippets passed into the Task A prompt (default `5`). |
-| `TASK_A_MAX_TOKENS` / `TASK_A_TEMPERATURE` | Generation limits for Task A. |
 | `TASK_B_EMBEDDED_CATALOG` | Path to embedded business catalog JSONL. |
-| `TASK_B_LLM_CANDIDATE_CAP` | Max candidates sent to the LLM reranker (default `6`). |
-| `TASK_B_RANK_MODE` | `llm` (default) or `retrieval` (embedding order only, &lt;1s). `TASK_B_FAST_RANK=1` forces retrieval. |
-| `STARTUP_PREWARM` | `all` (default): load shared weights + RAG + catalog before traffic. |
-| `SKIP_STARTUP_PREWARM` | Set to `1` to skip startup load (not recommended on Spaces). |
-| `HF_TOKEN` | Optional; pass into **Docker build** for Hub rate limits when downloading models. |
-Optional overrides (only if tasks use different models): `TASK_A_EMBEDDING_MODEL`, `TASK_A_LOCAL_LLM_MODEL`, `TASK_B_LOCAL_EMBEDDING_MODEL`, `TASK_B_LOCAL_LLM_MODEL`.
 ---
 ## Shared behaviour
-- **Generation:** **Gemini API** by default; **local Qwen** if `GENERATION_BACKEND=local`. Embeddings always **local** MiniLM.
-- **Nigerian English (competition bonus):** Task A reviews and Task B rationales are prompted for **natural Nigerian English** (see `user_modeling_prompt.py`, `recommendation_pipeline.py`).
-- **Task A point of view:** Reviews must be **first person** (`I` / `my`) as the user who visited — not third-person narration about the user.
-- **Failure handling:** Task A retries once with a **format + POV** nudge if parsing fails. Task B uses **popularity fallback** if reranker JSON is invalid.
-- **Deployment (2 vCPU / 16 GB):** Use **`uvicorn --workers 1`**. With Gemini, startup only loads the embedder (fast, low RAM). With local Qwen, one generation at a time via **`inference_lock`**.
-- **Swagger:** With **Gemini**, Task B usually finishes in a few seconds. Local CPU rerank may still exceed browser timeouts.
 ---
@@ -82,19 +82,19 @@ Optional overrides (only if tasks use different models): `TASK_A_EMBEDDING_MODEL
 **Endpoint:** `POST /user-modeling` (aliases `/task-1`, `/task_a`)
-**Input:** `persona` (multiline snapshot; optional `user_id: …` for RAG boost), `product` (business facts), `include_raw`.
-**Output:** `stars`, `review`, `parse_ok`, `rag_snippets_used`, `agent_steps`.
 | Step | `agent_steps` | Code | What happens |
-|------|---------------|------|----------------|
-| 1 | `yelp_rag_retrieve` *(if index exists)* | `UserModelingService._retrieve_examples` → `TaskAReviewRagIndex.retrieve` | Embed `persona + product` with `LOCAL_EMBEDDING_MODEL`; top‑`TASK_A_RAG_TOP_K` snippets; prefer same `user_id` when present in persona. |
-| 2 | `gemini_generate` or `local_hf_causal_lm` | `_generate` / `_generate_fix` | Gemini API or shared local Qwen; Nigerian voice + **first-person** rules; RAG + snapshot + business. |
-| 3 | `parse_stars_review` | `parse_model_output` | Parse `Stars:` and `Review:`; one retry if missing. |
-If `TASK_A_REVIEWS_EMBEDDED` is missing, step 1 is skipped (`rag_snippets_used: 0`).
-**Build index:** `scripts/build_task_a_review_rag.py` (Yelp `review.json` + `business.json`).
 ---
@@ -102,47 +102,52 @@ If `TASK_A_REVIEWS_EMBEDDED` is missing, step 1 is skipped (`rag_snippets_used:
 **Endpoint:** `POST /recommendation` (aliases `/task-2`, `/task_b`)
-**Input:** `persona`, optional `city` / `state`, `chat_history`, `top_k_retrieval` (default **20**), `top_n_final` (default **3**).
-**Output:** `recommendations[]` with `business_id`, `rank`, `rationale`, plus `candidates_considered`, `agent_steps`.
 | Step | `agent_steps` | Code | What happens |
-|------|---------------|------|----------------|
-| 1 | `embedded_persona_context` | `_embed_persona_local` → `get_embedder()` | Persona + recent chat encoded with `LOCAL_EMBEDDING_MODEL`. |
-| 2 | `vector_retrieval_top_{K}` | `CatalogIndex.retrieve` | Cosine search on `TASK_B_EMBEDDED_CATALOG`; optional city/state filter. |
-| 3 | `gemini_reason_and_rank`, `local_hf_llm_reason_and_rank`, or `retrieval_score_rank` | `chat_rank_gemini`, `chat_rank_local_hf`, or `retrieval_rank` | LLM rerank ≤6 candidates; retrieval mode skips API. |
-**Build index:** `scripts/build_business_catalog.py` → `scripts/embed_catalog.py` (same embedding model as runtime).
 ---
 ## Docker build vs container start
 | Phase | What runs | What you get |
-|-------|-----------|----------------|
-| **`docker build`** | `snapshot_download` for embedder + LLM weights; stub or Yelp JSONL; `DOCKER_BUILD_SKIP_LLM_WARM=1` by default | Model **files** on disk in the image; no second full Qwen load during build (prevents exit 137 OOM). |
-| **Container start** | `STARTUP_PREWARM=all` → `warm_shared_weights()` + RAG + catalog load | Models loaded **once** into RAM (~1–2 min on CPU); then requests are much faster. |
-| **Each request** | `asyncio.to_thread` + shared models | One inference at a time under `inference_lock`; `/health` and `/docs` stay responsive. |
 ---
 ## Reproducibility checklist
-1. `cp env.example .env` — set `HF_TOKEN` if needed for builds.
-2. Build indexes locally (optional if using Docker stubs):
-   - `python scripts/build_task_a_review_rag.py …`
-   - `python scripts/build_business_catalog.py …` then `python scripts/embed_catalog.py …`
-3. `docker build` / Space rebuild with build-time `HF_TOKEN` if Hub rate-limits.
-4. `docker compose up` or `uvicorn app.main:app --host 0.0.0.0 --port 8080`
-5. Wait for logs: `Startup prewarm complete.`
-6. Smoke test: `GET /health`, then `POST /user-modeling`, then `POST /recommendation` with `top_k_retrieval: 15`, `top_n_final: 5`.
 ---
 ## Paper pointers
-- **Why RAG for Task A:** Calibrate stars and voice from real Yelp snippets; `user_id` match surfaces same-user style when available.
-- **Why two-stage Task B:** Vector retrieval scales; LLM reranking adds persona-conditioned explanations.
-- **Why shared models:** One embedder + one LM for both tasks — fits 16 GB RAM with a single worker.
-- **Nigerian English + first-person Task A:** Prompt-level design; ablate with/without locale or POV blocks.
-- **Limitations:** CPU latency, single-worker queue, stub catalog when Yelp JSON is not baked into the image, small LM vs fine-tuned Azure baseline.

 # Agent workflow (Task A & Task B)
+This document matches the current API `agent_steps`, application implementation, and environment variables. Use it as the backbone for architecture notes, evaluation, and deployment.
 ## Architecture overview
     P[STARTUP_PREWARM=all]
     S[shared_models.warm_shared_weights]
     A[UserModelingService.warm - RAG index]
+    B[RecommendationService.ensure_catalog]
     P --> S --> A
     P --> B
   end
   subgraph taskA [POST /user-modeling]
     R1[yelp_rag_retrieve]
+    G1[gemini_generate]
     P1[parse_stars_review]
     R1 --> G1 --> P1
   end
   subgraph taskB [POST /recommendation]
     E2[embedded_persona_context]
     V2[vector_retrieval_top_K]
+    G2[gemini_reason_and_rank]
+    E2 --> V2 --> G2
   end
   S -.-> E2
 | Layer | Role |
 |--------|------|
+| **`app/gemini_client.py`** | Gemini API generation for Task A and Task B. |
+| **`app/shared_models.py`** | Shared **SentenceTransformer** embedder for RAG and retrieval. |
+| **`app/main.py`** | FastAPI routes and lifecycle with `asyncio.to_thread`. |
+| **`app/user_modeling.py`** | Task A user simulation: RAG + Gemini generation + output parsing. |
+| **`app/recommendation_pipeline.py`** | Task B recommendation: persona retrieval + Gemini reranking. |
+| **`scripts/docker_build_assets.py`** | Builds embedding snapshots and JSONL indexes; honors `SKIP_LOCAL_LLM_HUB_DOWNLOAD` for Gemini-first deploys. |
 ---
 | Variable | Purpose |
 |----------|---------|
+| `GENERATION_BACKEND` | `gemini` or `auto` (auto picks Gemini when `GEMINI_API_KEY` / `GOOGLE_API_KEY` is set). |
+| `GEMINI_API_KEY` / `GOOGLE_API_KEY` | Required for Gemini API generation. |
 | `GEMINI_MODEL` | Model id (default `gemini-2.0-flash`). |
+| `SKIP_LOCAL_LLM_HUB_DOWNLOAD` | `1` at Docker build: skip local Qwen download and keep the image Gemini-focused. |
+| `LOCAL_EMBEDDING_MODEL` | Shared embedder (default `all-MiniLM-L6-v2`) for Task A RAG and Task B retrieval. |
+| `TASK_A_REVIEWS_EMBEDDED` | Path to Task A RAG snippet JSONL. |
+| `TASK_A_RAG_TOP_K` | Number of snippets retrieved for Task A (default `5`). |
+| `TASK_A_MAX_TOKENS` | Token limit for Task A generation. |
+| `TASK_A_TEMPERATURE` | Temperature for Task A Gemini generation. |
 | `TASK_B_EMBEDDED_CATALOG` | Path to embedded business catalog JSONL. |
+| `TASK_B_MAX_OUTPUT_TOKENS` | Max tokens for Task B Gemini rank output. |
+| `STARTUP_PREWARM` | `all` (default): warm shared embedder and indexes at startup. |
+| `SKIP_STARTUP_PREWARM` | Set to `1` to skip startup prewarm. |
+| `HF_TOKEN` | Optional token for Docker build-time Hub downloads. |
+> Note: current app code is Gemini-first. Local causal LLM inference is not used by the active Task A/B code paths.
 ---
 ## Shared behaviour
+- **Generation:** Gemini API only in active Task A/B code paths.
+- **Embeddings:** Local SentenceTransformer embeddings are used for Task A RAG and Task B retrieval.
+- **Nigerian English:** Prompt design enforces natural Nigerian English in Task A reviews and Task B rationales.
+- **Task A POV:** Reviews must be first-person from the user’s perspective (`I`, `my`, `me`).
+- **Failure handling:** Task A retries once with a stricter format/POV fix prompt if parsing fails. Task B falls back to retrieval order with safe rationale text if Gemini output cannot be parsed.
+- **Startup preload:** `STARTUP_PREWARM=all` warms the shared embedder and loads indexes before traffic.
 ---
 **Endpoint:** `POST /user-modeling` (aliases `/task-1`, `/task_a`)
+**Input:** `persona`, `product`, `include_raw`.
+**Output:** `task`, `agent_steps`, `rag_snippets_used`, `stars`, `review`, `parse_ok`, `raw` (optional).
 | Step | `agent_steps` | Code | What happens |
+|------|---------------|------|-------------|
+| 1 | `yelp_rag_retrieve` *(if index exists)* | `UserModelingService._retrieve_examples` → `TaskAReviewRagIndex.retrieve` | Embed `persona + product` with `LOCAL_EMBEDDING_MODEL`; retrieve top `TASK_A_RAG_TOP_K` snippets for style calibration. |
+| 2 | `gemini_generate` | `UserModelingService._generate` | Gemini generates a Nigerian English first-person review and star rating. |
+| 3 | `parse_stars_review` | `parse_model_output` | Extract `Stars:` and `Review:`; retry once if parse is incomplete. |
+If the RAG index file is missing, step 1 is skipped and `rag_snippets_used` is `0`.
+**Build index:** `scripts/build_task_a_review_rag.py`.
 ---
 **Endpoint:** `POST /recommendation` (aliases `/task-2`, `/task_b`)
+**Input:** `persona`, optional `city` / `state`, `chat_history`, `top_k_retrieval` (default `20`), `top_n_final` (default `5`).
+**Output:** `task`, `agent_steps`, `candidates_considered`, `recommendations[]`.
 | Step | `agent_steps` | Code | What happens |
+|------|---------------|------|-------------|
+| 1 | `embedded_persona_context` | `build_query_text` | Build a combined persona + recent chat query string. |
+| 2 | `vector_retrieval_top_{K}` | `CatalogIndex.retrieve` | Cosine similarity retrieval from `TASK_B_EMBEDDED_CATALOG` using local embeddings; optional city/state filter. |
+| 3 | `gemini_reason_and_rank` | `chat_rank_gemini` | Gemini ranks selected candidates and returns conversational rationales. |
+**Build index:** `scripts/build_business_catalog.py` → `scripts/embed_catalog.py`.
 ---
 ## Docker build vs container start
 | Phase | What runs | What you get |
+|-------|-----------|-------------|
+| **`docker build`** | Embedder snapshot and JSONL index creation; `SKIP_LOCAL_LLM_HUB_DOWNLOAD=1` keeps the build Gemini-first. | Model files and indexes ready in the image. |
+| **Container start** | `STARTUP_PREWARM=all` → warm shared embedder, Task A RAG index, and Task B catalog index. | Startup load happens once, then requests are faster. |
+| **Each request** | `asyncio.to_thread` + shared embeddings + Gemini API | Task A/B remain responsive; no local LLM load. |
 ---
 ## Reproducibility checklist
+1. `cp env.example .env` and set `GEMINI_API_KEY` or `GOOGLE_API_KEY`.
+2. Build indexes locally if needed:
+   - `python scripts/build_task_a_review_rag.py`
+   - `python scripts/build_business_catalog.py`
+   - `python scripts/embed_catalog.py`
+3. Run the app:
+   - `docker compose up`
+   - or `uvicorn app.main:app --host 0.0.0.0 --port 8080`
+4. Confirm startup logs: `Startup prewarm complete.`
+5. Smoke test:
+   - `GET /health`
+   - `POST /user-modeling`
+   - `POST /recommendation` with `top_k_retrieval: 15`, `top_n_final: 5`.
 ---
 ## Paper pointers
+- **Gemini-first pipeline:** active code paths call Gemini for both Task A and Task B.
+- **RAG for Task A:** retrieve real review snippets for style calibration and rating behavior.
+- **Two-stage recommendation:** local retrieval scales, followed by persona-conditioned Gemini reranking.
+- **Nigerian English:** prompt-level design enforces tone and first-person review voice.
+- **Limitations:** API dependency, embedding latency, single worker, and potential query/catalog model mismatch.

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
 This Space is configured as **`sdk: docker`**. The image builds from `Dockerfile` (CPU-only PyTorch so CUDA wheels don’t OOM the builder). During **`docker build`**, models are **`snapshot_download`**’d into `/models/huggingface` **without loading the full LLM into RAM**; **`SentenceTransformer`** embeds a **stub** or Yelp-derived catalog plus **`data/task_a_reviews_embedded.jsonl`** (review RAG for Task A). See `scripts/docker_build_assets.py`.
-Task **A**: persona + product → rating/review via **Gemini API** (default) or optional **local** Qwen, plus **retrieved Yelp review snippets** from the baked JSONL. Task **B**: **local** sentence-transformer retrieval over businesses plus **Gemini** (or local) reranking.
 **Secrets (Hugging Face Space):** **`GEMINI_API_KEY`** (or `GOOGLE_API_KEY`) — required for generation when `GENERATION_BACKEND=gemini`. Optional **`HF_TOKEN`** for **Docker build** only (embedder download). Never commit keys in the repo.
@@ -67,9 +67,9 @@ python scripts/build_task_a_review_rag.py \
 Use the same `TASK_B_LOCAL_EMBEDDING_MODEL` (or `TASK_A_EMBEDDING_MODEL`) at build and runtime. Omit the file only for quick tests (generation runs without RAG).
-**Generation:** set `GEMINI_API_KEY` in `.env` (see `env.example`). With `GENERATION_BACKEND=gemini` (default), Task A and Task B use **`GEMINI_MODEL`** (default `gemini-2.0-flash`). Set `GENERATION_BACKEND=local` to use on-device Qwen instead.
-**Task B** reranking uses Gemini when configured; embeddings stay local (`LOCAL_EMBEDDING_MODEL`).
 **Recommendation index** (needs Yelp `business.json` on your machine, e.g. `../yelp_dataset/extracted/` from a parent workspace):
@@ -107,7 +107,7 @@ Default compose maps **`7860:7860`**. The image bakes **`/code/data/business_cat
 The Docker image sets **`HF_HUB_OFFLINE=1`** and **`TRANSFORMERS_OFFLINE=1`** so the running container does not call the Hugging Face Hub. During **`docker build`**, **`snapshot_download`** copies model **files** into `/models/huggingface` (and stub JSONL is embedded). Loading weights **into RAM** during build was disabled by default (**`DOCKER_BUILD_SKIP_LLM_WARM=1`**) because HF build VMs often **OOM (exit 137)** when loading Qwen; that RAM would not stay in the final image anyway.
-At **container start**, **`STARTUP_PREWARM=all`** (default) loads **one shared** embedding model and **one shared** causal LM (`app/shared_models.py`), then Task A RAG + Task B catalog — so **`/task-2`** does not pay a second full Qwen load. Expect **~1–2 minutes** on CPU after deploy while logs show `Loading shared …`; then both endpoints stay fast. Disable with **`SKIP_STARTUP_PREWARM=1`** (not recommended on Spaces).
 ### Smoke checks
@@ -119,7 +119,7 @@ OpenAPI: `http://localhost:7860/docs` when using Docker (port **7860**). Local `
 |------|------|
 | `app/main.py` | FastAPI routes |
 | [`AGENT_WORKFLOW.md`](AGENT_WORKFLOW.md) | Agent steps, reproducibility, paper hooks (Nigerian English, fallbacks) |
-| `app/user_modeling.py`, `app/user_modeling_prompt.py`, `app/task_a_rag.py` | Task 1 local LLM + Yelp review RAG |
 | `app/recommendation_pipeline.py` | Task 2 retrieval + rerank |
 | `scripts/build_business_catalog.py` | Yelp → catalog JSONL |
 | `scripts/embed_catalog.py` | Embed catalog (local sentence-transformers) |

 This Space is configured as **`sdk: docker`**. The image builds from `Dockerfile` (CPU-only PyTorch so CUDA wheels don’t OOM the builder). During **`docker build`**, models are **`snapshot_download`**’d into `/models/huggingface` **without loading the full LLM into RAM**; **`SentenceTransformer`** embeds a **stub** or Yelp-derived catalog plus **`data/task_a_reviews_embedded.jsonl`** (review RAG for Task A). See `scripts/docker_build_assets.py`.
+Task **A**: persona + product → rating/review via **Gemini API** and retrieved Yelp review snippets from the baked JSONL. Task **B**: local sentence-transformer retrieval over businesses plus **Gemini** reranking.
 **Secrets (Hugging Face Space):** **`GEMINI_API_KEY`** (or `GOOGLE_API_KEY`) — required for generation when `GENERATION_BACKEND=gemini`. Optional **`HF_TOKEN`** for **Docker build** only (embedder download). Never commit keys in the repo.
 Use the same `TASK_B_LOCAL_EMBEDDING_MODEL` (or `TASK_A_EMBEDDING_MODEL`) at build and runtime. Omit the file only for quick tests (generation runs without RAG).
+**Generation:** set `GEMINI_API_KEY` in `.env` (see `env.example`). With `GENERATION_BACKEND=gemini` or `auto` (default), Task A and Task B both use **Gemini**. Local causal LLM inference is not used by current runtime code.
+**Task B** reranking uses Gemini; embeddings stay local (`LOCAL_EMBEDDING_MODEL`).
 **Recommendation index** (needs Yelp `business.json` on your machine, e.g. `../yelp_dataset/extracted/` from a parent workspace):
 The Docker image sets **`HF_HUB_OFFLINE=1`** and **`TRANSFORMERS_OFFLINE=1`** so the running container does not call the Hugging Face Hub. During **`docker build`**, **`snapshot_download`** copies model **files** into `/models/huggingface` (and stub JSONL is embedded). Loading weights **into RAM** during build was disabled by default (**`DOCKER_BUILD_SKIP_LLM_WARM=1`**) because HF build VMs often **OOM (exit 137)** when loading Qwen; that RAM would not stay in the final image anyway.
+At **container start**, **`STARTUP_PREWARM=all`** (default) loads the shared embedding model and preloads Task A RAG + Task B catalog indexes. Expect **~1–2 minutes** on CPU after deploy while logs show `Loading shared …`; then both endpoints stay fast. Disable with **`SKIP_STARTUP_PREWARM=1`** (not recommended on Spaces).
 ### Smoke checks
 |------|------|
 | `app/main.py` | FastAPI routes |
 | [`AGENT_WORKFLOW.md`](AGENT_WORKFLOW.md) | Agent steps, reproducibility, paper hooks (Nigerian English, fallbacks) |
+| `app/user_modeling.py`, `app/user_modeling_prompt.py`, `app/task_a_rag.py` | Task 1 Gemini generation + Yelp review RAG |
 | `app/recommendation_pipeline.py` | Task 2 retrieval + rerank |
 | `scripts/build_business_catalog.py` | Yelp → catalog JSONL |
 | `scripts/embed_catalog.py` | Embed catalog (local sentence-transformers) |

app/recommendation_pipeline.py CHANGED Viewed

@@ -2,30 +2,19 @@ from __future__ import annotations
 import json
 import logging
-import math
 import os
 import re
-import threading
 import time
 from pathlib import Path
 from typing import Any
 import numpy as np
-#modules
 from app._paths import submission_root
 from app.gemini_client import gemini_generate_text, use_gemini
-from app.shared_models import (
-    causal_lm_model_id_task_b,
-    embedding_model_name_task_b,
-    get_causal_lm,
-    get_embedder,
-    inference_lock,
-)
 logger = logging.getLogger(__name__)
-_recommend_inflight = threading.Lock()
 def _resolve_catalog_path(raw: str) -> Path:
     p = Path(raw)
@@ -83,7 +72,7 @@ class CatalogIndex:
             )
         nq = np.linalg.norm(q)
         if nq == 0:
-            q = np.ones_like(q) / math.sqrt(len(q))
         else:
             q = q / nq
@@ -121,239 +110,74 @@ class CatalogIndex:
         ]
-def _task_b_rank_mode() -> str:
-    """llm (default) or retrieval (no causal LM; sub-second, Swagger-safe)."""
-    if os.environ.get("TASK_B_FAST_RANK", "").strip().lower() in ("1", "true", "yes"):
-        return "retrieval"
-    mode = os.environ.get("TASK_B_RANK_MODE", "llm").strip().lower()
-    if mode in ("retrieval", "retrieval_only", "fast", "embed"):
-        return "retrieval"
-    return "llm"
-def retrieval_rank(candidates: list[dict[str, Any]], top_n: int) -> list[dict[str, Any]]:
-    ordered = sorted(
-        candidates,
-        key=lambda c: float(c.get("retrieval_score", 0)),
-        reverse=True,
-    )[:top_n]
-    out: list[dict[str, Any]] = []
-    for i, c in enumerate(ordered, start=1):
-        cats = (c.get("categories") or "similar venues").strip()
-        if len(cats) > 55:
-            cats = cats[:52] + "…"
-        out.append(
-            {
-                "business_id": c["business_id"],
-                "rank": i,
-                "rationale": f"Good semantic fit for {cats}; aligns with persona signals.",
-            }
-        )
-    return out
-def popularity_fallback(candidates: list[dict[str, Any]], top_n: int) -> list[dict[str, Any]]:
-    ranked = sorted(
-        candidates,
-        key=lambda c: (float(c.get("review_count", 0)), float(c.get("stars", 0))),
-        reverse=True,
-    )
-    out = []
-    for i, c in enumerate(ranked[:top_n], start=1):
-        out.append(
-            {
-                "business_id": c["business_id"],
-                "rank": i,
-                "rationale": "High activity and average rating on Yelp (popularity prior).",
-            }
-        )
-    return out
-_RANK_SYSTEM = "Return valid JSON only. No markdown."
-def _build_rank_user_prompt(
     persona: str,
     chat_history: list[dict[str, str]],
     candidates: list[dict[str, Any]],
     top_n: int,
-) -> str:
-    hist_txt = ""
-    if chat_history:
-        lines = []
-        for turn in chat_history[-6:]:
-            role = turn.get("role", "user")
-            content = turn.get("content", "")
-            lines.append(f"{role}: {content}")
-        hist_txt = "\n".join(lines)
-    cand_payload = [
-        {
-            "id": c["business_id"],
-            "name": (c.get("name", "") or "")[:48],
-            "cat": (c.get("categories", "") or "")[:56],
-        }
-        for c in candidates
-    ]
-    return f"""Rank the best {top_n} businesses for this user (Nigerian English rationales, third person, under 14 words each).
 Persona:
 {persona.strip()[:1200]}
-Chat:
-{hist_txt or "(none)"}
 Candidates:
-{json.dumps(cand_payload, ensure_ascii=False)}
 Output ONLY a JSON array of {top_n} objects: {{"business_id":"<id>","rank":1,"rationale":"..."}} — distinct ids from candidates, rank 1 best."""
-def _normalize_ranked_output(
-    raw: str,
-    candidates: list[dict[str, Any]],
-    top_n: int,
-) -> list[dict[str, Any]]:
-    data = _parse_json_array(raw)
-    if not data:
-        logger.warning("Rank parse failed; using popularity fallback on candidates.")
-        return popularity_fallback(candidates, top_n)
-    seen: set[str] = set()
-    cleaned = []
-    for item in data:
-        if not isinstance(item, dict):
-            continue
-        bid = item.get("business_id")
-        if not bid or bid in seen:
-            continue
-        seen.add(str(bid))
-        cleaned.append(
-            {
-                "business_id": str(bid),
-                "rank": int(item.get("rank", len(cleaned) + 1)),
-                "rationale": str(item.get("rationale", "")).strip()
-                or "Matched persona and retrieval signals.",
-            }
-        )
-        if len(cleaned) >= top_n:
-            break
-    if len(cleaned) < min(top_n, len(candidates)):
-        for c in candidates:
-            if len(cleaned) >= top_n:
-                break
-            bid = c["business_id"]
-            if bid in seen:
-                continue
-            seen.add(bid)
-            cleaned.append(
-                {
-                    "business_id": bid,
-                    "rank": len(cleaned) + 1,
-                    "rationale": "Added by retrieval order after partial model output.",
-                }
-            )
-    cleaned.sort(key=lambda x: x["rank"])
-    return cleaned[:top_n]
-def chat_rank_gemini(
-    *,
-    persona: str,
-    chat_history: list[dict[str, str]],
-    candidates: list[dict[str, Any]],
-    top_n: int,
-) -> list[dict[str, Any]]:
-    user_prompt = _build_rank_user_prompt(persona, chat_history, candidates, top_n)
-    temp = float(os.environ.get("TASK_B_TEMPERATURE", "0.2"))
-    max_out = min(512, int(os.environ.get("TASK_B_MAX_OUTPUT_TOKENS", "256")))
     raw = gemini_generate_text(
-        system_instruction=_RANK_SYSTEM,
         user_text=user_prompt,
-        temperature=temp,
-        max_output_tokens=max_out,
     )
     return _normalize_ranked_output(raw, candidates, top_n)
-def chat_rank_local_hf(
-    *,
-    persona: str,
-    chat_history: list[dict[str, str]],
     candidates: list[dict[str, Any]],
     top_n: int,
-    tokenizer: Any,
-    model: Any,
-    device: str,
 ) -> list[dict[str, Any]]:
     try:
-        import torch  # type: ignore[import-untyped]
-    except ImportError as e:
-        raise RuntimeError("Local reranking needs PyTorch (install sentence-transformers or torch).") from e
-    user_prompt = _build_rank_user_prompt(persona, chat_history, candidates, top_n)
-    messages = [
-        {"role": "system", "content": _RANK_SYSTEM},
-        {"role": "user", "content": user_prompt},
-    ]
-    prompt_txt = tokenizer.apply_chat_template(
-        messages,
-        tokenize=False,
-        add_generation_prompt=True,
-    )
-    inputs = tokenizer(
-        prompt_txt,
-        return_tensors="pt",
-        truncation=True,
-        max_length=1536,
-    ).to(device)
-    if tokenizer.pad_token_id is None:
-        tokenizer.pad_token_id = tokenizer.eos_token_id
-    max_new_tokens = min(200, 24 + top_n * 32)
-    with inference_lock(), torch.no_grad():
-        out = model.generate(
-            **inputs,
-            max_new_tokens=max_new_tokens,
-            do_sample=False,
-            pad_token_id=tokenizer.pad_token_id,
-        )
-    gen_ids = out[0][inputs["input_ids"].shape[1] :]
-    raw = tokenizer.decode(gen_ids, skip_special_tokens=True).strip()
-    return _normalize_ranked_output(raw, candidates, top_n)
-def _parse_json_array(raw: str) -> list[Any]:
-    raw = raw.strip()
-    try:
-        val = json.loads(raw)
-        return val if isinstance(val, list) else []
-    except json.JSONDecodeError:
-        pass
-    m = re.search(r"\[[\s\S]*\]", raw)
-    if m:
-        try:
-            val = json.loads(m.group(0))
-            return val if isinstance(val, list) else []
-        except json.JSONDecodeError:
-            pass
-    return []
-def build_query_text(persona: str, chat_history: list[dict[str, str]]) -> str:
-    parts = [persona.strip()]
-    for turn in chat_history[-4:]:
-        c = turn.get("content", "").strip()
-        if c:
-            parts.append(c)
-    return "\n".join(parts)
 class RecommendationService:
     def __init__(self) -> None:
-        self._local_llm_model_id = causal_lm_model_id_task_b()
-        self._local_model_name = embedding_model_name_task_b()
         catalog_raw = os.environ.get(
             "TASK_B_EMBEDDED_CATALOG", "data/business_catalog_embedded.jsonl"
         )
@@ -361,29 +185,13 @@ class RecommendationService:
         self.index = CatalogIndex(self.catalog_path)
         self._loaded = False
-    def _ensure_local_embedder(self) -> Any:
-        return get_embedder(self._local_model_name)
-    def _embed_persona_local(self, text: str) -> list[float]:
-        model = self._ensure_local_embedder()
-        t = text.replace("\n", " ")[:8000]
-        vec = model.encode([t], convert_to_numpy=True, normalize_embeddings=False)[0]
-        return vec.astype(float).tolist()
-    def _ensure_local_rank_llm(self) -> tuple[Any, Any, str]:
-        tok, mdl, dev = get_causal_lm(self._local_llm_model_id)
-        return tok, mdl, str(dev)
     def ensure_catalog(self) -> None:
         if self._loaded:
             return
-        logger.info("Loading Task B catalog from %s", self.catalog_path)
         self.index.load()
         self._loaded = True
-        logger.info("Task B catalog ready (%d businesses)", len(self.index._rows))
-    def warm(self) -> None:
-        self.ensure_catalog()
     def recommend(
         self,
@@ -395,98 +203,22 @@ class RecommendationService:
         top_k_retrieval: int = 20,
         top_n_final: int = 5,
     ) -> dict[str, Any]:
-        # Local HF rerank is slow on CPU; serialize. Gemini can run concurrently.
-        if not use_gemini():
-            if not _recommend_inflight.acquire(blocking=False):
-                raise RuntimeError(
-                    "Another recommendation is already running; wait for it to finish before retrying."
-                )
-            try:
-                return self._recommend_impl(
-                    persona,
-                    city=city,
-                    state=state,
-                    chat_history=chat_history,
-                    top_k_retrieval=top_k_retrieval,
-                    top_n_final=top_n_final,
-                )
-            finally:
-                _recommend_inflight.release()
-        return self._recommend_impl(
-            persona,
-            city=city,
-            state=state,
-            chat_history=chat_history,
-            top_k_retrieval=top_k_retrieval,
-            top_n_final=top_n_final,
-        )
-    def _recommend_impl(
-        self,
-        persona: str,
-        *,
-        city: str | None = None,
-        state: str | None = None,
-        chat_history: list[dict[str, str]] | None = None,
-        top_k_retrieval: int = 20,
-        top_n_final: int = 5,
-    ) -> dict[str, Any]:
-        t0 = time.perf_counter()
-        chat_history = chat_history or []
         self.ensure_catalog()
-        rank_mode = _task_b_rank_mode()
-        llm_cap = int(os.environ.get("TASK_B_LLM_CANDIDATE_CAP", "6"))
-        top_k_retrieval = max(top_k_retrieval, top_n_final)
         qtext = build_query_text(persona, chat_history)
-        qemb = self._embed_persona_local(qtext)
-        candidates = self.index.retrieve(qemb, top_k_retrieval, city, state)
-        logger.info("Task B retrieved %d candidates in %.2fs", len(candidates), time.perf_counter() - t0)
-        if not candidates:
-            raise RuntimeError("No retrieval candidates — check catalog filters and embeddings.")
-        rank_pool = candidates[: min(len(candidates), llm_cap)]
-        logger.info(
-            "Task B %s rank on %d of %d candidates (top_n=%d) …",
-            rank_mode,
-            len(rank_pool),
-            len(candidates),
-            top_n_final,
         )
-        t1 = time.perf_counter()
-        if rank_mode == "retrieval":
-            ranked = retrieval_rank(rank_pool, top_n_final)
-            rank_step = "retrieval_score_rank"
-        elif use_gemini():
-            ranked = chat_rank_gemini(
-                persona=persona,
-                chat_history=chat_history,
-                candidates=rank_pool,
-                top_n=top_n_final,
-            )
-            rank_step = "gemini_reason_and_rank"
-        else:
-            tok, mdl, dev = self._ensure_local_rank_llm()
-            ranked = chat_rank_local_hf(
-                persona=persona,
-                chat_history=chat_history,
-                candidates=rank_pool,
-                top_n=top_n_final,
-                tokenizer=tok,
-                model=mdl,
-                device=dev,
-            )
-            rank_step = "local_hf_llm_reason_and_rank"
-        logger.info("Task B rank done in %.2fs (total %.2fs)", time.perf_counter() - t1, time.perf_counter() - t0)
         return {
             "task": "2_recommendation",
             "agent_steps": [
                 "embedded_persona_context",
                 f"vector_retrieval_top_{top_k_retrieval}",
-                rank_step,
             ],
             "candidates_considered": len(candidates),
             "recommendations": ranked,

 import json
 import logging
 import os
 import re
 import time
 from pathlib import Path
 from typing import Any
 import numpy as np
 from app._paths import submission_root
 from app.gemini_client import gemini_generate_text, use_gemini
 logger = logging.getLogger(__name__)
 def _resolve_catalog_path(raw: str) -> Path:
     p = Path(raw)
             )
         nq = np.linalg.norm(q)
         if nq == 0:
+            q = np.ones_like(q) / np.sqrt(len(q))
         else:
             q = q / nq
         ]
+def build_query_text(persona: str, chat_history: list[dict[str, str]]) -> str:
+    parts = [persona.strip()]
+    for turn in chat_history[-4:]:
+        c = turn.get("content", "").strip()
+        if c:
+            parts.append(c)
+    return "\n".join(parts)
+def chat_rank_gemini(
+    *,
     persona: str,
     chat_history: list[dict[str, str]],
     candidates: list[dict[str, Any]],
     top_n: int,
+) -> list[dict[str, Any]]:
+    user_prompt = f"""Rank the best {top_n} businesses for this user with conversational rationales.
 Persona:
 {persona.strip()[:1200]}
+Chat History:
+{json.dumps(chat_history[-6:], ensure_ascii=False) if chat_history else "(none)"}
 Candidates:
+{json.dumps(candidates, ensure_ascii=False)}
 Output ONLY a JSON array of {top_n} objects: {{"business_id":"<id>","rank":1,"rationale":"..."}} — distinct ids from candidates, rank 1 best."""
     raw = gemini_generate_text(
+        system_instruction="Return valid JSON only.",
         user_text=user_prompt,
+        temperature=0.2,
+        max_output_tokens=512,
     )
     return _normalize_ranked_output(raw, candidates, top_n)
+def _normalize_ranked_output(
+    raw: str,
     candidates: list[dict[str, Any]],
     top_n: int,
 ) -> list[dict[str, Any]]:
     try:
+        data = json.loads(raw)
+        if not isinstance(data, list):
+            raise ValueError("Invalid JSON format for ranked output.")
+        return [
+            {
+                "business_id": item["business_id"],
+                "rank": item["rank"],
+                "rationale": item["rationale"],
+            }
+            for item in data[:top_n]
+        ]
+    except (json.JSONDecodeError, KeyError, ValueError):
+        logger.warning("Failed to parse Gemini output; falling back to retrieval order.")
+        return [
+            {
+                "business_id": c["business_id"],
+                "rank": i + 1,
+                "rationale": "Fallback rationale due to parsing error.",
+            }
+            for i, c in enumerate(candidates[:top_n])
+        ]
 class RecommendationService:
     def __init__(self) -> None:
         catalog_raw = os.environ.get(
             "TASK_B_EMBEDDED_CATALOG", "data/business_catalog_embedded.jsonl"
         )
         self.index = CatalogIndex(self.catalog_path)
         self._loaded = False
     def ensure_catalog(self) -> None:
         if self._loaded:
             return
+        logger.info("Loading catalog from %s", self.catalog_path)
         self.index.load()
         self._loaded = True
+        logger.info("Catalog loaded with %d businesses", len(self.index._rows))
     def recommend(
         self,
         top_k_retrieval: int = 20,
         top_n_final: int = 5,
     ) -> dict[str, Any]:
         self.ensure_catalog()
+        chat_history = chat_history or []
         qtext = build_query_text(persona, chat_history)
+        candidates = self.index.retrieve(qtext, top_k_retrieval, city, state)
+        ranked = chat_rank_gemini(
+            persona=persona,
+            chat_history=chat_history,
+            candidates=candidates,
+            top_n=top_n_final,
         )
         return {
             "task": "2_recommendation",
             "agent_steps": [
                 "embedded_persona_context",
                 f"vector_retrieval_top_{top_k_retrieval}",
+                "gemini_reason_and_rank",
             ],
             "candidates_considered": len(candidates),
             "recommendations": ranked,

app/shared_models.py CHANGED Viewed

@@ -5,12 +5,9 @@ import os
 import threading
 from typing import Any
-from app.gemini_client import use_gemini
 logger = logging.getLogger(__name__)
 _embedders: dict[str, Any] = {}
-_llm_cache: dict[str, tuple[Any, Any, Any]] = {}
 _inference_lock = threading.Lock()
@@ -23,15 +20,6 @@ def local_embedding_model() -> str:
     )
-def local_llm_model() -> str:
-    return (
-        os.environ.get("LOCAL_LLM_MODEL", "").strip()
-        or os.environ.get("TASK_B_LOCAL_LLM_MODEL", "").strip()
-        or os.environ.get("TASK_A_LOCAL_LLM_MODEL", "").strip()
-        or "Qwen/Qwen2.5-1.5B-Instruct"
-    )
 def embedding_model_name_task_a() -> str:
     override = os.environ.get("TASK_A_EMBEDDING_MODEL", "").strip()
     return override or local_embedding_model()
@@ -42,26 +30,11 @@ def embedding_model_name_task_b() -> str:
     return override or local_embedding_model()
-def causal_lm_model_id_task_a() -> str:
-    override = os.environ.get("TASK_A_LOCAL_LLM_MODEL", "").strip()
-    return override or local_llm_model()
-def causal_lm_model_id_task_b() -> str:
-    override = os.environ.get("TASK_B_LOCAL_LLM_MODEL", "").strip()
-    return override or local_llm_model()
 def unique_embedding_model_names() -> list[str]:
     names = {embedding_model_name_task_a(), embedding_model_name_task_b()}
     return sorted(names)
-def unique_llm_model_ids() -> list[str]:
-    ids = {causal_lm_model_id_task_a(), causal_lm_model_id_task_b()}
-    return sorted(ids)
 def get_embedder(model_name: str) -> Any:
     key = model_name.strip()
     if key not in _embedders:
@@ -78,44 +51,10 @@ def inference_lock() -> threading.Lock:
     return _inference_lock
-def get_causal_lm(model_id: str) -> tuple[Any, Any, Any]:
-    mid = model_id.strip()
-    if mid not in _llm_cache:
-        try:
-            import torch  # type: ignore[import-untyped]
-            from transformers import AutoModelForCausalLM, AutoTokenizer  # type: ignore[import-untyped]
-        except ImportError as e:
-            raise RuntimeError("transformers and torch required") from e
-        logger.info("Loading shared causal LM %s", mid)
-        tok = AutoTokenizer.from_pretrained(mid, trust_remote_code=True)
-        use_cuda = torch.cuda.is_available()
-        device_obj = torch.device("cuda" if use_cuda else "cpu")
-        dtype = torch.float16 if use_cuda else torch.float32
-        mdl = AutoModelForCausalLM.from_pretrained(
-            mid,
-            torch_dtype=dtype,
-            trust_remote_code=True,
-            low_cpu_mem_usage=True,
-        )
-        mdl = mdl.to(device_obj)
-        mdl.eval()
-        _llm_cache[mid] = (tok, mdl, device_obj)
-    return _llm_cache[mid]
 def warm_shared_weights() -> None:
     for name in unique_embedding_model_names():
         get_embedder(name)
-    if use_gemini():
-        logger.info(
-            "Shared weights ready (%d embedder(s); generation via Gemini API)",
-            len(_embedders),
-        )
-        return
-    for mid in unique_llm_model_ids():
-        get_causal_lm(mid)
     logger.info(
-        "Shared weights ready (%d embedder(s), %d causal LM(s))",
         len(_embedders),
-        len(_llm_cache),
     )

 import threading
 from typing import Any
 logger = logging.getLogger(__name__)
 _embedders: dict[str, Any] = {}
 _inference_lock = threading.Lock()
     )
 def embedding_model_name_task_a() -> str:
     override = os.environ.get("TASK_A_EMBEDDING_MODEL", "").strip()
     return override or local_embedding_model()
     return override or local_embedding_model()
 def unique_embedding_model_names() -> list[str]:
     names = {embedding_model_name_task_a(), embedding_model_name_task_b()}
     return sorted(names)
 def get_embedder(model_name: str) -> Any:
     key = model_name.strip()
     if key not in _embedders:
     return _inference_lock
 def warm_shared_weights() -> None:
     for name in unique_embedding_model_names():
         get_embedder(name)
     logger.info(
+        "Shared weights ready (%d embedder(s); generation via Gemini API)",
         len(_embedders),
     )

app/user_modeling.py CHANGED Viewed

@@ -10,11 +10,8 @@ from typing import Any
 from app._paths import submission_root
 from app.gemini_client import gemini_generate_chat, gemini_generate_text, use_gemini
 from app.shared_models import (
-    causal_lm_model_id_task_a,
     embedding_model_name_task_a,
-    get_causal_lm,
     get_embedder,
-    inference_lock,
 )
 from app.task_a_rag import TaskAReviewRagIndex
 from app.user_modeling_prompt import build_prompt_parts_with_rag
@@ -45,14 +42,13 @@ def _resolve_path(raw: str) -> Path:
 def _task_a_gen_step() -> str:
-    return "gemini_generate" if use_gemini() else "local_hf_causal_lm"
 class UserModelingService:
     def __init__(self) -> None:
         self._max_tokens = int(os.environ.get("TASK_A_MAX_TOKENS", "1024"))
         self._temperature = float(os.environ.get("TASK_A_TEMPERATURE", "0.35"))
-        self._local_llm_model_id = causal_lm_model_id_task_a()
         self._embedding_model_name = embedding_model_name_task_a()
         rag_raw = os.environ.get(
             "TASK_A_REVIEWS_EMBEDDED",
@@ -74,9 +70,6 @@ class UserModelingService:
         if self._rag_path.is_file():
             self._rag().load()
-    def _ensure_local_llm(self) -> tuple[Any, Any, Any]:
-        return get_causal_lm(self._local_llm_model_id)
     def _retrieve_examples(self, persona: str, product: str) -> list[dict[str, Any]]:
         if not self._rag_path.is_file():
             return []
@@ -86,14 +79,14 @@ class UserModelingService:
     def _generate(self, persona: str, product: str, examples: list[dict[str, Any]]) -> str:
         inst, user_body = build_prompt_parts_with_rag(persona, product, examples)
-        if use_gemini():
-            return gemini_generate_text(
-                system_instruction=inst,
-                user_text=user_body,
-                temperature=self._temperature,
-                max_output_tokens=min(int(self._max_tokens), 1024),
-            )
-        return self._generate_local(persona, product, examples)
     def _generate_fix(
         self, persona: str, product: str, prior_raw: str, examples: list[dict[str, Any]]
@@ -103,88 +96,19 @@ class UserModelingService:
             "Your answer must follow exactly:\nStars: <1-5>\nReview:\n<text>\n\n"
             "The Review must be first person (I/my/me), as the user who visited — not third person. Fix strictly."
         )
-        if use_gemini():
-            return gemini_generate_chat(
-                [
-                    {"role": "system", "content": inst},
-                    {"role": "user", "content": user_body},
-                    {"role": "assistant", "content": prior_raw},
-                    {"role": "user", "content": fix_user},
-                ],
-                temperature=0.2,
-                max_output_tokens=min(int(self._max_tokens), 1024),
-            )
-        return self._generate_local_fix(persona, product, prior_raw, examples)
-    def _generate_local(self, persona: str, product: str, examples: list[dict[str, Any]]) -> str:
-        tok, mdl, device = self._ensure_local_llm()
-        inst, user_body = build_prompt_parts_with_rag(persona, product, examples)
-        messages = [
-            {"role": "system", "content": inst},
-            {"role": "user", "content": user_body},
-        ]
-        prompt_txt = tok.apply_chat_template(
-            messages,
-            tokenize=False,
-            add_generation_prompt=True,
-        )
-        try:
-            import torch  # type: ignore[import-untyped]
-        except ImportError as e:
-            raise RuntimeError("Task 1 needs torch.") from e
-        inputs = tok(prompt_txt, return_tensors="pt").to(device)
-        if tok.pad_token_id is None:
-            tok.pad_token_id = tok.eos_token_id
-        max_new = min(int(self._max_tokens), 768)
-        with inference_lock(), torch.no_grad():
-            out = mdl.generate(
-                **inputs,
-                max_new_tokens=max_new,
-                do_sample=True,
-                temperature=self._temperature,
-                top_p=0.9,
-                pad_token_id=tok.pad_token_id,
-            )
-        gen_ids = out[0][inputs["input_ids"].shape[1] :]
-        return tok.decode(gen_ids, skip_special_tokens=True).strip()
-    def _generate_local_fix(
-        self, persona: str, product: str, prior_raw: str, examples: list[dict[str, Any]]
-    ) -> str:
-        tok, mdl, device = self._ensure_local_llm()
-        inst, user_body = build_prompt_parts_with_rag(persona, product, examples)
-        fix_user = (
-            "Your answer must follow exactly:\nStars: <1-5>\nReview:\n<text>\n\n"
-            "The Review must be first person (I/my/me), as the user who visited — not third person. Fix strictly."
-        )
-        messages = [
-            {"role": "system", "content": inst},
-            {"role": "user", "content": user_body},
-            {"role": "assistant", "content": prior_raw},
-            {"role": "user", "content": fix_user},
-        ]
-        prompt_txt = tok.apply_chat_template(
-            messages,
-            tokenize=False,
-            add_generation_prompt=True,
         )
-        import torch  # type: ignore[import-untyped]
-        inputs = tok(prompt_txt, return_tensors="pt").to(device)
-        if tok.pad_token_id is None:
-            tok.pad_token_id = tok.eos_token_id
-        max_new = min(int(self._max_tokens), 768)
-        with inference_lock(), torch.no_grad():
-            out = mdl.generate(
-                **inputs,
-                max_new_tokens=max_new,
-                do_sample=False,
-                pad_token_id=tok.pad_token_id,
-            )
-        gen_ids = out[0][inputs["input_ids"].shape[1] :]
-        return tok.decode(gen_ids, skip_special_tokens=True).strip()
     def generate(self, persona: str, product: str, *, include_raw: bool = False) -> dict[str, Any]:
         t0 = time.perf_counter()

 from app._paths import submission_root
 from app.gemini_client import gemini_generate_chat, gemini_generate_text, use_gemini
 from app.shared_models import (
     embedding_model_name_task_a,
     get_embedder,
 )
 from app.task_a_rag import TaskAReviewRagIndex
 from app.user_modeling_prompt import build_prompt_parts_with_rag
 def _task_a_gen_step() -> str:
+    return "gemini_generate"
 class UserModelingService:
     def __init__(self) -> None:
         self._max_tokens = int(os.environ.get("TASK_A_MAX_TOKENS", "1024"))
         self._temperature = float(os.environ.get("TASK_A_TEMPERATURE", "0.35"))
         self._embedding_model_name = embedding_model_name_task_a()
         rag_raw = os.environ.get(
             "TASK_A_REVIEWS_EMBEDDED",
         if self._rag_path.is_file():
             self._rag().load()
     def _retrieve_examples(self, persona: str, product: str) -> list[dict[str, Any]]:
         if not self._rag_path.is_file():
             return []
     def _generate(self, persona: str, product: str, examples: list[dict[str, Any]]) -> str:
         inst, user_body = build_prompt_parts_with_rag(persona, product, examples)
+        if not use_gemini():
+            raise RuntimeError("Task 1 requires Gemini for generation.")
+        return gemini_generate_text(
+            system_instruction=inst,
+            user_text=user_body,
+            temperature=self._temperature,
+            max_output_tokens=min(int(self._max_tokens), 1024),
+        )
     def _generate_fix(
         self, persona: str, product: str, prior_raw: str, examples: list[dict[str, Any]]
             "Your answer must follow exactly:\nStars: <1-5>\nReview:\n<text>\n\n"
             "The Review must be first person (I/my/me), as the user who visited — not third person. Fix strictly."
         )
+        if not use_gemini():
+            raise RuntimeError("Task 1 requires Gemini for generation.")
+        return gemini_generate_chat(
+            [
+                {"role": "system", "content": inst},
+                {"role": "user", "content": user_body},
+                {"role": "assistant", "content": prior_raw},
+                {"role": "user", "content": fix_user},
+            ],
+            temperature=0.2,
+            max_output_tokens=min(int(self._max_tokens), 1024),
         )
     def generate(self, persona: str, product: str, *, include_raw: bool = False) -> dict[str, Any]:
         t0 = time.perf_counter()