Spaces:

BladeSzaSza
/

Talker

Sleeping

BolyosCsaba Claude Sonnet 4.6 commited on Mar 3

Commit

5f3a4a0

1 Parent(s): 847df57

Optimize Qwen3 for fast social chat: disable thinking mode, cap tokens

- Disable Qwen3's <think>...</think> reasoning block (enable_thinking=False)
which was the primary cause of timeouts — the model was spending most of
its time on chain-of-thought before generating a single word of response
- Reduce max_tokens from 16384 → 300 (conversational replies don't need 16k)
- Update system prompt to reinforce short, natural 1–3 sentence replies
- Add CLAUDE.md with architecture and routing documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (3) hide show

CLAUDE.md +82 -0
config/config.yaml +9 -8
src/llm_client.py +1 -0

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+Talker is a Gradio + FastAPI application deployed on HuggingFace Spaces that implements an AI chat agent with [Open Floor Protocol (OFP)](https://openfloor.dev) support. It uses Qwen3 (via `transformers`) for LLM inference, with ZeroGPU acceleration on HuggingFace Spaces.
+## Running the Application
+```bash
+pip install -r requirements.txt
+python app.py
+# Access at http://localhost:7860
+```
+## Testing
+```bash
+# Test OFP endpoints against deployed space
+python test_ofp_endpoint.py
+# Validate deployment with curl
+./validate_deployment.sh [BASE_URL]
+# e.g. ./validate_deployment.sh https://bladeszasza-talker.hf.space
+```
+## Architecture
+The app is a single `app.py` that wires together Gradio UI, FastAPI routes, and the `src/` modules:
+- **`app.py`** — Entry point. Creates `LLMClient` and `ChatAgent`, defines the `@spaces.GPU` inference function (`llm_generate_gpu`), builds the Gradio `gr.Blocks` UI, and registers OFP API routes.
+- **`src/llm_client.py`** — Wraps Qwen3 via HuggingFace `transformers`. Tokenizer loads eagerly; model weights load lazily on first GPU call inside `generate_response_from_messages()`.
+- **`src/chat_agent.py`** — Maintains conversation history and agent stats (`messages_processed`, `responses_sent`). Handles OFP envelope processing via `process_envelope()`.
+- **`src/ofp_client.py`** — Sends OFP envelopes to external conveners/agents via HTTP POST.
+- **`src/models.py`** — Dataclasses for `Envelope`, `DialogEvent`, `Event`, `Identification`.
+- **`config/config.yaml`** — All runtime config: agent URIs, LLM model/params, UI settings.
+## Critical Routing Detail
+Gradio mounts a SvelteKit catch-all at `/` that intercepts any routes registered directly on `demo.app`. Custom FastAPI routes **must** use a prefix that Gradio doesn't claim.
+The app uses `APIRouter(prefix="/ofp-api")` and calls `demo.app.include_router(ofp_router)` **before** `demo.launch()`. External OFP endpoints are:
+- `GET  /ofp-api/manifest`
+- `POST /ofp-api/ofp`
+If `/ofp-api/` collides with Gradio internals, change it to `/xapi/` or another prefix and verify with:
+```python
+print([r.path for r in demo.app.routes])
+```
+## ZeroGPU Constraint
+`@spaces.GPU` decorated functions must be defined at **module level** in `app.py` (not inside classes or nested functions) so the HuggingFace startup scanner can detect them. The model is always invoked through `llm_generate_gpu()` — never call `llm_client.generate_response_from_messages()` directly from app code, as it requires a GPU context.
+## Configuration
+Edit `config/config.yaml` to change the model or agent settings. Key fields:
+```yaml
+agent:
+  speaker_uri: 'tag:talker.service,2025:agent-01'
+  service_url: 'https://<space>.hf.space/ofp-api/ofp'
+llm:
+  model: 'Qwen/Qwen3-0.6B'   # 0.6B is fast on CPU; 4B is slower but better quality
+  max_tokens: 16384
+  temperature: 0.7
+```
+## Environment Variables
+- `HF_TOKEN` — Required for private HuggingFace models; not needed for public Qwen3 models.
+- `OPENAI_API_KEY` — If switching `llm.provider` to `openai`.
+## OFP Event Handling
+The `/ofp-api/ofp` endpoint processes two event types in-line (in `app.py`, not via `ChatAgent.process_envelope`):
+- `getManifests` → returns `publishManifest` response immediately
+- `utterance` → calls `llm_generate_gpu`, appends to `agent.conversation_history`, returns OFP utterance envelope
+`ChatAgent.process_envelope()` in `src/chat_agent.py` exists but is not called by the main app flow — the API endpoint handles events directly. `ChatAgent` is used for state tracking (`conversation_history`, `messages_processed`, `responses_sent`) and the debug panel.

config/config.yaml CHANGED Viewed

@@ -19,18 +19,19 @@ llm:
   # For HuggingFace: set HF_TOKEN environment variable
   api_url: null  # Optional: custom API endpoint
-  # Generation parameters (optimized for Qwen3)
-  max_tokens: 16384
   temperature: 0.7
-  top_p: 0.8
-  top_k: 20
   # System prompt
   system_prompt: |
-    You are Talker, a helpful AI assistant participating in an Open Floor Protocol conversation.
-    You provide clear, concise, and friendly responses.
-    You can discuss a wide range of topics and help with questions.
-    Please reason step by step when solving complex problems.
 conversation:
   # Automatically respond to all messages

   # For HuggingFace: set HF_TOKEN environment variable
   api_url: null  # Optional: custom API endpoint
+  # Generation parameters (optimized for responsive social chat)
+  # Qwen3 thinking mode is disabled in llm_client.py (enable_thinking=False)
+  # so max_tokens here caps the actual reply, not a long reasoning chain.
+  max_tokens: 300
   temperature: 0.7
+  top_p: 0.9
+  top_k: 40
   # System prompt
   system_prompt: |
+    You are Talker, a friendly and witty conversational AI in an Open Floor Protocol multi-agent conversation.
+    Keep replies short and natural — 1 to 3 sentences. Be warm, direct, and engaging.
+    Do not explain your reasoning; just respond like a person in a real conversation.
 conversation:
   # Automatically respond to all messages

src/llm_client.py CHANGED Viewed

@@ -95,6 +95,7 @@ class LLMClient:
                 messages,
                 tokenize=False,
                 add_generation_prompt=True,
             )
             model_inputs = self.tokenizer([text], return_tensors="pt").to(self._model.device)

                 messages,
                 tokenize=False,
                 add_generation_prompt=True,
+                enable_thinking=False,  # Qwen3: skip <think>...</think> block for fast chat
             )
             model_inputs = self.tokenizer([text], return_tensors="pt").to(self._model.device)