BolyosCsaba Claude Sonnet 4.6 commited on
Commit
5f3a4a0
Β·
1 Parent(s): 847df57

Optimize Qwen3 for fast social chat: disable thinking mode, cap tokens

Browse files

- Disable Qwen3's <think>...</think> reasoning block (enable_thinking=False)
which was the primary cause of timeouts β€” the model was spending most of
its time on chain-of-thought before generating a single word of response
- Reduce max_tokens from 16384 β†’ 300 (conversational replies don't need 16k)
- Update system prompt to reinforce short, natural 1–3 sentence replies
- Add CLAUDE.md with architecture and routing documentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (3) hide show
  1. CLAUDE.md +82 -0
  2. config/config.yaml +9 -8
  3. src/llm_client.py +1 -0
CLAUDE.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ Talker is a Gradio + FastAPI application deployed on HuggingFace Spaces that implements an AI chat agent with [Open Floor Protocol (OFP)](https://openfloor.dev) support. It uses Qwen3 (via `transformers`) for LLM inference, with ZeroGPU acceleration on HuggingFace Spaces.
8
+
9
+ ## Running the Application
10
+
11
+ ```bash
12
+ pip install -r requirements.txt
13
+ python app.py
14
+ # Access at http://localhost:7860
15
+ ```
16
+
17
+ ## Testing
18
+
19
+ ```bash
20
+ # Test OFP endpoints against deployed space
21
+ python test_ofp_endpoint.py
22
+
23
+ # Validate deployment with curl
24
+ ./validate_deployment.sh [BASE_URL]
25
+ # e.g. ./validate_deployment.sh https://bladeszasza-talker.hf.space
26
+ ```
27
+
28
+ ## Architecture
29
+
30
+ The app is a single `app.py` that wires together Gradio UI, FastAPI routes, and the `src/` modules:
31
+
32
+ - **`app.py`** β€” Entry point. Creates `LLMClient` and `ChatAgent`, defines the `@spaces.GPU` inference function (`llm_generate_gpu`), builds the Gradio `gr.Blocks` UI, and registers OFP API routes.
33
+ - **`src/llm_client.py`** β€” Wraps Qwen3 via HuggingFace `transformers`. Tokenizer loads eagerly; model weights load lazily on first GPU call inside `generate_response_from_messages()`.
34
+ - **`src/chat_agent.py`** β€” Maintains conversation history and agent stats (`messages_processed`, `responses_sent`). Handles OFP envelope processing via `process_envelope()`.
35
+ - **`src/ofp_client.py`** β€” Sends OFP envelopes to external conveners/agents via HTTP POST.
36
+ - **`src/models.py`** β€” Dataclasses for `Envelope`, `DialogEvent`, `Event`, `Identification`.
37
+ - **`config/config.yaml`** β€” All runtime config: agent URIs, LLM model/params, UI settings.
38
+
39
+ ## Critical Routing Detail
40
+
41
+ Gradio mounts a SvelteKit catch-all at `/` that intercepts any routes registered directly on `demo.app`. Custom FastAPI routes **must** use a prefix that Gradio doesn't claim.
42
+
43
+ The app uses `APIRouter(prefix="/ofp-api")` and calls `demo.app.include_router(ofp_router)` **before** `demo.launch()`. External OFP endpoints are:
44
+ - `GET /ofp-api/manifest`
45
+ - `POST /ofp-api/ofp`
46
+
47
+ If `/ofp-api/` collides with Gradio internals, change it to `/xapi/` or another prefix and verify with:
48
+ ```python
49
+ print([r.path for r in demo.app.routes])
50
+ ```
51
+
52
+ ## ZeroGPU Constraint
53
+
54
+ `@spaces.GPU` decorated functions must be defined at **module level** in `app.py` (not inside classes or nested functions) so the HuggingFace startup scanner can detect them. The model is always invoked through `llm_generate_gpu()` β€” never call `llm_client.generate_response_from_messages()` directly from app code, as it requires a GPU context.
55
+
56
+ ## Configuration
57
+
58
+ Edit `config/config.yaml` to change the model or agent settings. Key fields:
59
+
60
+ ```yaml
61
+ agent:
62
+ speaker_uri: 'tag:talker.service,2025:agent-01'
63
+ service_url: 'https://<space>.hf.space/ofp-api/ofp'
64
+
65
+ llm:
66
+ model: 'Qwen/Qwen3-0.6B' # 0.6B is fast on CPU; 4B is slower but better quality
67
+ max_tokens: 16384
68
+ temperature: 0.7
69
+ ```
70
+
71
+ ## Environment Variables
72
+
73
+ - `HF_TOKEN` β€” Required for private HuggingFace models; not needed for public Qwen3 models.
74
+ - `OPENAI_API_KEY` β€” If switching `llm.provider` to `openai`.
75
+
76
+ ## OFP Event Handling
77
+
78
+ The `/ofp-api/ofp` endpoint processes two event types in-line (in `app.py`, not via `ChatAgent.process_envelope`):
79
+ - `getManifests` β†’ returns `publishManifest` response immediately
80
+ - `utterance` β†’ calls `llm_generate_gpu`, appends to `agent.conversation_history`, returns OFP utterance envelope
81
+
82
+ `ChatAgent.process_envelope()` in `src/chat_agent.py` exists but is not called by the main app flow β€” the API endpoint handles events directly. `ChatAgent` is used for state tracking (`conversation_history`, `messages_processed`, `responses_sent`) and the debug panel.
config/config.yaml CHANGED
@@ -19,18 +19,19 @@ llm:
19
  # For HuggingFace: set HF_TOKEN environment variable
20
  api_url: null # Optional: custom API endpoint
21
 
22
- # Generation parameters (optimized for Qwen3)
23
- max_tokens: 16384
 
 
24
  temperature: 0.7
25
- top_p: 0.8
26
- top_k: 20
27
 
28
  # System prompt
29
  system_prompt: |
30
- You are Talker, a helpful AI assistant participating in an Open Floor Protocol conversation.
31
- You provide clear, concise, and friendly responses.
32
- You can discuss a wide range of topics and help with questions.
33
- Please reason step by step when solving complex problems.
34
 
35
  conversation:
36
  # Automatically respond to all messages
 
19
  # For HuggingFace: set HF_TOKEN environment variable
20
  api_url: null # Optional: custom API endpoint
21
 
22
+ # Generation parameters (optimized for responsive social chat)
23
+ # Qwen3 thinking mode is disabled in llm_client.py (enable_thinking=False)
24
+ # so max_tokens here caps the actual reply, not a long reasoning chain.
25
+ max_tokens: 300
26
  temperature: 0.7
27
+ top_p: 0.9
28
+ top_k: 40
29
 
30
  # System prompt
31
  system_prompt: |
32
+ You are Talker, a friendly and witty conversational AI in an Open Floor Protocol multi-agent conversation.
33
+ Keep replies short and natural β€” 1 to 3 sentences. Be warm, direct, and engaging.
34
+ Do not explain your reasoning; just respond like a person in a real conversation.
 
35
 
36
  conversation:
37
  # Automatically respond to all messages
src/llm_client.py CHANGED
@@ -95,6 +95,7 @@ class LLMClient:
95
  messages,
96
  tokenize=False,
97
  add_generation_prompt=True,
 
98
  )
99
 
100
  model_inputs = self.tokenizer([text], return_tensors="pt").to(self._model.device)
 
95
  messages,
96
  tokenize=False,
97
  add_generation_prompt=True,
98
+ enable_thinking=False, # Qwen3: skip <think>...</think> block for fast chat
99
  )
100
 
101
  model_inputs = self.tokenizer([text], return_tensors="pt").to(self._model.device)