Spaces:
Sleeping
Sleeping
BolyosCsaba Claude Sonnet 4.6 commited on
Commit Β·
5f3a4a0
1
Parent(s): 847df57
Optimize Qwen3 for fast social chat: disable thinking mode, cap tokens
Browse files- Disable Qwen3's <think>...</think> reasoning block (enable_thinking=False)
which was the primary cause of timeouts β the model was spending most of
its time on chain-of-thought before generating a single word of response
- Reduce max_tokens from 16384 β 300 (conversational replies don't need 16k)
- Update system prompt to reinforce short, natural 1β3 sentence replies
- Add CLAUDE.md with architecture and routing documentation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- CLAUDE.md +82 -0
- config/config.yaml +9 -8
- src/llm_client.py +1 -0
CLAUDE.md
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CLAUDE.md
|
| 2 |
+
|
| 3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
| 4 |
+
|
| 5 |
+
## Project Overview
|
| 6 |
+
|
| 7 |
+
Talker is a Gradio + FastAPI application deployed on HuggingFace Spaces that implements an AI chat agent with [Open Floor Protocol (OFP)](https://openfloor.dev) support. It uses Qwen3 (via `transformers`) for LLM inference, with ZeroGPU acceleration on HuggingFace Spaces.
|
| 8 |
+
|
| 9 |
+
## Running the Application
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
pip install -r requirements.txt
|
| 13 |
+
python app.py
|
| 14 |
+
# Access at http://localhost:7860
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
## Testing
|
| 18 |
+
|
| 19 |
+
```bash
|
| 20 |
+
# Test OFP endpoints against deployed space
|
| 21 |
+
python test_ofp_endpoint.py
|
| 22 |
+
|
| 23 |
+
# Validate deployment with curl
|
| 24 |
+
./validate_deployment.sh [BASE_URL]
|
| 25 |
+
# e.g. ./validate_deployment.sh https://bladeszasza-talker.hf.space
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Architecture
|
| 29 |
+
|
| 30 |
+
The app is a single `app.py` that wires together Gradio UI, FastAPI routes, and the `src/` modules:
|
| 31 |
+
|
| 32 |
+
- **`app.py`** β Entry point. Creates `LLMClient` and `ChatAgent`, defines the `@spaces.GPU` inference function (`llm_generate_gpu`), builds the Gradio `gr.Blocks` UI, and registers OFP API routes.
|
| 33 |
+
- **`src/llm_client.py`** β Wraps Qwen3 via HuggingFace `transformers`. Tokenizer loads eagerly; model weights load lazily on first GPU call inside `generate_response_from_messages()`.
|
| 34 |
+
- **`src/chat_agent.py`** β Maintains conversation history and agent stats (`messages_processed`, `responses_sent`). Handles OFP envelope processing via `process_envelope()`.
|
| 35 |
+
- **`src/ofp_client.py`** β Sends OFP envelopes to external conveners/agents via HTTP POST.
|
| 36 |
+
- **`src/models.py`** β Dataclasses for `Envelope`, `DialogEvent`, `Event`, `Identification`.
|
| 37 |
+
- **`config/config.yaml`** β All runtime config: agent URIs, LLM model/params, UI settings.
|
| 38 |
+
|
| 39 |
+
## Critical Routing Detail
|
| 40 |
+
|
| 41 |
+
Gradio mounts a SvelteKit catch-all at `/` that intercepts any routes registered directly on `demo.app`. Custom FastAPI routes **must** use a prefix that Gradio doesn't claim.
|
| 42 |
+
|
| 43 |
+
The app uses `APIRouter(prefix="/ofp-api")` and calls `demo.app.include_router(ofp_router)` **before** `demo.launch()`. External OFP endpoints are:
|
| 44 |
+
- `GET /ofp-api/manifest`
|
| 45 |
+
- `POST /ofp-api/ofp`
|
| 46 |
+
|
| 47 |
+
If `/ofp-api/` collides with Gradio internals, change it to `/xapi/` or another prefix and verify with:
|
| 48 |
+
```python
|
| 49 |
+
print([r.path for r in demo.app.routes])
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
## ZeroGPU Constraint
|
| 53 |
+
|
| 54 |
+
`@spaces.GPU` decorated functions must be defined at **module level** in `app.py` (not inside classes or nested functions) so the HuggingFace startup scanner can detect them. The model is always invoked through `llm_generate_gpu()` β never call `llm_client.generate_response_from_messages()` directly from app code, as it requires a GPU context.
|
| 55 |
+
|
| 56 |
+
## Configuration
|
| 57 |
+
|
| 58 |
+
Edit `config/config.yaml` to change the model or agent settings. Key fields:
|
| 59 |
+
|
| 60 |
+
```yaml
|
| 61 |
+
agent:
|
| 62 |
+
speaker_uri: 'tag:talker.service,2025:agent-01'
|
| 63 |
+
service_url: 'https://<space>.hf.space/ofp-api/ofp'
|
| 64 |
+
|
| 65 |
+
llm:
|
| 66 |
+
model: 'Qwen/Qwen3-0.6B' # 0.6B is fast on CPU; 4B is slower but better quality
|
| 67 |
+
max_tokens: 16384
|
| 68 |
+
temperature: 0.7
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## Environment Variables
|
| 72 |
+
|
| 73 |
+
- `HF_TOKEN` β Required for private HuggingFace models; not needed for public Qwen3 models.
|
| 74 |
+
- `OPENAI_API_KEY` β If switching `llm.provider` to `openai`.
|
| 75 |
+
|
| 76 |
+
## OFP Event Handling
|
| 77 |
+
|
| 78 |
+
The `/ofp-api/ofp` endpoint processes two event types in-line (in `app.py`, not via `ChatAgent.process_envelope`):
|
| 79 |
+
- `getManifests` β returns `publishManifest` response immediately
|
| 80 |
+
- `utterance` β calls `llm_generate_gpu`, appends to `agent.conversation_history`, returns OFP utterance envelope
|
| 81 |
+
|
| 82 |
+
`ChatAgent.process_envelope()` in `src/chat_agent.py` exists but is not called by the main app flow β the API endpoint handles events directly. `ChatAgent` is used for state tracking (`conversation_history`, `messages_processed`, `responses_sent`) and the debug panel.
|
config/config.yaml
CHANGED
|
@@ -19,18 +19,19 @@ llm:
|
|
| 19 |
# For HuggingFace: set HF_TOKEN environment variable
|
| 20 |
api_url: null # Optional: custom API endpoint
|
| 21 |
|
| 22 |
-
# Generation parameters (optimized for
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
temperature: 0.7
|
| 25 |
-
top_p: 0.
|
| 26 |
-
top_k:
|
| 27 |
|
| 28 |
# System prompt
|
| 29 |
system_prompt: |
|
| 30 |
-
You are Talker, a
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
Please reason step by step when solving complex problems.
|
| 34 |
|
| 35 |
conversation:
|
| 36 |
# Automatically respond to all messages
|
|
|
|
| 19 |
# For HuggingFace: set HF_TOKEN environment variable
|
| 20 |
api_url: null # Optional: custom API endpoint
|
| 21 |
|
| 22 |
+
# Generation parameters (optimized for responsive social chat)
|
| 23 |
+
# Qwen3 thinking mode is disabled in llm_client.py (enable_thinking=False)
|
| 24 |
+
# so max_tokens here caps the actual reply, not a long reasoning chain.
|
| 25 |
+
max_tokens: 300
|
| 26 |
temperature: 0.7
|
| 27 |
+
top_p: 0.9
|
| 28 |
+
top_k: 40
|
| 29 |
|
| 30 |
# System prompt
|
| 31 |
system_prompt: |
|
| 32 |
+
You are Talker, a friendly and witty conversational AI in an Open Floor Protocol multi-agent conversation.
|
| 33 |
+
Keep replies short and natural β 1 to 3 sentences. Be warm, direct, and engaging.
|
| 34 |
+
Do not explain your reasoning; just respond like a person in a real conversation.
|
|
|
|
| 35 |
|
| 36 |
conversation:
|
| 37 |
# Automatically respond to all messages
|
src/llm_client.py
CHANGED
|
@@ -95,6 +95,7 @@ class LLMClient:
|
|
| 95 |
messages,
|
| 96 |
tokenize=False,
|
| 97 |
add_generation_prompt=True,
|
|
|
|
| 98 |
)
|
| 99 |
|
| 100 |
model_inputs = self.tokenizer([text], return_tensors="pt").to(self._model.device)
|
|
|
|
| 95 |
messages,
|
| 96 |
tokenize=False,
|
| 97 |
add_generation_prompt=True,
|
| 98 |
+
enable_thinking=False, # Qwen3: skip <think>...</think> block for fast chat
|
| 99 |
)
|
| 100 |
|
| 101 |
model_inputs = self.tokenizer([text], return_tensors="pt").to(self._model.device)
|