Spaces:

BladeSzaSza
/

Talker

Sleeping

App Files Files Community

Talker / CLAUDE.md

BolyosCsaba

Optimize Qwen3 for fast social chat: disable thinking mode, cap tokens

5f3a4a0 about 2 months ago

preview code

raw

history blame contribute delete

3.8 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Talker is a Gradio + FastAPI application deployed on HuggingFace Spaces that implements an AI chat agent with Open Floor Protocol (OFP) support. It uses Qwen3 (via transformers) for LLM inference, with ZeroGPU acceleration on HuggingFace Spaces.

Running the Application

pip install -r requirements.txt
python app.py
# Access at http://localhost:7860

Testing

# Test OFP endpoints against deployed space
python test_ofp_endpoint.py

# Validate deployment with curl
./validate_deployment.sh [BASE_URL]
# e.g. ./validate_deployment.sh https://bladeszasza-talker.hf.space

Architecture

The app is a single app.py that wires together Gradio UI, FastAPI routes, and the src/ modules:

app.py — Entry point. Creates LLMClient and ChatAgent, defines the @spaces.GPU inference function (llm_generate_gpu), builds the Gradio gr.Blocks UI, and registers OFP API routes.
src/llm_client.py — Wraps Qwen3 via HuggingFace transformers. Tokenizer loads eagerly; model weights load lazily on first GPU call inside generate_response_from_messages().
src/chat_agent.py — Maintains conversation history and agent stats (messages_processed, responses_sent). Handles OFP envelope processing via process_envelope().
src/ofp_client.py — Sends OFP envelopes to external conveners/agents via HTTP POST.
src/models.py — Dataclasses for Envelope, DialogEvent, Event, Identification.
config/config.yaml — All runtime config: agent URIs, LLM model/params, UI settings.

Critical Routing Detail

Gradio mounts a SvelteKit catch-all at / that intercepts any routes registered directly on demo.app. Custom FastAPI routes must use a prefix that Gradio doesn't claim.

The app uses APIRouter(prefix="/ofp-api") and calls demo.app.include_router(ofp_router) before demo.launch(). External OFP endpoints are:

GET /ofp-api/manifest
POST /ofp-api/ofp

If /ofp-api/ collides with Gradio internals, change it to /xapi/ or another prefix and verify with:

print([r.path for r in demo.app.routes])

ZeroGPU Constraint

@spaces.GPU decorated functions must be defined at module level in app.py (not inside classes or nested functions) so the HuggingFace startup scanner can detect them. The model is always invoked through llm_generate_gpu() — never call llm_client.generate_response_from_messages() directly from app code, as it requires a GPU context.

Configuration

Edit config/config.yaml to change the model or agent settings. Key fields:

agent:
  speaker_uri: 'tag:talker.service,2025:agent-01'
  service_url: 'https://<space>.hf.space/ofp-api/ofp'

llm:
  model: 'Qwen/Qwen3-0.6B'   # 0.6B is fast on CPU; 4B is slower but better quality
  max_tokens: 16384
  temperature: 0.7

Environment Variables

HF_TOKEN — Required for private HuggingFace models; not needed for public Qwen3 models.
OPENAI_API_KEY — If switching llm.provider to openai.

OFP Event Handling

The /ofp-api/ofp endpoint processes two event types in-line (in app.py, not via ChatAgent.process_envelope):

getManifests → returns publishManifest response immediately
utterance → calls llm_generate_gpu, appends to agent.conversation_history, returns OFP utterance envelope

ChatAgent.process_envelope() in src/chat_agent.py exists but is not called by the main app flow — the API endpoint handles events directly. ChatAgent is used for state tracking (conversation_history, messages_processed, responses_sent) and the debug panel.