Spaces:
Running
Running
add claude
Browse files
CLAUDE.md
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CLAUDE.md
|
| 2 |
+
|
| 3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
| 4 |
+
|
| 5 |
+
## What This Is
|
| 6 |
+
|
| 7 |
+
A Gradio-based chat interface for ServiceNow-AI's Apriel reasoning models, deployed as a HuggingFace Space. Users chat with vLLM-hosted models via an OpenAI-compatible API, with streaming responses and multimodal (text + image) support.
|
| 8 |
+
|
| 9 |
+
## Running Locally
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
# Install dependencies
|
| 13 |
+
pip install -r requirements.txt
|
| 14 |
+
|
| 15 |
+
# Run with hot reload (needs env vars β see below)
|
| 16 |
+
python gradio_runner.py app.py
|
| 17 |
+
|
| 18 |
+
# Or run directly
|
| 19 |
+
python app.py
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
The Makefile target `make runAppReloading` bundles env vars and launches with hot reload, but contains hardcoded tokens β use it only as a reference for which env vars are needed.
|
| 23 |
+
|
| 24 |
+
## Required Environment Variables
|
| 25 |
+
|
| 26 |
+
- `AUTH_TOKEN` β vLLM API auth token
|
| 27 |
+
- `HF_TOKEN` β HuggingFace token (for chat logging dataset)
|
| 28 |
+
- `VLLM_API_URL_APRIEL_1_6_15B` β single vLLM endpoint
|
| 29 |
+
- `VLLM_API_URL_LIST_APRIEL_1_6_15B` β comma-separated endpoints for load balancing
|
| 30 |
+
- `MODEL_NAME_APRIEL_1_6_15B` β model name on vLLM server
|
| 31 |
+
- `DEBUG_MODE` β "True"/"False" for verbose logging
|
| 32 |
+
- `APRIEL_PROMPT_DATASET` β HF dataset repo for chat logging
|
| 33 |
+
|
| 34 |
+
## Architecture
|
| 35 |
+
|
| 36 |
+
**app.py** β Main Gradio app (UI layout, streaming inference, session state). `run_chat_inference()` is the core generator that streams chat completions, handles reasoning tag splitting (`[BEGIN FINAL RESPONSE]`), and supports multimodal input (up to 5 images converted to base64).
|
| 37 |
+
|
| 38 |
+
**utils.py** β Model configuration registry (`models_config` dict) and logging helpers. Each model entry defines: HF URL, API name, vLLM endpoints, auth token, reasoning/multimodal flags, temperature, and output tags. Add new models here.
|
| 39 |
+
|
| 40 |
+
**log_chat.py** β Async queue-based chat logger. Writes to local `train.csv` and syncs to a HuggingFace Hub dataset. Uses a daemon thread to avoid blocking the UI. Has a `test_log_chat()` function for manual testing.
|
| 41 |
+
|
| 42 |
+
**theme.py** β Custom Gradio theme (Apriel) extending Soft theme with custom colors and fonts.
|
| 43 |
+
|
| 44 |
+
**styles.css** β Responsive CSS with dark mode support. Chat height uses CSS calc with breakpoints at 1280px, 1024px, 400px.
|
| 45 |
+
|
| 46 |
+
**timer.py** β Simple step-based timing utility for performance profiling.
|
| 47 |
+
|
| 48 |
+
## HuggingFace Space Deployment
|
| 49 |
+
|
| 50 |
+
The Space is configured via YAML frontmatter in `README.md` (sdk, sdk_version, app_file). The `sdk_version` must match the gradio version in `requirements.txt` β mismatches cause build failures.
|
| 51 |
+
|
| 52 |
+
## Key Patterns
|
| 53 |
+
|
| 54 |
+
- **Endpoint rotation**: `setup_model()` round-robins across vLLM endpoints from the comma-separated env var list
|
| 55 |
+
- **Session state**: A global `session_state` dict tracks streaming status, stop flags, chat/session IDs, and opt-out preference
|
| 56 |
+
- **Reasoning models**: Responses are split on `[BEGIN FINAL RESPONSE]` tag β content before is "thought", content after is the visible response
|
| 57 |
+
- **Concurrency**: Gradio queue with `default_concurrency_limit=4`
|