Spaces:
Runtime error
A newer version of the Gradio SDK is available:
6.3.0
Architecture
This document describes the system architecture of prompt-prix, including module responsibilities, data flow, and key design decisions.
System Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Gradio UI (ui.py) β β
β β βββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ β β
β β β Config Panelβ β Prompt Input β β Model Output Tabs β β β
β β β β’ Servers β β β’ Single β β β’ Tab 1..10 β β β
β β β β’ Models β β β’ Batch β β β’ Streaming display β β β
β β β β’ System β β β’ Tools JSON β β β’ Status colors β β β
β β β Prompt β ββββββββββββββββ βββββββββββββββββββββββββ β β
β β βββββββββββββββ β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β localStorage: servers, models, temperature, etc. β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Python Backend β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β handlers.py β β
β β β’ fetch_available_models() β ServerPool.refresh_manifests() β β
β β β’ initialize_session() β Create ComparisonSession β β
β β β’ send_single_prompt() β Work-stealing dispatcher β β
β β β’ export_markdown/json() β Report generation β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ β
β β core.py β β β
β β βββββββββββββββββββββββ β βββββββββββββββββββββββββββ β β
β β β ServerPool ββββββ΄ββββΊβ ComparisonSession β β β
β β β β’ servers: dict β β β’ state: SessionState β β β
β β β β’ refresh_manifest β β β’ send_prompt_to_model β β β
β β β β’ acquire/release β β β’ get_context_display β β β
β β βββββββββββββββββββββββ βββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β stream_completion() / get_completion() β β β
β β β β’ Async HTTP streaming to LM Studio β β β
β β β β’ Yields text chunks or returns full response β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β config.py β β
β β Pydantic Models: ServerConfig, ModelContext, SessionState β β
β β Constants: DEFAULT_TEMPERATURE, DEFAULT_MAX_TOKENS, etc. β β
β β Environment: load_servers_from_env(), get_gradio_port() β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LM Studio Servers β
β ββββββββββββββββββββββββββ ββββββββββββββββββββββββββ β
β β Server 1 (e.g. 3090) β β Server 2 (e.g. 8000) β β
β β β’ GET /v1/models β β β’ GET /v1/models β β
β β β’ POST /v1/chat/... β β β’ POST /v1/chat/... β β
β β ββ Model A β β ββ Model B, C β β
β β ββ Model B β β β β
β ββββββββββββββββββββββββββ ββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Module Breakdown
Directory Structure
prompt_prix/
βββ main.py # Entry point
βββ ui.py # Gradio UI definition
βββ handlers.py # Shared event handlers (fetch, stop)
βββ state.py # Global mutable state
βββ core.py # ServerPool, ComparisonSession, streaming
βββ config.py # Pydantic models, constants, env loading
βββ parsers.py # Input parsing utilities
βββ export.py # Report generation
βββ dispatcher.py # WorkStealingDispatcher for parallel execution
βββ battery.py # BatteryRunner, TestResult, BatteryRun
βββ tabs/
β βββ __init__.py
β βββ battery/
β β βββ __init__.py
β β βββ handlers.py # Battery-specific handlers
β βββ compare/
β βββ __init__.py
β βββ handlers.py # Compare-specific handlers
βββ adapters/
β βββ lmstudio.py # LMStudioAdapter
βββ benchmarks/
βββ base.py # TestCase protocol
βββ custom_json.py # CustomJSONLoader
config.py - Configuration & Data Models
Purpose: Define all Pydantic models for type-safe configuration and state.
| Class | Purpose |
|---|---|
ServerConfig |
Single LM Studio server state (URL, available_models, is_busy) |
ModelConfig |
Model identity and display name |
Message |
Single message in a conversation (role, content - supports multimodal) |
ModelContext |
Complete conversation history for one model |
SessionState |
Full session: models, contexts, system_prompt, halted status |
Message Multimodal Support:
The Message model supports both text and multimodal content:
# Text-only message
Message(role="user", content="Hello")
# Multimodal message (text + image)
Message(role="user", content=[
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
])
# Helper methods
msg.get_text() # Extract text content
msg.has_image() # Check if message contains an image
Key Functions:
load_servers_from_env()- Read LM_STUDIO_SERVER_N environment variablesget_default_servers()- Return env servers or placeholder defaultsget_gradio_port()- Read GRADIO_PORT or default to 7860get_fara_config()- Read FARA_SERVER_URL and FARA_MODEL_ID for vision adapterencode_image_to_data_url(path)- Convert image file to base64 data URLbuild_multimodal_content(text, image_path)- Build OpenAI-format multimodal content
core.py - Server Pool & Session Management
Purpose: Core business logic for server management and model interactions.
ServerPool
Manages multiple LM Studio servers:
class ServerPool:
servers: dict[str, ServerConfig] # URL -> config
_locks: dict[str, asyncio.Lock] # URL -> lock
async def refresh_all_manifests() # GET /v1/models on all servers
def find_available_server(model_id) -> Optional[str] # Find idle server with model
async def acquire_server(url) # Mark busy, acquire lock
def release_server(url) # Mark available, release lock
ComparisonSession
Manages a comparison session:
class ComparisonSession:
server_pool: ServerPool
state: SessionState # Contains models, contexts, config
async def send_prompt_to_model(model_id, prompt, on_chunk=None)
async def send_prompt_to_all(prompt, on_chunk=None)
def get_context_display(model_id) -> str
Streaming Functions
async def stream_completion(
server_url, model_id, messages, temperature, max_tokens,
timeout_seconds, tools=None, seed=None, repeat_penalty=None
) -> AsyncGenerator[str, None]:
"""Yields text chunks as they arrive via SSE.
Args:
seed: Optional int for reproducible outputs (passed to model API)
repeat_penalty: Optional float to penalize repeated tokens (1.0 = off)
"""
async def get_completion(...) -> str:
"""Non-streaming version, returns full response."""
handlers.py - Shared Event Handlers
Purpose: Shared async handlers used across multiple tabs.
| Handler | Purpose | Returns |
|---|---|---|
fetch_available_models(servers_text) |
Query all servers for available models | (status, gr.update(choices=[...])) |
handle_stop() |
Signal cancellation via global state | status |
_init_pool_and_validate(servers_text, models) |
Initialize ServerPool and validate models | (pool, error_message) |
tabs/battery/handlers.py - Battery Tab Handlers
Purpose: Handlers specific to the Battery (benchmark) tab.
| Handler | Trigger | Returns |
|---|---|---|
validate_file(file_path) |
File upload | Validation status string |
get_test_ids(file_path) |
File upload | List of test IDs |
run_handler(file, models, servers, ...) |
"Run Battery" button | Generator yielding (status, grid_df) |
quick_prompt_handler(prompt, models, ...) |
"Run Prompt" button | Markdown results |
export_json() |
"Export JSON" button | (status, preview) |
export_csv() |
"Export CSV" button | (status, preview) |
get_cell_detail(model, test) |
Detail dropdown | Markdown detail |
refresh_grid(display_mode) |
Display mode change | Updated grid DataFrame |
tabs/compare/handlers.py - Compare Tab Handlers
Purpose: Handlers specific to the Compare (interactive) tab.
| Handler | Trigger | Returns |
|---|---|---|
initialize_session(servers, models, system_prompt, ...) |
Auto-init on send | (status, *model_tabs) |
send_single_prompt(prompt, tools_json, image_path, seed, repeat_penalty) |
"Send to All" button | Generator yielding (status, tab_states, *model_outputs) |
export_markdown() |
"Export Markdown" button | (status, preview) |
export_json() |
"Export JSON" button | (status, preview) |
launch_beyond_compare(model_a, model_b) |
"Open in Beyond Compare" button | status |
Compare Tab Features:
- Image Attachment: Upload images for vision models (encoded as base64 data URLs)
- Seed Parameter: Set a seed for reproducible outputs across models
- Repeat Penalty: Configurable penalty (1.0-2.0) to reduce repetitive token generation
dispatcher.py - Work-Stealing Dispatcher
Purpose: Parallel execution across multiple servers with work-stealing.
class WorkStealingDispatcher:
"""Dispatches work items to servers using work-stealing pattern."""
async def dispatch(
self,
work_items: list[WorkItem],
execute_fn: Callable[[WorkItem, str], Coroutine],
on_progress: Optional[Callable[[str, str], None]] = None
) -> dict[str, Any]:
"""Execute work items in parallel across available servers."""
The dispatcher:
- Maintains a queue of work items (model + test case pairs)
- Finds idle servers that can run each work item
- Executes items in parallel across all available servers
- Supports cooperative cancellation via
state.should_stop()
ui.py - Gradio UI Definition
Purpose: Define all Gradio components and wire up event bindings.
Key Components:
| Component | Type | Purpose |
|---|---|---|
servers_input |
Textbox | LM Studio server URLs (one per line) |
models_checkboxes |
CheckboxGroup | Select models to compare |
system_prompt_input |
Textbox (50 lines) | Editable system prompt |
temperature_slider |
Slider | Model temperature (0-2) |
timeout_slider |
Slider | Request timeout (30-600s) |
max_tokens_slider |
Slider | Max tokens (256-8192) |
seed_input |
Number | Optional seed for reproducible outputs |
repeat_penalty_slider |
Slider | Repeat penalty (1.0-2.0, default 1.1) |
prompt_input |
Textbox | User prompt entry |
image_input |
Image | Optional image attachment for vision models |
tools_input |
Code (JSON) | Tools for function calling |
model_outputs[0..9] |
Markdown | Model response tabs |
tab_states |
JSON (hidden) | Tab status for color updates |
Event Bindings:
- Buttons trigger async handlers
tab_states.changetriggers JavaScript for inline style updatesapp.loadrestores state from localStorage
state.py - Global State
Purpose: Holds mutable state shared across handlers.
server_pool: Optional[ServerPool] = None
session: Optional[ComparisonSession] = None
Design Decision: Separated to avoid circular imports between ui.py and handlers.py.
parsers.py - Text Parsing Utilities
Purpose: Parse user input from UI components.
| Function | Input | Output |
|---|---|---|
parse_models_input(text) |
"model1\nmodel2" | ["model1", "model2"] |
parse_servers_input(text) |
"http://...\nhttp://..." | ["http://...", "http://..."] |
parse_prompts_file(content) |
File content | List of prompts |
load_system_prompt(file_path) |
Optional file path | System prompt string |
get_default_system_prompt() |
- | Default prompt from file or constant |
export.py - Report Generation
Purpose: Generate exportable reports from session state.
def generate_markdown_report(state: SessionState) -> str:
"""Create Markdown with header, system prompt, and all model conversations."""
def generate_json_report(state: SessionState) -> str:
"""Create structured JSON with configuration and conversations."""
def save_report(content: str, filepath: str):
"""Write report to file."""
main.py - Entry Point
Purpose: Application entry point and backwards-compatibility exports.
def run():
app = create_app()
app.launch(server_name="0.0.0.0", server_port=get_gradio_port())
Data Flow: Sending a Prompt
1. User types prompt, clicks "Send Prompt"
β
βΌ
2. ui.py: send_button.click(fn=send_single_prompt, inputs=[prompt, tools])
β
βΌ
3. handlers.py: send_single_prompt(prompt, tools_json)
β - Validate session exists
β - Parse tools JSON
β - Add user message to all model contexts
β - Refresh server manifests
β
βΌ
4. Work-Stealing Dispatcher Loop:
β βββββββββββββββββββββββββββββββββββββββββββ
β β For each idle server: β
β β Find model in queue this server has β
β β If found: start async task β
β βββββββββββββββββββββββββββββββββββββββββββ
β β await asyncio.sleep(0.1)
β β yield (status, tab_states, *outputs) βββββββΊ UI updates
β β Clean up completed tasks
β ββββββββββββ while queue or active_tasks
β
βΌ
5. Each async task: run_model_on_server(model_id, server_url)
β - Mark model as "streaming"
β - Call stream_completion() ββββββββββββββββββββΊ LM Studio API
β - Accumulate chunks in streaming_responses[model_id]
β - On complete: add assistant message to context
β - Release server
β
βΌ
6. Final yield: ("β
All responses complete", final_states, *final_outputs)
State Management
Session State (Python)
SessionState:
models: list[str] # Selected models
contexts: dict[str, ModelContext] # model_id -> conversation
system_prompt: str
temperature: float
timeout_seconds: int
max_tokens: int
halted: bool # True if any model failed
halt_reason: Optional[str]
UI State (Browser localStorage)
| Key | Type | Purpose |
|---|---|---|
promptprix_servers |
string | Server URLs (newline-separated) |
promptprix_model_choices |
JSON array | Available models from last fetch |
promptprix_models |
JSON array | Selected models |
promptprix_temperature |
float | Temperature setting |
promptprix_timeout |
int | Timeout setting |
promptprix_max_tokens |
int | Max tokens setting |
promptprix_tools |
string | Tools JSON |
promptprix_system_prompt |
string | System prompt text |
Persistence: Only saved when user clicks "Save State" button (explicit save).
Tab Status Visualization
Tab colors indicate model status during streaming:
| Status | Color | Border |
|---|---|---|
pending |
Red gradient (#fee2e2 β #fecaca) | 4px solid #ef4444 |
streaming |
Yellow gradient (#fef3c7 β #fde68a) | 4px solid #f59e0b |
completed |
Green gradient (#d1fae5 β #a7f3d0) | 4px solid #10b981 |
Implementation: Uses inline JavaScript styles (element.style) to overcome Gradio theme CSS.
Error Handling
Fail-Fast Validation
initialize_sessionvalidates:- Servers are configured
- Models are configured
- All selected models exist on at least one server
send_single_promptvalidates:- Session is initialized
- Session is not halted
- Prompt is not empty
- Tools JSON is valid (if provided)
Halt-on-Error
If any model fails during send_prompt_to_all:
state.halted = Truestate.halt_reason = "Model {model_id} failed: {error}"- Subsequent prompts are rejected
Human-Readable Errors
The LMStudioError exception extracts error messages from LM Studio's JSON responses:
{"error": {"message": "Model not loaded"}} β "Model not loaded"
Integration Points
Upstream: Benchmark Sources
prompt-prix can consume test cases from established benchmark ecosystems:
| Source | Format | Usage |
|---|---|---|
| promptfoo | YAML with assertions | Full eval format with pass/fail criteria |
| Inspect AI | Python test definitions | Export prompts, import as JSON |
| Custom JSON | OpenAI-compatible messages | Direct load in prompt-prix |
See ADR-001 for rationale.
API Layer: OpenAI-Compatible
All inference servers must expose OpenAI-compatible endpoints:
GET /v1/models β List available models
POST /v1/chat/completions β Chat completion (streaming)
Supported servers:
- LM Studio (native)
- Ollama (OpenAI mode)
- vLLM
- llama.cpp server
- Any OpenAI-compatible proxy
See ADR-003 for rationale.
Fan-Out Dispatcher Pattern
The core abstraction is fan-out: one prompt dispatched to N models in parallel.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Fan-Out Dispatcher β
β β
β Input: (prompt, [model_a, model_b, model_c]) β
β β β
β ββββββββββββββββΌβββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ β
β β Model A β β Model B β β Model C β β
β β Server1 β β Server1 β β Server2 β β
β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β
β β β β β
β βΌ βΌ βΌ β
β Response A Response B Response C β
β β
β Output: {model_a: resp_a, model_b: resp_b, model_c: resp_c}β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Work-Stealing Implementation
The dispatcher uses work-stealing for GPU efficiency:
- Queue: All models to process
- Acquire: Find idle server that has queued model
- Execute: Stream response, update UI
- Release: Server becomes available for next model
This maximizes utilization when models are distributed across multiple GPUs.
See ADR-002 for rationale.