prompt-prix / docs /ARCHITECTURE.md
3v324v23's picture
chore: Update documentation
9eb149e

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Architecture

This document describes the system architecture of prompt-prix, including module responsibilities, data flow, and key design decisions.

System Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           Browser                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    Gradio UI (ui.py)                        β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚  β”‚  β”‚ Config Panelβ”‚ β”‚ Prompt Input β”‚ β”‚ Model Output Tabs     β”‚ β”‚   β”‚
β”‚  β”‚  β”‚ β€’ Servers   β”‚ β”‚ β€’ Single     β”‚ β”‚ β€’ Tab 1..10           β”‚ β”‚   β”‚
β”‚  β”‚  β”‚ β€’ Models    β”‚ β”‚ β€’ Batch      β”‚ β”‚ β€’ Streaming display   β”‚ β”‚   β”‚
β”‚  β”‚  β”‚ β€’ System    β”‚ β”‚ β€’ Tools JSON β”‚ β”‚ β€’ Status colors       β”‚ β”‚   β”‚
β”‚  β”‚  β”‚   Prompt    β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                             β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚  β”‚  β”‚ localStorage: servers, models, temperature, etc.        β”‚ β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Python Backend                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚                    handlers.py                                β”‚  β”‚
β”‚  β”‚  β€’ fetch_available_models()  β†’ ServerPool.refresh_manifests() β”‚  β”‚
β”‚  β”‚  β€’ initialize_session()      β†’ Create ComparisonSession       β”‚  β”‚
β”‚  β”‚  β€’ send_single_prompt()      β†’ Work-stealing dispatcher       β”‚  β”‚
β”‚  β”‚  β€’ export_markdown/json()    β†’ Report generation              β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                β”‚                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚                    core.py  β”‚                                 β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚  β”‚
β”‚  β”‚  β”‚    ServerPool       │◄───┴───►│  ComparisonSession      β”‚ β”‚  β”‚
β”‚  β”‚  β”‚  β€’ servers: dict    β”‚         β”‚  β€’ state: SessionState  β”‚ β”‚  β”‚
β”‚  β”‚  β”‚  β€’ refresh_manifest β”‚         β”‚  β€’ send_prompt_to_model β”‚ β”‚  β”‚
β”‚  β”‚  β”‚  β€’ acquire/release  β”‚         β”‚  β€’ get_context_display  β”‚ β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚  β”‚
β”‚  β”‚                                                               β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚  β”‚
β”‚  β”‚  β”‚  stream_completion() / get_completion()                  β”‚ β”‚  β”‚
β”‚  β”‚  β”‚  β€’ Async HTTP streaming to LM Studio                     β”‚ β”‚  β”‚
β”‚  β”‚  β”‚  β€’ Yields text chunks or returns full response           β”‚ β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                β”‚                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚                    config.py                                  β”‚  β”‚
β”‚  β”‚  Pydantic Models: ServerConfig, ModelContext, SessionState   β”‚  β”‚
β”‚  β”‚  Constants: DEFAULT_TEMPERATURE, DEFAULT_MAX_TOKENS, etc.    β”‚  β”‚
β”‚  β”‚  Environment: load_servers_from_env(), get_gradio_port()     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LM Studio Servers                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚  β”‚  Server 1 (e.g. 3090)  β”‚    β”‚  Server 2 (e.g. 8000)  β”‚          β”‚
β”‚  β”‚  β€’ GET /v1/models      β”‚    β”‚  β€’ GET /v1/models      β”‚          β”‚
β”‚  β”‚  β€’ POST /v1/chat/...   β”‚    β”‚  β€’ POST /v1/chat/...   β”‚          β”‚
β”‚  β”‚  └─ Model A            β”‚    β”‚  └─ Model B, C         β”‚          β”‚
β”‚  β”‚  └─ Model B            β”‚    β”‚                        β”‚          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Module Breakdown

Directory Structure

prompt_prix/
β”œβ”€β”€ main.py              # Entry point
β”œβ”€β”€ ui.py                # Gradio UI definition
β”œβ”€β”€ handlers.py          # Shared event handlers (fetch, stop)
β”œβ”€β”€ state.py             # Global mutable state
β”œβ”€β”€ core.py              # ServerPool, ComparisonSession, streaming
β”œβ”€β”€ config.py            # Pydantic models, constants, env loading
β”œβ”€β”€ parsers.py           # Input parsing utilities
β”œβ”€β”€ export.py            # Report generation
β”œβ”€β”€ dispatcher.py        # WorkStealingDispatcher for parallel execution
β”œβ”€β”€ battery.py           # BatteryRunner, TestResult, BatteryRun
β”œβ”€β”€ tabs/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ battery/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── handlers.py  # Battery-specific handlers
β”‚   └── compare/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── handlers.py  # Compare-specific handlers
β”œβ”€β”€ adapters/
β”‚   └── lmstudio.py      # LMStudioAdapter
└── benchmarks/
    β”œβ”€β”€ base.py          # TestCase protocol
    └── custom_json.py   # CustomJSONLoader

config.py - Configuration & Data Models

Purpose: Define all Pydantic models for type-safe configuration and state.

Class Purpose
ServerConfig Single LM Studio server state (URL, available_models, is_busy)
ModelConfig Model identity and display name
Message Single message in a conversation (role, content - supports multimodal)
ModelContext Complete conversation history for one model
SessionState Full session: models, contexts, system_prompt, halted status

Message Multimodal Support: The Message model supports both text and multimodal content:

# Text-only message
Message(role="user", content="Hello")

# Multimodal message (text + image)
Message(role="user", content=[
    {"type": "text", "text": "What's in this image?"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
])

# Helper methods
msg.get_text()   # Extract text content
msg.has_image()  # Check if message contains an image

Key Functions:

  • load_servers_from_env() - Read LM_STUDIO_SERVER_N environment variables
  • get_default_servers() - Return env servers or placeholder defaults
  • get_gradio_port() - Read GRADIO_PORT or default to 7860
  • get_fara_config() - Read FARA_SERVER_URL and FARA_MODEL_ID for vision adapter
  • encode_image_to_data_url(path) - Convert image file to base64 data URL
  • build_multimodal_content(text, image_path) - Build OpenAI-format multimodal content

core.py - Server Pool & Session Management

Purpose: Core business logic for server management and model interactions.

ServerPool

Manages multiple LM Studio servers:

class ServerPool:
    servers: dict[str, ServerConfig]  # URL -> config
    _locks: dict[str, asyncio.Lock]   # URL -> lock

    async def refresh_all_manifests()  # GET /v1/models on all servers
    def find_available_server(model_id) -> Optional[str]  # Find idle server with model
    async def acquire_server(url)      # Mark busy, acquire lock
    def release_server(url)            # Mark available, release lock

ComparisonSession

Manages a comparison session:

class ComparisonSession:
    server_pool: ServerPool
    state: SessionState  # Contains models, contexts, config

    async def send_prompt_to_model(model_id, prompt, on_chunk=None)
    async def send_prompt_to_all(prompt, on_chunk=None)
    def get_context_display(model_id) -> str

Streaming Functions

async def stream_completion(
    server_url, model_id, messages, temperature, max_tokens,
    timeout_seconds, tools=None, seed=None, repeat_penalty=None
) -> AsyncGenerator[str, None]:
    """Yields text chunks as they arrive via SSE.

    Args:
        seed: Optional int for reproducible outputs (passed to model API)
        repeat_penalty: Optional float to penalize repeated tokens (1.0 = off)
    """

async def get_completion(...) -> str:
    """Non-streaming version, returns full response."""

handlers.py - Shared Event Handlers

Purpose: Shared async handlers used across multiple tabs.

Handler Purpose Returns
fetch_available_models(servers_text) Query all servers for available models (status, gr.update(choices=[...]))
handle_stop() Signal cancellation via global state status
_init_pool_and_validate(servers_text, models) Initialize ServerPool and validate models (pool, error_message)

tabs/battery/handlers.py - Battery Tab Handlers

Purpose: Handlers specific to the Battery (benchmark) tab.

Handler Trigger Returns
validate_file(file_path) File upload Validation status string
get_test_ids(file_path) File upload List of test IDs
run_handler(file, models, servers, ...) "Run Battery" button Generator yielding (status, grid_df)
quick_prompt_handler(prompt, models, ...) "Run Prompt" button Markdown results
export_json() "Export JSON" button (status, preview)
export_csv() "Export CSV" button (status, preview)
get_cell_detail(model, test) Detail dropdown Markdown detail
refresh_grid(display_mode) Display mode change Updated grid DataFrame

tabs/compare/handlers.py - Compare Tab Handlers

Purpose: Handlers specific to the Compare (interactive) tab.

Handler Trigger Returns
initialize_session(servers, models, system_prompt, ...) Auto-init on send (status, *model_tabs)
send_single_prompt(prompt, tools_json, image_path, seed, repeat_penalty) "Send to All" button Generator yielding (status, tab_states, *model_outputs)
export_markdown() "Export Markdown" button (status, preview)
export_json() "Export JSON" button (status, preview)
launch_beyond_compare(model_a, model_b) "Open in Beyond Compare" button status

Compare Tab Features:

  • Image Attachment: Upload images for vision models (encoded as base64 data URLs)
  • Seed Parameter: Set a seed for reproducible outputs across models
  • Repeat Penalty: Configurable penalty (1.0-2.0) to reduce repetitive token generation

dispatcher.py - Work-Stealing Dispatcher

Purpose: Parallel execution across multiple servers with work-stealing.

class WorkStealingDispatcher:
    """Dispatches work items to servers using work-stealing pattern."""

    async def dispatch(
        self,
        work_items: list[WorkItem],
        execute_fn: Callable[[WorkItem, str], Coroutine],
        on_progress: Optional[Callable[[str, str], None]] = None
    ) -> dict[str, Any]:
        """Execute work items in parallel across available servers."""

The dispatcher:

  1. Maintains a queue of work items (model + test case pairs)
  2. Finds idle servers that can run each work item
  3. Executes items in parallel across all available servers
  4. Supports cooperative cancellation via state.should_stop()

ui.py - Gradio UI Definition

Purpose: Define all Gradio components and wire up event bindings.

Key Components:

Component Type Purpose
servers_input Textbox LM Studio server URLs (one per line)
models_checkboxes CheckboxGroup Select models to compare
system_prompt_input Textbox (50 lines) Editable system prompt
temperature_slider Slider Model temperature (0-2)
timeout_slider Slider Request timeout (30-600s)
max_tokens_slider Slider Max tokens (256-8192)
seed_input Number Optional seed for reproducible outputs
repeat_penalty_slider Slider Repeat penalty (1.0-2.0, default 1.1)
prompt_input Textbox User prompt entry
image_input Image Optional image attachment for vision models
tools_input Code (JSON) Tools for function calling
model_outputs[0..9] Markdown Model response tabs
tab_states JSON (hidden) Tab status for color updates

Event Bindings:

  • Buttons trigger async handlers
  • tab_states.change triggers JavaScript for inline style updates
  • app.load restores state from localStorage

state.py - Global State

Purpose: Holds mutable state shared across handlers.

server_pool: Optional[ServerPool] = None
session: Optional[ComparisonSession] = None

Design Decision: Separated to avoid circular imports between ui.py and handlers.py.

parsers.py - Text Parsing Utilities

Purpose: Parse user input from UI components.

Function Input Output
parse_models_input(text) "model1\nmodel2" ["model1", "model2"]
parse_servers_input(text) "http://...\nhttp://..." ["http://...", "http://..."]
parse_prompts_file(content) File content List of prompts
load_system_prompt(file_path) Optional file path System prompt string
get_default_system_prompt() - Default prompt from file or constant

export.py - Report Generation

Purpose: Generate exportable reports from session state.

def generate_markdown_report(state: SessionState) -> str:
    """Create Markdown with header, system prompt, and all model conversations."""

def generate_json_report(state: SessionState) -> str:
    """Create structured JSON with configuration and conversations."""

def save_report(content: str, filepath: str):
    """Write report to file."""

main.py - Entry Point

Purpose: Application entry point and backwards-compatibility exports.

def run():
    app = create_app()
    app.launch(server_name="0.0.0.0", server_port=get_gradio_port())

Data Flow: Sending a Prompt

1. User types prompt, clicks "Send Prompt"
         β”‚
         β–Ό
2. ui.py: send_button.click(fn=send_single_prompt, inputs=[prompt, tools])
         β”‚
         β–Ό
3. handlers.py: send_single_prompt(prompt, tools_json)
   β”‚ - Validate session exists
   β”‚ - Parse tools JSON
   β”‚ - Add user message to all model contexts
   β”‚ - Refresh server manifests
         β”‚
         β–Ό
4. Work-Stealing Dispatcher Loop:
   β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ β”‚ For each idle server:                   β”‚
   β”‚ β”‚   Find model in queue this server has   β”‚
   β”‚ β”‚   If found: start async task            β”‚
   β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚ β”‚ await asyncio.sleep(0.1)
   β”‚ β”‚ yield (status, tab_states, *outputs)  ──────► UI updates
   β”‚ β”‚ Clean up completed tasks
   β”‚ └─────────── while queue or active_tasks
         β”‚
         β–Ό
5. Each async task: run_model_on_server(model_id, server_url)
   β”‚ - Mark model as "streaming"
   β”‚ - Call stream_completion() ───────────────────► LM Studio API
   β”‚ - Accumulate chunks in streaming_responses[model_id]
   β”‚ - On complete: add assistant message to context
   β”‚ - Release server
         β”‚
         β–Ό
6. Final yield: ("βœ… All responses complete", final_states, *final_outputs)

State Management

Session State (Python)

SessionState:
  models: list[str]                    # Selected models
  contexts: dict[str, ModelContext]    # model_id -> conversation
  system_prompt: str
  temperature: float
  timeout_seconds: int
  max_tokens: int
  halted: bool                         # True if any model failed
  halt_reason: Optional[str]

UI State (Browser localStorage)

Key Type Purpose
promptprix_servers string Server URLs (newline-separated)
promptprix_model_choices JSON array Available models from last fetch
promptprix_models JSON array Selected models
promptprix_temperature float Temperature setting
promptprix_timeout int Timeout setting
promptprix_max_tokens int Max tokens setting
promptprix_tools string Tools JSON
promptprix_system_prompt string System prompt text

Persistence: Only saved when user clicks "Save State" button (explicit save).

Tab Status Visualization

Tab colors indicate model status during streaming:

Status Color Border
pending Red gradient (#fee2e2 β†’ #fecaca) 4px solid #ef4444
streaming Yellow gradient (#fef3c7 β†’ #fde68a) 4px solid #f59e0b
completed Green gradient (#d1fae5 β†’ #a7f3d0) 4px solid #10b981

Implementation: Uses inline JavaScript styles (element.style) to overcome Gradio theme CSS.

Error Handling

Fail-Fast Validation

  1. initialize_session validates:

    • Servers are configured
    • Models are configured
    • All selected models exist on at least one server
  2. send_single_prompt validates:

    • Session is initialized
    • Session is not halted
    • Prompt is not empty
    • Tools JSON is valid (if provided)

Halt-on-Error

If any model fails during send_prompt_to_all:

  • state.halted = True
  • state.halt_reason = "Model {model_id} failed: {error}"
  • Subsequent prompts are rejected

Human-Readable Errors

The LMStudioError exception extracts error messages from LM Studio's JSON responses:

{"error": {"message": "Model not loaded"}}  β†’  "Model not loaded"

Integration Points

Upstream: Benchmark Sources

prompt-prix can consume test cases from established benchmark ecosystems:

Source Format Usage
promptfoo YAML with assertions Full eval format with pass/fail criteria
Inspect AI Python test definitions Export prompts, import as JSON
Custom JSON OpenAI-compatible messages Direct load in prompt-prix

See ADR-001 for rationale.

API Layer: OpenAI-Compatible

All inference servers must expose OpenAI-compatible endpoints:

GET  /v1/models              β†’ List available models
POST /v1/chat/completions    β†’ Chat completion (streaming)

Supported servers:

  • LM Studio (native)
  • Ollama (OpenAI mode)
  • vLLM
  • llama.cpp server
  • Any OpenAI-compatible proxy

See ADR-003 for rationale.

Fan-Out Dispatcher Pattern

The core abstraction is fan-out: one prompt dispatched to N models in parallel.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Fan-Out Dispatcher                       β”‚
β”‚                                                              β”‚
β”‚  Input: (prompt, [model_a, model_b, model_c])               β”‚
β”‚                        β”‚                                     β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚
β”‚         β–Ό              β–Ό              β–Ό                     β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚    β”‚ Model A β”‚    β”‚ Model B β”‚    β”‚ Model C β”‚               β”‚
β”‚    β”‚ Server1 β”‚    β”‚ Server1 β”‚    β”‚ Server2 β”‚               β”‚
β”‚    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜               β”‚
β”‚         β”‚              β”‚              β”‚                     β”‚
β”‚         β–Ό              β–Ό              β–Ό                     β”‚
β”‚    Response A     Response B     Response C                 β”‚
β”‚                                                              β”‚
β”‚  Output: {model_a: resp_a, model_b: resp_b, model_c: resp_c}β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Work-Stealing Implementation

The dispatcher uses work-stealing for GPU efficiency:

  1. Queue: All models to process
  2. Acquire: Find idle server that has queued model
  3. Execute: Stream response, update UI
  4. Release: Server becomes available for next model

This maximizes utilization when models are distributed across multiple GPUs.

See ADR-002 for rationale.

Architecture Decision Records

ADR Decision
001 Use existing benchmarks (promptfoo, Inspect AI) instead of custom eval schema
002 Fan-out pattern as core architectural abstraction
003 OpenAI-compatible API as sole integration layer