Spaces:
Runtime error
Runtime error
| # Architecture | |
| This document describes the system architecture of prompt-prix, including module responsibilities, data flow, and key design decisions. | |
| ## System Overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Browser β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β Gradio UI (ui.py) β β | |
| β β βββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββ β β | |
| β β β Config Panelβ β Prompt Input β β Model Output Tabs β β β | |
| β β β β’ Servers β β β’ Single β β β’ Tab 1..10 β β β | |
| β β β β’ Models β β β’ Batch β β β’ Streaming display β β β | |
| β β β β’ System β β β’ Tools JSON β β β’ Status colors β β β | |
| β β β Prompt β ββββββββββββββββ βββββββββββββββββββββββββ β β | |
| β β βββββββββββββββ β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β localStorage: servers, models, temperature, etc. β β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Python Backend β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β handlers.py β β | |
| β β β’ fetch_available_models() β ServerPool.refresh_manifests() β β | |
| β β β’ initialize_session() β Create ComparisonSession β β | |
| β β β’ send_single_prompt() β Work-stealing dispatcher β β | |
| β β β’ export_markdown/json() β Report generation β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β β | |
| β βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ β | |
| β β core.py β β β | |
| β β βββββββββββββββββββββββ β βββββββββββββββββββββββββββ β β | |
| β β β ServerPool ββββββ΄ββββΊβ ComparisonSession β β β | |
| β β β β’ servers: dict β β β’ state: SessionState β β β | |
| β β β β’ refresh_manifest β β β’ send_prompt_to_model β β β | |
| β β β β’ acquire/release β β β’ get_context_display β β β | |
| β β βββββββββββββββββββββββ βββββββββββββββββββββββββββ β β | |
| β β β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β β β stream_completion() / get_completion() β β β | |
| β β β β’ Async HTTP streaming to LM Studio β β β | |
| β β β β’ Yields text chunks or returns full response β β β | |
| β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β config.py β β | |
| β β Pydantic Models: ServerConfig, ModelContext, SessionState β β | |
| β β Constants: DEFAULT_TEMPERATURE, DEFAULT_MAX_TOKENS, etc. β β | |
| β β Environment: load_servers_from_env(), get_gradio_port() β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β LM Studio Servers β | |
| β ββββββββββββββββββββββββββ ββββββββββββββββββββββββββ β | |
| β β Server 1 (e.g. 3090) β β Server 2 (e.g. 8000) β β | |
| β β β’ GET /v1/models β β β’ GET /v1/models β β | |
| β β β’ POST /v1/chat/... β β β’ POST /v1/chat/... β β | |
| β β ββ Model A β β ββ Model B, C β β | |
| β β ββ Model B β β β β | |
| β ββββββββββββββββββββββββββ ββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Module Breakdown | |
| ### Directory Structure | |
| ``` | |
| prompt_prix/ | |
| βββ main.py # Entry point | |
| βββ ui.py # Gradio UI definition | |
| βββ handlers.py # Shared event handlers (fetch, stop) | |
| βββ state.py # Global mutable state | |
| βββ core.py # ServerPool, ComparisonSession, streaming | |
| βββ config.py # Pydantic models, constants, env loading | |
| βββ parsers.py # Input parsing utilities | |
| βββ export.py # Report generation | |
| βββ dispatcher.py # WorkStealingDispatcher for parallel execution | |
| βββ battery.py # BatteryRunner, TestResult, BatteryRun | |
| βββ tabs/ | |
| β βββ __init__.py | |
| β βββ battery/ | |
| β β βββ __init__.py | |
| β β βββ handlers.py # Battery-specific handlers | |
| β βββ compare/ | |
| β βββ __init__.py | |
| β βββ handlers.py # Compare-specific handlers | |
| βββ adapters/ | |
| β βββ lmstudio.py # LMStudioAdapter | |
| βββ benchmarks/ | |
| βββ base.py # TestCase protocol | |
| βββ custom_json.py # CustomJSONLoader | |
| ``` | |
| ### config.py - Configuration & Data Models | |
| **Purpose**: Define all Pydantic models for type-safe configuration and state. | |
| | Class | Purpose | | |
| |-------|---------| | |
| | `ServerConfig` | Single LM Studio server state (URL, available_models, is_busy) | | |
| | `ModelConfig` | Model identity and display name | | |
| | `Message` | Single message in a conversation (role, content - supports multimodal) | | |
| | `ModelContext` | Complete conversation history for one model | | |
| | `SessionState` | Full session: models, contexts, system_prompt, halted status | | |
| **Message Multimodal Support**: | |
| The `Message` model supports both text and multimodal content: | |
| ```python | |
| # Text-only message | |
| Message(role="user", content="Hello") | |
| # Multimodal message (text + image) | |
| Message(role="user", content=[ | |
| {"type": "text", "text": "What's in this image?"}, | |
| {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}} | |
| ]) | |
| # Helper methods | |
| msg.get_text() # Extract text content | |
| msg.has_image() # Check if message contains an image | |
| ``` | |
| **Key Functions**: | |
| - `load_servers_from_env()` - Read LM_STUDIO_SERVER_N environment variables | |
| - `get_default_servers()` - Return env servers or placeholder defaults | |
| - `get_gradio_port()` - Read GRADIO_PORT or default to 7860 | |
| - `get_fara_config()` - Read FARA_SERVER_URL and FARA_MODEL_ID for vision adapter | |
| - `encode_image_to_data_url(path)` - Convert image file to base64 data URL | |
| - `build_multimodal_content(text, image_path)` - Build OpenAI-format multimodal content | |
| ### core.py - Server Pool & Session Management | |
| **Purpose**: Core business logic for server management and model interactions. | |
| #### ServerPool | |
| Manages multiple LM Studio servers: | |
| ```python | |
| class ServerPool: | |
| servers: dict[str, ServerConfig] # URL -> config | |
| _locks: dict[str, asyncio.Lock] # URL -> lock | |
| async def refresh_all_manifests() # GET /v1/models on all servers | |
| def find_available_server(model_id) -> Optional[str] # Find idle server with model | |
| async def acquire_server(url) # Mark busy, acquire lock | |
| def release_server(url) # Mark available, release lock | |
| ``` | |
| #### ComparisonSession | |
| Manages a comparison session: | |
| ```python | |
| class ComparisonSession: | |
| server_pool: ServerPool | |
| state: SessionState # Contains models, contexts, config | |
| async def send_prompt_to_model(model_id, prompt, on_chunk=None) | |
| async def send_prompt_to_all(prompt, on_chunk=None) | |
| def get_context_display(model_id) -> str | |
| ``` | |
| #### Streaming Functions | |
| ```python | |
| async def stream_completion( | |
| server_url, model_id, messages, temperature, max_tokens, | |
| timeout_seconds, tools=None, seed=None, repeat_penalty=None | |
| ) -> AsyncGenerator[str, None]: | |
| """Yields text chunks as they arrive via SSE. | |
| Args: | |
| seed: Optional int for reproducible outputs (passed to model API) | |
| repeat_penalty: Optional float to penalize repeated tokens (1.0 = off) | |
| """ | |
| async def get_completion(...) -> str: | |
| """Non-streaming version, returns full response.""" | |
| ``` | |
| ### handlers.py - Shared Event Handlers | |
| **Purpose**: Shared async handlers used across multiple tabs. | |
| | Handler | Purpose | Returns | | |
| |---------|---------|---------| | |
| | `fetch_available_models(servers_text)` | Query all servers for available models | `(status, gr.update(choices=[...]))` | | |
| | `handle_stop()` | Signal cancellation via global state | `status` | | |
| | `_init_pool_and_validate(servers_text, models)` | Initialize ServerPool and validate models | `(pool, error_message)` | | |
| ### tabs/battery/handlers.py - Battery Tab Handlers | |
| **Purpose**: Handlers specific to the Battery (benchmark) tab. | |
| | Handler | Trigger | Returns | | |
| |---------|---------|---------| | |
| | `validate_file(file_path)` | File upload | Validation status string | | |
| | `get_test_ids(file_path)` | File upload | List of test IDs | | |
| | `run_handler(file, models, servers, ...)` | "Run Battery" button | Generator yielding `(status, grid_df)` | | |
| | `quick_prompt_handler(prompt, models, ...)` | "Run Prompt" button | Markdown results | | |
| | `export_json()` | "Export JSON" button | `(status, preview)` | | |
| | `export_csv()` | "Export CSV" button | `(status, preview)` | | |
| | `get_cell_detail(model, test)` | Detail dropdown | Markdown detail | | |
| | `refresh_grid(display_mode)` | Display mode change | Updated grid DataFrame | | |
| ### tabs/compare/handlers.py - Compare Tab Handlers | |
| **Purpose**: Handlers specific to the Compare (interactive) tab. | |
| | Handler | Trigger | Returns | | |
| |---------|---------|---------| | |
| | `initialize_session(servers, models, system_prompt, ...)` | Auto-init on send | `(status, *model_tabs)` | | |
| | `send_single_prompt(prompt, tools_json, image_path, seed, repeat_penalty)` | "Send to All" button | Generator yielding `(status, tab_states, *model_outputs)` | | |
| | `export_markdown()` | "Export Markdown" button | `(status, preview)` | | |
| | `export_json()` | "Export JSON" button | `(status, preview)` | | |
| | `launch_beyond_compare(model_a, model_b)` | "Open in Beyond Compare" button | `status` | | |
| **Compare Tab Features**: | |
| - **Image Attachment**: Upload images for vision models (encoded as base64 data URLs) | |
| - **Seed Parameter**: Set a seed for reproducible outputs across models | |
| - **Repeat Penalty**: Configurable penalty (1.0-2.0) to reduce repetitive token generation | |
| ### dispatcher.py - Work-Stealing Dispatcher | |
| **Purpose**: Parallel execution across multiple servers with work-stealing. | |
| ```python | |
| class WorkStealingDispatcher: | |
| """Dispatches work items to servers using work-stealing pattern.""" | |
| async def dispatch( | |
| self, | |
| work_items: list[WorkItem], | |
| execute_fn: Callable[[WorkItem, str], Coroutine], | |
| on_progress: Optional[Callable[[str, str], None]] = None | |
| ) -> dict[str, Any]: | |
| """Execute work items in parallel across available servers.""" | |
| ``` | |
| The dispatcher: | |
| 1. Maintains a queue of work items (model + test case pairs) | |
| 2. Finds idle servers that can run each work item | |
| 3. Executes items in parallel across all available servers | |
| 4. Supports cooperative cancellation via `state.should_stop()` | |
| ### ui.py - Gradio UI Definition | |
| **Purpose**: Define all Gradio components and wire up event bindings. | |
| **Key Components**: | |
| | Component | Type | Purpose | | |
| |-----------|------|---------| | |
| | `servers_input` | Textbox | LM Studio server URLs (one per line) | | |
| | `models_checkboxes` | CheckboxGroup | Select models to compare | | |
| | `system_prompt_input` | Textbox (50 lines) | Editable system prompt | | |
| | `temperature_slider` | Slider | Model temperature (0-2) | | |
| | `timeout_slider` | Slider | Request timeout (30-600s) | | |
| | `max_tokens_slider` | Slider | Max tokens (256-8192) | | |
| | `seed_input` | Number | Optional seed for reproducible outputs | | |
| | `repeat_penalty_slider` | Slider | Repeat penalty (1.0-2.0, default 1.1) | | |
| | `prompt_input` | Textbox | User prompt entry | | |
| | `image_input` | Image | Optional image attachment for vision models | | |
| | `tools_input` | Code (JSON) | Tools for function calling | | |
| | `model_outputs[0..9]` | Markdown | Model response tabs | | |
| | `tab_states` | JSON (hidden) | Tab status for color updates | | |
| **Event Bindings**: | |
| - Buttons trigger async handlers | |
| - `tab_states.change` triggers JavaScript for inline style updates | |
| - `app.load` restores state from localStorage | |
| ### state.py - Global State | |
| **Purpose**: Holds mutable state shared across handlers. | |
| ```python | |
| server_pool: Optional[ServerPool] = None | |
| session: Optional[ComparisonSession] = None | |
| ``` | |
| **Design Decision**: Separated to avoid circular imports between ui.py and handlers.py. | |
| ### parsers.py - Text Parsing Utilities | |
| **Purpose**: Parse user input from UI components. | |
| | Function | Input | Output | | |
| |----------|-------|--------| | |
| | `parse_models_input(text)` | "model1\nmodel2" | `["model1", "model2"]` | | |
| | `parse_servers_input(text)` | "http://...\nhttp://..." | `["http://...", "http://..."]` | | |
| | `parse_prompts_file(content)` | File content | List of prompts | | |
| | `load_system_prompt(file_path)` | Optional file path | System prompt string | | |
| | `get_default_system_prompt()` | - | Default prompt from file or constant | | |
| ### export.py - Report Generation | |
| **Purpose**: Generate exportable reports from session state. | |
| ```python | |
| def generate_markdown_report(state: SessionState) -> str: | |
| """Create Markdown with header, system prompt, and all model conversations.""" | |
| def generate_json_report(state: SessionState) -> str: | |
| """Create structured JSON with configuration and conversations.""" | |
| def save_report(content: str, filepath: str): | |
| """Write report to file.""" | |
| ``` | |
| ### main.py - Entry Point | |
| **Purpose**: Application entry point and backwards-compatibility exports. | |
| ```python | |
| def run(): | |
| app = create_app() | |
| app.launch(server_name="0.0.0.0", server_port=get_gradio_port()) | |
| ``` | |
| ## Data Flow: Sending a Prompt | |
| ``` | |
| 1. User types prompt, clicks "Send Prompt" | |
| β | |
| βΌ | |
| 2. ui.py: send_button.click(fn=send_single_prompt, inputs=[prompt, tools]) | |
| β | |
| βΌ | |
| 3. handlers.py: send_single_prompt(prompt, tools_json) | |
| β - Validate session exists | |
| β - Parse tools JSON | |
| β - Add user message to all model contexts | |
| β - Refresh server manifests | |
| β | |
| βΌ | |
| 4. Work-Stealing Dispatcher Loop: | |
| β βββββββββββββββββββββββββββββββββββββββββββ | |
| β β For each idle server: β | |
| β β Find model in queue this server has β | |
| β β If found: start async task β | |
| β βββββββββββββββββββββββββββββββββββββββββββ | |
| β β await asyncio.sleep(0.1) | |
| β β yield (status, tab_states, *outputs) βββββββΊ UI updates | |
| β β Clean up completed tasks | |
| β ββββββββββββ while queue or active_tasks | |
| β | |
| βΌ | |
| 5. Each async task: run_model_on_server(model_id, server_url) | |
| β - Mark model as "streaming" | |
| β - Call stream_completion() ββββββββββββββββββββΊ LM Studio API | |
| β - Accumulate chunks in streaming_responses[model_id] | |
| β - On complete: add assistant message to context | |
| β - Release server | |
| β | |
| βΌ | |
| 6. Final yield: ("β All responses complete", final_states, *final_outputs) | |
| ``` | |
| ## State Management | |
| ### Session State (Python) | |
| ```python | |
| SessionState: | |
| models: list[str] # Selected models | |
| contexts: dict[str, ModelContext] # model_id -> conversation | |
| system_prompt: str | |
| temperature: float | |
| timeout_seconds: int | |
| max_tokens: int | |
| halted: bool # True if any model failed | |
| halt_reason: Optional[str] | |
| ``` | |
| ### UI State (Browser localStorage) | |
| | Key | Type | Purpose | | |
| |-----|------|---------| | |
| | `promptprix_servers` | string | Server URLs (newline-separated) | | |
| | `promptprix_model_choices` | JSON array | Available models from last fetch | | |
| | `promptprix_models` | JSON array | Selected models | | |
| | `promptprix_temperature` | float | Temperature setting | | |
| | `promptprix_timeout` | int | Timeout setting | | |
| | `promptprix_max_tokens` | int | Max tokens setting | | |
| | `promptprix_tools` | string | Tools JSON | | |
| | `promptprix_system_prompt` | string | System prompt text | | |
| **Persistence**: Only saved when user clicks "Save State" button (explicit save). | |
| ## Tab Status Visualization | |
| Tab colors indicate model status during streaming: | |
| | Status | Color | Border | | |
| |--------|-------|--------| | |
| | `pending` | Red gradient (#fee2e2 β #fecaca) | 4px solid #ef4444 | | |
| | `streaming` | Yellow gradient (#fef3c7 β #fde68a) | 4px solid #f59e0b | | |
| | `completed` | Green gradient (#d1fae5 β #a7f3d0) | 4px solid #10b981 | | |
| **Implementation**: Uses inline JavaScript styles (`element.style`) to overcome Gradio theme CSS. | |
| ## Error Handling | |
| ### Fail-Fast Validation | |
| 1. `initialize_session` validates: | |
| - Servers are configured | |
| - Models are configured | |
| - All selected models exist on at least one server | |
| 2. `send_single_prompt` validates: | |
| - Session is initialized | |
| - Session is not halted | |
| - Prompt is not empty | |
| - Tools JSON is valid (if provided) | |
| ### Halt-on-Error | |
| If any model fails during `send_prompt_to_all`: | |
| - `state.halted = True` | |
| - `state.halt_reason = "Model {model_id} failed: {error}"` | |
| - Subsequent prompts are rejected | |
| ### Human-Readable Errors | |
| The `LMStudioError` exception extracts error messages from LM Studio's JSON responses: | |
| ```python | |
| {"error": {"message": "Model not loaded"}} β "Model not loaded" | |
| ``` | |
| ## Integration Points | |
| ### Upstream: Benchmark Sources | |
| prompt-prix can consume test cases from established benchmark ecosystems: | |
| | Source | Format | Usage | | |
| |--------|--------|-------| | |
| | **promptfoo** | YAML with assertions | Full eval format with pass/fail criteria | | |
| | **Inspect AI** | Python test definitions | Export prompts, import as JSON | | |
| | **Custom JSON** | OpenAI-compatible messages | Direct load in prompt-prix | | |
| See [ADR-001](adr/completed/001-use-existing-benchmarks.md) for rationale. | |
| ### API Layer: OpenAI-Compatible | |
| All inference servers must expose OpenAI-compatible endpoints: | |
| ``` | |
| GET /v1/models β List available models | |
| POST /v1/chat/completions β Chat completion (streaming) | |
| ``` | |
| Supported servers: | |
| - LM Studio (native) | |
| - Ollama (OpenAI mode) | |
| - vLLM | |
| - llama.cpp server | |
| - Any OpenAI-compatible proxy | |
| See [ADR-003](adr/completed/003-openai-compatible-api.md) for rationale. | |
| ## Fan-Out Dispatcher Pattern | |
| The core abstraction is **fan-out**: one prompt dispatched to N models in parallel. | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Fan-Out Dispatcher β | |
| β β | |
| β Input: (prompt, [model_a, model_b, model_c]) β | |
| β β β | |
| β ββββββββββββββββΌβββββββββββββββ β | |
| β βΌ βΌ βΌ β | |
| β βββββββββββ βββββββββββ βββββββββββ β | |
| β β Model A β β Model B β β Model C β β | |
| β β Server1 β β Server1 β β Server2 β β | |
| β ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ β | |
| β β β β β | |
| β βΌ βΌ βΌ β | |
| β Response A Response B Response C β | |
| β β | |
| β Output: {model_a: resp_a, model_b: resp_b, model_c: resp_c}β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Work-Stealing Implementation | |
| The dispatcher uses work-stealing for GPU efficiency: | |
| 1. **Queue**: All models to process | |
| 2. **Acquire**: Find idle server that has queued model | |
| 3. **Execute**: Stream response, update UI | |
| 4. **Release**: Server becomes available for next model | |
| This maximizes utilization when models are distributed across multiple GPUs. | |
| See [ADR-002](adr/completed/002-fan-out-pattern-as-core.md) for rationale. | |
| ## Architecture Decision Records | |
| | ADR | Decision | | |
| |-----|----------| | |
| | [001](adr/completed/001-use-existing-benchmarks.md) | Use existing benchmarks (promptfoo, Inspect AI) instead of custom eval schema | | |
| | [002](adr/completed/002-fan-out-pattern-as-core.md) | Fan-out pattern as core architectural abstraction | | |
| | [003](adr/completed/003-openai-compatible-api.md) | OpenAI-compatible API as sole integration layer | | |