Spaces:

reflectiveattention
/

prompt-prix

Runtime error

App Files Files Community

prompt-prix / docs /ARCHITECTURE.md

3v324v23

chore: Update documentation

9eb149e 11 days ago

preview code

raw

history blame contribute delete

25.3 kB

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Architecture

This document describes the system architecture of prompt-prix, including module responsibilities, data flow, and key design decisions.

System Overview

┌─────────────────────────────────────────────────────────────────────┐
│                           Browser                                   │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    Gradio UI (ui.py)                        │   │
│  │  ┌─────────────┐ ┌──────────────┐ ┌───────────────────────┐ │   │
│  │  │ Config Panel│ │ Prompt Input │ │ Model Output Tabs     │ │   │
│  │  │ • Servers   │ │ • Single     │ │ • Tab 1..10           │ │   │
│  │  │ • Models    │ │ • Batch      │ │ • Streaming display   │ │   │
│  │  │ • System    │ │ • Tools JSON │ │ • Status colors       │ │   │
│  │  │   Prompt    │ └──────────────┘ └───────────────────────┘ │   │
│  │  └─────────────┘                                             │   │
│  │  ┌─────────────────────────────────────────────────────────┐ │   │
│  │  │ localStorage: servers, models, temperature, etc.        │ │   │
│  │  └─────────────────────────────────────────────────────────┘ │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Python Backend                                  │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    handlers.py                                │  │
│  │  • fetch_available_models()  → ServerPool.refresh_manifests() │  │
│  │  • initialize_session()      → Create ComparisonSession       │  │
│  │  • send_single_prompt()      → Work-stealing dispatcher       │  │
│  │  • export_markdown/json()    → Report generation              │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                │                                    │
│  ┌─────────────────────────────┼────────────────────────────────┐  │
│  │                    core.py  │                                 │  │
│  │  ┌─────────────────────┐    │    ┌─────────────────────────┐ │  │
│  │  │    ServerPool       │◄───┴───►│  ComparisonSession      │ │  │
│  │  │  • servers: dict    │         │  • state: SessionState  │ │  │
│  │  │  • refresh_manifest │         │  • send_prompt_to_model │ │  │
│  │  │  • acquire/release  │         │  • get_context_display  │ │  │
│  │  └─────────────────────┘         └─────────────────────────┘ │  │
│  │                                                               │  │
│  │  ┌─────────────────────────────────────────────────────────┐ │  │
│  │  │  stream_completion() / get_completion()                  │ │  │
│  │  │  • Async HTTP streaming to LM Studio                     │ │  │
│  │  │  • Yields text chunks or returns full response           │ │  │
│  │  └─────────────────────────────────────────────────────────┘ │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                │                                    │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    config.py                                  │  │
│  │  Pydantic Models: ServerConfig, ModelContext, SessionState   │  │
│  │  Constants: DEFAULT_TEMPERATURE, DEFAULT_MAX_TOKENS, etc.    │  │
│  │  Environment: load_servers_from_env(), get_gradio_port()     │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    LM Studio Servers                                │
│  ┌────────────────────────┐    ┌────────────────────────┐          │
│  │  Server 1 (e.g. 3090)  │    │  Server 2 (e.g. 8000)  │          │
│  │  • GET /v1/models      │    │  • GET /v1/models      │          │
│  │  • POST /v1/chat/...   │    │  • POST /v1/chat/...   │          │
│  │  └─ Model A            │    │  └─ Model B, C         │          │
│  │  └─ Model B            │    │                        │          │
│  └────────────────────────┘    └────────────────────────┘          │
└─────────────────────────────────────────────────────────────────────┘

Module Breakdown

Directory Structure

prompt_prix/
├── main.py              # Entry point
├── ui.py                # Gradio UI definition
├── handlers.py          # Shared event handlers (fetch, stop)
├── state.py             # Global mutable state
├── core.py              # ServerPool, ComparisonSession, streaming
├── config.py            # Pydantic models, constants, env loading
├── parsers.py           # Input parsing utilities
├── export.py            # Report generation
├── dispatcher.py        # WorkStealingDispatcher for parallel execution
├── battery.py           # BatteryRunner, TestResult, BatteryRun
├── tabs/
│   ├── __init__.py
│   ├── battery/
│   │   ├── __init__.py
│   │   └── handlers.py  # Battery-specific handlers
│   └── compare/
│       ├── __init__.py
│       └── handlers.py  # Compare-specific handlers
├── adapters/
│   └── lmstudio.py      # LMStudioAdapter
└── benchmarks/
    ├── base.py          # TestCase protocol
    └── custom_json.py   # CustomJSONLoader

config.py - Configuration & Data Models

Purpose: Define all Pydantic models for type-safe configuration and state.

Class	Purpose
`ServerConfig`	Single LM Studio server state (URL, available_models, is_busy)
`ModelConfig`	Model identity and display name
`Message`	Single message in a conversation (role, content - supports multimodal)
`ModelContext`	Complete conversation history for one model
`SessionState`	Full session: models, contexts, system_prompt, halted status

Message Multimodal Support: The Message model supports both text and multimodal content:

# Text-only message
Message(role="user", content="Hello")

# Multimodal message (text + image)
Message(role="user", content=[
    {"type": "text", "text": "What's in this image?"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
])

# Helper methods
msg.get_text()   # Extract text content
msg.has_image()  # Check if message contains an image

Key Functions:

load_servers_from_env() - Read LM_STUDIO_SERVER_N environment variables
get_default_servers() - Return env servers or placeholder defaults
get_gradio_port() - Read GRADIO_PORT or default to 7860
get_fara_config() - Read FARA_SERVER_URL and FARA_MODEL_ID for vision adapter
encode_image_to_data_url(path) - Convert image file to base64 data URL
build_multimodal_content(text, image_path) - Build OpenAI-format multimodal content

core.py - Server Pool & Session Management

Purpose: Core business logic for server management and model interactions.

ServerPool

Manages multiple LM Studio servers:

class ServerPool:
    servers: dict[str, ServerConfig]  # URL -> config
    _locks: dict[str, asyncio.Lock]   # URL -> lock

    async def refresh_all_manifests()  # GET /v1/models on all servers
    def find_available_server(model_id) -> Optional[str]  # Find idle server with model
    async def acquire_server(url)      # Mark busy, acquire lock
    def release_server(url)            # Mark available, release lock

ComparisonSession

Manages a comparison session:

class ComparisonSession:
    server_pool: ServerPool
    state: SessionState  # Contains models, contexts, config

    async def send_prompt_to_model(model_id, prompt, on_chunk=None)
    async def send_prompt_to_all(prompt, on_chunk=None)
    def get_context_display(model_id) -> str

Streaming Functions

async def stream_completion(
    server_url, model_id, messages, temperature, max_tokens,
    timeout_seconds, tools=None, seed=None, repeat_penalty=None
) -> AsyncGenerator[str, None]:
    """Yields text chunks as they arrive via SSE.

    Args:
        seed: Optional int for reproducible outputs (passed to model API)
        repeat_penalty: Optional float to penalize repeated tokens (1.0 = off)
    """

async def get_completion(...) -> str:
    """Non-streaming version, returns full response."""

handlers.py - Shared Event Handlers

Purpose: Shared async handlers used across multiple tabs.

Handler	Purpose	Returns
`fetch_available_models(servers_text)`	Query all servers for available models	`(status, gr.update(choices=[...]))`
`handle_stop()`	Signal cancellation via global state	`status`
`_init_pool_and_validate(servers_text, models)`	Initialize ServerPool and validate models	`(pool, error_message)`

tabs/battery/handlers.py - Battery Tab Handlers

Purpose: Handlers specific to the Battery (benchmark) tab.

Handler	Trigger	Returns
`validate_file(file_path)`	File upload	Validation status string
`get_test_ids(file_path)`	File upload	List of test IDs
`run_handler(file, models, servers, ...)`	"Run Battery" button	Generator yielding `(status, grid_df)`
`quick_prompt_handler(prompt, models, ...)`	"Run Prompt" button	Markdown results
`export_json()`	"Export JSON" button	`(status, preview)`
`export_csv()`	"Export CSV" button	`(status, preview)`
`get_cell_detail(model, test)`	Detail dropdown	Markdown detail
`refresh_grid(display_mode)`	Display mode change	Updated grid DataFrame

tabs/compare/handlers.py - Compare Tab Handlers

Purpose: Handlers specific to the Compare (interactive) tab.

Handler	Trigger	Returns
`initialize_session(servers, models, system_prompt, ...)`	Auto-init on send	`(status, *model_tabs)`
`send_single_prompt(prompt, tools_json, image_path, seed, repeat_penalty)`	"Send to All" button	Generator yielding `(status, tab_states, *model_outputs)`
`export_markdown()`	"Export Markdown" button	`(status, preview)`
`export_json()`	"Export JSON" button	`(status, preview)`
`launch_beyond_compare(model_a, model_b)`	"Open in Beyond Compare" button	`status`

Compare Tab Features:

Image Attachment: Upload images for vision models (encoded as base64 data URLs)
Seed Parameter: Set a seed for reproducible outputs across models
Repeat Penalty: Configurable penalty (1.0-2.0) to reduce repetitive token generation

dispatcher.py - Work-Stealing Dispatcher

Purpose: Parallel execution across multiple servers with work-stealing.

class WorkStealingDispatcher:
    """Dispatches work items to servers using work-stealing pattern."""

    async def dispatch(
        self,
        work_items: list[WorkItem],
        execute_fn: Callable[[WorkItem, str], Coroutine],
        on_progress: Optional[Callable[[str, str], None]] = None
    ) -> dict[str, Any]:
        """Execute work items in parallel across available servers."""

The dispatcher:

Maintains a queue of work items (model + test case pairs)
Finds idle servers that can run each work item
Executes items in parallel across all available servers
Supports cooperative cancellation via state.should_stop()

ui.py - Gradio UI Definition

Purpose: Define all Gradio components and wire up event bindings.

Key Components:

Component	Type	Purpose
`servers_input`	Textbox	LM Studio server URLs (one per line)
`models_checkboxes`	CheckboxGroup	Select models to compare
`system_prompt_input`	Textbox (50 lines)	Editable system prompt
`temperature_slider`	Slider	Model temperature (0-2)
`timeout_slider`	Slider	Request timeout (30-600s)
`max_tokens_slider`	Slider	Max tokens (256-8192)
`seed_input`	Number	Optional seed for reproducible outputs
`repeat_penalty_slider`	Slider	Repeat penalty (1.0-2.0, default 1.1)
`prompt_input`	Textbox	User prompt entry
`image_input`	Image	Optional image attachment for vision models
`tools_input`	Code (JSON)	Tools for function calling
`model_outputs[0..9]`	Markdown	Model response tabs
`tab_states`	JSON (hidden)	Tab status for color updates

Event Bindings:

Buttons trigger async handlers
tab_states.change triggers JavaScript for inline style updates
app.load restores state from localStorage

state.py - Global State

Purpose: Holds mutable state shared across handlers.

server_pool: Optional[ServerPool] = None
session: Optional[ComparisonSession] = None

Design Decision: Separated to avoid circular imports between ui.py and handlers.py.

parsers.py - Text Parsing Utilities

Purpose: Parse user input from UI components.

Function	Input	Output
`parse_models_input(text)`	"model1\nmodel2"	`["model1", "model2"]`
`parse_servers_input(text)`	"http://...\nhttp://..."	`["http://...", "http://..."]`
`parse_prompts_file(content)`	File content	List of prompts
`load_system_prompt(file_path)`	Optional file path	System prompt string
`get_default_system_prompt()`	-	Default prompt from file or constant

export.py - Report Generation

Purpose: Generate exportable reports from session state.

def generate_markdown_report(state: SessionState) -> str:
    """Create Markdown with header, system prompt, and all model conversations."""

def generate_json_report(state: SessionState) -> str:
    """Create structured JSON with configuration and conversations."""

def save_report(content: str, filepath: str):
    """Write report to file."""

main.py - Entry Point

Purpose: Application entry point and backwards-compatibility exports.

def run():
    app = create_app()
    app.launch(server_name="0.0.0.0", server_port=get_gradio_port())

Data Flow: Sending a Prompt

1. User types prompt, clicks "Send Prompt"
         │
         ▼
2. ui.py: send_button.click(fn=send_single_prompt, inputs=[prompt, tools])
         │
         ▼
3. handlers.py: send_single_prompt(prompt, tools_json)
   │ - Validate session exists
   │ - Parse tools JSON
   │ - Add user message to all model contexts
   │ - Refresh server manifests
         │
         ▼
4. Work-Stealing Dispatcher Loop:
   │ ┌─────────────────────────────────────────┐
   │ │ For each idle server:                   │
   │ │   Find model in queue this server has   │
   │ │   If found: start async task            │
   │ └─────────────────────────────────────────┘
   │ │ await asyncio.sleep(0.1)
   │ │ yield (status, tab_states, *outputs)  ──────► UI updates
   │ │ Clean up completed tasks
   │ └─────────── while queue or active_tasks
         │
         ▼
5. Each async task: run_model_on_server(model_id, server_url)
   │ - Mark model as "streaming"
   │ - Call stream_completion() ───────────────────► LM Studio API
   │ - Accumulate chunks in streaming_responses[model_id]
   │ - On complete: add assistant message to context
   │ - Release server
         │
         ▼
6. Final yield: ("✅ All responses complete", final_states, *final_outputs)

State Management

Session State (Python)

SessionState:
  models: list[str]                    # Selected models
  contexts: dict[str, ModelContext]    # model_id -> conversation
  system_prompt: str
  temperature: float
  timeout_seconds: int
  max_tokens: int
  halted: bool                         # True if any model failed
  halt_reason: Optional[str]

UI State (Browser localStorage)

Key	Type	Purpose
`promptprix_servers`	string	Server URLs (newline-separated)
`promptprix_model_choices`	JSON array	Available models from last fetch
`promptprix_models`	JSON array	Selected models
`promptprix_temperature`	float	Temperature setting
`promptprix_timeout`	int	Timeout setting
`promptprix_max_tokens`	int	Max tokens setting
`promptprix_tools`	string	Tools JSON
`promptprix_system_prompt`	string	System prompt text

Persistence: Only saved when user clicks "Save State" button (explicit save).

Tab Status Visualization

Tab colors indicate model status during streaming:

Status	Color	Border
`pending`	Red gradient (#fee2e2 → #fecaca)	4px solid #ef4444
`streaming`	Yellow gradient (#fef3c7 → #fde68a)	4px solid #f59e0b
`completed`	Green gradient (#d1fae5 → #a7f3d0)	4px solid #10b981

Implementation: Uses inline JavaScript styles (element.style) to overcome Gradio theme CSS.

Error Handling

Fail-Fast Validation

initialize_session validates:
- Servers are configured
- Models are configured
- All selected models exist on at least one server
send_single_prompt validates:
- Session is initialized
- Session is not halted
- Prompt is not empty
- Tools JSON is valid (if provided)

Halt-on-Error

If any model fails during send_prompt_to_all:

state.halted = True
state.halt_reason = "Model {model_id} failed: {error}"
Subsequent prompts are rejected

Human-Readable Errors

The LMStudioError exception extracts error messages from LM Studio's JSON responses:

{"error": {"message": "Model not loaded"}}  →  "Model not loaded"

Integration Points

Upstream: Benchmark Sources

prompt-prix can consume test cases from established benchmark ecosystems:

Source	Format	Usage
promptfoo	YAML with assertions	Full eval format with pass/fail criteria
Inspect AI	Python test definitions	Export prompts, import as JSON
Custom JSON	OpenAI-compatible messages	Direct load in prompt-prix

See ADR-001 for rationale.

API Layer: OpenAI-Compatible

All inference servers must expose OpenAI-compatible endpoints:

GET  /v1/models              → List available models
POST /v1/chat/completions    → Chat completion (streaming)

Supported servers:

LM Studio (native)
Ollama (OpenAI mode)
vLLM
llama.cpp server
Any OpenAI-compatible proxy

See ADR-003 for rationale.

Fan-Out Dispatcher Pattern

The core abstraction is fan-out: one prompt dispatched to N models in parallel.

┌─────────────────────────────────────────────────────────────┐
│                     Fan-Out Dispatcher                       │
│                                                              │
│  Input: (prompt, [model_a, model_b, model_c])               │
│                        │                                     │
│         ┌──────────────┼──────────────┐                     │
│         ▼              ▼              ▼                     │
│    ┌─────────┐    ┌─────────┐    ┌─────────┐               │
│    │ Model A │    │ Model B │    │ Model C │               │
│    │ Server1 │    │ Server1 │    │ Server2 │               │
│    └────┬────┘    └────┬────┘    └────┬────┘               │
│         │              │              │                     │
│         ▼              ▼              ▼                     │
│    Response A     Response B     Response C                 │
│                                                              │
│  Output: {model_a: resp_a, model_b: resp_b, model_c: resp_c}│
└─────────────────────────────────────────────────────────────┘

Work-Stealing Implementation

The dispatcher uses work-stealing for GPU efficiency:

Queue: All models to process
Acquire: Find idle server that has queued model
Execute: Stream response, update UI
Release: Server becomes available for next model

This maximizes utilization when models are distributed across multiple GPUs.

See ADR-002 for rationale.

Architecture Decision Records

ADR	Decision
001	Use existing benchmarks (promptfoo, Inspect AI) instead of custom eval schema
002	Fan-out pattern as core architectural abstraction
003	OpenAI-compatible API as sole integration layer