# Architecture

This document describes the system architecture of prompt-prix, including module responsibilities, data flow, and key design decisions.

## System Overview

```
┌─────────────────────────────────────────────────────────────────────┐
│                           Browser                                   │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │                    Gradio UI (ui.py)                        │   │
│  │  ┌─────────────┐ ┌──────────────┐ ┌───────────────────────┐ │   │
│  │  │ Config Panel│ │ Prompt Input │ │ Model Output Tabs     │ │   │
│  │  │ • Servers   │ │ • Single     │ │ • Tab 1..10           │ │   │
│  │  │ • Models    │ │ • Batch      │ │ • Streaming display   │ │   │
│  │  │ • System    │ │ • Tools JSON │ │ • Status colors       │ │   │
│  │  │   Prompt    │ └──────────────┘ └───────────────────────┘ │   │
│  │  └─────────────┘                                             │   │
│  │  ┌─────────────────────────────────────────────────────────┐ │   │
│  │  │ localStorage: servers, models, temperature, etc.        │ │   │
│  │  └─────────────────────────────────────────────────────────┘ │   │
│  └─────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Python Backend                                  │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    handlers.py                                │  │
│  │  • fetch_available_models()  → ServerPool.refresh_manifests() │  │
│  │  • initialize_session()      → Create ComparisonSession       │  │
│  │  • send_single_prompt()      → Work-stealing dispatcher       │  │
│  │  • export_markdown/json()    → Report generation              │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                │                                    │
│  ┌─────────────────────────────┼────────────────────────────────┐  │
│  │                    core.py  │                                 │  │
│  │  ┌─────────────────────┐    │    ┌─────────────────────────┐ │  │
│  │  │    ServerPool       │◄───┴───►│  ComparisonSession      │ │  │
│  │  │  • servers: dict    │         │  • state: SessionState  │ │  │
│  │  │  • refresh_manifest │         │  • send_prompt_to_model │ │  │
│  │  │  • acquire/release  │         │  • get_context_display  │ │  │
│  │  └─────────────────────┘         └─────────────────────────┘ │  │
│  │                                                               │  │
│  │  ┌─────────────────────────────────────────────────────────┐ │  │
│  │  │  stream_completion() / get_completion()                  │ │  │
│  │  │  • Async HTTP streaming to LM Studio                     │ │  │
│  │  │  • Yields text chunks or returns full response           │ │  │
│  │  └─────────────────────────────────────────────────────────┘ │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                │                                    │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    config.py                                  │  │
│  │  Pydantic Models: ServerConfig, ModelContext, SessionState   │  │
│  │  Constants: DEFAULT_TEMPERATURE, DEFAULT_MAX_TOKENS, etc.    │  │
│  │  Environment: load_servers_from_env(), get_gradio_port()     │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    LM Studio Servers                                │
│  ┌────────────────────────┐    ┌────────────────────────┐          │
│  │  Server 1 (e.g. 3090)  │    │  Server 2 (e.g. 8000)  │          │
│  │  • GET /v1/models      │    │  • GET /v1/models      │          │
│  │  • POST /v1/chat/...   │    │  • POST /v1/chat/...   │          │
│  │  └─ Model A            │    │  └─ Model B, C         │          │
│  │  └─ Model B            │    │                        │          │
│  └────────────────────────┘    └────────────────────────┘          │
└─────────────────────────────────────────────────────────────────────┘
```

## Module Breakdown

### Directory Structure

```
prompt_prix/
├── main.py              # Entry point
├── ui.py                # Gradio UI definition
├── handlers.py          # Shared event handlers (fetch, stop)
├── state.py             # Global mutable state
├── core.py              # ServerPool, ComparisonSession, streaming
├── config.py            # Pydantic models, constants, env loading
├── parsers.py           # Input parsing utilities
├── export.py            # Report generation
├── dispatcher.py        # WorkStealingDispatcher for parallel execution
├── battery.py           # BatteryRunner, TestResult, BatteryRun
├── tabs/
│   ├── __init__.py
│   ├── battery/
│   │   ├── __init__.py
│   │   └── handlers.py  # Battery-specific handlers
│   └── compare/
│       ├── __init__.py
│       └── handlers.py  # Compare-specific handlers
├── adapters/
│   └── lmstudio.py      # LMStudioAdapter
└── benchmarks/
    ├── base.py          # TestCase protocol
    └── custom_json.py   # CustomJSONLoader
```

### config.py - Configuration & Data Models

**Purpose**: Define all Pydantic models for type-safe configuration and state.

| Class | Purpose |
|-------|---------|
| `ServerConfig` | Single LM Studio server state (URL, available_models, is_busy) |
| `ModelConfig` | Model identity and display name |
| `Message` | Single message in a conversation (role, content - supports multimodal) |
| `ModelContext` | Complete conversation history for one model |
| `SessionState` | Full session: models, contexts, system_prompt, halted status |

**Message Multimodal Support**:
The `Message` model supports both text and multimodal content:
```python
# Text-only message
Message(role="user", content="Hello")

# Multimodal message (text + image)
Message(role="user", content=[
    {"type": "text", "text": "What's in this image?"},
    {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
])

# Helper methods
msg.get_text()   # Extract text content
msg.has_image()  # Check if message contains an image
```

**Key Functions**:
- `load_servers_from_env()` - Read LM_STUDIO_SERVER_N environment variables
- `get_default_servers()` - Return env servers or placeholder defaults
- `get_gradio_port()` - Read GRADIO_PORT or default to 7860
- `get_fara_config()` - Read FARA_SERVER_URL and FARA_MODEL_ID for vision adapter
- `encode_image_to_data_url(path)` - Convert image file to base64 data URL
- `build_multimodal_content(text, image_path)` - Build OpenAI-format multimodal content

### core.py - Server Pool & Session Management

**Purpose**: Core business logic for server management and model interactions.

#### ServerPool

Manages multiple LM Studio servers:

```python
class ServerPool:
    servers: dict[str, ServerConfig]  # URL -> config
    _locks: dict[str, asyncio.Lock]   # URL -> lock

    async def refresh_all_manifests()  # GET /v1/models on all servers
    def find_available_server(model_id) -> Optional[str]  # Find idle server with model
    async def acquire_server(url)      # Mark busy, acquire lock
    def release_server(url)            # Mark available, release lock
```

#### ComparisonSession

Manages a comparison session:

```python
class ComparisonSession:
    server_pool: ServerPool
    state: SessionState  # Contains models, contexts, config

    async def send_prompt_to_model(model_id, prompt, on_chunk=None)
    async def send_prompt_to_all(prompt, on_chunk=None)
    def get_context_display(model_id) -> str
```

#### Streaming Functions

```python
async def stream_completion(
    server_url, model_id, messages, temperature, max_tokens,
    timeout_seconds, tools=None, seed=None, repeat_penalty=None
) -> AsyncGenerator[str, None]:
    """Yields text chunks as they arrive via SSE.

    Args:
        seed: Optional int for reproducible outputs (passed to model API)
        repeat_penalty: Optional float to penalize repeated tokens (1.0 = off)
    """

async def get_completion(...) -> str:
    """Non-streaming version, returns full response."""
```

### handlers.py - Shared Event Handlers

**Purpose**: Shared async handlers used across multiple tabs.

| Handler | Purpose | Returns |
|---------|---------|---------|
| `fetch_available_models(servers_text)` | Query all servers for available models | `(status, gr.update(choices=[...]))` |
| `handle_stop()` | Signal cancellation via global state | `status` |
| `_init_pool_and_validate(servers_text, models)` | Initialize ServerPool and validate models | `(pool, error_message)` |

### tabs/battery/handlers.py - Battery Tab Handlers

**Purpose**: Handlers specific to the Battery (benchmark) tab.

| Handler | Trigger | Returns |
|---------|---------|---------|
| `validate_file(file_path)` | File upload | Validation status string |
| `get_test_ids(file_path)` | File upload | List of test IDs |
| `run_handler(file, models, servers, ...)` | "Run Battery" button | Generator yielding `(status, grid_df)` |
| `quick_prompt_handler(prompt, models, ...)` | "Run Prompt" button | Markdown results |
| `export_json()` | "Export JSON" button | `(status, preview)` |
| `export_csv()` | "Export CSV" button | `(status, preview)` |
| `get_cell_detail(model, test)` | Detail dropdown | Markdown detail |
| `refresh_grid(display_mode)` | Display mode change | Updated grid DataFrame |

### tabs/compare/handlers.py - Compare Tab Handlers

**Purpose**: Handlers specific to the Compare (interactive) tab.

| Handler | Trigger | Returns |
|---------|---------|---------|
| `initialize_session(servers, models, system_prompt, ...)` | Auto-init on send | `(status, *model_tabs)` |
| `send_single_prompt(prompt, tools_json, image_path, seed, repeat_penalty)` | "Send to All" button | Generator yielding `(status, tab_states, *model_outputs)` |
| `export_markdown()` | "Export Markdown" button | `(status, preview)` |
| `export_json()` | "Export JSON" button | `(status, preview)` |
| `launch_beyond_compare(model_a, model_b)` | "Open in Beyond Compare" button | `status` |

**Compare Tab Features**:
- **Image Attachment**: Upload images for vision models (encoded as base64 data URLs)
- **Seed Parameter**: Set a seed for reproducible outputs across models
- **Repeat Penalty**: Configurable penalty (1.0-2.0) to reduce repetitive token generation

### dispatcher.py - Work-Stealing Dispatcher

**Purpose**: Parallel execution across multiple servers with work-stealing.

```python
class WorkStealingDispatcher:
    """Dispatches work items to servers using work-stealing pattern."""

    async def dispatch(
        self,
        work_items: list[WorkItem],
        execute_fn: Callable[[WorkItem, str], Coroutine],
        on_progress: Optional[Callable[[str, str], None]] = None
    ) -> dict[str, Any]:
        """Execute work items in parallel across available servers."""
```

The dispatcher:
1. Maintains a queue of work items (model + test case pairs)
2. Finds idle servers that can run each work item
3. Executes items in parallel across all available servers
4. Supports cooperative cancellation via `state.should_stop()`

### ui.py - Gradio UI Definition

**Purpose**: Define all Gradio components and wire up event bindings.

**Key Components**:

| Component | Type | Purpose |
|-----------|------|---------|
| `servers_input` | Textbox | LM Studio server URLs (one per line) |
| `models_checkboxes` | CheckboxGroup | Select models to compare |
| `system_prompt_input` | Textbox (50 lines) | Editable system prompt |
| `temperature_slider` | Slider | Model temperature (0-2) |
| `timeout_slider` | Slider | Request timeout (30-600s) |
| `max_tokens_slider` | Slider | Max tokens (256-8192) |
| `seed_input` | Number | Optional seed for reproducible outputs |
| `repeat_penalty_slider` | Slider | Repeat penalty (1.0-2.0, default 1.1) |
| `prompt_input` | Textbox | User prompt entry |
| `image_input` | Image | Optional image attachment for vision models |
| `tools_input` | Code (JSON) | Tools for function calling |
| `model_outputs[0..9]` | Markdown | Model response tabs |
| `tab_states` | JSON (hidden) | Tab status for color updates |

**Event Bindings**:
- Buttons trigger async handlers
- `tab_states.change` triggers JavaScript for inline style updates
- `app.load` restores state from localStorage

### state.py - Global State

**Purpose**: Holds mutable state shared across handlers.

```python
server_pool: Optional[ServerPool] = None
session: Optional[ComparisonSession] = None
```

**Design Decision**: Separated to avoid circular imports between ui.py and handlers.py.

### parsers.py - Text Parsing Utilities

**Purpose**: Parse user input from UI components.

| Function | Input | Output |
|----------|-------|--------|
| `parse_models_input(text)` | "model1\nmodel2" | `["model1", "model2"]` |
| `parse_servers_input(text)` | "http://...\nhttp://..." | `["http://...", "http://..."]` |
| `parse_prompts_file(content)` | File content | List of prompts |
| `load_system_prompt(file_path)` | Optional file path | System prompt string |
| `get_default_system_prompt()` | - | Default prompt from file or constant |

### export.py - Report Generation

**Purpose**: Generate exportable reports from session state.

```python
def generate_markdown_report(state: SessionState) -> str:
    """Create Markdown with header, system prompt, and all model conversations."""

def generate_json_report(state: SessionState) -> str:
    """Create structured JSON with configuration and conversations."""

def save_report(content: str, filepath: str):
    """Write report to file."""
```

### main.py - Entry Point

**Purpose**: Application entry point and backwards-compatibility exports.

```python
def run():
    app = create_app()
    app.launch(server_name="0.0.0.0", server_port=get_gradio_port())
```

## Data Flow: Sending a Prompt

```
1. User types prompt, clicks "Send Prompt"
         │
         ▼
2. ui.py: send_button.click(fn=send_single_prompt, inputs=[prompt, tools])
         │
         ▼
3. handlers.py: send_single_prompt(prompt, tools_json)
   │ - Validate session exists
   │ - Parse tools JSON
   │ - Add user message to all model contexts
   │ - Refresh server manifests
         │
         ▼
4. Work-Stealing Dispatcher Loop:
   │ ┌─────────────────────────────────────────┐
   │ │ For each idle server:                   │
   │ │   Find model in queue this server has   │
   │ │   If found: start async task            │
   │ └─────────────────────────────────────────┘
   │ │ await asyncio.sleep(0.1)
   │ │ yield (status, tab_states, *outputs)  ──────► UI updates
   │ │ Clean up completed tasks
   │ └─────────── while queue or active_tasks
         │
         ▼
5. Each async task: run_model_on_server(model_id, server_url)
   │ - Mark model as "streaming"
   │ - Call stream_completion() ───────────────────► LM Studio API
   │ - Accumulate chunks in streaming_responses[model_id]
   │ - On complete: add assistant message to context
   │ - Release server
         │
         ▼
6. Final yield: ("✅ All responses complete", final_states, *final_outputs)
```

## State Management

### Session State (Python)

```python
SessionState:
  models: list[str]                    # Selected models
  contexts: dict[str, ModelContext]    # model_id -> conversation
  system_prompt: str
  temperature: float
  timeout_seconds: int
  max_tokens: int
  halted: bool                         # True if any model failed
  halt_reason: Optional[str]
```

### UI State (Browser localStorage)

| Key | Type | Purpose |
|-----|------|---------|
| `promptprix_servers` | string | Server URLs (newline-separated) |
| `promptprix_model_choices` | JSON array | Available models from last fetch |
| `promptprix_models` | JSON array | Selected models |
| `promptprix_temperature` | float | Temperature setting |
| `promptprix_timeout` | int | Timeout setting |
| `promptprix_max_tokens` | int | Max tokens setting |
| `promptprix_tools` | string | Tools JSON |
| `promptprix_system_prompt` | string | System prompt text |

**Persistence**: Only saved when user clicks "Save State" button (explicit save).

## Tab Status Visualization

Tab colors indicate model status during streaming:

| Status | Color | Border |
|--------|-------|--------|
| `pending` | Red gradient (#fee2e2 → #fecaca) | 4px solid #ef4444 |
| `streaming` | Yellow gradient (#fef3c7 → #fde68a) | 4px solid #f59e0b |
| `completed` | Green gradient (#d1fae5 → #a7f3d0) | 4px solid #10b981 |

**Implementation**: Uses inline JavaScript styles (`element.style`) to overcome Gradio theme CSS.

## Error Handling

### Fail-Fast Validation

1. `initialize_session` validates:
   - Servers are configured
   - Models are configured
   - All selected models exist on at least one server

2. `send_single_prompt` validates:
   - Session is initialized
   - Session is not halted
   - Prompt is not empty
   - Tools JSON is valid (if provided)

### Halt-on-Error

If any model fails during `send_prompt_to_all`:
- `state.halted = True`
- `state.halt_reason = "Model {model_id} failed: {error}"`
- Subsequent prompts are rejected

### Human-Readable Errors

The `LMStudioError` exception extracts error messages from LM Studio's JSON responses:

```python
{"error": {"message": "Model not loaded"}}  →  "Model not loaded"
```

## Integration Points

### Upstream: Benchmark Sources

prompt-prix can consume test cases from established benchmark ecosystems:

| Source | Format | Usage |
|--------|--------|-------|
| **promptfoo** | YAML with assertions | Full eval format with pass/fail criteria |
| **Inspect AI** | Python test definitions | Export prompts, import as JSON |
| **Custom JSON** | OpenAI-compatible messages | Direct load in prompt-prix |

See [ADR-001](adr/completed/001-use-existing-benchmarks.md) for rationale.

### API Layer: OpenAI-Compatible

All inference servers must expose OpenAI-compatible endpoints:

```
GET  /v1/models              → List available models
POST /v1/chat/completions    → Chat completion (streaming)
```

Supported servers:
- LM Studio (native)
- Ollama (OpenAI mode)
- vLLM
- llama.cpp server
- Any OpenAI-compatible proxy

See [ADR-003](adr/completed/003-openai-compatible-api.md) for rationale.

## Fan-Out Dispatcher Pattern

The core abstraction is **fan-out**: one prompt dispatched to N models in parallel.

```
┌─────────────────────────────────────────────────────────────┐
│                     Fan-Out Dispatcher                       │
│                                                              │
│  Input: (prompt, [model_a, model_b, model_c])               │
│                        │                                     │
│         ┌──────────────┼──────────────┐                     │
│         ▼              ▼              ▼                     │
│    ┌─────────┐    ┌─────────┐    ┌─────────┐               │
│    │ Model A │    │ Model B │    │ Model C │               │
│    │ Server1 │    │ Server1 │    │ Server2 │               │
│    └────┬────┘    └────┬────┘    └────┬────┘               │
│         │              │              │                     │
│         ▼              ▼              ▼                     │
│    Response A     Response B     Response C                 │
│                                                              │
│  Output: {model_a: resp_a, model_b: resp_b, model_c: resp_c}│
└─────────────────────────────────────────────────────────────┘
```

### Work-Stealing Implementation

The dispatcher uses work-stealing for GPU efficiency:

1. **Queue**: All models to process
2. **Acquire**: Find idle server that has queued model
3. **Execute**: Stream response, update UI
4. **Release**: Server becomes available for next model

This maximizes utilization when models are distributed across multiple GPUs.

See [ADR-002](adr/completed/002-fan-out-pattern-as-core.md) for rationale.

## Architecture Decision Records

| ADR | Decision |
|-----|----------|
| [001](adr/completed/001-use-existing-benchmarks.md) | Use existing benchmarks (promptfoo, Inspect AI) instead of custom eval schema |
| [002](adr/completed/002-fan-out-pattern-as-core.md) | Fan-out pattern as core architectural abstraction |
| [003](adr/completed/003-openai-compatible-api.md) | OpenAI-compatible API as sole integration layer |