Spaces:

reflectiveattention
/

prompt-prix

Runtime error

App Files Files Community

prompt-prix / docs /ARCHITECTURE.md

3v324v23

chore: Update documentation

9eb149e 12 days ago

preview code

raw

history blame contribute delete

25.3 kB

	# Architecture

	This document describes the system architecture of prompt-prix, including module responsibilities, data flow, and key design decisions.

	## System Overview

	```
	┌─────────────────────────────────────────────────────────────────────┐
	│ Browser │
	│ ┌─────────────────────────────────────────────────────────────┐ │
	│ │ Gradio UI (ui.py) │ │
	│ │ ┌─────────────┐ ┌──────────────┐ ┌───────────────────────┐ │ │
	│ │ │ Config Panel│ │ Prompt Input │ │ Model Output Tabs │ │ │
	│ │ │ • Servers │ │ • Single │ │ • Tab 1..10 │ │ │
	│ │ │ • Models │ │ • Batch │ │ • Streaming display │ │ │
	│ │ │ • System │ │ • Tools JSON │ │ • Status colors │ │ │
	│ │ │ Prompt │ └──────────────┘ └───────────────────────┘ │ │
	│ │ └─────────────┘ │ │
	│ │ ┌─────────────────────────────────────────────────────────┐ │ │
	│ │ │ localStorage: servers, models, temperature, etc. │ │ │
	│ │ └─────────────────────────────────────────────────────────┘ │ │
	│ └─────────────────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────────┐
	│ Python Backend │
	│ ┌──────────────────────────────────────────────────────────────┐ │
	│ │ handlers.py │ │
	│ │ • fetch_available_models() → ServerPool.refresh_manifests() │ │
	│ │ • initialize_session() → Create ComparisonSession │ │
	│ │ • send_single_prompt() → Work-stealing dispatcher │ │
	│ │ • export_markdown/json() → Report generation │ │
	│ └──────────────────────────────────────────────────────────────┘ │
	│ │ │
	│ ┌─────────────────────────────┼────────────────────────────────┐ │
	│ │ core.py │ │ │
	│ │ ┌─────────────────────┐ │ ┌─────────────────────────┐ │ │
	│ │ │ ServerPool │◄───┴───►│ ComparisonSession │ │ │
	│ │ │ • servers: dict │ │ • state: SessionState │ │ │
	│ │ │ • refresh_manifest │ │ • send_prompt_to_model │ │ │
	│ │ │ • acquire/release │ │ • get_context_display │ │ │
	│ │ └─────────────────────┘ └─────────────────────────┘ │ │
	│ │ │ │
	│ │ ┌─────────────────────────────────────────────────────────┐ │ │
	│ │ │ stream_completion() / get_completion() │ │ │
	│ │ │ • Async HTTP streaming to LM Studio │ │ │
	│ │ │ • Yields text chunks or returns full response │ │ │
	│ │ └─────────────────────────────────────────────────────────┘ │ │
	│ └──────────────────────────────────────────────────────────────┘ │
	│ │ │
	│ ┌──────────────────────────────────────────────────────────────┐ │
	│ │ config.py │ │
	│ │ Pydantic Models: ServerConfig, ModelContext, SessionState │ │
	│ │ Constants: DEFAULT_TEMPERATURE, DEFAULT_MAX_TOKENS, etc. │ │
	│ │ Environment: load_servers_from_env(), get_gradio_port() │ │
	│ └──────────────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────────────┐
	│ LM Studio Servers │
	│ ┌────────────────────────┐ ┌────────────────────────┐ │
	│ │ Server 1 (e.g. 3090) │ │ Server 2 (e.g. 8000) │ │
	│ │ • GET /v1/models │ │ • GET /v1/models │ │
	│ │ • POST /v1/chat/... │ │ • POST /v1/chat/... │ │
	│ │ └─ Model A │ │ └─ Model B, C │ │
	│ │ └─ Model B │ │ │ │
	│ └────────────────────────┘ └────────────────────────┘ │
	└─────────────────────────────────────────────────────────────────────┘
	```

	## Module Breakdown

	### Directory Structure

	```
	prompt_prix/
	├── main.py # Entry point
	├── ui.py # Gradio UI definition
	├── handlers.py # Shared event handlers (fetch, stop)
	├── state.py # Global mutable state
	├── core.py # ServerPool, ComparisonSession, streaming
	├── config.py # Pydantic models, constants, env loading
	├── parsers.py # Input parsing utilities
	├── export.py # Report generation
	├── dispatcher.py # WorkStealingDispatcher for parallel execution
	├── battery.py # BatteryRunner, TestResult, BatteryRun
	├── tabs/
	│ ├── __init__.py
	│ ├── battery/
	│ │ ├── __init__.py
	│ │ └── handlers.py # Battery-specific handlers
	│ └── compare/
	│ ├── __init__.py
	│ └── handlers.py # Compare-specific handlers
	├── adapters/
	│ └── lmstudio.py # LMStudioAdapter
	└── benchmarks/
	├── base.py # TestCase protocol
	└── custom_json.py # CustomJSONLoader
	```

	### config.py - Configuration & Data Models

	Purpose: Define all Pydantic models for type-safe configuration and state.

	\| Class \| Purpose \|
	\|-------\|---------\|
	\| `ServerConfig` \| Single LM Studio server state (URL, available_models, is_busy) \|
	\| `ModelConfig` \| Model identity and display name \|
	\| `Message` \| Single message in a conversation (role, content - supports multimodal) \|
	\| `ModelContext` \| Complete conversation history for one model \|
	\| `SessionState` \| Full session: models, contexts, system_prompt, halted status \|

	Message Multimodal Support:
	The `Message` model supports both text and multimodal content:
	```python
	# Text-only message
	Message(role="user", content="Hello")

	# Multimodal message (text + image)
	Message(role="user", content=[
	{"type": "text", "text": "What's in this image?"},
	{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
	])

	# Helper methods
	msg.get_text() # Extract text content
	msg.has_image() # Check if message contains an image
	```

	Key Functions:
	- `load_servers_from_env()` - Read LM_STUDIO_SERVER_N environment variables
	- `get_default_servers()` - Return env servers or placeholder defaults
	- `get_gradio_port()` - Read GRADIO_PORT or default to 7860
	- `get_fara_config()` - Read FARA_SERVER_URL and FARA_MODEL_ID for vision adapter
	- `encode_image_to_data_url(path)` - Convert image file to base64 data URL
	- `build_multimodal_content(text, image_path)` - Build OpenAI-format multimodal content

	### core.py - Server Pool & Session Management

	Purpose: Core business logic for server management and model interactions.

	#### ServerPool

	Manages multiple LM Studio servers:

	```python
	class ServerPool:
	servers: dict[str, ServerConfig] # URL -> config
	_locks: dict[str, asyncio.Lock] # URL -> lock

	async def refresh_all_manifests() # GET /v1/models on all servers
	def find_available_server(model_id) -> Optional[str] # Find idle server with model
	async def acquire_server(url) # Mark busy, acquire lock
	def release_server(url) # Mark available, release lock
	```

	#### ComparisonSession

	Manages a comparison session:

	```python
	class ComparisonSession:
	server_pool: ServerPool
	state: SessionState # Contains models, contexts, config

	async def send_prompt_to_model(model_id, prompt, on_chunk=None)
	async def send_prompt_to_all(prompt, on_chunk=None)
	def get_context_display(model_id) -> str
	```

	#### Streaming Functions

	```python
	async def stream_completion(
	server_url, model_id, messages, temperature, max_tokens,
	timeout_seconds, tools=None, seed=None, repeat_penalty=None
	) -> AsyncGenerator[str, None]:
	"""Yields text chunks as they arrive via SSE.

	Args:
	seed: Optional int for reproducible outputs (passed to model API)
	repeat_penalty: Optional float to penalize repeated tokens (1.0 = off)
	"""

	async def get_completion(...) -> str:
	"""Non-streaming version, returns full response."""
	```

	### handlers.py - Shared Event Handlers

	Purpose: Shared async handlers used across multiple tabs.

	\| Handler \| Purpose \| Returns \|
	\|---------\|---------\|---------\|
	\| `fetch_available_models(servers_text)` \| Query all servers for available models \| `(status, gr.update(choices=[...]))` \|
	\| `handle_stop()` \| Signal cancellation via global state \| `status` \|
	\| `_init_pool_and_validate(servers_text, models)` \| Initialize ServerPool and validate models \| `(pool, error_message)` \|

	### tabs/battery/handlers.py - Battery Tab Handlers

	Purpose: Handlers specific to the Battery (benchmark) tab.

	\| Handler \| Trigger \| Returns \|
	\|---------\|---------\|---------\|
	\| `validate_file(file_path)` \| File upload \| Validation status string \|
	\| `get_test_ids(file_path)` \| File upload \| List of test IDs \|
	\| `run_handler(file, models, servers, ...)` \| "Run Battery" button \| Generator yielding `(status, grid_df)` \|
	\| `quick_prompt_handler(prompt, models, ...)` \| "Run Prompt" button \| Markdown results \|
	\| `export_json()` \| "Export JSON" button \| `(status, preview)` \|
	\| `export_csv()` \| "Export CSV" button \| `(status, preview)` \|
	\| `get_cell_detail(model, test)` \| Detail dropdown \| Markdown detail \|
	\| `refresh_grid(display_mode)` \| Display mode change \| Updated grid DataFrame \|

	### tabs/compare/handlers.py - Compare Tab Handlers

	Purpose: Handlers specific to the Compare (interactive) tab.

	\| Handler \| Trigger \| Returns \|
	\|---------\|---------\|---------\|
	\| `initialize_session(servers, models, system_prompt, ...)` \| Auto-init on send \| `(status, *model_tabs)` \|
	\| `send_single_prompt(prompt, tools_json, image_path, seed, repeat_penalty)` \| "Send to All" button \| Generator yielding `(status, tab_states, *model_outputs)` \|
	\| `export_markdown()` \| "Export Markdown" button \| `(status, preview)` \|
	\| `export_json()` \| "Export JSON" button \| `(status, preview)` \|
	\| `launch_beyond_compare(model_a, model_b)` \| "Open in Beyond Compare" button \| `status` \|

	Compare Tab Features:
	- Image Attachment: Upload images for vision models (encoded as base64 data URLs)
	- Seed Parameter: Set a seed for reproducible outputs across models
	- Repeat Penalty: Configurable penalty (1.0-2.0) to reduce repetitive token generation

	### dispatcher.py - Work-Stealing Dispatcher

	Purpose: Parallel execution across multiple servers with work-stealing.

	```python
	class WorkStealingDispatcher:
	"""Dispatches work items to servers using work-stealing pattern."""

	async def dispatch(
	self,
	work_items: list[WorkItem],
	execute_fn: Callable[[WorkItem, str], Coroutine],
	on_progress: Optional[Callable[[str, str], None]] = None
	) -> dict[str, Any]:
	"""Execute work items in parallel across available servers."""
	```

	The dispatcher:
	1. Maintains a queue of work items (model + test case pairs)
	2. Finds idle servers that can run each work item
	3. Executes items in parallel across all available servers
	4. Supports cooperative cancellation via `state.should_stop()`

	### ui.py - Gradio UI Definition

	Purpose: Define all Gradio components and wire up event bindings.

	Key Components:

	\| Component \| Type \| Purpose \|
	\|-----------\|------\|---------\|
	\| `servers_input` \| Textbox \| LM Studio server URLs (one per line) \|
	\| `models_checkboxes` \| CheckboxGroup \| Select models to compare \|
	\| `system_prompt_input` \| Textbox (50 lines) \| Editable system prompt \|
	\| `temperature_slider` \| Slider \| Model temperature (0-2) \|
	\| `timeout_slider` \| Slider \| Request timeout (30-600s) \|
	\| `max_tokens_slider` \| Slider \| Max tokens (256-8192) \|
	\| `seed_input` \| Number \| Optional seed for reproducible outputs \|
	\| `repeat_penalty_slider` \| Slider \| Repeat penalty (1.0-2.0, default 1.1) \|
	\| `prompt_input` \| Textbox \| User prompt entry \|
	\| `image_input` \| Image \| Optional image attachment for vision models \|
	\| `tools_input` \| Code (JSON) \| Tools for function calling \|
	\| `model_outputs[0..9]` \| Markdown \| Model response tabs \|
	\| `tab_states` \| JSON (hidden) \| Tab status for color updates \|

	Event Bindings:
	- Buttons trigger async handlers
	- `tab_states.change` triggers JavaScript for inline style updates
	- `app.load` restores state from localStorage

	### state.py - Global State

	Purpose: Holds mutable state shared across handlers.

	```python
	server_pool: Optional[ServerPool] = None
	session: Optional[ComparisonSession] = None
	```

	Design Decision: Separated to avoid circular imports between ui.py and handlers.py.

	### parsers.py - Text Parsing Utilities

	Purpose: Parse user input from UI components.

	\| Function \| Input \| Output \|
	\|----------\|-------\|--------\|
	\| `parse_models_input(text)` \| "model1\nmodel2" \| `["model1", "model2"]` \|
	\| `parse_servers_input(text)` \| "http://...\nhttp://..." \| `["http://...", "http://..."]` \|
	\| `parse_prompts_file(content)` \| File content \| List of prompts \|
	\| `load_system_prompt(file_path)` \| Optional file path \| System prompt string \|
	\| `get_default_system_prompt()` \| - \| Default prompt from file or constant \|

	### export.py - Report Generation

	Purpose: Generate exportable reports from session state.

	```python
	def generate_markdown_report(state: SessionState) -> str:
	"""Create Markdown with header, system prompt, and all model conversations."""

	def generate_json_report(state: SessionState) -> str:
	"""Create structured JSON with configuration and conversations."""

	def save_report(content: str, filepath: str):
	"""Write report to file."""
	```

	### main.py - Entry Point

	Purpose: Application entry point and backwards-compatibility exports.

	```python
	def run():
	app = create_app()
	app.launch(server_name="0.0.0.0", server_port=get_gradio_port())
	```

	## Data Flow: Sending a Prompt

	```
	1. User types prompt, clicks "Send Prompt"
	│
	▼
	2. ui.py: send_button.click(fn=send_single_prompt, inputs=[prompt, tools])
	│
	▼
	3. handlers.py: send_single_prompt(prompt, tools_json)
	│ - Validate session exists
	│ - Parse tools JSON
	│ - Add user message to all model contexts
	│ - Refresh server manifests
	│
	▼
	4. Work-Stealing Dispatcher Loop:
	│ ┌─────────────────────────────────────────┐
	│ │ For each idle server: │
	│ │ Find model in queue this server has │
	│ │ If found: start async task │
	│ └─────────────────────────────────────────┘
	│ │ await asyncio.sleep(0.1)
	│ │ yield (status, tab_states, *outputs) ──────► UI updates
	│ │ Clean up completed tasks
	│ └─────────── while queue or active_tasks
	│
	▼
	5. Each async task: run_model_on_server(model_id, server_url)
	│ - Mark model as "streaming"
	│ - Call stream_completion() ───────────────────► LM Studio API
	│ - Accumulate chunks in streaming_responses[model_id]
	│ - On complete: add assistant message to context
	│ - Release server
	│
	▼
	6. Final yield: ("✅ All responses complete", final_states, *final_outputs)
	```

	## State Management

	### Session State (Python)

	```python
	SessionState:
	models: list[str] # Selected models
	contexts: dict[str, ModelContext] # model_id -> conversation
	system_prompt: str
	temperature: float
	timeout_seconds: int
	max_tokens: int
	halted: bool # True if any model failed
	halt_reason: Optional[str]
	```

	### UI State (Browser localStorage)

	\| Key \| Type \| Purpose \|
	\|-----\|------\|---------\|
	\| `promptprix_servers` \| string \| Server URLs (newline-separated) \|
	\| `promptprix_model_choices` \| JSON array \| Available models from last fetch \|
	\| `promptprix_models` \| JSON array \| Selected models \|
	\| `promptprix_temperature` \| float \| Temperature setting \|
	\| `promptprix_timeout` \| int \| Timeout setting \|
	\| `promptprix_max_tokens` \| int \| Max tokens setting \|
	\| `promptprix_tools` \| string \| Tools JSON \|
	\| `promptprix_system_prompt` \| string \| System prompt text \|

	Persistence: Only saved when user clicks "Save State" button (explicit save).

	## Tab Status Visualization

	Tab colors indicate model status during streaming:

	\| Status \| Color \| Border \|
	\|--------\|-------\|--------\|
	\| `pending` \| Red gradient (#fee2e2 → #fecaca) \| 4px solid #ef4444 \|
	\| `streaming` \| Yellow gradient (#fef3c7 → #fde68a) \| 4px solid #f59e0b \|
	\| `completed` \| Green gradient (#d1fae5 → #a7f3d0) \| 4px solid #10b981 \|

	Implementation: Uses inline JavaScript styles (`element.style`) to overcome Gradio theme CSS.

	## Error Handling

	### Fail-Fast Validation

	1. `initialize_session` validates:
	- Servers are configured
	- Models are configured
	- All selected models exist on at least one server

	2. `send_single_prompt` validates:
	- Session is initialized
	- Session is not halted
	- Prompt is not empty
	- Tools JSON is valid (if provided)

	### Halt-on-Error

	If any model fails during `send_prompt_to_all`:
	- `state.halted = True`
	- `state.halt_reason = "Model {model_id} failed: {error}"`
	- Subsequent prompts are rejected

	### Human-Readable Errors

	The `LMStudioError` exception extracts error messages from LM Studio's JSON responses:

	```python
	{"error": {"message": "Model not loaded"}} → "Model not loaded"
	```

	## Integration Points

	### Upstream: Benchmark Sources

	prompt-prix can consume test cases from established benchmark ecosystems:

	\| Source \| Format \| Usage \|
	\|--------\|--------\|-------\|
	\| promptfoo \| YAML with assertions \| Full eval format with pass/fail criteria \|
	\| Inspect AI \| Python test definitions \| Export prompts, import as JSON \|
	\| Custom JSON \| OpenAI-compatible messages \| Direct load in prompt-prix \|

	See [ADR-001](adr/completed/001-use-existing-benchmarks.md) for rationale.

	### API Layer: OpenAI-Compatible

	All inference servers must expose OpenAI-compatible endpoints:

	```
	GET /v1/models → List available models
	POST /v1/chat/completions → Chat completion (streaming)
	```

	Supported servers:
	- LM Studio (native)
	- Ollama (OpenAI mode)
	- vLLM
	- llama.cpp server
	- Any OpenAI-compatible proxy

	See [ADR-003](adr/completed/003-openai-compatible-api.md) for rationale.

	## Fan-Out Dispatcher Pattern

	The core abstraction is fan-out: one prompt dispatched to N models in parallel.

	```
	┌─────────────────────────────────────────────────────────────┐
	│ Fan-Out Dispatcher │
	│ │
	│ Input: (prompt, [model_a, model_b, model_c]) │
	│ │ │
	│ ┌──────────────┼──────────────┐ │
	│ ▼ ▼ ▼ │
	│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
	│ │ Model A │ │ Model B │ │ Model C │ │
	│ │ Server1 │ │ Server1 │ │ Server2 │ │
	│ └────┬────┘ └────┬────┘ └────┬────┘ │
	│ │ │ │ │
	│ ▼ ▼ ▼ │
	│ Response A Response B Response C │
	│ │
	│ Output: {model_a: resp_a, model_b: resp_b, model_c: resp_c}│
	└─────────────────────────────────────────────────────────────┘
	```

	### Work-Stealing Implementation

	The dispatcher uses work-stealing for GPU efficiency:

	1. Queue: All models to process
	2. Acquire: Find idle server that has queued model
	3. Execute: Stream response, update UI
	4. Release: Server becomes available for next model

	This maximizes utilization when models are distributed across multiple GPUs.

	See [ADR-002](adr/completed/002-fan-out-pattern-as-core.md) for rationale.

	## Architecture Decision Records

	\| ADR \| Decision \|
	\|-----\|----------\|
	\| [001](adr/completed/001-use-existing-benchmarks.md) \| Use existing benchmarks (promptfoo, Inspect AI) instead of custom eval schema \|
	\| [002](adr/completed/002-fan-out-pattern-as-core.md) \| Fan-out pattern as core architectural abstraction \|
	\| [003](adr/completed/003-openai-compatible-api.md) \| OpenAI-compatible API as sole integration layer \|