Spaces:
Sleeping
Sleeping
metadata
title: Flow
emoji: 🔄
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
Flow
Evaluate and Optimize Coding Agent Configurations
Flow is a framework for running experiments on LLM coding agents. Compare context engineering strategies (message compaction, agent memory, sub-agents), evaluate results with LLM-as-Judge, and find optimal configurations that balance quality and token cost.
Features
- Ablation Studies: Test different agent configurations side-by-side
- LLM-as-Judge Evaluation: Automatically score agent outputs for correctness
- Pareto Analysis: Find optimal quality vs. cost tradeoffs
- Web UI: Visual interface for managing experiments and viewing results
- Config Export: Export winning configurations for production use
Quick Start
1. Install
# Clone and install with uv
git clone https://github.com/victordibia/flow
cd flow
uv sync
2. Configure Azure OpenAI
export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_DEPLOYMENT="gpt-4o"
3. Run Optimization
# Run with built-in task suite
uv run flow optimize --suite coding
# Or with custom tasks
uv run flow optimize --tasks my_tasks.jsonl
4. Launch Web UI
uv run flow serve
# Opens at http://localhost:8091
CLI Commands
flow optimize [OPTIONS] # Run optimization experiments
flow serve # Start the web UI
flow run [TASK] # Run a single agent task
flow config # Show current configuration
flow init # Initialize Flow directories
What Gets Optimized
Flow tests different context engineering strategies:
| Strategy | Description |
|---|---|
| Message Compaction | Keep first N + last M messages, discard middle |
| Agent Memory | Persistent storage the agent controls |
| Sub-Agent Isolation | Delegate research to isolated sub-agent |
Example configurations:
from flow.experiments.ablation import AblationConfig
configs = [
AblationConfig(name="baseline", enable_message_compaction=False),
AblationConfig(name="compaction", enable_message_compaction=True, compaction_head_size=10),
AblationConfig(name="full", enable_message_compaction=True, enable_memory_tool=True),
]
Task Format
Tasks are defined in JSONL format:
{"name": "fizzbuzz", "prompt": "Create fizzbuzz.py and run it", "criteria": [{"name": "correct", "instruction": "Output shows FizzBuzz pattern"}]}
Development
# Install dev dependencies
uv sync --dev
# Run tests
uv run pytest tests/ -v
# Type checking
uv run pyright src/
# Linting
uv run ruff check src/
uv run ruff format src/
License
MIT License - see LICENSE for details.
