JustinTX's picture
Add files using upload-large-folder tool
517cbd2 verified

Prompt Optimization Benchmark

Evolves plain-text LLM instruction prompts using SkyDiscover's evolutionary search.

How It Works

  1. Start with a seed prompt (e.g., a one-line instruction)
  2. Evaluate the prompt by running it on a QA task and measuring exact-match accuracy
  3. Use an LLM to rewrite the prompt guided by performance feedback and context from other candidates
  4. Repeat — the population of prompts improves over generations

Key config: language: text and diff_based_generation: false — prompts are fully rewritten each iteration (not diffed like code).

Benchmarks

Task Dataset Metric Dir
Multi-hop QA HotPotQA Exact match accuracy (0–1) hotpot_qa/

Quick Start

cd benchmarks/prompt_optimization/hotpot_qa

# Install deps
pip install dspy litellm bm25s pystemmer datasets diskcache ujson

# Set API key
export OPENAI_API_KEY=...

# Run (first run downloads ~1.3GB of data)
uv run skydiscover-run initial_prompt.txt evaluator.py -c config_adaevolve.yaml -i 100  # AdaEvolve
uv run skydiscover-run initial_prompt.txt evaluator.py -c config_evox.yaml -i 100       # EvoX

See hotpot_qa/README.md for full details.