JustinTX
/

sky2

Model card Files Files and versions

sky2 / benchmarks /prompt_optimization /README.md

JustinTX's picture

Add files using upload-large-folder tool

517cbd2 verified 29 days ago

|

history blame contribute delete

1.25 kB

	# Prompt Optimization Benchmark

	Evolves plain-text LLM instruction prompts using SkyDiscover's evolutionary search.

	## How It Works

	1. Start with a seed prompt (e.g., a one-line instruction)
	2. Evaluate the prompt by running it on a QA task and measuring exact-match accuracy
	3. Use an LLM to rewrite the prompt guided by performance feedback and context from other candidates
	4. Repeat — the population of prompts improves over generations

	Key config: `language: text` and `diff_based_generation: false` — prompts are fully rewritten each iteration (not diffed like code).

	## Benchmarks

	\| Task \| Dataset \| Metric \| Dir \|
	\|------\|---------\|--------\|-----\|
	\| Multi-hop QA \| [HotPotQA](https://hotpotqa.github.io/) \| Exact match accuracy (0–1) \| `hotpot_qa/` \|

	## Quick Start

	```bash
	cd benchmarks/prompt_optimization/hotpot_qa

	# Install deps
	pip install dspy litellm bm25s pystemmer datasets diskcache ujson

	# Set API key
	export OPENAI_API_KEY=...

	# Run (first run downloads ~1.3GB of data)
	uv run skydiscover-run initial_prompt.txt evaluator.py -c config_adaevolve.yaml -i 100 # AdaEvolve
	uv run skydiscover-run initial_prompt.txt evaluator.py -c config_evox.yaml -i 100 # EvoX
	```

	See `hotpot_qa/README.md` for full details.