shinka-backup / docs /analyzing_agent_trajectories.md

Add files using upload-large-folder tool

1556404 verified about 1 month ago

13.2 kB

	# Analyzing Agent Trajectories

	How to inspect what the agents actually did during an experiment — distinct from "what was the final score." Use this to diagnose whether the agent is adding value or failing silently.

	Two agents to analyze separately:

	- Evolution Agent: the code-generation LLM that mutates programs each generation
	- Evaluation Agent (eval_agent / EV2): the meta-agent that periodically analyzes the best code and produces diagnostic feedback + aux metrics + code candidates

	---

	## Where to Find the Trajectories

	All artifacts live under the experiment directory. Assume `$RUN = results/frontier_cs_algorithmic/<experiment_name>` and `$P = $RUN/p<id>` for a single problem.

	### Evolution Agent trajectory

	\| What \| Path \| Contents \|
	\|------\|------\|----------\|
	\| Full program history \| `$P/evolution_db.sqlite` \| `programs` table: id, generation, combined_score, correct, code, parent_id, archive/inspiration ids, public/private metrics, text_feedback, metadata \|
	\| Code each gen \| `$P/gen_N/main.cpp` \| The evaluated source at generation N \|
	\| Patch each gen \| `$P/gen_N/edit.diff`, `$P/gen_N/search_replace.txt` \| What the evolution LLM changed \|
	\| Parent before patch \| `$P/gen_N/original.cpp` \| Code that was mutated \|
	\| Evaluation output \| `$P/gen_N/results/metrics.json`, `$P/gen_N/results/correct.json` \| Primary evaluator output (includes aux metrics merged as `aux_*`) \|
	\| LLM call traces (if `--trajectory-log`) \| `$P/gen_N/llm_trajectories/` \| Raw LLM prompts and responses for the code-gen LLM \|
	\| Run config \| `$P/experiment_config.yaml` \| All evolution settings used \|

	### Evaluation Agent (EV2) trajectory

	\| What \| Path \| Contents \|
	\|------\|------\|----------\|
	\| Service state \| `$P/eval_agent_memory/service_state.json` \| `total_agent_runs`, `last_agent_trigger_gen`, `agent_trigger_history`, full `generation_history` (primary_score per gen) \|
	\| Current diagnostic \| `$P/eval_agent_memory/diagnostic_report.md` \| Latest Hypothesis / Experiment / Verdict / Direction (overwritten each trigger) \|
	\| Cross-gen memory \| `$P/eval_agent_memory/EVAL_AGENTS.md` \| Append-only log — hypothesis/verdict per trigger \|
	\| Current aux metrics script \| `$P/eval_agent_memory/auxiliary_metrics.py` \| The `evaluate_aux()` function the agent wrote \|
	\| Aux metrics backup \| `$P/eval_agent_memory/auxiliary_metrics.py.bak` \| Previous version before agent's last edit \|
	\| Per-trigger results \| `$P/eval_agent_memory/agent_runs/gen_N_result.json` \| Agent output summary per trigger \|
	\| Agent candidate (v4+) \| `$P/eval_agent_memory/agent_candidate.cpp` \| Code the agent wrote as a direct fix \|
	\| Per-gen aux snapshot \| `$P/gen_N/results/auxiliary_metrics_snapshot.py` \| Copy of aux_metrics.py used at gen N \|
	\| LLM raw completions (if `OPENHANDS_LOG_COMPLETIONS=1`) \| `$P/eval_agent_memory/llm_completions/` \| Raw eval-agent LLM calls \|
	\| Full message trajectory (if `ENABLE_FULL_TRAJECTORY_LOG=1`) \| Inside `agent_runs/gen_N_*.json` \| All messages incl. tool calls \|

	### Run-level logs (parallel scripts)

	\| What \| Path \|
	\|------\|------\|
	\| Per-problem runner logs \| `$RUN/_worker_logs/problem_<id>.log` \|
	\| Per-slot eval service logs \| `$RUN/_worker_logs/eval_service_port_<port>.log` \|

	### Logging flags (enable before running)

	\| Flag \| Effect \|
	\|------\|--------\|
	\| `--trajectory-log` (runner CLI) \| Save evolution sampler LLM prompts/responses \|
	\| `ENABLE_FULL_TRAJECTORY_LOG=1` (env) \| Save eval agent's full message trajectories \|
	\| `OPENHANDS_LOG_COMPLETIONS=1` (env) \| Save eval agent's raw LLM completions \|

	---

	## Part 1: Evolution Agent Analysis

	The evolution agent's full trajectory lives in the `evolution_db.sqlite` (the `programs` table) under each `pN/` directory.

	### 1.1 Progress and Completion

	```python
	import sqlite3

	db = f"{run_dir}/p{pid}/evolution_db.sqlite"
	conn = sqlite3.connect(db)
	max_gen = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0]
	best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0]
	total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
	```

	- `max_gen < 49` and DB `mtime` old → runner is stuck (likely waiting on LLM API)
	- Check `ps aux \| grep run_experiment` to see if workers are alive

	### 1.2 Archive and Parent Selection Health

	Evolution depends on `correct=1` to populate archive and select parents. If the archive is empty, evolution degenerates to "keep mutating the seed."

	```sql
	SELECT COUNT(*) FROM programs WHERE correct=1;
	```

	- Ratio `n_correct / n_total`:
	- Near 0 → `correct` definition is broken for this task (it's too strict). Archive is always empty, no diversity.
	- Near 1 → `correct` definition is loose (covers any runnable code). Archive and parents work normally.
	- In Frontier-CS, the original `correct=passed` (all test cases perfect) gave 0/51 — archive silently empty. Fixed by redefining `correct=True` for any compilable/runnable code.

	### 1.3 Per-Generation Best Score Trend

	Reveals whether evolution is actually progressing or stuck on a plateau:

	```python
	rows = conn.execute(
	"SELECT generation, combined_score FROM programs ORDER BY generation"
	).fetchall()
	gen_best = {}
	for g, s in rows:
	if s is not None and (g not in gen_best or s > gen_best[g]):
	gen_best[g] = s
	```

	Mark the agent trigger generations (e.g. gen 5, 10, 15, ...) and look for:
	- Step jumps at trigger points → agent feedback or candidate helped
	- Gradual climb → normal evolution, agent contribution unclear
	- Noise dominates → variance too high, can't attribute improvements

	### 1.4 Agent Candidates in the DB (v4+ only)

	When the eval agent writes code directly, candidates enter the `programs` table with `metadata.source = "agent_candidate"`.

	```sql
	SELECT COUNT(*) FROM programs
	WHERE metadata LIKE '%"source": "agent_candidate"%';
	```

	Warning: Don't use `LIKE '%agent_candidate%'` — that matches programs whose code content references the string. Always match the `source` key precisely.

	### 1.5 Fair Comparison Under Compute Budget

	Each agent candidate is an extra LLM call. If the agent triggered 9 times and wrote 9 candidates, the agent run had 50 + 9 = 59 code generations vs vanilla's 50. For an apples-to-apples comparison:

	```sql
	SELECT MAX(combined_score) FROM programs WHERE generation <= (50 - n_candidates);
	```

	Compute `(50 - n_candidates)` per run to normalize total LLM calls, then compare.

	---

	## Part 2: Evaluation Agent Analysis

	The eval agent's trajectory lives in `pN/eval_agent_memory/` and `pN/gen_N/results/metrics.json` (aux metric outputs).

	### 2.1 Trigger History

	```python
	import json
	state = json.load(open(f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"))
	print(state["total_agent_runs"], state["last_agent_trigger_gen"])
	print(state["agent_trigger_history"])
	```

	- `total_agent_runs` should match `(max_gen - 5) // 5 + 1` for periodic mode (K=5)
	- If it's much lower, agent is being skipped (busy or erroring)

	### 2.2 Diagnostic Report Quality

	Read `diagnostic_report.md` and classify:

	- Strong: specific numbers ("runtime 4500ms at n=1000"), names the exact bottleneck function/line, gives concrete direction
	- Weak: vague phrases ("may have TLE", "consider optimizing"), no measured evidence, speculative on already-solved code
	- Broken: empty file, parse errors, or repeats previous report

	Good diagnostic example (from p42):
	> "`aux_reported_L` is exactly `aux_L0_baseline`... algorithm is falling back to baseline unrotated packing... search loop (lines 65-93) caps `a` at 250. Increase to 500+."

	Weak example (from p241):
	> "Might have TLE for larger inputs. Cannot be determined yet. Monitor the metric."

	### 2.3 Aux Metrics: Are They Actually Running Code?

	This is where silent failure hides. Check the aux_metrics.py for path correctness and run output.

	```python
	# Check the file
	aux = open(f"{run_dir}/p{pid}/eval_agent_memory/auxiliary_metrics.py").read()
	# Good pattern: subprocess.run + path to main.cpp
	# Red flag: only reads metrics.json or does Python simulation
	```

	Then check the latest generation's aux metric output in the DB:

	```python
	import sqlite3, json
	conn = sqlite3.connect(f"{run_dir}/p{pid}/evolution_db.sqlite")
	conn.row_factory = sqlite3.Row
	row = conn.execute(
	"SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
	).fetchone()
	pub = json.loads(row["public_metrics"])
	aux_keys = [k for k in pub if k.startswith("aux_")]
	```

	Classify what you see:

	- Meaningful: `aux_runtime_ms_N35`, `aux_num_rects`, `aux_reported_L_n11` — real measurements
	- Meta-only noise: `aux_aux_metric_eval_success`, `aux_aux_metric_error_code` — function ran but returned no real data
	- Compile failure: `aux_compile_success = 0` → path bug or broken aux_metrics

	Meta-only noise with no real measurements = aux metrics are silently broken.

	### 2.4 The Path Bug Check

	The most common silent failure: aux_metrics.py uses the wrong path to the code file.

	```bash
	grep "code_path\\|main.cpp\\|results_dir" pN/eval_agent_memory/auxiliary_metrics.py
	```

	- `os.path.join(results_dir, 'main.cpp')` — correct if service passes `gen_N/`
	- `os.path.join(results_dir, '..', 'main.cpp')` — correct if service passes `gen_N/results/`
	- Mismatch → compile always fails, every metric returns 0

	Cross-check: what does the service actually pass? Look in `ev2_service_standalone.py` for the `evaluate_func(request.results_dir)` call and whether it's adjusted.

	### 2.5 Agent Candidate Outputs (v4+)

	When the agent writes code directly, compare candidate score vs the original best:

	```python
	rows = conn.execute(
	"SELECT generation, combined_score, metadata FROM programs "
	"WHERE metadata LIKE '%\"source\": \"agent_candidate\"%' "
	"ORDER BY generation"
	).fetchall()
	for r in rows:
	meta = json.loads(r["metadata"])
	print(f"gen={r['generation']} candidate={r['combined_score']:.2f} "
	f"original={meta.get('original_score', '?')}")
	```

	- Candidate beats original → agent successfully identified and implemented a fix
	- Candidate = 0 → agent's code didn't compile or scored 0
	- Candidate < original → agent made things worse

	Ratio of "candidate ≥ original" is a direct measure of whether the agent is producing useful code.

	### 2.6 Effect Attribution

	Can't easily attribute score gains to the agent vs variance. Use two cross-checks:

	- Compare to matched vanilla: fork from the same gen 5, same problems, both run 50 gens. Any diff beyond ±5 per problem and ±2 on average is plausibly signal, otherwise noise.
	- Look at trigger→jump alignment: does the score jump within 1-2 gens of a trigger? Consistent alignment across problems = signal. Random placement = noise.

	---

	## Quick Diagnostic Script

	One-shot sanity check for a running experiment:

	```python
	import sqlite3, os, json, time

	def analyze_run(run_dir, pids, vanilla_dir=None):
	for pid in pids:
	db = f"{run_dir}/p{pid}/evolution_db.sqlite"
	conn = sqlite3.connect(db); conn.row_factory = sqlite3.Row

	mg = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0] or 0
	best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
	n_corr = conn.execute("SELECT COUNT(*) FROM programs WHERE correct=1").fetchone()[0]
	n_total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
	n_cand = conn.execute(
	"SELECT COUNT(*) FROM programs WHERE metadata LIKE '%\"source\": \"agent_candidate\"%'"
	).fetchone()[0]

	row = conn.execute(
	"SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
	).fetchone()
	pub = json.loads(row["public_metrics"]) if row and row["public_metrics"] else {}
	real_aux = [k for k in pub if k.startswith("aux_")
	and "metric_eval" not in k and "error_code" not in k]

	state_path = f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"
	runs = json.load(open(state_path))["total_agent_runs"] if os.path.exists(state_path) else 0

	age = time.time() - os.path.getmtime(db)
	vanilla_best = ""
	if vanilla_dir:
	bv = sqlite3.connect(f"{vanilla_dir}/p{pid}/evolution_db.sqlite").execute(
	"SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
	vanilla_best = f" \| vanilla={bv:.2f} diff={best-bv:+.2f}"

	print(f"p{pid}: gen={mg}/49 \| best={best:.2f}{vanilla_best} \| "
	f"correct={n_corr}/{n_total} \| cands={n_cand} \| "
	f"agent_runs={runs} \| real_aux={len(real_aux)} \| db_age={age:.0f}s")
	```

	Red flags to watch for:

	\| Symptom \| Meaning \|
	\|---------\|---------\|
	\| `correct=0/N` \| `correct` definition too strict, archive broken \|
	\| `db_age > 300s` while `mg < 49` \| runner stuck on LLM call \|
	\| `agent_runs << (mg-5)/5` \| agent being skipped \|
	\| `real_aux = 0` but `agent_runs > 0` \| aux metrics silently failing (likely path bug) \|
	\| `cands = 0` but using v4 \| agent not producing candidates (check prompt) \|
	\| Big diff from vanilla but `cands = 0` \| signal is from text feedback alone \|