File size: 13,157 Bytes
1556404 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 | # Analyzing Agent Trajectories
How to inspect what the agents actually did during an experiment — distinct from "what was the final score." Use this to diagnose whether the agent is adding value or failing silently.
Two agents to analyze separately:
- **Evolution Agent**: the code-generation LLM that mutates programs each generation
- **Evaluation Agent (eval_agent / EV2)**: the meta-agent that periodically analyzes the best code and produces diagnostic feedback + aux metrics + code candidates
---
## Where to Find the Trajectories
All artifacts live under the experiment directory. Assume `$RUN = results/frontier_cs_algorithmic/<experiment_name>` and `$P = $RUN/p<id>` for a single problem.
### Evolution Agent trajectory
| What | Path | Contents |
|------|------|----------|
| Full program history | `$P/evolution_db.sqlite` | `programs` table: id, generation, combined_score, correct, code, parent_id, archive/inspiration ids, public/private metrics, text_feedback, metadata |
| Code each gen | `$P/gen_N/main.cpp` | The evaluated source at generation N |
| Patch each gen | `$P/gen_N/edit.diff`, `$P/gen_N/search_replace.txt` | What the evolution LLM changed |
| Parent before patch | `$P/gen_N/original.cpp` | Code that was mutated |
| Evaluation output | `$P/gen_N/results/metrics.json`, `$P/gen_N/results/correct.json` | Primary evaluator output (includes aux metrics merged as `aux_*`) |
| LLM call traces (if `--trajectory-log`) | `$P/gen_N/llm_trajectories/` | Raw LLM prompts and responses for the code-gen LLM |
| Run config | `$P/experiment_config.yaml` | All evolution settings used |
### Evaluation Agent (EV2) trajectory
| What | Path | Contents |
|------|------|----------|
| Service state | `$P/eval_agent_memory/service_state.json` | `total_agent_runs`, `last_agent_trigger_gen`, `agent_trigger_history`, full `generation_history` (primary_score per gen) |
| Current diagnostic | `$P/eval_agent_memory/diagnostic_report.md` | Latest Hypothesis / Experiment / Verdict / Direction (overwritten each trigger) |
| Cross-gen memory | `$P/eval_agent_memory/EVAL_AGENTS.md` | Append-only log — hypothesis/verdict per trigger |
| Current aux metrics script | `$P/eval_agent_memory/auxiliary_metrics.py` | The `evaluate_aux()` function the agent wrote |
| Aux metrics backup | `$P/eval_agent_memory/auxiliary_metrics.py.bak` | Previous version before agent's last edit |
| Per-trigger results | `$P/eval_agent_memory/agent_runs/gen_N_result.json` | Agent output summary per trigger |
| Agent candidate (v4+) | `$P/eval_agent_memory/agent_candidate.cpp` | Code the agent wrote as a direct fix |
| Per-gen aux snapshot | `$P/gen_N/results/auxiliary_metrics_snapshot.py` | Copy of aux_metrics.py used at gen N |
| LLM raw completions (if `OPENHANDS_LOG_COMPLETIONS=1`) | `$P/eval_agent_memory/llm_completions/` | Raw eval-agent LLM calls |
| Full message trajectory (if `ENABLE_FULL_TRAJECTORY_LOG=1`) | Inside `agent_runs/gen_N_*.json` | All messages incl. tool calls |
### Run-level logs (parallel scripts)
| What | Path |
|------|------|
| Per-problem runner logs | `$RUN/_worker_logs/problem_<id>.log` |
| Per-slot eval service logs | `$RUN/_worker_logs/eval_service_port_<port>.log` |
### Logging flags (enable before running)
| Flag | Effect |
|------|--------|
| `--trajectory-log` (runner CLI) | Save evolution sampler LLM prompts/responses |
| `ENABLE_FULL_TRAJECTORY_LOG=1` (env) | Save eval agent's full message trajectories |
| `OPENHANDS_LOG_COMPLETIONS=1` (env) | Save eval agent's raw LLM completions |
---
## Part 1: Evolution Agent Analysis
The evolution agent's full trajectory lives in the `evolution_db.sqlite` (the `programs` table) under each `pN/` directory.
### 1.1 Progress and Completion
```python
import sqlite3
db = f"{run_dir}/p{pid}/evolution_db.sqlite"
conn = sqlite3.connect(db)
max_gen = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0]
best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0]
total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
```
- `max_gen < 49` and DB `mtime` old → runner is stuck (likely waiting on LLM API)
- Check `ps aux | grep run_experiment` to see if workers are alive
### 1.2 Archive and Parent Selection Health
Evolution depends on `correct=1` to populate archive and select parents. If the archive is empty, evolution degenerates to "keep mutating the seed."
```sql
SELECT COUNT(*) FROM programs WHERE correct=1;
```
- Ratio `n_correct / n_total`:
- Near 0 → `correct` definition is broken for this task (it's too strict). Archive is always empty, no diversity.
- Near 1 → `correct` definition is loose (covers any runnable code). Archive and parents work normally.
- In Frontier-CS, the original `correct=passed` (all test cases perfect) gave 0/51 — **archive silently empty**. Fixed by redefining `correct=True` for any compilable/runnable code.
### 1.3 Per-Generation Best Score Trend
Reveals whether evolution is actually progressing or stuck on a plateau:
```python
rows = conn.execute(
"SELECT generation, combined_score FROM programs ORDER BY generation"
).fetchall()
gen_best = {}
for g, s in rows:
if s is not None and (g not in gen_best or s > gen_best[g]):
gen_best[g] = s
```
Mark the agent trigger generations (e.g. gen 5, 10, 15, ...) and look for:
- **Step jumps at trigger points** → agent feedback or candidate helped
- **Gradual climb** → normal evolution, agent contribution unclear
- **Noise dominates** → variance too high, can't attribute improvements
### 1.4 Agent Candidates in the DB (v4+ only)
When the eval agent writes code directly, candidates enter the `programs` table with `metadata.source = "agent_candidate"`.
```sql
SELECT COUNT(*) FROM programs
WHERE metadata LIKE '%"source": "agent_candidate"%';
```
**Warning**: Don't use `LIKE '%agent_candidate%'` — that matches programs whose code content references the string. Always match the `source` key precisely.
### 1.5 Fair Comparison Under Compute Budget
Each agent candidate is an extra LLM call. If the agent triggered 9 times and wrote 9 candidates, the agent run had 50 + 9 = 59 code generations vs vanilla's 50. For an apples-to-apples comparison:
```sql
SELECT MAX(combined_score) FROM programs WHERE generation <= (50 - n_candidates);
```
Compute `(50 - n_candidates)` per run to normalize total LLM calls, then compare.
---
## Part 2: Evaluation Agent Analysis
The eval agent's trajectory lives in `pN/eval_agent_memory/` and `pN/gen_N/results/metrics.json` (aux metric outputs).
### 2.1 Trigger History
```python
import json
state = json.load(open(f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"))
print(state["total_agent_runs"], state["last_agent_trigger_gen"])
print(state["agent_trigger_history"])
```
- `total_agent_runs` should match `(max_gen - 5) // 5 + 1` for periodic mode (K=5)
- If it's much lower, agent is being skipped (busy or erroring)
### 2.2 Diagnostic Report Quality
Read `diagnostic_report.md` and classify:
- **Strong**: specific numbers ("runtime 4500ms at n=1000"), names the exact bottleneck function/line, gives concrete direction
- **Weak**: vague phrases ("may have TLE", "consider optimizing"), no measured evidence, speculative on already-solved code
- **Broken**: empty file, parse errors, or repeats previous report
Good diagnostic example (from p42):
> "`aux_reported_L` is exactly `aux_L0_baseline`... algorithm is falling back to baseline unrotated packing... search loop (lines 65-93) caps `a` at 250. Increase to 500+."
Weak example (from p241):
> "Might have TLE for larger inputs. Cannot be determined yet. Monitor the metric."
### 2.3 Aux Metrics: Are They Actually Running Code?
This is where silent failure hides. Check the aux_metrics.py for path correctness and run output.
```python
# Check the file
aux = open(f"{run_dir}/p{pid}/eval_agent_memory/auxiliary_metrics.py").read()
# Good pattern: subprocess.run + path to main.cpp
# Red flag: only reads metrics.json or does Python simulation
```
Then check the latest generation's aux metric output in the DB:
```python
import sqlite3, json
conn = sqlite3.connect(f"{run_dir}/p{pid}/evolution_db.sqlite")
conn.row_factory = sqlite3.Row
row = conn.execute(
"SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
).fetchone()
pub = json.loads(row["public_metrics"])
aux_keys = [k for k in pub if k.startswith("aux_")]
```
Classify what you see:
- **Meaningful**: `aux_runtime_ms_N35`, `aux_num_rects`, `aux_reported_L_n11` — real measurements
- **Meta-only noise**: `aux_aux_metric_eval_success`, `aux_aux_metric_error_code` — function ran but returned no real data
- **Compile failure**: `aux_compile_success = 0` → path bug or broken aux_metrics
Meta-only noise with no real measurements = aux metrics are silently broken.
### 2.4 The Path Bug Check
The most common silent failure: aux_metrics.py uses the wrong path to the code file.
```bash
grep "code_path\|main.cpp\|results_dir" pN/eval_agent_memory/auxiliary_metrics.py
```
- `os.path.join(results_dir, 'main.cpp')` — correct if service passes `gen_N/`
- `os.path.join(results_dir, '..', 'main.cpp')` — correct if service passes `gen_N/results/`
- Mismatch → compile always fails, every metric returns 0
Cross-check: what does the service actually pass? Look in `ev2_service_standalone.py` for the `evaluate_func(request.results_dir)` call and whether it's adjusted.
### 2.5 Agent Candidate Outputs (v4+)
When the agent writes code directly, compare candidate score vs the original best:
```python
rows = conn.execute(
"SELECT generation, combined_score, metadata FROM programs "
"WHERE metadata LIKE '%\"source\": \"agent_candidate\"%' "
"ORDER BY generation"
).fetchall()
for r in rows:
meta = json.loads(r["metadata"])
print(f"gen={r['generation']} candidate={r['combined_score']:.2f} "
f"original={meta.get('original_score', '?')}")
```
- Candidate beats original → agent successfully identified and implemented a fix
- Candidate = 0 → agent's code didn't compile or scored 0
- Candidate < original → agent made things worse
Ratio of "candidate ≥ original" is a direct measure of whether the agent is producing useful code.
### 2.6 Effect Attribution
Can't easily attribute score gains to the agent vs variance. Use two cross-checks:
- **Compare to matched vanilla**: fork from the same gen 5, same problems, both run 50 gens. Any diff beyond ±5 per problem and ±2 on average is plausibly signal, otherwise noise.
- **Look at trigger→jump alignment**: does the score jump within 1-2 gens of a trigger? Consistent alignment across problems = signal. Random placement = noise.
---
## Quick Diagnostic Script
One-shot sanity check for a running experiment:
```python
import sqlite3, os, json, time
def analyze_run(run_dir, pids, vanilla_dir=None):
for pid in pids:
db = f"{run_dir}/p{pid}/evolution_db.sqlite"
conn = sqlite3.connect(db); conn.row_factory = sqlite3.Row
mg = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0] or 0
best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
n_corr = conn.execute("SELECT COUNT(*) FROM programs WHERE correct=1").fetchone()[0]
n_total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
n_cand = conn.execute(
"SELECT COUNT(*) FROM programs WHERE metadata LIKE '%\"source\": \"agent_candidate\"%'"
).fetchone()[0]
row = conn.execute(
"SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
).fetchone()
pub = json.loads(row["public_metrics"]) if row and row["public_metrics"] else {}
real_aux = [k for k in pub if k.startswith("aux_")
and "metric_eval" not in k and "error_code" not in k]
state_path = f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"
runs = json.load(open(state_path))["total_agent_runs"] if os.path.exists(state_path) else 0
age = time.time() - os.path.getmtime(db)
vanilla_best = ""
if vanilla_dir:
bv = sqlite3.connect(f"{vanilla_dir}/p{pid}/evolution_db.sqlite").execute(
"SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
vanilla_best = f" | vanilla={bv:.2f} diff={best-bv:+.2f}"
print(f"p{pid}: gen={mg}/49 | best={best:.2f}{vanilla_best} | "
f"correct={n_corr}/{n_total} | cands={n_cand} | "
f"agent_runs={runs} | real_aux={len(real_aux)} | db_age={age:.0f}s")
```
Red flags to watch for:
| Symptom | Meaning |
|---------|---------|
| `correct=0/N` | `correct` definition too strict, archive broken |
| `db_age > 300s` while `mg < 49` | runner stuck on LLM call |
| `agent_runs << (mg-5)/5` | agent being skipped |
| `real_aux = 0` but `agent_runs > 0` | aux metrics silently failing (likely path bug) |
| `cands = 0` but using v4 | agent not producing candidates (check prompt) |
| Big diff from vanilla but `cands = 0` | signal is from text feedback alone |
|