| # Analyzing Agent Trajectories |
|
|
| How to inspect what the agents actually did during an experiment — distinct from "what was the final score." Use this to diagnose whether the agent is adding value or failing silently. |
|
|
| Two agents to analyze separately: |
|
|
| - **Evolution Agent**: the code-generation LLM that mutates programs each generation |
| - **Evaluation Agent (eval_agent / EV2)**: the meta-agent that periodically analyzes the best code and produces diagnostic feedback + aux metrics + code candidates |
| |
| --- |
| |
| ## Where to Find the Trajectories |
| |
| All artifacts live under the experiment directory. Assume `$RUN = results/frontier_cs_algorithmic/<experiment_name>` and `$P = $RUN/p<id>` for a single problem. |
| |
| ### Evolution Agent trajectory |
| |
| | What | Path | Contents | |
| |------|------|----------| |
| | Full program history | `$P/evolution_db.sqlite` | `programs` table: id, generation, combined_score, correct, code, parent_id, archive/inspiration ids, public/private metrics, text_feedback, metadata | |
| | Code each gen | `$P/gen_N/main.cpp` | The evaluated source at generation N | |
| | Patch each gen | `$P/gen_N/edit.diff`, `$P/gen_N/search_replace.txt` | What the evolution LLM changed | |
| | Parent before patch | `$P/gen_N/original.cpp` | Code that was mutated | |
| | Evaluation output | `$P/gen_N/results/metrics.json`, `$P/gen_N/results/correct.json` | Primary evaluator output (includes aux metrics merged as `aux_*`) | |
| | LLM call traces (if `--trajectory-log`) | `$P/gen_N/llm_trajectories/` | Raw LLM prompts and responses for the code-gen LLM | |
| | Run config | `$P/experiment_config.yaml` | All evolution settings used | |
| |
| ### Evaluation Agent (EV2) trajectory |
| |
| | What | Path | Contents | |
| |------|------|----------| |
| | Service state | `$P/eval_agent_memory/service_state.json` | `total_agent_runs`, `last_agent_trigger_gen`, `agent_trigger_history`, full `generation_history` (primary_score per gen) | |
| | Current diagnostic | `$P/eval_agent_memory/diagnostic_report.md` | Latest Hypothesis / Experiment / Verdict / Direction (overwritten each trigger) | |
| | Cross-gen memory | `$P/eval_agent_memory/EVAL_AGENTS.md` | Append-only log — hypothesis/verdict per trigger | |
| | Current aux metrics script | `$P/eval_agent_memory/auxiliary_metrics.py` | The `evaluate_aux()` function the agent wrote | |
| | Aux metrics backup | `$P/eval_agent_memory/auxiliary_metrics.py.bak` | Previous version before agent's last edit | |
| | Per-trigger results | `$P/eval_agent_memory/agent_runs/gen_N_result.json` | Agent output summary per trigger | |
| | Agent candidate (v4+) | `$P/eval_agent_memory/agent_candidate.cpp` | Code the agent wrote as a direct fix | |
| | Per-gen aux snapshot | `$P/gen_N/results/auxiliary_metrics_snapshot.py` | Copy of aux_metrics.py used at gen N | |
| | LLM raw completions (if `OPENHANDS_LOG_COMPLETIONS=1`) | `$P/eval_agent_memory/llm_completions/` | Raw eval-agent LLM calls | |
| | Full message trajectory (if `ENABLE_FULL_TRAJECTORY_LOG=1`) | Inside `agent_runs/gen_N_*.json` | All messages incl. tool calls | |
| |
| ### Run-level logs (parallel scripts) |
| |
| | What | Path | |
| |------|------| |
| | Per-problem runner logs | `$RUN/_worker_logs/problem_<id>.log` | |
| | Per-slot eval service logs | `$RUN/_worker_logs/eval_service_port_<port>.log` | |
| |
| ### Logging flags (enable before running) |
| |
| | Flag | Effect | |
| |------|--------| |
| | `--trajectory-log` (runner CLI) | Save evolution sampler LLM prompts/responses | |
| | `ENABLE_FULL_TRAJECTORY_LOG=1` (env) | Save eval agent's full message trajectories | |
| | `OPENHANDS_LOG_COMPLETIONS=1` (env) | Save eval agent's raw LLM completions | |
| |
| --- |
| |
| ## Part 1: Evolution Agent Analysis |
| |
| The evolution agent's full trajectory lives in the `evolution_db.sqlite` (the `programs` table) under each `pN/` directory. |
| |
| ### 1.1 Progress and Completion |
| |
| ```python |
| import sqlite3 |
| |
| db = f"{run_dir}/p{pid}/evolution_db.sqlite" |
| conn = sqlite3.connect(db) |
| max_gen = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0] |
| best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0] |
| total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0] |
| ``` |
| |
| - `max_gen < 49` and DB `mtime` old → runner is stuck (likely waiting on LLM API) |
| - Check `ps aux | grep run_experiment` to see if workers are alive |
| |
| ### 1.2 Archive and Parent Selection Health |
| |
| Evolution depends on `correct=1` to populate archive and select parents. If the archive is empty, evolution degenerates to "keep mutating the seed." |
| |
| ```sql |
| SELECT COUNT(*) FROM programs WHERE correct=1; |
| ``` |
| |
| - Ratio `n_correct / n_total`: |
| - Near 0 → `correct` definition is broken for this task (it's too strict). Archive is always empty, no diversity. |
| - Near 1 → `correct` definition is loose (covers any runnable code). Archive and parents work normally. |
| - In Frontier-CS, the original `correct=passed` (all test cases perfect) gave 0/51 — **archive silently empty**. Fixed by redefining `correct=True` for any compilable/runnable code. |
| |
| ### 1.3 Per-Generation Best Score Trend |
| |
| Reveals whether evolution is actually progressing or stuck on a plateau: |
| |
| ```python |
| rows = conn.execute( |
| "SELECT generation, combined_score FROM programs ORDER BY generation" |
| ).fetchall() |
| gen_best = {} |
| for g, s in rows: |
| if s is not None and (g not in gen_best or s > gen_best[g]): |
| gen_best[g] = s |
| ``` |
| |
| Mark the agent trigger generations (e.g. gen 5, 10, 15, ...) and look for: |
| - **Step jumps at trigger points** → agent feedback or candidate helped |
| - **Gradual climb** → normal evolution, agent contribution unclear |
| - **Noise dominates** → variance too high, can't attribute improvements |
|
|
| ### 1.4 Agent Candidates in the DB (v4+ only) |
|
|
| When the eval agent writes code directly, candidates enter the `programs` table with `metadata.source = "agent_candidate"`. |
|
|
| ```sql |
| SELECT COUNT(*) FROM programs |
| WHERE metadata LIKE '%"source": "agent_candidate"%'; |
| ``` |
|
|
| **Warning**: Don't use `LIKE '%agent_candidate%'` — that matches programs whose code content references the string. Always match the `source` key precisely. |
|
|
| ### 1.5 Fair Comparison Under Compute Budget |
|
|
| Each agent candidate is an extra LLM call. If the agent triggered 9 times and wrote 9 candidates, the agent run had 50 + 9 = 59 code generations vs vanilla's 50. For an apples-to-apples comparison: |
|
|
| ```sql |
| SELECT MAX(combined_score) FROM programs WHERE generation <= (50 - n_candidates); |
| ``` |
|
|
| Compute `(50 - n_candidates)` per run to normalize total LLM calls, then compare. |
|
|
| --- |
|
|
| ## Part 2: Evaluation Agent Analysis |
|
|
| The eval agent's trajectory lives in `pN/eval_agent_memory/` and `pN/gen_N/results/metrics.json` (aux metric outputs). |
|
|
| ### 2.1 Trigger History |
|
|
| ```python |
| import json |
| state = json.load(open(f"{run_dir}/p{pid}/eval_agent_memory/service_state.json")) |
| print(state["total_agent_runs"], state["last_agent_trigger_gen"]) |
| print(state["agent_trigger_history"]) |
| ``` |
|
|
| - `total_agent_runs` should match `(max_gen - 5) // 5 + 1` for periodic mode (K=5) |
| - If it's much lower, agent is being skipped (busy or erroring) |
|
|
| ### 2.2 Diagnostic Report Quality |
|
|
| Read `diagnostic_report.md` and classify: |
|
|
| - **Strong**: specific numbers ("runtime 4500ms at n=1000"), names the exact bottleneck function/line, gives concrete direction |
| - **Weak**: vague phrases ("may have TLE", "consider optimizing"), no measured evidence, speculative on already-solved code |
| - **Broken**: empty file, parse errors, or repeats previous report |
|
|
| Good diagnostic example (from p42): |
| > "`aux_reported_L` is exactly `aux_L0_baseline`... algorithm is falling back to baseline unrotated packing... search loop (lines 65-93) caps `a` at 250. Increase to 500+." |
|
|
| Weak example (from p241): |
| > "Might have TLE for larger inputs. Cannot be determined yet. Monitor the metric." |
|
|
| ### 2.3 Aux Metrics: Are They Actually Running Code? |
|
|
| This is where silent failure hides. Check the aux_metrics.py for path correctness and run output. |
| |
| ```python |
| # Check the file |
| aux = open(f"{run_dir}/p{pid}/eval_agent_memory/auxiliary_metrics.py").read() |
| # Good pattern: subprocess.run + path to main.cpp |
| # Red flag: only reads metrics.json or does Python simulation |
| ``` |
| |
| Then check the latest generation's aux metric output in the DB: |
| |
| ```python |
| import sqlite3, json |
| conn = sqlite3.connect(f"{run_dir}/p{pid}/evolution_db.sqlite") |
| conn.row_factory = sqlite3.Row |
| row = conn.execute( |
| "SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1" |
| ).fetchone() |
| pub = json.loads(row["public_metrics"]) |
| aux_keys = [k for k in pub if k.startswith("aux_")] |
| ``` |
| |
| Classify what you see: |
|
|
| - **Meaningful**: `aux_runtime_ms_N35`, `aux_num_rects`, `aux_reported_L_n11` — real measurements |
| - **Meta-only noise**: `aux_aux_metric_eval_success`, `aux_aux_metric_error_code` — function ran but returned no real data |
| - **Compile failure**: `aux_compile_success = 0` → path bug or broken aux_metrics |
| |
| Meta-only noise with no real measurements = aux metrics are silently broken. |
| |
| ### 2.4 The Path Bug Check |
| |
| The most common silent failure: aux_metrics.py uses the wrong path to the code file. |
|
|
| ```bash |
| grep "code_path\|main.cpp\|results_dir" pN/eval_agent_memory/auxiliary_metrics.py |
| ``` |
|
|
| - `os.path.join(results_dir, 'main.cpp')` — correct if service passes `gen_N/` |
| - `os.path.join(results_dir, '..', 'main.cpp')` — correct if service passes `gen_N/results/` |
| - Mismatch → compile always fails, every metric returns 0 |
|
|
| Cross-check: what does the service actually pass? Look in `ev2_service_standalone.py` for the `evaluate_func(request.results_dir)` call and whether it's adjusted. |
|
|
| ### 2.5 Agent Candidate Outputs (v4+) |
|
|
| When the agent writes code directly, compare candidate score vs the original best: |
|
|
| ```python |
| rows = conn.execute( |
| "SELECT generation, combined_score, metadata FROM programs " |
| "WHERE metadata LIKE '%\"source\": \"agent_candidate\"%' " |
| "ORDER BY generation" |
| ).fetchall() |
| for r in rows: |
| meta = json.loads(r["metadata"]) |
| print(f"gen={r['generation']} candidate={r['combined_score']:.2f} " |
| f"original={meta.get('original_score', '?')}") |
| ``` |
|
|
| - Candidate beats original → agent successfully identified and implemented a fix |
| - Candidate = 0 → agent's code didn't compile or scored 0 |
| - Candidate < original → agent made things worse |
|
|
| Ratio of "candidate ≥ original" is a direct measure of whether the agent is producing useful code. |
|
|
| ### 2.6 Effect Attribution |
|
|
| Can't easily attribute score gains to the agent vs variance. Use two cross-checks: |
|
|
| - **Compare to matched vanilla**: fork from the same gen 5, same problems, both run 50 gens. Any diff beyond ±5 per problem and ±2 on average is plausibly signal, otherwise noise. |
| - **Look at trigger→jump alignment**: does the score jump within 1-2 gens of a trigger? Consistent alignment across problems = signal. Random placement = noise. |
|
|
| --- |
|
|
| ## Quick Diagnostic Script |
|
|
| One-shot sanity check for a running experiment: |
|
|
| ```python |
| import sqlite3, os, json, time |
| |
| def analyze_run(run_dir, pids, vanilla_dir=None): |
| for pid in pids: |
| db = f"{run_dir}/p{pid}/evolution_db.sqlite" |
| conn = sqlite3.connect(db); conn.row_factory = sqlite3.Row |
| |
| mg = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0] or 0 |
| best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0 |
| n_corr = conn.execute("SELECT COUNT(*) FROM programs WHERE correct=1").fetchone()[0] |
| n_total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0] |
| n_cand = conn.execute( |
| "SELECT COUNT(*) FROM programs WHERE metadata LIKE '%\"source\": \"agent_candidate\"%'" |
| ).fetchone()[0] |
| |
| row = conn.execute( |
| "SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1" |
| ).fetchone() |
| pub = json.loads(row["public_metrics"]) if row and row["public_metrics"] else {} |
| real_aux = [k for k in pub if k.startswith("aux_") |
| and "metric_eval" not in k and "error_code" not in k] |
| |
| state_path = f"{run_dir}/p{pid}/eval_agent_memory/service_state.json" |
| runs = json.load(open(state_path))["total_agent_runs"] if os.path.exists(state_path) else 0 |
| |
| age = time.time() - os.path.getmtime(db) |
| vanilla_best = "" |
| if vanilla_dir: |
| bv = sqlite3.connect(f"{vanilla_dir}/p{pid}/evolution_db.sqlite").execute( |
| "SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0 |
| vanilla_best = f" | vanilla={bv:.2f} diff={best-bv:+.2f}" |
| |
| print(f"p{pid}: gen={mg}/49 | best={best:.2f}{vanilla_best} | " |
| f"correct={n_corr}/{n_total} | cands={n_cand} | " |
| f"agent_runs={runs} | real_aux={len(real_aux)} | db_age={age:.0f}s") |
| ``` |
|
|
| Red flags to watch for: |
|
|
| | Symptom | Meaning | |
| |---------|---------| |
| | `correct=0/N` | `correct` definition too strict, archive broken | |
| | `db_age > 300s` while `mg < 49` | runner stuck on LLM call | |
| | `agent_runs << (mg-5)/5` | agent being skipped | |
| | `real_aux = 0` but `agent_runs > 0` | aux metrics silently failing (likely path bug) | |
| | `cands = 0` but using v4 | agent not producing candidates (check prompt) | |
| | Big diff from vanilla but `cands = 0` | signal is from text feedback alone | |
|
|