Analyzing Agent Trajectories
How to inspect what the agents actually did during an experiment — distinct from "what was the final score." Use this to diagnose whether the agent is adding value or failing silently.
Two agents to analyze separately:
- Evolution Agent: the code-generation LLM that mutates programs each generation
- Evaluation Agent (eval_agent / EV2): the meta-agent that periodically analyzes the best code and produces diagnostic feedback + aux metrics + code candidates
Where to Find the Trajectories
All artifacts live under the experiment directory. Assume $RUN = results/frontier_cs_algorithmic/<experiment_name> and $P = $RUN/p<id> for a single problem.
Evolution Agent trajectory
| What | Path | Contents |
|---|---|---|
| Full program history | $P/evolution_db.sqlite |
programs table: id, generation, combined_score, correct, code, parent_id, archive/inspiration ids, public/private metrics, text_feedback, metadata |
| Code each gen | $P/gen_N/main.cpp |
The evaluated source at generation N |
| Patch each gen | $P/gen_N/edit.diff, $P/gen_N/search_replace.txt |
What the evolution LLM changed |
| Parent before patch | $P/gen_N/original.cpp |
Code that was mutated |
| Evaluation output | $P/gen_N/results/metrics.json, $P/gen_N/results/correct.json |
Primary evaluator output (includes aux metrics merged as aux_*) |
LLM call traces (if --trajectory-log) |
$P/gen_N/llm_trajectories/ |
Raw LLM prompts and responses for the code-gen LLM |
| Run config | $P/experiment_config.yaml |
All evolution settings used |
Evaluation Agent (EV2) trajectory
| What | Path | Contents |
|---|---|---|
| Service state | $P/eval_agent_memory/service_state.json |
total_agent_runs, last_agent_trigger_gen, agent_trigger_history, full generation_history (primary_score per gen) |
| Current diagnostic | $P/eval_agent_memory/diagnostic_report.md |
Latest Hypothesis / Experiment / Verdict / Direction (overwritten each trigger) |
| Cross-gen memory | $P/eval_agent_memory/EVAL_AGENTS.md |
Append-only log — hypothesis/verdict per trigger |
| Current aux metrics script | $P/eval_agent_memory/auxiliary_metrics.py |
The evaluate_aux() function the agent wrote |
| Aux metrics backup | $P/eval_agent_memory/auxiliary_metrics.py.bak |
Previous version before agent's last edit |
| Per-trigger results | $P/eval_agent_memory/agent_runs/gen_N_result.json |
Agent output summary per trigger |
| Agent candidate (v4+) | $P/eval_agent_memory/agent_candidate.cpp |
Code the agent wrote as a direct fix |
| Per-gen aux snapshot | $P/gen_N/results/auxiliary_metrics_snapshot.py |
Copy of aux_metrics.py used at gen N |
LLM raw completions (if OPENHANDS_LOG_COMPLETIONS=1) |
$P/eval_agent_memory/llm_completions/ |
Raw eval-agent LLM calls |
Full message trajectory (if ENABLE_FULL_TRAJECTORY_LOG=1) |
Inside agent_runs/gen_N_*.json |
All messages incl. tool calls |
Run-level logs (parallel scripts)
| What | Path |
|---|---|
| Per-problem runner logs | $RUN/_worker_logs/problem_<id>.log |
| Per-slot eval service logs | $RUN/_worker_logs/eval_service_port_<port>.log |
Logging flags (enable before running)
| Flag | Effect |
|---|---|
--trajectory-log (runner CLI) |
Save evolution sampler LLM prompts/responses |
ENABLE_FULL_TRAJECTORY_LOG=1 (env) |
Save eval agent's full message trajectories |
OPENHANDS_LOG_COMPLETIONS=1 (env) |
Save eval agent's raw LLM completions |
Part 1: Evolution Agent Analysis
The evolution agent's full trajectory lives in the evolution_db.sqlite (the programs table) under each pN/ directory.
1.1 Progress and Completion
import sqlite3
db = f"{run_dir}/p{pid}/evolution_db.sqlite"
conn = sqlite3.connect(db)
max_gen = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0]
best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0]
total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
max_gen < 49and DBmtimeold → runner is stuck (likely waiting on LLM API)- Check
ps aux | grep run_experimentto see if workers are alive
1.2 Archive and Parent Selection Health
Evolution depends on correct=1 to populate archive and select parents. If the archive is empty, evolution degenerates to "keep mutating the seed."
SELECT COUNT(*) FROM programs WHERE correct=1;
- Ratio
n_correct / n_total:- Near 0 →
correctdefinition is broken for this task (it's too strict). Archive is always empty, no diversity. - Near 1 →
correctdefinition is loose (covers any runnable code). Archive and parents work normally.
- Near 0 →
- In Frontier-CS, the original
correct=passed(all test cases perfect) gave 0/51 — archive silently empty. Fixed by redefiningcorrect=Truefor any compilable/runnable code.
1.3 Per-Generation Best Score Trend
Reveals whether evolution is actually progressing or stuck on a plateau:
rows = conn.execute(
"SELECT generation, combined_score FROM programs ORDER BY generation"
).fetchall()
gen_best = {}
for g, s in rows:
if s is not None and (g not in gen_best or s > gen_best[g]):
gen_best[g] = s
Mark the agent trigger generations (e.g. gen 5, 10, 15, ...) and look for:
- Step jumps at trigger points → agent feedback or candidate helped
- Gradual climb → normal evolution, agent contribution unclear
- Noise dominates → variance too high, can't attribute improvements
1.4 Agent Candidates in the DB (v4+ only)
When the eval agent writes code directly, candidates enter the programs table with metadata.source = "agent_candidate".
SELECT COUNT(*) FROM programs
WHERE metadata LIKE '%"source": "agent_candidate"%';
Warning: Don't use LIKE '%agent_candidate%' — that matches programs whose code content references the string. Always match the source key precisely.
1.5 Fair Comparison Under Compute Budget
Each agent candidate is an extra LLM call. If the agent triggered 9 times and wrote 9 candidates, the agent run had 50 + 9 = 59 code generations vs vanilla's 50. For an apples-to-apples comparison:
SELECT MAX(combined_score) FROM programs WHERE generation <= (50 - n_candidates);
Compute (50 - n_candidates) per run to normalize total LLM calls, then compare.
Part 2: Evaluation Agent Analysis
The eval agent's trajectory lives in pN/eval_agent_memory/ and pN/gen_N/results/metrics.json (aux metric outputs).
2.1 Trigger History
import json
state = json.load(open(f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"))
print(state["total_agent_runs"], state["last_agent_trigger_gen"])
print(state["agent_trigger_history"])
total_agent_runsshould match(max_gen - 5) // 5 + 1for periodic mode (K=5)- If it's much lower, agent is being skipped (busy or erroring)
2.2 Diagnostic Report Quality
Read diagnostic_report.md and classify:
- Strong: specific numbers ("runtime 4500ms at n=1000"), names the exact bottleneck function/line, gives concrete direction
- Weak: vague phrases ("may have TLE", "consider optimizing"), no measured evidence, speculative on already-solved code
- Broken: empty file, parse errors, or repeats previous report
Good diagnostic example (from p42):
"
aux_reported_Lis exactlyaux_L0_baseline... algorithm is falling back to baseline unrotated packing... search loop (lines 65-93) capsaat 250. Increase to 500+."
Weak example (from p241):
"Might have TLE for larger inputs. Cannot be determined yet. Monitor the metric."
2.3 Aux Metrics: Are They Actually Running Code?
This is where silent failure hides. Check the aux_metrics.py for path correctness and run output.
# Check the file
aux = open(f"{run_dir}/p{pid}/eval_agent_memory/auxiliary_metrics.py").read()
# Good pattern: subprocess.run + path to main.cpp
# Red flag: only reads metrics.json or does Python simulation
Then check the latest generation's aux metric output in the DB:
import sqlite3, json
conn = sqlite3.connect(f"{run_dir}/p{pid}/evolution_db.sqlite")
conn.row_factory = sqlite3.Row
row = conn.execute(
"SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
).fetchone()
pub = json.loads(row["public_metrics"])
aux_keys = [k for k in pub if k.startswith("aux_")]
Classify what you see:
- Meaningful:
aux_runtime_ms_N35,aux_num_rects,aux_reported_L_n11— real measurements - Meta-only noise:
aux_aux_metric_eval_success,aux_aux_metric_error_code— function ran but returned no real data - Compile failure:
aux_compile_success = 0→ path bug or broken aux_metrics
Meta-only noise with no real measurements = aux metrics are silently broken.
2.4 The Path Bug Check
The most common silent failure: aux_metrics.py uses the wrong path to the code file.
grep "code_path\|main.cpp\|results_dir" pN/eval_agent_memory/auxiliary_metrics.py
os.path.join(results_dir, 'main.cpp')— correct if service passesgen_N/os.path.join(results_dir, '..', 'main.cpp')— correct if service passesgen_N/results/- Mismatch → compile always fails, every metric returns 0
Cross-check: what does the service actually pass? Look in ev2_service_standalone.py for the evaluate_func(request.results_dir) call and whether it's adjusted.
2.5 Agent Candidate Outputs (v4+)
When the agent writes code directly, compare candidate score vs the original best:
rows = conn.execute(
"SELECT generation, combined_score, metadata FROM programs "
"WHERE metadata LIKE '%\"source\": \"agent_candidate\"%' "
"ORDER BY generation"
).fetchall()
for r in rows:
meta = json.loads(r["metadata"])
print(f"gen={r['generation']} candidate={r['combined_score']:.2f} "
f"original={meta.get('original_score', '?')}")
- Candidate beats original → agent successfully identified and implemented a fix
- Candidate = 0 → agent's code didn't compile or scored 0
- Candidate < original → agent made things worse
Ratio of "candidate ≥ original" is a direct measure of whether the agent is producing useful code.
2.6 Effect Attribution
Can't easily attribute score gains to the agent vs variance. Use two cross-checks:
- Compare to matched vanilla: fork from the same gen 5, same problems, both run 50 gens. Any diff beyond ±5 per problem and ±2 on average is plausibly signal, otherwise noise.
- Look at trigger→jump alignment: does the score jump within 1-2 gens of a trigger? Consistent alignment across problems = signal. Random placement = noise.
Quick Diagnostic Script
One-shot sanity check for a running experiment:
import sqlite3, os, json, time
def analyze_run(run_dir, pids, vanilla_dir=None):
for pid in pids:
db = f"{run_dir}/p{pid}/evolution_db.sqlite"
conn = sqlite3.connect(db); conn.row_factory = sqlite3.Row
mg = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0] or 0
best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
n_corr = conn.execute("SELECT COUNT(*) FROM programs WHERE correct=1").fetchone()[0]
n_total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
n_cand = conn.execute(
"SELECT COUNT(*) FROM programs WHERE metadata LIKE '%\"source\": \"agent_candidate\"%'"
).fetchone()[0]
row = conn.execute(
"SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
).fetchone()
pub = json.loads(row["public_metrics"]) if row and row["public_metrics"] else {}
real_aux = [k for k in pub if k.startswith("aux_")
and "metric_eval" not in k and "error_code" not in k]
state_path = f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"
runs = json.load(open(state_path))["total_agent_runs"] if os.path.exists(state_path) else 0
age = time.time() - os.path.getmtime(db)
vanilla_best = ""
if vanilla_dir:
bv = sqlite3.connect(f"{vanilla_dir}/p{pid}/evolution_db.sqlite").execute(
"SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
vanilla_best = f" | vanilla={bv:.2f} diff={best-bv:+.2f}"
print(f"p{pid}: gen={mg}/49 | best={best:.2f}{vanilla_best} | "
f"correct={n_corr}/{n_total} | cands={n_cand} | "
f"agent_runs={runs} | real_aux={len(real_aux)} | db_age={age:.0f}s")
Red flags to watch for:
| Symptom | Meaning |
|---|---|
correct=0/N |
correct definition too strict, archive broken |
db_age > 300s while mg < 49 |
runner stuck on LLM call |
agent_runs << (mg-5)/5 |
agent being skipped |
real_aux = 0 but agent_runs > 0 |
aux metrics silently failing (likely path bug) |
cands = 0 but using v4 |
agent not producing candidates (check prompt) |
Big diff from vanilla but cands = 0 |
signal is from text feedback alone |