shinka-backup / docs /analyzing_agent_trajectories.md
JustinTX's picture
Add files using upload-large-folder tool
1556404 verified

Analyzing Agent Trajectories

How to inspect what the agents actually did during an experiment — distinct from "what was the final score." Use this to diagnose whether the agent is adding value or failing silently.

Two agents to analyze separately:

  • Evolution Agent: the code-generation LLM that mutates programs each generation
  • Evaluation Agent (eval_agent / EV2): the meta-agent that periodically analyzes the best code and produces diagnostic feedback + aux metrics + code candidates

Where to Find the Trajectories

All artifacts live under the experiment directory. Assume $RUN = results/frontier_cs_algorithmic/<experiment_name> and $P = $RUN/p<id> for a single problem.

Evolution Agent trajectory

What Path Contents
Full program history $P/evolution_db.sqlite programs table: id, generation, combined_score, correct, code, parent_id, archive/inspiration ids, public/private metrics, text_feedback, metadata
Code each gen $P/gen_N/main.cpp The evaluated source at generation N
Patch each gen $P/gen_N/edit.diff, $P/gen_N/search_replace.txt What the evolution LLM changed
Parent before patch $P/gen_N/original.cpp Code that was mutated
Evaluation output $P/gen_N/results/metrics.json, $P/gen_N/results/correct.json Primary evaluator output (includes aux metrics merged as aux_*)
LLM call traces (if --trajectory-log) $P/gen_N/llm_trajectories/ Raw LLM prompts and responses for the code-gen LLM
Run config $P/experiment_config.yaml All evolution settings used

Evaluation Agent (EV2) trajectory

What Path Contents
Service state $P/eval_agent_memory/service_state.json total_agent_runs, last_agent_trigger_gen, agent_trigger_history, full generation_history (primary_score per gen)
Current diagnostic $P/eval_agent_memory/diagnostic_report.md Latest Hypothesis / Experiment / Verdict / Direction (overwritten each trigger)
Cross-gen memory $P/eval_agent_memory/EVAL_AGENTS.md Append-only log — hypothesis/verdict per trigger
Current aux metrics script $P/eval_agent_memory/auxiliary_metrics.py The evaluate_aux() function the agent wrote
Aux metrics backup $P/eval_agent_memory/auxiliary_metrics.py.bak Previous version before agent's last edit
Per-trigger results $P/eval_agent_memory/agent_runs/gen_N_result.json Agent output summary per trigger
Agent candidate (v4+) $P/eval_agent_memory/agent_candidate.cpp Code the agent wrote as a direct fix
Per-gen aux snapshot $P/gen_N/results/auxiliary_metrics_snapshot.py Copy of aux_metrics.py used at gen N
LLM raw completions (if OPENHANDS_LOG_COMPLETIONS=1) $P/eval_agent_memory/llm_completions/ Raw eval-agent LLM calls
Full message trajectory (if ENABLE_FULL_TRAJECTORY_LOG=1) Inside agent_runs/gen_N_*.json All messages incl. tool calls

Run-level logs (parallel scripts)

What Path
Per-problem runner logs $RUN/_worker_logs/problem_<id>.log
Per-slot eval service logs $RUN/_worker_logs/eval_service_port_<port>.log

Logging flags (enable before running)

Flag Effect
--trajectory-log (runner CLI) Save evolution sampler LLM prompts/responses
ENABLE_FULL_TRAJECTORY_LOG=1 (env) Save eval agent's full message trajectories
OPENHANDS_LOG_COMPLETIONS=1 (env) Save eval agent's raw LLM completions

Part 1: Evolution Agent Analysis

The evolution agent's full trajectory lives in the evolution_db.sqlite (the programs table) under each pN/ directory.

1.1 Progress and Completion

import sqlite3

db = f"{run_dir}/p{pid}/evolution_db.sqlite"
conn = sqlite3.connect(db)
max_gen = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0]
best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0]
total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
  • max_gen < 49 and DB mtime old → runner is stuck (likely waiting on LLM API)
  • Check ps aux | grep run_experiment to see if workers are alive

1.2 Archive and Parent Selection Health

Evolution depends on correct=1 to populate archive and select parents. If the archive is empty, evolution degenerates to "keep mutating the seed."

SELECT COUNT(*) FROM programs WHERE correct=1;
  • Ratio n_correct / n_total:
    • Near 0 → correct definition is broken for this task (it's too strict). Archive is always empty, no diversity.
    • Near 1 → correct definition is loose (covers any runnable code). Archive and parents work normally.
  • In Frontier-CS, the original correct=passed (all test cases perfect) gave 0/51 — archive silently empty. Fixed by redefining correct=True for any compilable/runnable code.

1.3 Per-Generation Best Score Trend

Reveals whether evolution is actually progressing or stuck on a plateau:

rows = conn.execute(
    "SELECT generation, combined_score FROM programs ORDER BY generation"
).fetchall()
gen_best = {}
for g, s in rows:
    if s is not None and (g not in gen_best or s > gen_best[g]):
        gen_best[g] = s

Mark the agent trigger generations (e.g. gen 5, 10, 15, ...) and look for:

  • Step jumps at trigger points → agent feedback or candidate helped
  • Gradual climb → normal evolution, agent contribution unclear
  • Noise dominates → variance too high, can't attribute improvements

1.4 Agent Candidates in the DB (v4+ only)

When the eval agent writes code directly, candidates enter the programs table with metadata.source = "agent_candidate".

SELECT COUNT(*) FROM programs
WHERE metadata LIKE '%"source": "agent_candidate"%';

Warning: Don't use LIKE '%agent_candidate%' — that matches programs whose code content references the string. Always match the source key precisely.

1.5 Fair Comparison Under Compute Budget

Each agent candidate is an extra LLM call. If the agent triggered 9 times and wrote 9 candidates, the agent run had 50 + 9 = 59 code generations vs vanilla's 50. For an apples-to-apples comparison:

SELECT MAX(combined_score) FROM programs WHERE generation <= (50 - n_candidates);

Compute (50 - n_candidates) per run to normalize total LLM calls, then compare.


Part 2: Evaluation Agent Analysis

The eval agent's trajectory lives in pN/eval_agent_memory/ and pN/gen_N/results/metrics.json (aux metric outputs).

2.1 Trigger History

import json
state = json.load(open(f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"))
print(state["total_agent_runs"], state["last_agent_trigger_gen"])
print(state["agent_trigger_history"])
  • total_agent_runs should match (max_gen - 5) // 5 + 1 for periodic mode (K=5)
  • If it's much lower, agent is being skipped (busy or erroring)

2.2 Diagnostic Report Quality

Read diagnostic_report.md and classify:

  • Strong: specific numbers ("runtime 4500ms at n=1000"), names the exact bottleneck function/line, gives concrete direction
  • Weak: vague phrases ("may have TLE", "consider optimizing"), no measured evidence, speculative on already-solved code
  • Broken: empty file, parse errors, or repeats previous report

Good diagnostic example (from p42):

"aux_reported_L is exactly aux_L0_baseline... algorithm is falling back to baseline unrotated packing... search loop (lines 65-93) caps a at 250. Increase to 500+."

Weak example (from p241):

"Might have TLE for larger inputs. Cannot be determined yet. Monitor the metric."

2.3 Aux Metrics: Are They Actually Running Code?

This is where silent failure hides. Check the aux_metrics.py for path correctness and run output.

# Check the file
aux = open(f"{run_dir}/p{pid}/eval_agent_memory/auxiliary_metrics.py").read()
# Good pattern: subprocess.run + path to main.cpp
# Red flag: only reads metrics.json or does Python simulation

Then check the latest generation's aux metric output in the DB:

import sqlite3, json
conn = sqlite3.connect(f"{run_dir}/p{pid}/evolution_db.sqlite")
conn.row_factory = sqlite3.Row
row = conn.execute(
    "SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
).fetchone()
pub = json.loads(row["public_metrics"])
aux_keys = [k for k in pub if k.startswith("aux_")]

Classify what you see:

  • Meaningful: aux_runtime_ms_N35, aux_num_rects, aux_reported_L_n11 — real measurements
  • Meta-only noise: aux_aux_metric_eval_success, aux_aux_metric_error_code — function ran but returned no real data
  • Compile failure: aux_compile_success = 0 → path bug or broken aux_metrics

Meta-only noise with no real measurements = aux metrics are silently broken.

2.4 The Path Bug Check

The most common silent failure: aux_metrics.py uses the wrong path to the code file.

grep "code_path\|main.cpp\|results_dir" pN/eval_agent_memory/auxiliary_metrics.py
  • os.path.join(results_dir, 'main.cpp') — correct if service passes gen_N/
  • os.path.join(results_dir, '..', 'main.cpp') — correct if service passes gen_N/results/
  • Mismatch → compile always fails, every metric returns 0

Cross-check: what does the service actually pass? Look in ev2_service_standalone.py for the evaluate_func(request.results_dir) call and whether it's adjusted.

2.5 Agent Candidate Outputs (v4+)

When the agent writes code directly, compare candidate score vs the original best:

rows = conn.execute(
    "SELECT generation, combined_score, metadata FROM programs "
    "WHERE metadata LIKE '%\"source\": \"agent_candidate\"%' "
    "ORDER BY generation"
).fetchall()
for r in rows:
    meta = json.loads(r["metadata"])
    print(f"gen={r['generation']} candidate={r['combined_score']:.2f} "
          f"original={meta.get('original_score', '?')}")
  • Candidate beats original → agent successfully identified and implemented a fix
  • Candidate = 0 → agent's code didn't compile or scored 0
  • Candidate < original → agent made things worse

Ratio of "candidate ≥ original" is a direct measure of whether the agent is producing useful code.

2.6 Effect Attribution

Can't easily attribute score gains to the agent vs variance. Use two cross-checks:

  • Compare to matched vanilla: fork from the same gen 5, same problems, both run 50 gens. Any diff beyond ±5 per problem and ±2 on average is plausibly signal, otherwise noise.
  • Look at trigger→jump alignment: does the score jump within 1-2 gens of a trigger? Consistent alignment across problems = signal. Random placement = noise.

Quick Diagnostic Script

One-shot sanity check for a running experiment:

import sqlite3, os, json, time

def analyze_run(run_dir, pids, vanilla_dir=None):
    for pid in pids:
        db = f"{run_dir}/p{pid}/evolution_db.sqlite"
        conn = sqlite3.connect(db); conn.row_factory = sqlite3.Row

        mg = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0] or 0
        best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
        n_corr = conn.execute("SELECT COUNT(*) FROM programs WHERE correct=1").fetchone()[0]
        n_total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
        n_cand = conn.execute(
            "SELECT COUNT(*) FROM programs WHERE metadata LIKE '%\"source\": \"agent_candidate\"%'"
        ).fetchone()[0]

        row = conn.execute(
            "SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
        ).fetchone()
        pub = json.loads(row["public_metrics"]) if row and row["public_metrics"] else {}
        real_aux = [k for k in pub if k.startswith("aux_")
                    and "metric_eval" not in k and "error_code" not in k]

        state_path = f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"
        runs = json.load(open(state_path))["total_agent_runs"] if os.path.exists(state_path) else 0

        age = time.time() - os.path.getmtime(db)
        vanilla_best = ""
        if vanilla_dir:
            bv = sqlite3.connect(f"{vanilla_dir}/p{pid}/evolution_db.sqlite").execute(
                "SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
            vanilla_best = f" | vanilla={bv:.2f} diff={best-bv:+.2f}"

        print(f"p{pid}: gen={mg}/49 | best={best:.2f}{vanilla_best} | "
              f"correct={n_corr}/{n_total} | cands={n_cand} | "
              f"agent_runs={runs} | real_aux={len(real_aux)} | db_age={age:.0f}s")

Red flags to watch for:

Symptom Meaning
correct=0/N correct definition too strict, archive broken
db_age > 300s while mg < 49 runner stuck on LLM call
agent_runs << (mg-5)/5 agent being skipped
real_aux = 0 but agent_runs > 0 aux metrics silently failing (likely path bug)
cands = 0 but using v4 agent not producing candidates (check prompt)
Big diff from vanilla but cands = 0 signal is from text feedback alone