shinka-backup / docs /analyzing_agent_trajectories.md
JustinTX's picture
Add files using upload-large-folder tool
1556404 verified
# Analyzing Agent Trajectories
How to inspect what the agents actually did during an experiment — distinct from "what was the final score." Use this to diagnose whether the agent is adding value or failing silently.
Two agents to analyze separately:
- **Evolution Agent**: the code-generation LLM that mutates programs each generation
- **Evaluation Agent (eval_agent / EV2)**: the meta-agent that periodically analyzes the best code and produces diagnostic feedback + aux metrics + code candidates
---
## Where to Find the Trajectories
All artifacts live under the experiment directory. Assume `$RUN = results/frontier_cs_algorithmic/<experiment_name>` and `$P = $RUN/p<id>` for a single problem.
### Evolution Agent trajectory
| What | Path | Contents |
|------|------|----------|
| Full program history | `$P/evolution_db.sqlite` | `programs` table: id, generation, combined_score, correct, code, parent_id, archive/inspiration ids, public/private metrics, text_feedback, metadata |
| Code each gen | `$P/gen_N/main.cpp` | The evaluated source at generation N |
| Patch each gen | `$P/gen_N/edit.diff`, `$P/gen_N/search_replace.txt` | What the evolution LLM changed |
| Parent before patch | `$P/gen_N/original.cpp` | Code that was mutated |
| Evaluation output | `$P/gen_N/results/metrics.json`, `$P/gen_N/results/correct.json` | Primary evaluator output (includes aux metrics merged as `aux_*`) |
| LLM call traces (if `--trajectory-log`) | `$P/gen_N/llm_trajectories/` | Raw LLM prompts and responses for the code-gen LLM |
| Run config | `$P/experiment_config.yaml` | All evolution settings used |
### Evaluation Agent (EV2) trajectory
| What | Path | Contents |
|------|------|----------|
| Service state | `$P/eval_agent_memory/service_state.json` | `total_agent_runs`, `last_agent_trigger_gen`, `agent_trigger_history`, full `generation_history` (primary_score per gen) |
| Current diagnostic | `$P/eval_agent_memory/diagnostic_report.md` | Latest Hypothesis / Experiment / Verdict / Direction (overwritten each trigger) |
| Cross-gen memory | `$P/eval_agent_memory/EVAL_AGENTS.md` | Append-only log — hypothesis/verdict per trigger |
| Current aux metrics script | `$P/eval_agent_memory/auxiliary_metrics.py` | The `evaluate_aux()` function the agent wrote |
| Aux metrics backup | `$P/eval_agent_memory/auxiliary_metrics.py.bak` | Previous version before agent's last edit |
| Per-trigger results | `$P/eval_agent_memory/agent_runs/gen_N_result.json` | Agent output summary per trigger |
| Agent candidate (v4+) | `$P/eval_agent_memory/agent_candidate.cpp` | Code the agent wrote as a direct fix |
| Per-gen aux snapshot | `$P/gen_N/results/auxiliary_metrics_snapshot.py` | Copy of aux_metrics.py used at gen N |
| LLM raw completions (if `OPENHANDS_LOG_COMPLETIONS=1`) | `$P/eval_agent_memory/llm_completions/` | Raw eval-agent LLM calls |
| Full message trajectory (if `ENABLE_FULL_TRAJECTORY_LOG=1`) | Inside `agent_runs/gen_N_*.json` | All messages incl. tool calls |
### Run-level logs (parallel scripts)
| What | Path |
|------|------|
| Per-problem runner logs | `$RUN/_worker_logs/problem_<id>.log` |
| Per-slot eval service logs | `$RUN/_worker_logs/eval_service_port_<port>.log` |
### Logging flags (enable before running)
| Flag | Effect |
|------|--------|
| `--trajectory-log` (runner CLI) | Save evolution sampler LLM prompts/responses |
| `ENABLE_FULL_TRAJECTORY_LOG=1` (env) | Save eval agent's full message trajectories |
| `OPENHANDS_LOG_COMPLETIONS=1` (env) | Save eval agent's raw LLM completions |
---
## Part 1: Evolution Agent Analysis
The evolution agent's full trajectory lives in the `evolution_db.sqlite` (the `programs` table) under each `pN/` directory.
### 1.1 Progress and Completion
```python
import sqlite3
db = f"{run_dir}/p{pid}/evolution_db.sqlite"
conn = sqlite3.connect(db)
max_gen = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0]
best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0]
total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
```
- `max_gen < 49` and DB `mtime` old → runner is stuck (likely waiting on LLM API)
- Check `ps aux | grep run_experiment` to see if workers are alive
### 1.2 Archive and Parent Selection Health
Evolution depends on `correct=1` to populate archive and select parents. If the archive is empty, evolution degenerates to "keep mutating the seed."
```sql
SELECT COUNT(*) FROM programs WHERE correct=1;
```
- Ratio `n_correct / n_total`:
- Near 0 → `correct` definition is broken for this task (it's too strict). Archive is always empty, no diversity.
- Near 1 → `correct` definition is loose (covers any runnable code). Archive and parents work normally.
- In Frontier-CS, the original `correct=passed` (all test cases perfect) gave 0/51 — **archive silently empty**. Fixed by redefining `correct=True` for any compilable/runnable code.
### 1.3 Per-Generation Best Score Trend
Reveals whether evolution is actually progressing or stuck on a plateau:
```python
rows = conn.execute(
"SELECT generation, combined_score FROM programs ORDER BY generation"
).fetchall()
gen_best = {}
for g, s in rows:
if s is not None and (g not in gen_best or s > gen_best[g]):
gen_best[g] = s
```
Mark the agent trigger generations (e.g. gen 5, 10, 15, ...) and look for:
- **Step jumps at trigger points** → agent feedback or candidate helped
- **Gradual climb** → normal evolution, agent contribution unclear
- **Noise dominates** → variance too high, can't attribute improvements
### 1.4 Agent Candidates in the DB (v4+ only)
When the eval agent writes code directly, candidates enter the `programs` table with `metadata.source = "agent_candidate"`.
```sql
SELECT COUNT(*) FROM programs
WHERE metadata LIKE '%"source": "agent_candidate"%';
```
**Warning**: Don't use `LIKE '%agent_candidate%'` — that matches programs whose code content references the string. Always match the `source` key precisely.
### 1.5 Fair Comparison Under Compute Budget
Each agent candidate is an extra LLM call. If the agent triggered 9 times and wrote 9 candidates, the agent run had 50 + 9 = 59 code generations vs vanilla's 50. For an apples-to-apples comparison:
```sql
SELECT MAX(combined_score) FROM programs WHERE generation <= (50 - n_candidates);
```
Compute `(50 - n_candidates)` per run to normalize total LLM calls, then compare.
---
## Part 2: Evaluation Agent Analysis
The eval agent's trajectory lives in `pN/eval_agent_memory/` and `pN/gen_N/results/metrics.json` (aux metric outputs).
### 2.1 Trigger History
```python
import json
state = json.load(open(f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"))
print(state["total_agent_runs"], state["last_agent_trigger_gen"])
print(state["agent_trigger_history"])
```
- `total_agent_runs` should match `(max_gen - 5) // 5 + 1` for periodic mode (K=5)
- If it's much lower, agent is being skipped (busy or erroring)
### 2.2 Diagnostic Report Quality
Read `diagnostic_report.md` and classify:
- **Strong**: specific numbers ("runtime 4500ms at n=1000"), names the exact bottleneck function/line, gives concrete direction
- **Weak**: vague phrases ("may have TLE", "consider optimizing"), no measured evidence, speculative on already-solved code
- **Broken**: empty file, parse errors, or repeats previous report
Good diagnostic example (from p42):
> "`aux_reported_L` is exactly `aux_L0_baseline`... algorithm is falling back to baseline unrotated packing... search loop (lines 65-93) caps `a` at 250. Increase to 500+."
Weak example (from p241):
> "Might have TLE for larger inputs. Cannot be determined yet. Monitor the metric."
### 2.3 Aux Metrics: Are They Actually Running Code?
This is where silent failure hides. Check the aux_metrics.py for path correctness and run output.
```python
# Check the file
aux = open(f"{run_dir}/p{pid}/eval_agent_memory/auxiliary_metrics.py").read()
# Good pattern: subprocess.run + path to main.cpp
# Red flag: only reads metrics.json or does Python simulation
```
Then check the latest generation's aux metric output in the DB:
```python
import sqlite3, json
conn = sqlite3.connect(f"{run_dir}/p{pid}/evolution_db.sqlite")
conn.row_factory = sqlite3.Row
row = conn.execute(
"SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
).fetchone()
pub = json.loads(row["public_metrics"])
aux_keys = [k for k in pub if k.startswith("aux_")]
```
Classify what you see:
- **Meaningful**: `aux_runtime_ms_N35`, `aux_num_rects`, `aux_reported_L_n11` — real measurements
- **Meta-only noise**: `aux_aux_metric_eval_success`, `aux_aux_metric_error_code` — function ran but returned no real data
- **Compile failure**: `aux_compile_success = 0` → path bug or broken aux_metrics
Meta-only noise with no real measurements = aux metrics are silently broken.
### 2.4 The Path Bug Check
The most common silent failure: aux_metrics.py uses the wrong path to the code file.
```bash
grep "code_path\|main.cpp\|results_dir" pN/eval_agent_memory/auxiliary_metrics.py
```
- `os.path.join(results_dir, 'main.cpp')` — correct if service passes `gen_N/`
- `os.path.join(results_dir, '..', 'main.cpp')` — correct if service passes `gen_N/results/`
- Mismatch → compile always fails, every metric returns 0
Cross-check: what does the service actually pass? Look in `ev2_service_standalone.py` for the `evaluate_func(request.results_dir)` call and whether it's adjusted.
### 2.5 Agent Candidate Outputs (v4+)
When the agent writes code directly, compare candidate score vs the original best:
```python
rows = conn.execute(
"SELECT generation, combined_score, metadata FROM programs "
"WHERE metadata LIKE '%\"source\": \"agent_candidate\"%' "
"ORDER BY generation"
).fetchall()
for r in rows:
meta = json.loads(r["metadata"])
print(f"gen={r['generation']} candidate={r['combined_score']:.2f} "
f"original={meta.get('original_score', '?')}")
```
- Candidate beats original → agent successfully identified and implemented a fix
- Candidate = 0 → agent's code didn't compile or scored 0
- Candidate < original → agent made things worse
Ratio of "candidate ≥ original" is a direct measure of whether the agent is producing useful code.
### 2.6 Effect Attribution
Can't easily attribute score gains to the agent vs variance. Use two cross-checks:
- **Compare to matched vanilla**: fork from the same gen 5, same problems, both run 50 gens. Any diff beyond ±5 per problem and ±2 on average is plausibly signal, otherwise noise.
- **Look at trigger→jump alignment**: does the score jump within 1-2 gens of a trigger? Consistent alignment across problems = signal. Random placement = noise.
---
## Quick Diagnostic Script
One-shot sanity check for a running experiment:
```python
import sqlite3, os, json, time
def analyze_run(run_dir, pids, vanilla_dir=None):
for pid in pids:
db = f"{run_dir}/p{pid}/evolution_db.sqlite"
conn = sqlite3.connect(db); conn.row_factory = sqlite3.Row
mg = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0] or 0
best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
n_corr = conn.execute("SELECT COUNT(*) FROM programs WHERE correct=1").fetchone()[0]
n_total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
n_cand = conn.execute(
"SELECT COUNT(*) FROM programs WHERE metadata LIKE '%\"source\": \"agent_candidate\"%'"
).fetchone()[0]
row = conn.execute(
"SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
).fetchone()
pub = json.loads(row["public_metrics"]) if row and row["public_metrics"] else {}
real_aux = [k for k in pub if k.startswith("aux_")
and "metric_eval" not in k and "error_code" not in k]
state_path = f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"
runs = json.load(open(state_path))["total_agent_runs"] if os.path.exists(state_path) else 0
age = time.time() - os.path.getmtime(db)
vanilla_best = ""
if vanilla_dir:
bv = sqlite3.connect(f"{vanilla_dir}/p{pid}/evolution_db.sqlite").execute(
"SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
vanilla_best = f" | vanilla={bv:.2f} diff={best-bv:+.2f}"
print(f"p{pid}: gen={mg}/49 | best={best:.2f}{vanilla_best} | "
f"correct={n_corr}/{n_total} | cands={n_cand} | "
f"agent_runs={runs} | real_aux={len(real_aux)} | db_age={age:.0f}s")
```
Red flags to watch for:
| Symptom | Meaning |
|---------|---------|
| `correct=0/N` | `correct` definition too strict, archive broken |
| `db_age > 300s` while `mg < 49` | runner stuck on LLM call |
| `agent_runs << (mg-5)/5` | agent being skipped |
| `real_aux = 0` but `agent_runs > 0` | aux metrics silently failing (likely path bug) |
| `cands = 0` but using v4 | agent not producing candidates (check prompt) |
| Big diff from vanilla but `cands = 0` | signal is from text feedback alone |