File size: 13,157 Bytes
# Analyzing Agent Trajectories

How to inspect what the agents actually did during an experiment — distinct from "what was the final score." Use this to diagnose whether the agent is adding value or failing silently.

Two agents to analyze separately:

- **Evolution Agent**: the code-generation LLM that mutates programs each generation
- **Evaluation Agent (eval_agent / EV2)**: the meta-agent that periodically analyzes the best code and produces diagnostic feedback + aux metrics + code candidates

---

## Where to Find the Trajectories

All artifacts live under the experiment directory. Assume `$RUN = results/frontier_cs_algorithmic/<experiment_name>` and `$P = $RUN/p<id>` for a single problem.

### Evolution Agent trajectory

| What | Path | Contents |
|------|------|----------|
| Full program history | `$P/evolution_db.sqlite` | `programs` table: id, generation, combined_score, correct, code, parent_id, archive/inspiration ids, public/private metrics, text_feedback, metadata |
| Code each gen | `$P/gen_N/main.cpp` | The evaluated source at generation N |
| Patch each gen | `$P/gen_N/edit.diff`, `$P/gen_N/search_replace.txt` | What the evolution LLM changed |
| Parent before patch | `$P/gen_N/original.cpp` | Code that was mutated |
| Evaluation output | `$P/gen_N/results/metrics.json`, `$P/gen_N/results/correct.json` | Primary evaluator output (includes aux metrics merged as `aux_*`) |
| LLM call traces (if `--trajectory-log`) | `$P/gen_N/llm_trajectories/` | Raw LLM prompts and responses for the code-gen LLM |
| Run config | `$P/experiment_config.yaml` | All evolution settings used |

### Evaluation Agent (EV2) trajectory

| What | Path | Contents |
|------|------|----------|
| Service state | `$P/eval_agent_memory/service_state.json` | `total_agent_runs`, `last_agent_trigger_gen`, `agent_trigger_history`, full `generation_history` (primary_score per gen) |
| Current diagnostic | `$P/eval_agent_memory/diagnostic_report.md` | Latest Hypothesis / Experiment / Verdict / Direction (overwritten each trigger) |
| Cross-gen memory | `$P/eval_agent_memory/EVAL_AGENTS.md` | Append-only log — hypothesis/verdict per trigger |
| Current aux metrics script | `$P/eval_agent_memory/auxiliary_metrics.py` | The `evaluate_aux()` function the agent wrote |
| Aux metrics backup | `$P/eval_agent_memory/auxiliary_metrics.py.bak` | Previous version before agent's last edit |
| Per-trigger results | `$P/eval_agent_memory/agent_runs/gen_N_result.json` | Agent output summary per trigger |
| Agent candidate (v4+) | `$P/eval_agent_memory/agent_candidate.cpp` | Code the agent wrote as a direct fix |
| Per-gen aux snapshot | `$P/gen_N/results/auxiliary_metrics_snapshot.py` | Copy of aux_metrics.py used at gen N |
| LLM raw completions (if `OPENHANDS_LOG_COMPLETIONS=1`) | `$P/eval_agent_memory/llm_completions/` | Raw eval-agent LLM calls |
| Full message trajectory (if `ENABLE_FULL_TRAJECTORY_LOG=1`) | Inside `agent_runs/gen_N_*.json` | All messages incl. tool calls |

### Run-level logs (parallel scripts)

| What | Path |
|------|------|
| Per-problem runner logs | `$RUN/_worker_logs/problem_<id>.log` |
| Per-slot eval service logs | `$RUN/_worker_logs/eval_service_port_<port>.log` |

### Logging flags (enable before running)

| Flag | Effect |
|------|--------|
| `--trajectory-log` (runner CLI) | Save evolution sampler LLM prompts/responses |
| `ENABLE_FULL_TRAJECTORY_LOG=1` (env) | Save eval agent's full message trajectories |
| `OPENHANDS_LOG_COMPLETIONS=1` (env) | Save eval agent's raw LLM completions |

---

## Part 1: Evolution Agent Analysis

The evolution agent's full trajectory lives in the `evolution_db.sqlite` (the `programs` table) under each `pN/` directory.

### 1.1 Progress and Completion

```python
import sqlite3

db = f"{run_dir}/p{pid}/evolution_db.sqlite"
conn = sqlite3.connect(db)
max_gen = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0]
best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0]
total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
```

- `max_gen < 49` and DB `mtime` old → runner is stuck (likely waiting on LLM API)
- Check `ps aux | grep run_experiment` to see if workers are alive

### 1.2 Archive and Parent Selection Health

Evolution depends on `correct=1` to populate archive and select parents. If the archive is empty, evolution degenerates to "keep mutating the seed."

```sql
SELECT COUNT(*) FROM programs WHERE correct=1;
```

- Ratio `n_correct / n_total`:
  - Near 0 → `correct` definition is broken for this task (it's too strict). Archive is always empty, no diversity.
  - Near 1 → `correct` definition is loose (covers any runnable code). Archive and parents work normally.
- In Frontier-CS, the original `correct=passed` (all test cases perfect) gave 0/51 — **archive silently empty**. Fixed by redefining `correct=True` for any compilable/runnable code.

### 1.3 Per-Generation Best Score Trend

Reveals whether evolution is actually progressing or stuck on a plateau:

```python
rows = conn.execute(
    "SELECT generation, combined_score FROM programs ORDER BY generation"
).fetchall()
gen_best = {}
for g, s in rows:
    if s is not None and (g not in gen_best or s > gen_best[g]):
        gen_best[g] = s
```

Mark the agent trigger generations (e.g. gen 5, 10, 15, ...) and look for:
- **Step jumps at trigger points** → agent feedback or candidate helped
- **Gradual climb** → normal evolution, agent contribution unclear
- **Noise dominates** → variance too high, can't attribute improvements

### 1.4 Agent Candidates in the DB (v4+ only)

When the eval agent writes code directly, candidates enter the `programs` table with `metadata.source = "agent_candidate"`.

```sql
SELECT COUNT(*) FROM programs
WHERE metadata LIKE '%"source": "agent_candidate"%';
```

**Warning**: Don't use `LIKE '%agent_candidate%'` — that matches programs whose code content references the string. Always match the `source` key precisely.

### 1.5 Fair Comparison Under Compute Budget

Each agent candidate is an extra LLM call. If the agent triggered 9 times and wrote 9 candidates, the agent run had 50 + 9 = 59 code generations vs vanilla's 50. For an apples-to-apples comparison:

```sql
SELECT MAX(combined_score) FROM programs WHERE generation <= (50 - n_candidates);
```

Compute `(50 - n_candidates)` per run to normalize total LLM calls, then compare.

---

## Part 2: Evaluation Agent Analysis

The eval agent's trajectory lives in `pN/eval_agent_memory/` and `pN/gen_N/results/metrics.json` (aux metric outputs).

### 2.1 Trigger History

```python
import json
state = json.load(open(f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"))
print(state["total_agent_runs"], state["last_agent_trigger_gen"])
print(state["agent_trigger_history"])
```

- `total_agent_runs` should match `(max_gen - 5) // 5 + 1` for periodic mode (K=5)
- If it's much lower, agent is being skipped (busy or erroring)

### 2.2 Diagnostic Report Quality

Read `diagnostic_report.md` and classify:

- **Strong**: specific numbers ("runtime 4500ms at n=1000"), names the exact bottleneck function/line, gives concrete direction
- **Weak**: vague phrases ("may have TLE", "consider optimizing"), no measured evidence, speculative on already-solved code
- **Broken**: empty file, parse errors, or repeats previous report

Good diagnostic example (from p42):
> "`aux_reported_L` is exactly `aux_L0_baseline`... algorithm is falling back to baseline unrotated packing... search loop (lines 65-93) caps `a` at 250. Increase to 500+."

Weak example (from p241):
> "Might have TLE for larger inputs. Cannot be determined yet. Monitor the metric."

### 2.3 Aux Metrics: Are They Actually Running Code?

This is where silent failure hides. Check the aux_metrics.py for path correctness and run output.

```python
# Check the file
aux = open(f"{run_dir}/p{pid}/eval_agent_memory/auxiliary_metrics.py").read()
# Good pattern: subprocess.run + path to main.cpp
# Red flag: only reads metrics.json or does Python simulation
```

Then check the latest generation's aux metric output in the DB:

```python
import sqlite3, json
conn = sqlite3.connect(f"{run_dir}/p{pid}/evolution_db.sqlite")
conn.row_factory = sqlite3.Row
row = conn.execute(
    "SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
).fetchone()
pub = json.loads(row["public_metrics"])
aux_keys = [k for k in pub if k.startswith("aux_")]
```

Classify what you see:

- **Meaningful**: `aux_runtime_ms_N35`, `aux_num_rects`, `aux_reported_L_n11` — real measurements
- **Meta-only noise**: `aux_aux_metric_eval_success`, `aux_aux_metric_error_code` — function ran but returned no real data
- **Compile failure**: `aux_compile_success = 0` → path bug or broken aux_metrics

Meta-only noise with no real measurements = aux metrics are silently broken.

### 2.4 The Path Bug Check

The most common silent failure: aux_metrics.py uses the wrong path to the code file.

```bash
grep "code_path\|main.cpp\|results_dir" pN/eval_agent_memory/auxiliary_metrics.py
```

- `os.path.join(results_dir, 'main.cpp')` — correct if service passes `gen_N/`
- `os.path.join(results_dir, '..', 'main.cpp')` — correct if service passes `gen_N/results/`
- Mismatch → compile always fails, every metric returns 0

Cross-check: what does the service actually pass? Look in `ev2_service_standalone.py` for the `evaluate_func(request.results_dir)` call and whether it's adjusted.

### 2.5 Agent Candidate Outputs (v4+)

When the agent writes code directly, compare candidate score vs the original best:

```python
rows = conn.execute(
    "SELECT generation, combined_score, metadata FROM programs "
    "WHERE metadata LIKE '%\"source\": \"agent_candidate\"%' "
    "ORDER BY generation"
).fetchall()
for r in rows:
    meta = json.loads(r["metadata"])
    print(f"gen={r['generation']} candidate={r['combined_score']:.2f} "
          f"original={meta.get('original_score', '?')}")
```

- Candidate beats original → agent successfully identified and implemented a fix
- Candidate = 0 → agent's code didn't compile or scored 0
- Candidate < original → agent made things worse

Ratio of "candidate ≥ original" is a direct measure of whether the agent is producing useful code.

### 2.6 Effect Attribution

Can't easily attribute score gains to the agent vs variance. Use two cross-checks:

- **Compare to matched vanilla**: fork from the same gen 5, same problems, both run 50 gens. Any diff beyond ±5 per problem and ±2 on average is plausibly signal, otherwise noise.
- **Look at trigger→jump alignment**: does the score jump within 1-2 gens of a trigger? Consistent alignment across problems = signal. Random placement = noise.

---

## Quick Diagnostic Script

One-shot sanity check for a running experiment:

```python
import sqlite3, os, json, time

def analyze_run(run_dir, pids, vanilla_dir=None):
    for pid in pids:
        db = f"{run_dir}/p{pid}/evolution_db.sqlite"
        conn = sqlite3.connect(db); conn.row_factory = sqlite3.Row

        mg = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0] or 0
        best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
        n_corr = conn.execute("SELECT COUNT(*) FROM programs WHERE correct=1").fetchone()[0]
        n_total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
        n_cand = conn.execute(
            "SELECT COUNT(*) FROM programs WHERE metadata LIKE '%\"source\": \"agent_candidate\"%'"
        ).fetchone()[0]

        row = conn.execute(
            "SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
        ).fetchone()
        pub = json.loads(row["public_metrics"]) if row and row["public_metrics"] else {}
        real_aux = [k for k in pub if k.startswith("aux_")
                    and "metric_eval" not in k and "error_code" not in k]

        state_path = f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"
        runs = json.load(open(state_path))["total_agent_runs"] if os.path.exists(state_path) else 0

        age = time.time() - os.path.getmtime(db)
        vanilla_best = ""
        if vanilla_dir:
            bv = sqlite3.connect(f"{vanilla_dir}/p{pid}/evolution_db.sqlite").execute(
                "SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
            vanilla_best = f" | vanilla={bv:.2f} diff={best-bv:+.2f}"

        print(f"p{pid}: gen={mg}/49 | best={best:.2f}{vanilla_best} | "
              f"correct={n_corr}/{n_total} | cands={n_cand} | "
              f"agent_runs={runs} | real_aux={len(real_aux)} | db_age={age:.0f}s")
```

Red flags to watch for:

| Symptom | Meaning |
|---------|---------|
| `correct=0/N` | `correct` definition too strict, archive broken |
| `db_age > 300s` while `mg < 49` | runner stuck on LLM call |
| `agent_runs << (mg-5)/5` | agent being skipped |
| `real_aux = 0` but `agent_runs > 0` | aux metrics silently failing (likely path bug) |
| `cands = 0` but using v4 | agent not producing candidates (check prompt) |
| Big diff from vanilla but `cands = 0` | signal is from text feedback alone |