File size: 13,157 Bytes
1556404
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
# Analyzing Agent Trajectories

How to inspect what the agents actually did during an experiment — distinct from "what was the final score." Use this to diagnose whether the agent is adding value or failing silently.

Two agents to analyze separately:

- **Evolution Agent**: the code-generation LLM that mutates programs each generation
- **Evaluation Agent (eval_agent / EV2)**: the meta-agent that periodically analyzes the best code and produces diagnostic feedback + aux metrics + code candidates

---

## Where to Find the Trajectories

All artifacts live under the experiment directory. Assume `$RUN = results/frontier_cs_algorithmic/<experiment_name>` and `$P = $RUN/p<id>` for a single problem.

### Evolution Agent trajectory

| What | Path | Contents |
|------|------|----------|
| Full program history | `$P/evolution_db.sqlite` | `programs` table: id, generation, combined_score, correct, code, parent_id, archive/inspiration ids, public/private metrics, text_feedback, metadata |
| Code each gen | `$P/gen_N/main.cpp` | The evaluated source at generation N |
| Patch each gen | `$P/gen_N/edit.diff`, `$P/gen_N/search_replace.txt` | What the evolution LLM changed |
| Parent before patch | `$P/gen_N/original.cpp` | Code that was mutated |
| Evaluation output | `$P/gen_N/results/metrics.json`, `$P/gen_N/results/correct.json` | Primary evaluator output (includes aux metrics merged as `aux_*`) |
| LLM call traces (if `--trajectory-log`) | `$P/gen_N/llm_trajectories/` | Raw LLM prompts and responses for the code-gen LLM |
| Run config | `$P/experiment_config.yaml` | All evolution settings used |

### Evaluation Agent (EV2) trajectory

| What | Path | Contents |
|------|------|----------|
| Service state | `$P/eval_agent_memory/service_state.json` | `total_agent_runs`, `last_agent_trigger_gen`, `agent_trigger_history`, full `generation_history` (primary_score per gen) |
| Current diagnostic | `$P/eval_agent_memory/diagnostic_report.md` | Latest Hypothesis / Experiment / Verdict / Direction (overwritten each trigger) |
| Cross-gen memory | `$P/eval_agent_memory/EVAL_AGENTS.md` | Append-only log — hypothesis/verdict per trigger |
| Current aux metrics script | `$P/eval_agent_memory/auxiliary_metrics.py` | The `evaluate_aux()` function the agent wrote |
| Aux metrics backup | `$P/eval_agent_memory/auxiliary_metrics.py.bak` | Previous version before agent's last edit |
| Per-trigger results | `$P/eval_agent_memory/agent_runs/gen_N_result.json` | Agent output summary per trigger |
| Agent candidate (v4+) | `$P/eval_agent_memory/agent_candidate.cpp` | Code the agent wrote as a direct fix |
| Per-gen aux snapshot | `$P/gen_N/results/auxiliary_metrics_snapshot.py` | Copy of aux_metrics.py used at gen N |
| LLM raw completions (if `OPENHANDS_LOG_COMPLETIONS=1`) | `$P/eval_agent_memory/llm_completions/` | Raw eval-agent LLM calls |
| Full message trajectory (if `ENABLE_FULL_TRAJECTORY_LOG=1`) | Inside `agent_runs/gen_N_*.json` | All messages incl. tool calls |

### Run-level logs (parallel scripts)

| What | Path |
|------|------|
| Per-problem runner logs | `$RUN/_worker_logs/problem_<id>.log` |
| Per-slot eval service logs | `$RUN/_worker_logs/eval_service_port_<port>.log` |

### Logging flags (enable before running)

| Flag | Effect |
|------|--------|
| `--trajectory-log` (runner CLI) | Save evolution sampler LLM prompts/responses |
| `ENABLE_FULL_TRAJECTORY_LOG=1` (env) | Save eval agent's full message trajectories |
| `OPENHANDS_LOG_COMPLETIONS=1` (env) | Save eval agent's raw LLM completions |

---

## Part 1: Evolution Agent Analysis

The evolution agent's full trajectory lives in the `evolution_db.sqlite` (the `programs` table) under each `pN/` directory.

### 1.1 Progress and Completion

```python
import sqlite3

db = f"{run_dir}/p{pid}/evolution_db.sqlite"
conn = sqlite3.connect(db)
max_gen = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0]
best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0]
total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
```

- `max_gen < 49` and DB `mtime` old → runner is stuck (likely waiting on LLM API)
- Check `ps aux | grep run_experiment` to see if workers are alive

### 1.2 Archive and Parent Selection Health

Evolution depends on `correct=1` to populate archive and select parents. If the archive is empty, evolution degenerates to "keep mutating the seed."

```sql
SELECT COUNT(*) FROM programs WHERE correct=1;
```

- Ratio `n_correct / n_total`:
  - Near 0 → `correct` definition is broken for this task (it's too strict). Archive is always empty, no diversity.
  - Near 1 → `correct` definition is loose (covers any runnable code). Archive and parents work normally.
- In Frontier-CS, the original `correct=passed` (all test cases perfect) gave 0/51 — **archive silently empty**. Fixed by redefining `correct=True` for any compilable/runnable code.

### 1.3 Per-Generation Best Score Trend

Reveals whether evolution is actually progressing or stuck on a plateau:

```python
rows = conn.execute(
    "SELECT generation, combined_score FROM programs ORDER BY generation"
).fetchall()
gen_best = {}
for g, s in rows:
    if s is not None and (g not in gen_best or s > gen_best[g]):
        gen_best[g] = s
```

Mark the agent trigger generations (e.g. gen 5, 10, 15, ...) and look for:
- **Step jumps at trigger points** → agent feedback or candidate helped
- **Gradual climb** → normal evolution, agent contribution unclear
- **Noise dominates** → variance too high, can't attribute improvements

### 1.4 Agent Candidates in the DB (v4+ only)

When the eval agent writes code directly, candidates enter the `programs` table with `metadata.source = "agent_candidate"`.

```sql
SELECT COUNT(*) FROM programs
WHERE metadata LIKE '%"source": "agent_candidate"%';
```

**Warning**: Don't use `LIKE '%agent_candidate%'` — that matches programs whose code content references the string. Always match the `source` key precisely.

### 1.5 Fair Comparison Under Compute Budget

Each agent candidate is an extra LLM call. If the agent triggered 9 times and wrote 9 candidates, the agent run had 50 + 9 = 59 code generations vs vanilla's 50. For an apples-to-apples comparison:

```sql
SELECT MAX(combined_score) FROM programs WHERE generation <= (50 - n_candidates);
```

Compute `(50 - n_candidates)` per run to normalize total LLM calls, then compare.

---

## Part 2: Evaluation Agent Analysis

The eval agent's trajectory lives in `pN/eval_agent_memory/` and `pN/gen_N/results/metrics.json` (aux metric outputs).

### 2.1 Trigger History

```python
import json
state = json.load(open(f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"))
print(state["total_agent_runs"], state["last_agent_trigger_gen"])
print(state["agent_trigger_history"])
```

- `total_agent_runs` should match `(max_gen - 5) // 5 + 1` for periodic mode (K=5)
- If it's much lower, agent is being skipped (busy or erroring)

### 2.2 Diagnostic Report Quality

Read `diagnostic_report.md` and classify:

- **Strong**: specific numbers ("runtime 4500ms at n=1000"), names the exact bottleneck function/line, gives concrete direction
- **Weak**: vague phrases ("may have TLE", "consider optimizing"), no measured evidence, speculative on already-solved code
- **Broken**: empty file, parse errors, or repeats previous report

Good diagnostic example (from p42):
> "`aux_reported_L` is exactly `aux_L0_baseline`... algorithm is falling back to baseline unrotated packing... search loop (lines 65-93) caps `a` at 250. Increase to 500+."

Weak example (from p241):
> "Might have TLE for larger inputs. Cannot be determined yet. Monitor the metric."

### 2.3 Aux Metrics: Are They Actually Running Code?

This is where silent failure hides. Check the aux_metrics.py for path correctness and run output.

```python
# Check the file
aux = open(f"{run_dir}/p{pid}/eval_agent_memory/auxiliary_metrics.py").read()
# Good pattern: subprocess.run + path to main.cpp
# Red flag: only reads metrics.json or does Python simulation
```

Then check the latest generation's aux metric output in the DB:

```python
import sqlite3, json
conn = sqlite3.connect(f"{run_dir}/p{pid}/evolution_db.sqlite")
conn.row_factory = sqlite3.Row
row = conn.execute(
    "SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
).fetchone()
pub = json.loads(row["public_metrics"])
aux_keys = [k for k in pub if k.startswith("aux_")]
```

Classify what you see:

- **Meaningful**: `aux_runtime_ms_N35`, `aux_num_rects`, `aux_reported_L_n11` — real measurements
- **Meta-only noise**: `aux_aux_metric_eval_success`, `aux_aux_metric_error_code` — function ran but returned no real data
- **Compile failure**: `aux_compile_success = 0` → path bug or broken aux_metrics

Meta-only noise with no real measurements = aux metrics are silently broken.

### 2.4 The Path Bug Check

The most common silent failure: aux_metrics.py uses the wrong path to the code file.

```bash
grep "code_path\|main.cpp\|results_dir" pN/eval_agent_memory/auxiliary_metrics.py
```

- `os.path.join(results_dir, 'main.cpp')` — correct if service passes `gen_N/`
- `os.path.join(results_dir, '..', 'main.cpp')` — correct if service passes `gen_N/results/`
- Mismatch → compile always fails, every metric returns 0

Cross-check: what does the service actually pass? Look in `ev2_service_standalone.py` for the `evaluate_func(request.results_dir)` call and whether it's adjusted.

### 2.5 Agent Candidate Outputs (v4+)

When the agent writes code directly, compare candidate score vs the original best:

```python
rows = conn.execute(
    "SELECT generation, combined_score, metadata FROM programs "
    "WHERE metadata LIKE '%\"source\": \"agent_candidate\"%' "
    "ORDER BY generation"
).fetchall()
for r in rows:
    meta = json.loads(r["metadata"])
    print(f"gen={r['generation']} candidate={r['combined_score']:.2f} "
          f"original={meta.get('original_score', '?')}")
```

- Candidate beats original → agent successfully identified and implemented a fix
- Candidate = 0 → agent's code didn't compile or scored 0
- Candidate < original → agent made things worse

Ratio of "candidate ≥ original" is a direct measure of whether the agent is producing useful code.

### 2.6 Effect Attribution

Can't easily attribute score gains to the agent vs variance. Use two cross-checks:

- **Compare to matched vanilla**: fork from the same gen 5, same problems, both run 50 gens. Any diff beyond ±5 per problem and ±2 on average is plausibly signal, otherwise noise.
- **Look at trigger→jump alignment**: does the score jump within 1-2 gens of a trigger? Consistent alignment across problems = signal. Random placement = noise.

---

## Quick Diagnostic Script

One-shot sanity check for a running experiment:

```python
import sqlite3, os, json, time

def analyze_run(run_dir, pids, vanilla_dir=None):
    for pid in pids:
        db = f"{run_dir}/p{pid}/evolution_db.sqlite"
        conn = sqlite3.connect(db); conn.row_factory = sqlite3.Row

        mg = conn.execute("SELECT MAX(generation) FROM programs").fetchone()[0] or 0
        best = conn.execute("SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
        n_corr = conn.execute("SELECT COUNT(*) FROM programs WHERE correct=1").fetchone()[0]
        n_total = conn.execute("SELECT COUNT(*) FROM programs").fetchone()[0]
        n_cand = conn.execute(
            "SELECT COUNT(*) FROM programs WHERE metadata LIKE '%\"source\": \"agent_candidate\"%'"
        ).fetchone()[0]

        row = conn.execute(
            "SELECT public_metrics FROM programs ORDER BY generation DESC LIMIT 1"
        ).fetchone()
        pub = json.loads(row["public_metrics"]) if row and row["public_metrics"] else {}
        real_aux = [k for k in pub if k.startswith("aux_")
                    and "metric_eval" not in k and "error_code" not in k]

        state_path = f"{run_dir}/p{pid}/eval_agent_memory/service_state.json"
        runs = json.load(open(state_path))["total_agent_runs"] if os.path.exists(state_path) else 0

        age = time.time() - os.path.getmtime(db)
        vanilla_best = ""
        if vanilla_dir:
            bv = sqlite3.connect(f"{vanilla_dir}/p{pid}/evolution_db.sqlite").execute(
                "SELECT MAX(combined_score) FROM programs").fetchone()[0] or 0
            vanilla_best = f" | vanilla={bv:.2f} diff={best-bv:+.2f}"

        print(f"p{pid}: gen={mg}/49 | best={best:.2f}{vanilla_best} | "
              f"correct={n_corr}/{n_total} | cands={n_cand} | "
              f"agent_runs={runs} | real_aux={len(real_aux)} | db_age={age:.0f}s")
```

Red flags to watch for:

| Symptom | Meaning |
|---------|---------|
| `correct=0/N` | `correct` definition too strict, archive broken |
| `db_age > 300s` while `mg < 49` | runner stuck on LLM call |
| `agent_runs << (mg-5)/5` | agent being skipped |
| `real_aux = 0` but `agent_runs > 0` | aux metrics silently failing (likely path bug) |
| `cands = 0` but using v4 | agent not producing candidates (check prompt) |
| Big diff from vanilla but `cands = 0` | signal is from text feedback alone |