narcolepticchicken commited on
Commit
6c0e47f
·
verified ·
1 Parent(s): a3dbabc

Session runbook for mechanism/baselines jobs

Browse files
Files changed (1) hide show
  1. SESSION_RUNBOOK.md +87 -0
SESSION_RUNBOOK.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OCC Collapse Mechanism — Runbook
2
+ ## Session: 2026-05-11
3
+
4
+ ### JOBS RUNNING (5 total, all session e95fd6cc)
5
+
6
+ | Job ID | Hardware | Script | Timeout | Status |
7
+ |--------|----------|--------|---------|--------|
8
+ | `6a0236d6317220dbbd1a7c07` | H200 | occ_debate_collapse_mechanism_v3.py | 24h | RUNNING |
9
+ | `6a0236d6aff1cd33e8f33ee6` | a10g-large | occ_cheap_baselines.py | 6h | RUNNING |
10
+ | `6a0236d6317220dbbd1a7c09` | a10g-large | occ_strong_baselines.py | 6h | RUNNING |
11
+ | `6a022292aff1cd33e8f33ded` | a10g-large | occ_strong_baselines.py (older) | 6h | RUNNING |
12
+ | `6a022033317220dbbd1a7b8c` | a10g-large | occ_cheap_baselines.py (older) | 6h | RUNNING |
13
+
14
+ ### DO NOT SUBMIT NEW JOBS until these complete. Session ID e95fd6cc is shared — new job submission WILL cancel all running jobs.
15
+
16
+ ### Data locations (on narcolepticchicken/occ-stack Hub)
17
+
18
+ | File | Produced by |
19
+ |------|-------------|
20
+ | `reports/debate_collapse_mechanism_results.json` | Mechanism v3 (pushes incrementally after each condition) |
21
+ | `reports/cheap_baselines_results.json` | Cheap baselines |
22
+ | `reports/strong_baselines_results.json` | Strong baselines |
23
+ | `reports/debate_extended_baselines_2seed.json` | Pre-existing v2 data (88.3% → 56.7% collapse) |
24
+
25
+ ### When mechanism data arrives, run:
26
+
27
+ ```bash
28
+ # 1. Download results
29
+ python -c "
30
+ from huggingface_hub import hf_hub_download
31
+ p = hf_hub_download('narcolepticchicken/occ-stack', 'reports/debate_collapse_mechanism_results.json')
32
+ import shutil; shutil.copy(p, './debate_collapse_mechanism_results.json')
33
+ "
34
+
35
+ # 2. Run the analysis harness (v2.1, handles v2+v3 formats)
36
+ python jobs/analyze_collapse.py debate_collapse_mechanism_results.json
37
+
38
+ # 3. Outputs: reports/analysis/
39
+ # - condition_summary.csv
40
+ # - per_topic_outcomes.csv
41
+ # - round_flip_matrix.csv
42
+ # - hypothesis_verdicts.json
43
+ # - fig_accuracy_by_condition.png
44
+ # - fig_honest_retention.png
45
+ # - fig_flip_rate.png
46
+ # - fig_adversary_skill.png
47
+ ```
48
+
49
+ ### Then fill v13 memo:
50
+
51
+ ```bash
52
+ # Fill {VALUE} placeholders in reports/v13_mechanism_memo.md
53
+ # Data comes from: reports/analysis/condition_summary.csv + hypothesis_verdicts.json
54
+ ```
55
+
56
+ ### Infrastructure
57
+
58
+ - Analysis harness: `jobs/analyze_collapse.py` (v2.1 - handles per_seed and seeds keys)
59
+ - v13 memo template: `reports/v13_mechanism_memo.md`
60
+ - All scripts: `narcolepticchicken/occ-stack` on Hub
61
+
62
+ ### Pre-registered hypotheses
63
+
64
+ Evaluated automatically by analysis harness using rules in HYPOTHESIS_RULES dict:
65
+
66
+ | Hypothesis | Mechanism |
67
+ |-----------|-----------|
68
+ | H1: Volume amplification | equal_token_unequal_turn vs baseline_1round_traced |
69
+ | H2: Turn-order effect | randomized_order_3round vs equal_3round_traced |
70
+ | H3: Voting vulnerability | judge_vote + confidence_weighted vs equal_3round_traced |
71
+ | H4: Contamination | Honest retention rate round 3 |
72
+ | H5: Confidence distortion | confidence_weighted vs equal_3round_traced |
73
+ | H6: Skill dependency | weak vs normal vs strong vs oracle adversary |
74
+ | H7: Topic vulnerability | Per-topic variance in collapse |
75
+
76
+ ### Expected results (from 2-seed pilot)
77
+
78
+ - 1-round baseline: 88.3% accuracy
79
+ - 3-round equal: 56.7% accuracy (32pp collapse)
80
+ - Random 25% drop: 85.0% with 26.5% token savings
81
+ - OCC credit: prevents catastrophe but doesn't beat random gating at moderate budgets
82
+
83
+ ### Key fix: judge_vote_3round
84
+
85
+ v2 returned 0/30 because extract_position() checked first-line prefixes.
86
+ v3 uses extract_judge_answer() with regex \b(yes|no)\b + last-occurrence tiebreaker.
87
+ The judge prompt now asks "Based on the debate, the correct answer is: " and generates 32 tokens at temperature 0.1.