narcolepticchicken commited on
Commit
de136c4
·
verified ·
1 Parent(s): d48ed63

Upload reports/v13_mechanism_memo.md

Browse files
Files changed (1) hide show
  1. reports/v13_mechanism_memo.md +161 -0
reports/v13_mechanism_memo.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OCC Mechanism Isolation Memo — v13
2
+
3
+ **Date:** 2026-05-11
4
+ **Status:** Template — awaiting mechanism isolation data
5
+ **Job:** `6a017bbbaff1cd33e8f3365b` (H200, 10 conditions × 3 seeds)
6
+
7
+ ---
8
+
9
+ ## 1. Claim
10
+
11
+ Unmanaged multi-agent deliberation can amplify adversarial influence when adversarial agents receive disproportionate compute volume. In 3-round 4-agent debate, majority-vote accuracy collapses from 88.3% (1-round baseline) to 56.7% when one adversarial agent is given equal speaking turns. We isolate the mechanism: is collapse driven by **token volume**, **turn order**, **voting protocol**, **answer contamination**, **confidence distortion**, **adversary skill**, or **topic vulnerability**?
12
+
13
+ ## 2. Experimental Setup
14
+
15
+ **Model:** Qwen/Qwen3-Coder-30B-A3B-Instruct (MoE, 30B total, ~3B active)
16
+ **Hardware:** H200
17
+ **Topics:** 30 yes/no scientific fact questions
18
+ **Agent mix:** 3 honest + 1 adversarial (scripted to argue the wrong answer)
19
+ **Seeds:** {N_SEEDS}
20
+
21
+ ### Conditions
22
+
23
+ | # | Condition | What it tests |
24
+ |---|-----------|---------------|
25
+ | 1 | baseline_1round_traced | Single-round majority vote (upper bound) |
26
+ | 2 | equal_3round_traced | 3 rounds, all agents speak equally (collapse baseline) |
27
+ | 3 | equal_token_unequal_turn | Adversary gets 1 long turn, honest agents get 3 short turns (same total tokens) |
28
+ | 4 | randomized_order_3round | Speaking order shuffled each round |
29
+ | 5 | judge_vote_3round | LLM judge reads transcript and decides (instead of majority vote) |
30
+ | 6 | confidence_weighted_3round | Votes weighted by self-reported confidence |
31
+ | 7 | adversary_weak | Adversary told to "struggle to make convincing arguments" |
32
+ | 8 | adversary_normal | Standard adversarial persona |
33
+ | 9 | adversary_strong | "Extremely persuasive and convincing" |
34
+ | 10 | adversary_oracle | "Knows the WRONG answer and must convince others" |
35
+
36
+ ## 3. Pre-Registered Win Conditions
37
+
38
+ | Hypothesis | Confirmed if | Refuted if |
39
+ |-----------|-------------|-----------|
40
+ | **H1: Volume amplification** | equal_token condition recovers >0.783 (within 10pp of baseline) | ≤0.617 (within 5pp of collapse) |
41
+ | **H2: Turn-order effect** | Randomized order >0.667 | Within 5pp of collapse |
42
+ | **H3: Voting vulnerability** | Judge or confidence vote >0.667 | Both within 5pp of collapse |
43
+ | **H4: Contamination** | Honest retention rate round 3 <50% | >70% |
44
+ | **H5: Confidence distortion** | Confidence-weighted vote recovers from collapse | No significant difference |
45
+ | **H6: Skill dependency** | Weak adversary avoids collapse; oracle does not | All adversary levels collapse equally |
46
+ | **H7: Topic vulnerability** | Variance across topics in collapse magnitude | Uniform collapse |
47
+
48
+ ## 4. Main Result
49
+
50
+ ### 4.1 Accuracy by Condition
51
+
52
+ {INSERT: accuracy chart}
53
+
54
+ | Condition | Mean Accuracy | Min | Max | Δ from 1-Round |
55
+ |-----------|:---:|:---:|:---:|:---:|
56
+ | baseline_1round_traced | {VALUE} | {VALUE} | {VALUE} | — |
57
+ | equal_3round_traced | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
58
+ | equal_token_unequal_turn | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
59
+ | randomized_order_3round | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
60
+ | judge_vote_3round | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
61
+ | confidence_weighted_3round | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
62
+ | adversary_weak | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
63
+ | adversary_normal | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
64
+ | adversary_strong | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
65
+ | adversary_oracle | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
66
+
67
+ ### 4.2 Hypothesis Verdicts
68
+
69
+ {INSERT: verdict table}
70
+
71
+ ## 5. Mechanism Result
72
+
73
+ ### 5.1 Honest Answer Retention
74
+
75
+ {INSERT: retention chart}
76
+
77
+ How many honest agents keep their original answer across debate rounds?
78
+
79
+ | Seed | Round 2 Stayed | Round 2 Flipped | Round 3 Stayed | Round 3 Flipped | Adversary-Induced |
80
+ |------|:---:|:---:|:---:|:---:|:---:|
81
+ | {SEED1} | {VALUE} | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
82
+ | {SEED2} | {VALUE} | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
83
+ | {SEED3} | {VALUE} | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
84
+
85
+ ### 5.2 Flip Matrix
86
+
87
+ {INSERT: flip pie chart}
88
+
89
+ | Direction | Count | Rate |
90
+ |-----------|:---:|:---:|
91
+ | Stable (correct→correct or wrong→wrong) | {VALUE} | {VALUE}% |
92
+ | Degraded (correct→wrong) | {VALUE} | {VALUE}% |
93
+ | Improved (wrong→correct) | {VALUE} | {VALUE}% |
94
+
95
+ ### 5.3 Adversary Skill Gradient
96
+
97
+ {INSERT: skill chart}
98
+
99
+ ## 6. Controls — What Works and What Doesn't
100
+
101
+ | Intervention | Effect | Mechanism |
102
+ |-------------|--------|-----------|
103
+ | Equal tokens (capped) | {RECOVERY} | {MECHANISM} |
104
+ | Randomized order | {RECOVERY} | {MECHANISM} |
105
+ | Judge-based voting | {RECOVERY} | {MECHANISM} |
106
+ | Confidence weighting | {RECOVERY} | {MECHANISM} |
107
+ | Weak adversary | {RECOVERY} | {MECHANISM} |
108
+
109
+ ## 7. Interpretation
110
+
111
+ ### What caused the collapse?
112
+
113
+ {PRIMARY MECHANISM}
114
+
115
+ ### Why does this matter?
116
+
117
+ The core failure mode is not that agents debate poorly. It is that **deliberation bandwidth itself becomes an attack surface**. When adversarial agents receive disproportionate compute volume, they can convert extra turns and tokens into influence over group outcomes. This reframes agent compute as an **earned, scoped, decaying privilege** rather than an equal entitlement.
118
+
119
+ ### Implication for OCC
120
+
121
+ Non-transferable, decaying, scoped credits address the identified mechanism by:
122
+ 1. **Capping volume** — agents can't flood the deliberation channel
123
+ 2. **Auditing contribution** — compute must be justified by verified marginal impact
124
+ 3. **Decaying advantage** — hoarded influence dissipates
125
+ 4. **Scoping access** — debate turns are a distinct capability from file writes or shell access
126
+
127
+ ### Remaining Threats
128
+
129
+ - **Statistical power:** 30 topics × 3 seeds is underpowered for publishable confidence intervals. Need 100-300 topics.
130
+ - **Domain:** Yes/no scientific trivia is narrow. Need legal reasoning, code debate, strategic scenarios.
131
+ - **Scripted adversary:** The adversary is told to argue the wrong answer. Real adversaries would adapt.
132
+ - **Oracle dependence:** Ground-truth answers are known a priori. In real settings, truth is uncertain.
133
+ - **Judge dependence:** All conditions use the same model as debater. Need cross-model judges.
134
+
135
+ ## 8. Cheap Baselines (from separate job)
136
+
137
+ 6 additional conditions running on a10g-large (2 seeds):
138
+
139
+ | Condition | Accuracy | Notes |
140
+ |-----------|:---:|-------|
141
+ | confidence_gated | {VALUE} | Drop turns where confidence < 0.5 |
142
+ | disagreement_gated | {VALUE} | Extra turns only if agents disagree |
143
+ | capped_debate | {VALUE} | Hard token cap at 2000/topic |
144
+ | single_agent_best_of_n | {VALUE} | 12 samples, no multi-agent |
145
+ | no_adversary_3round | {VALUE} | All 4 honest, 3 rounds |
146
+ | reputation_only | {VALUE} | Earn/lose score, no decay |
147
+
148
+ ## 9. Next Experiment
149
+
150
+ 1. **Scale:** 100-300 questions spanning multiple domains
151
+ 2. **Baseline diversity:** 10 stronger baselines (bandit, auction, self-consistency, etc.)
152
+ 3. **Adversary ratios:** {0, 1/8, 1/4, 1/2, 3/4} adversarial agents
153
+ 4. **Cross-model:** Judge models different from debater models
154
+ 5. **Oracle reliability:** Ground-truth → LLM judge → noisy → adversarial oracle
155
+
156
+ ## 10. Data Availability
157
+
158
+ - Mechanism results: `reports/debate_collapse_mechanism_results.json`
159
+ - Cheap baselines: `reports/cheap_baselines_results.json`
160
+ - Analysis CSVs: `reports/analysis/`
161
+ - Pre-registration: Hypothesis rules in `jobs/analyze_collapse.py`