Upload reports/v13_mechanism_memo.md
Browse files- reports/v13_mechanism_memo.md +161 -0
reports/v13_mechanism_memo.md
ADDED
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OCC Mechanism Isolation Memo — v13
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-05-11
|
| 4 |
+
**Status:** Template — awaiting mechanism isolation data
|
| 5 |
+
**Job:** `6a017bbbaff1cd33e8f3365b` (H200, 10 conditions × 3 seeds)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Claim
|
| 10 |
+
|
| 11 |
+
Unmanaged multi-agent deliberation can amplify adversarial influence when adversarial agents receive disproportionate compute volume. In 3-round 4-agent debate, majority-vote accuracy collapses from 88.3% (1-round baseline) to 56.7% when one adversarial agent is given equal speaking turns. We isolate the mechanism: is collapse driven by **token volume**, **turn order**, **voting protocol**, **answer contamination**, **confidence distortion**, **adversary skill**, or **topic vulnerability**?
|
| 12 |
+
|
| 13 |
+
## 2. Experimental Setup
|
| 14 |
+
|
| 15 |
+
**Model:** Qwen/Qwen3-Coder-30B-A3B-Instruct (MoE, 30B total, ~3B active)
|
| 16 |
+
**Hardware:** H200
|
| 17 |
+
**Topics:** 30 yes/no scientific fact questions
|
| 18 |
+
**Agent mix:** 3 honest + 1 adversarial (scripted to argue the wrong answer)
|
| 19 |
+
**Seeds:** {N_SEEDS}
|
| 20 |
+
|
| 21 |
+
### Conditions
|
| 22 |
+
|
| 23 |
+
| # | Condition | What it tests |
|
| 24 |
+
|---|-----------|---------------|
|
| 25 |
+
| 1 | baseline_1round_traced | Single-round majority vote (upper bound) |
|
| 26 |
+
| 2 | equal_3round_traced | 3 rounds, all agents speak equally (collapse baseline) |
|
| 27 |
+
| 3 | equal_token_unequal_turn | Adversary gets 1 long turn, honest agents get 3 short turns (same total tokens) |
|
| 28 |
+
| 4 | randomized_order_3round | Speaking order shuffled each round |
|
| 29 |
+
| 5 | judge_vote_3round | LLM judge reads transcript and decides (instead of majority vote) |
|
| 30 |
+
| 6 | confidence_weighted_3round | Votes weighted by self-reported confidence |
|
| 31 |
+
| 7 | adversary_weak | Adversary told to "struggle to make convincing arguments" |
|
| 32 |
+
| 8 | adversary_normal | Standard adversarial persona |
|
| 33 |
+
| 9 | adversary_strong | "Extremely persuasive and convincing" |
|
| 34 |
+
| 10 | adversary_oracle | "Knows the WRONG answer and must convince others" |
|
| 35 |
+
|
| 36 |
+
## 3. Pre-Registered Win Conditions
|
| 37 |
+
|
| 38 |
+
| Hypothesis | Confirmed if | Refuted if |
|
| 39 |
+
|-----------|-------------|-----------|
|
| 40 |
+
| **H1: Volume amplification** | equal_token condition recovers >0.783 (within 10pp of baseline) | ≤0.617 (within 5pp of collapse) |
|
| 41 |
+
| **H2: Turn-order effect** | Randomized order >0.667 | Within 5pp of collapse |
|
| 42 |
+
| **H3: Voting vulnerability** | Judge or confidence vote >0.667 | Both within 5pp of collapse |
|
| 43 |
+
| **H4: Contamination** | Honest retention rate round 3 <50% | >70% |
|
| 44 |
+
| **H5: Confidence distortion** | Confidence-weighted vote recovers from collapse | No significant difference |
|
| 45 |
+
| **H6: Skill dependency** | Weak adversary avoids collapse; oracle does not | All adversary levels collapse equally |
|
| 46 |
+
| **H7: Topic vulnerability** | Variance across topics in collapse magnitude | Uniform collapse |
|
| 47 |
+
|
| 48 |
+
## 4. Main Result
|
| 49 |
+
|
| 50 |
+
### 4.1 Accuracy by Condition
|
| 51 |
+
|
| 52 |
+
{INSERT: accuracy chart}
|
| 53 |
+
|
| 54 |
+
| Condition | Mean Accuracy | Min | Max | Δ from 1-Round |
|
| 55 |
+
|-----------|:---:|:---:|:---:|:---:|
|
| 56 |
+
| baseline_1round_traced | {VALUE} | {VALUE} | {VALUE} | — |
|
| 57 |
+
| equal_3round_traced | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 58 |
+
| equal_token_unequal_turn | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 59 |
+
| randomized_order_3round | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 60 |
+
| judge_vote_3round | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 61 |
+
| confidence_weighted_3round | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 62 |
+
| adversary_weak | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 63 |
+
| adversary_normal | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 64 |
+
| adversary_strong | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 65 |
+
| adversary_oracle | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 66 |
+
|
| 67 |
+
### 4.2 Hypothesis Verdicts
|
| 68 |
+
|
| 69 |
+
{INSERT: verdict table}
|
| 70 |
+
|
| 71 |
+
## 5. Mechanism Result
|
| 72 |
+
|
| 73 |
+
### 5.1 Honest Answer Retention
|
| 74 |
+
|
| 75 |
+
{INSERT: retention chart}
|
| 76 |
+
|
| 77 |
+
How many honest agents keep their original answer across debate rounds?
|
| 78 |
+
|
| 79 |
+
| Seed | Round 2 Stayed | Round 2 Flipped | Round 3 Stayed | Round 3 Flipped | Adversary-Induced |
|
| 80 |
+
|------|:---:|:---:|:---:|:---:|:---:|
|
| 81 |
+
| {SEED1} | {VALUE} | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 82 |
+
| {SEED2} | {VALUE} | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 83 |
+
| {SEED3} | {VALUE} | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
|
| 84 |
+
|
| 85 |
+
### 5.2 Flip Matrix
|
| 86 |
+
|
| 87 |
+
{INSERT: flip pie chart}
|
| 88 |
+
|
| 89 |
+
| Direction | Count | Rate |
|
| 90 |
+
|-----------|:---:|:---:|
|
| 91 |
+
| Stable (correct→correct or wrong→wrong) | {VALUE} | {VALUE}% |
|
| 92 |
+
| Degraded (correct→wrong) | {VALUE} | {VALUE}% |
|
| 93 |
+
| Improved (wrong→correct) | {VALUE} | {VALUE}% |
|
| 94 |
+
|
| 95 |
+
### 5.3 Adversary Skill Gradient
|
| 96 |
+
|
| 97 |
+
{INSERT: skill chart}
|
| 98 |
+
|
| 99 |
+
## 6. Controls — What Works and What Doesn't
|
| 100 |
+
|
| 101 |
+
| Intervention | Effect | Mechanism |
|
| 102 |
+
|-------------|--------|-----------|
|
| 103 |
+
| Equal tokens (capped) | {RECOVERY} | {MECHANISM} |
|
| 104 |
+
| Randomized order | {RECOVERY} | {MECHANISM} |
|
| 105 |
+
| Judge-based voting | {RECOVERY} | {MECHANISM} |
|
| 106 |
+
| Confidence weighting | {RECOVERY} | {MECHANISM} |
|
| 107 |
+
| Weak adversary | {RECOVERY} | {MECHANISM} |
|
| 108 |
+
|
| 109 |
+
## 7. Interpretation
|
| 110 |
+
|
| 111 |
+
### What caused the collapse?
|
| 112 |
+
|
| 113 |
+
{PRIMARY MECHANISM}
|
| 114 |
+
|
| 115 |
+
### Why does this matter?
|
| 116 |
+
|
| 117 |
+
The core failure mode is not that agents debate poorly. It is that **deliberation bandwidth itself becomes an attack surface**. When adversarial agents receive disproportionate compute volume, they can convert extra turns and tokens into influence over group outcomes. This reframes agent compute as an **earned, scoped, decaying privilege** rather than an equal entitlement.
|
| 118 |
+
|
| 119 |
+
### Implication for OCC
|
| 120 |
+
|
| 121 |
+
Non-transferable, decaying, scoped credits address the identified mechanism by:
|
| 122 |
+
1. **Capping volume** — agents can't flood the deliberation channel
|
| 123 |
+
2. **Auditing contribution** — compute must be justified by verified marginal impact
|
| 124 |
+
3. **Decaying advantage** — hoarded influence dissipates
|
| 125 |
+
4. **Scoping access** — debate turns are a distinct capability from file writes or shell access
|
| 126 |
+
|
| 127 |
+
### Remaining Threats
|
| 128 |
+
|
| 129 |
+
- **Statistical power:** 30 topics × 3 seeds is underpowered for publishable confidence intervals. Need 100-300 topics.
|
| 130 |
+
- **Domain:** Yes/no scientific trivia is narrow. Need legal reasoning, code debate, strategic scenarios.
|
| 131 |
+
- **Scripted adversary:** The adversary is told to argue the wrong answer. Real adversaries would adapt.
|
| 132 |
+
- **Oracle dependence:** Ground-truth answers are known a priori. In real settings, truth is uncertain.
|
| 133 |
+
- **Judge dependence:** All conditions use the same model as debater. Need cross-model judges.
|
| 134 |
+
|
| 135 |
+
## 8. Cheap Baselines (from separate job)
|
| 136 |
+
|
| 137 |
+
6 additional conditions running on a10g-large (2 seeds):
|
| 138 |
+
|
| 139 |
+
| Condition | Accuracy | Notes |
|
| 140 |
+
|-----------|:---:|-------|
|
| 141 |
+
| confidence_gated | {VALUE} | Drop turns where confidence < 0.5 |
|
| 142 |
+
| disagreement_gated | {VALUE} | Extra turns only if agents disagree |
|
| 143 |
+
| capped_debate | {VALUE} | Hard token cap at 2000/topic |
|
| 144 |
+
| single_agent_best_of_n | {VALUE} | 12 samples, no multi-agent |
|
| 145 |
+
| no_adversary_3round | {VALUE} | All 4 honest, 3 rounds |
|
| 146 |
+
| reputation_only | {VALUE} | Earn/lose score, no decay |
|
| 147 |
+
|
| 148 |
+
## 9. Next Experiment
|
| 149 |
+
|
| 150 |
+
1. **Scale:** 100-300 questions spanning multiple domains
|
| 151 |
+
2. **Baseline diversity:** 10 stronger baselines (bandit, auction, self-consistency, etc.)
|
| 152 |
+
3. **Adversary ratios:** {0, 1/8, 1/4, 1/2, 3/4} adversarial agents
|
| 153 |
+
4. **Cross-model:** Judge models different from debater models
|
| 154 |
+
5. **Oracle reliability:** Ground-truth → LLM judge → noisy → adversarial oracle
|
| 155 |
+
|
| 156 |
+
## 10. Data Availability
|
| 157 |
+
|
| 158 |
+
- Mechanism results: `reports/debate_collapse_mechanism_results.json`
|
| 159 |
+
- Cheap baselines: `reports/cheap_baselines_results.json`
|
| 160 |
+
- Analysis CSVs: `reports/analysis/`
|
| 161 |
+
- Pre-registration: Hypothesis rules in `jobs/analyze_collapse.py`
|