narcolepticchicken
/

occ-stack

ml-intern

Model card Files Files and versions

xet

Community

narcolepticchicken commited on 20 days ago

Commit

de136c4

verified ·

1 Parent(s): d48ed63

Upload reports/v13_mechanism_memo.md

Browse files

Files changed (1) hide show

reports/v13_mechanism_memo.md +161 -0

reports/v13_mechanism_memo.md ADDED Viewed

	@@ -0,0 +1,161 @@

+# OCC Mechanism Isolation Memo — v13
+**Date:** 2026-05-11
+**Status:** Template — awaiting mechanism isolation data
+**Job:** `6a017bbbaff1cd33e8f3365b` (H200, 10 conditions × 3 seeds)
+---
+## 1. Claim
+Unmanaged multi-agent deliberation can amplify adversarial influence when adversarial agents receive disproportionate compute volume. In 3-round 4-agent debate, majority-vote accuracy collapses from 88.3% (1-round baseline) to 56.7% when one adversarial agent is given equal speaking turns. We isolate the mechanism: is collapse driven by **token volume**, **turn order**, **voting protocol**, **answer contamination**, **confidence distortion**, **adversary skill**, or **topic vulnerability**?
+## 2. Experimental Setup
+**Model:** Qwen/Qwen3-Coder-30B-A3B-Instruct (MoE, 30B total, ~3B active)
+**Hardware:** H200
+**Topics:** 30 yes/no scientific fact questions
+**Agent mix:** 3 honest + 1 adversarial (scripted to argue the wrong answer)
+**Seeds:** {N_SEEDS}
+### Conditions
+| # | Condition | What it tests |
+|---|-----------|---------------|
+| 1 | baseline_1round_traced | Single-round majority vote (upper bound) |
+| 2 | equal_3round_traced | 3 rounds, all agents speak equally (collapse baseline) |
+| 3 | equal_token_unequal_turn | Adversary gets 1 long turn, honest agents get 3 short turns (same total tokens) |
+| 4 | randomized_order_3round | Speaking order shuffled each round |
+| 5 | judge_vote_3round | LLM judge reads transcript and decides (instead of majority vote) |
+| 6 | confidence_weighted_3round | Votes weighted by self-reported confidence |
+| 7 | adversary_weak | Adversary told to "struggle to make convincing arguments" |
+| 8 | adversary_normal | Standard adversarial persona |
+| 9 | adversary_strong | "Extremely persuasive and convincing" |
+| 10 | adversary_oracle | "Knows the WRONG answer and must convince others" |
+## 3. Pre-Registered Win Conditions
+| Hypothesis | Confirmed if | Refuted if |
+|-----------|-------------|-----------|
+| **H1: Volume amplification** | equal_token condition recovers >0.783 (within 10pp of baseline) | ≤0.617 (within 5pp of collapse) |
+| **H2: Turn-order effect** | Randomized order >0.667 | Within 5pp of collapse |
+| **H3: Voting vulnerability** | Judge or confidence vote >0.667 | Both within 5pp of collapse |
+| **H4: Contamination** | Honest retention rate round 3 <50% | >70% |
+| **H5: Confidence distortion** | Confidence-weighted vote recovers from collapse | No significant difference |
+| **H6: Skill dependency** | Weak adversary avoids collapse; oracle does not | All adversary levels collapse equally |
+| **H7: Topic vulnerability** | Variance across topics in collapse magnitude | Uniform collapse |
+## 4. Main Result
+### 4.1 Accuracy by Condition
+{INSERT: accuracy chart}
+| Condition | Mean Accuracy | Min | Max | Δ from 1-Round |
+|-----------|:---:|:---:|:---:|:---:|
+| baseline_1round_traced | {VALUE} | {VALUE} | {VALUE} | — |
+| equal_3round_traced | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| equal_token_unequal_turn | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| randomized_order_3round | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| judge_vote_3round | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| confidence_weighted_3round | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| adversary_weak | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| adversary_normal | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| adversary_strong | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| adversary_oracle | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+### 4.2 Hypothesis Verdicts
+{INSERT: verdict table}
+## 5. Mechanism Result
+### 5.1 Honest Answer Retention
+{INSERT: retention chart}
+How many honest agents keep their original answer across debate rounds?
+| Seed | Round 2 Stayed | Round 2 Flipped | Round 3 Stayed | Round 3 Flipped | Adversary-Induced |
+|------|:---:|:---:|:---:|:---:|:---:|
+| {SEED1} | {VALUE} | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| {SEED2} | {VALUE} | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+| {SEED3} | {VALUE} | {VALUE} | {VALUE} | {VALUE} | {VALUE} |
+### 5.2 Flip Matrix
+{INSERT: flip pie chart}
+| Direction | Count | Rate |
+|-----------|:---:|:---:|
+| Stable (correct→correct or wrong→wrong) | {VALUE} | {VALUE}% |
+| Degraded (correct→wrong) | {VALUE} | {VALUE}% |
+| Improved (wrong→correct) | {VALUE} | {VALUE}% |
+### 5.3 Adversary Skill Gradient
+{INSERT: skill chart}
+## 6. Controls — What Works and What Doesn't
+| Intervention | Effect | Mechanism |
+|-------------|--------|-----------|
+| Equal tokens (capped) | {RECOVERY} | {MECHANISM} |
+| Randomized order | {RECOVERY} | {MECHANISM} |
+| Judge-based voting | {RECOVERY} | {MECHANISM} |
+| Confidence weighting | {RECOVERY} | {MECHANISM} |
+| Weak adversary | {RECOVERY} | {MECHANISM} |
+## 7. Interpretation
+### What caused the collapse?
+{PRIMARY MECHANISM}
+### Why does this matter?
+The core failure mode is not that agents debate poorly. It is that **deliberation bandwidth itself becomes an attack surface**. When adversarial agents receive disproportionate compute volume, they can convert extra turns and tokens into influence over group outcomes. This reframes agent compute as an **earned, scoped, decaying privilege** rather than an equal entitlement.
+### Implication for OCC
+Non-transferable, decaying, scoped credits address the identified mechanism by:
+1. **Capping volume** — agents can't flood the deliberation channel
+2. **Auditing contribution** — compute must be justified by verified marginal impact
+3. **Decaying advantage** — hoarded influence dissipates
+4. **Scoping access** — debate turns are a distinct capability from file writes or shell access
+### Remaining Threats
+- **Statistical power:** 30 topics × 3 seeds is underpowered for publishable confidence intervals. Need 100-300 topics.
+- **Domain:** Yes/no scientific trivia is narrow. Need legal reasoning, code debate, strategic scenarios.
+- **Scripted adversary:** The adversary is told to argue the wrong answer. Real adversaries would adapt.
+- **Oracle dependence:** Ground-truth answers are known a priori. In real settings, truth is uncertain.
+- **Judge dependence:** All conditions use the same model as debater. Need cross-model judges.
+## 8. Cheap Baselines (from separate job)
+6 additional conditions running on a10g-large (2 seeds):
+| Condition | Accuracy | Notes |
+|-----------|:---:|-------|
+| confidence_gated | {VALUE} | Drop turns where confidence < 0.5 |
+| disagreement_gated | {VALUE} | Extra turns only if agents disagree |
+| capped_debate | {VALUE} | Hard token cap at 2000/topic |
+| single_agent_best_of_n | {VALUE} | 12 samples, no multi-agent |
+| no_adversary_3round | {VALUE} | All 4 honest, 3 rounds |
+| reputation_only | {VALUE} | Earn/lose score, no decay |
+## 9. Next Experiment
+1. **Scale:** 100-300 questions spanning multiple domains
+2. **Baseline diversity:** 10 stronger baselines (bandit, auction, self-consistency, etc.)
+3. **Adversary ratios:** {0, 1/8, 1/4, 1/2, 3/4} adversarial agents
+4. **Cross-model:** Judge models different from debater models
+5. **Oracle reliability:** Ground-truth → LLM judge → noisy → adversarial oracle
+## 10. Data Availability
+- Mechanism results: `reports/debate_collapse_mechanism_results.json`
+- Cheap baselines: `reports/cheap_baselines_results.json`
+- Analysis CSVs: `reports/analysis/`
+- Pre-registration: Hypothesis rules in `jobs/analyze_collapse.py`