narcolepticchicken commited on
Commit
ceee38b
·
verified ·
1 Parent(s): 745d481

Upload design.md

Browse files
Files changed (1) hide show
  1. design.md +119 -140
design.md CHANGED
@@ -1,153 +1,132 @@
1
  # OCC Design Document
2
 
3
- ## 1. Core Principles
4
 
5
- 1. **Verified Impact First**: Credits are earned only after an oracle verifies marginal value.
6
- 2. **Non-Transferable Credits**: Agents cannot launder credits through others.
7
- 3. **Decaying Credits**: Hoarding is discouraged; use-it-or-lose-it dynamics.
8
- 4. **Capability-Based Rights**: Rights are per-resource, not blanket access.
9
- 5. **Auditable Accounting**: Every credit change has provenance.
10
 
11
- ## 2. Impact Oracle
12
 
13
- ### Scoring Modes
14
 
15
- **Code Tasks**
16
- - `unit_test_pass`: binary pass/fail
17
- - `pass_at_k`: fraction passing among k samples
18
- - `regression`: does the new state break prior passing tests?
19
- - `compute_comparison`: score normalized by tokens/FLOPs used
20
 
21
- **Retrieval QA Tasks**
22
- - `answer_correctness`: exact / fuzzy match to gold
23
- - `evidence_support`: NLI entailment check on retrieved evidence
24
- - `hallucination`: NLI contradiction or unsupported claims
25
- - `abstention_utility`: correct abstention on unanswerable questions
26
- - `calibration`: Brier score / ECE on confidence predictions
27
- - `proper_score`: proper scoring rule reward
28
 
29
- **Multi-Agent Debate Tasks**
30
- - `decision_quality`: final answer correctness
31
- - `influence_efficiency`: marginal contribution per token/compute
32
- - `throughput`: decisions per compute unit
33
 
34
- ### Reward Formula
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ```
37
- reward = verified_task_score
38
- + abstention_utility
39
- + calibration_bonus
40
- - hallucination_penalty
41
- - confident_wrong_penalty
42
- - compute_cost_penalty
43
- - gaming_penalty
44
-
45
- where:
46
- verified_task_score ∈ [0, 1] (pass/fail or accuracy)
47
- abstention_utility {-1, 0, +1} (+1 for correct abstain, -1 for incorrect abstain)
48
- calibration_bonus = (1 - brier_score) * 0.2
49
- hallucination_penalty = contradiction_score * 0.5
50
- confident_wrong_penalty = confidence * (1 - correct) * 0.3
51
- compute_cost_penalty = (cost / budget) * 0.2
52
- gaming_penalty = detected_pattern_penalty (see below)
 
 
53
  ```
54
 
55
- ### Gaming Detection
56
-
57
- - **Spam**: repeated low-value actions within short window → penalty
58
- - **Hoarding**: credit balance above threshold for N epochs decay acceleration
59
- - **Transfer**: indirect credit laundering via coordinated task submission ban
60
- - **Judge exploitation**: output distribution shift toward weak-judge preferences → KL penalty
61
- - **Over-abstention**: abstention rate > threshold → negative reward
62
- - **Verbose padding**: tokens per unit impact below threshold → penalty
63
-
64
- ## 3. Credit Ledger
65
-
66
- ### Schema
67
-
68
- Each entry: `(agent_id, task_id, action_id, earned, spent, decayed, remaining, reason, oracle_score, compute_cost, timestamp, capability_scope)`
69
-
70
- ### Rules
71
-
72
- 1. **Non-transferable**: `transfer(from, to, amount)` always returns `False`.
73
- 2. **Decay**: `remaining *= exp(-lambda * delta_t)` each evaluation cycle.
74
- 3. **Task scope**: credits earned in task A cannot fund task B unless explicitly pooled.
75
- 4. **Capability scope**: credits for "retrieval" cannot fund "file_write".
76
- 5. **Revocation**: negative outcomes can revoke credits retroactively within a window.
77
- 6. **Provenance**: every entry references an oracle decision hash.
78
-
79
- ## 4. Resource Broker
80
-
81
- ### Decision Matrix
82
-
83
- | Condition | Decision |
84
- |-----------|----------|
85
- | credit >= threshold, low risk | `allow` |
86
- | credit < threshold, low risk | `deny` |
87
- | credit >= threshold, high risk | `require_approval` |
88
- | credit >= threshold, suspicious pattern | `downgrade` or `escalate` |
89
- | emergency override | `escalate` |
90
-
91
- ### Resources
92
-
93
- - `model_call_small` / `model_call_large`
94
- - `retrieval_call`
95
- - `verifier_call`
96
- - `debate_turn`
97
- - `file_write`
98
- - `shell_execute`
99
- - `memory_write`
100
- - `human_escalation`
101
-
102
- ## 5. GRPO Hook
103
-
104
- We implement a reward function compatible with TRL's GRPOTrainer that maps Oracle outputs to per-group rewards. Since full training may be compute-limited, we provide:
105
-
106
- 1. `reward_fn(completions, oracle_scores)` — returns tensor of rewards
107
- 2. `GRPOHook` class — wraps Oracle + Ledger + Broker for online evaluation
108
- 3. `OfflineComparator` — compares policies using saved trajectories when training is infeasible
109
-
110
- ## 6. Benchmarks
111
-
112
- ### Benchmark 1: Code Compute Allocation
113
- - Dataset: `openai/openai_humaneval` or `evalplus/humanevalplus`
114
- - Baselines: fixed compute, verifier retries, OCC allocation
115
- - Metrics: pass@1, pass@k, tokens used, model calls, cost, compute saved at iso-accuracy
116
-
117
- ### Benchmark 2: Retrieval QA
118
- - Dataset: synthetic grounded QA + adversarial evidence
119
- - Baselines: direct answer, RAG, RAG+verifier, OCC
120
- - Metrics: correctness, hallucination rate, abstention utility, ECE, retrieval calls, cost
121
-
122
- ### Benchmark 3: Multi-Agent Debate
123
- - Dataset: synthetic factual disputes + code debates
124
- - Baselines: equal turns, majority vote, confidence-weighted, OCC
125
- - Metrics: decision quality, compute used, quality per GPU-second, bad-agent containment
126
-
127
- ## 7. Ablations
128
-
129
- 1. No credit ledger (oracle score used directly)
130
- 2. Transferable credits
131
- 3. Non-decaying credits
132
- 4. No abstention reward
133
- 5. No calibration penalty
134
- 6. No cost penalty
135
- 7. No anti-gaming penalty
136
- 8. No broker (oracle score only)
137
- 9. Broker with static rules
138
- 10. Broker with learned/score-based rights
139
-
140
- ## 8. Anti-Gaming Tests
141
-
142
- - Spam low-value actions
143
- - Hoard credits
144
- - Transfer credit indirectly
145
- - Exploit weak judge
146
- - Verbose but low-value debate turns
147
- - Over-abstention
148
- - Overuse retrieval
149
- - Manipulate confidence
150
- - Optimize for unit tests while breaking hidden tests
151
- - Collude in multi-agent debate
152
-
153
- Measure: gaming success rate, credit leakage, robustness under judge replacement, quality degradation, broker containment.
 
1
  # OCC Design Document
2
 
3
+ ## Philosophy
4
 
5
+ Compute is the scarce resource in modern agent systems. Every action a tool call, a retrieval, a debate turn, a verification pass — costs tokens, GPU seconds, or API dollars. Current systems allocate compute upfront (fixed budget) or reactively (retry on failure). OCC proposes a **proactive, earned-compute** model where agents must demonstrate marginal value before receiving more resources.
 
 
 
 
6
 
7
+ ## Core Thesis
8
 
9
+ > "Agents should earn compute, not spend it."
10
 
11
+ The system is named Oracle-Credit-Compute because every compute decision flows through three stages:
12
+ 1. **Oracle:** Score the marginal impact of an action
13
+ 2. **Credit:** Update the agent's credit balance based on that score
14
+ 3. **Compute:** The broker decides whether to grant the requested resource
 
15
 
16
+ ## Architecture
 
 
 
 
 
 
17
 
18
+ ### Impact Oracle
 
 
 
19
 
20
+ **Design principle:** Rule-based, auditable, resistant to reward hacking.
21
+
22
+ Neural reward models are vulnerable to Goodhart's Law and reward hacking (Gao et al., 2023; Skalse et al., 2022). A neural RM can be optimized to produce high scores without producing correct answers. OCC uses rule-based scoring with the following properties:
23
+
24
+ - **Verifiable outcomes:** Code correctness is checked by running tests. QA correctness is checked against gold answers. Debate quality is checked against ground truth.
25
+ - **Cost-adjusted scores:** Every score subtracts a compute cost penalty. This prevents agents from achieving correctness through brute-force token spending.
26
+ - **Proper scoring rules:** Calibration bonus via Brier score encourages well-calibrated confidence, not just correctness.
27
+ - **Anti-gaming detectors:** Explicit checks for hidden-test gaming, spam, collusion, and over-abstention.
28
+
29
+ ### Credit Ledger
30
+
31
+ **Design principle:** Non-transferable, decaying, capability-scoped.
32
+
33
+ - **Non-transferable:** `transfer()` always returns `False`. This prevents colluding agents from pooling credits or laundering them through intermediaries.
34
+ - **Exponential decay:** Idle credits decay at rate λ per time step. This prevents hoarding and encourages agents to use credits or lose them.
35
+ - **Capability-scoped:** Credits are scoped to specific capabilities (`retrieval`, `model_call`, `file_write`). An agent that is good at retrieval should not automatically get dangerous write permissions.
36
+ - **Full provenance:** Every entry has an oracle score, compute cost, timestamp, and reason. This enables auditing and debugging.
37
+
38
+ ### Resource Broker
39
+
40
+ **Design principle:** Risk-adjusted, capability-based, dynamic.
41
+
42
+ Resources are classified by risk:
43
+ - **Low:** `retrieval_call`, `debate_turn` — threshold 0.5 credits
44
+ - **Medium:** `model_call`, `verifier_call`, `memory_write` — threshold 2.0 credits
45
+ - **High:** `file_write`, `shell_execute`, `human_escalation` — threshold 5.0 credits, may require approval
46
+
47
+ The broker can make six decisions:
48
+ - `ALLOW`: credits ≥ threshold, no flags
49
+ - `DENY`: credits < threshold × 0.5
50
+ - `REQUIRE_APPROVAL`: high-risk + high risk score
51
+ - `DOWNGRADE`: credits between 0.5× and 1.0× threshold → downgrade to cheaper resource
52
+ - `ESCALATE`: repeated denials from same agent
53
+ - `ASK_JUSTIFICATION`: credits insufficient but agent has some history
54
+
55
+ ### GRPO/RL Hook
56
+
57
+ **Design principle:** Reward = verified impact - compute cost.
58
+
59
+ The reward function wraps the Impact Oracle and produces a scalar reward per completion. It is designed to be passed directly to TRL's `GRPOTrainer` as `reward_funcs`.
60
+
61
+ The offline comparator allows policy comparison without training:
62
+ 1. Generate trajectories from two policies on the same test set
63
+ 2. Score both with the same reward hook
64
+ 3. Compare mean rewards, win rates, and failure rates
65
+
66
+ ## Reward Formula
67
 
68
  ```
69
+ reward =
70
+ verified_task_score
71
+ + abstention_utility
72
+ + calibration_bonus
73
+ - hallucination_penalty
74
+ - confident_wrong_penalty
75
+ - compute_cost_penalty
76
+ - gaming_penalty
77
+
78
+ Where:
79
+ verified_task_score = correctness * weight_correctness
80
+ abstention_utility = +1.0 if correct abstain, -1.0 if wrong abstain
81
+ calibration_bonus = (1 - brier_score) * weight_calibration
82
+ brier_score = (confidence - outcome)^2
83
+ hallucination_penalty = 2.0 if entailment < 0.5 and contradiction > 0.5
84
+ confident_wrong_penalty = 3.0 if confidence > 0.8 and correctness < 0.5
85
+ compute_cost_penalty = compute_cost * 0.0001
86
+ gaming_penalty = 2.0 if hidden_tests fail while public pass
87
  ```
88
 
89
+ ## Anti-Gaming Design
90
+
91
+ ### Spam Attacks
92
+ - Detection: Repeated low-value actions (compute > 100, raw_score < 0.5)
93
+ - Containment: Oracle subtracts gaming_penalty. Ledger can revoke all credits on explicit detection.
94
+
95
+ ### Hidden-Test Gaming
96
+ - Detection: `public_pass=True` but `hidden_pass=False`
97
+ - Containment: Immediate gaming_penalty=2.0 subtracted from raw score.
98
+
99
+ ### Credit Laundering
100
+ - Prevention: `transfer()` returns `False` unconditionally.
101
+
102
+ ### Credit Hoarding
103
+ - Prevention: Exponential decay on idle credits.
104
+
105
+ ### Over-Abstention
106
+ - Detection: Agent abstains on answerable questions.
107
+ - Containment: Wrong abstentions get -abstention_bonus (-1.0).
108
+
109
+ ### Confidence Manipulation
110
+ - Detection: Brier score in calibration bonus.
111
+ - Containment: Overconfident wrong answers get confident_wrong_penalty=3.0.
112
+
113
+ ## Compute Budgeting
114
+
115
+ The system assumes a fixed compute budget per task. The broker enforces this by:
116
+ 1. Tracking cumulative compute cost in the ledger entries
117
+ 2. Denying requests when the agent's credit balance is below the threshold
118
+ 3. Downgrading to cheaper resources when balance is marginal
119
+
120
+ ## Failure Modes
121
+
122
+ 1. **Oracle brittleness:** If the scoring rules are incomplete, agents will find and exploit the gaps.
123
+ 2. **Broker conservatism:** If thresholds are too high, agents cannot act even when they should.
124
+ 3. **Decay too aggressive:** If λ is too high, agents lose credits before completing multi-step tasks.
125
+ 4. **Scope explosion:** Capability-scoped credits multiply the state space.
126
+
127
+ ## Future Extensions
128
+
129
+ 1. **Hierarchical broker:** Nested capability scopes (e.g., `model_call/code` vs `model_call/qa`).
130
+ 2. **Dynamic thresholds:** Learn thresholds from historical data rather than hardcoding.
131
+ 3. **Peer review:** Multiple oracles vote on controversial actions.
132
+ 4. **Human-in-the-loop:** Escalate high-risk decisions to human reviewers with credit incentives.