Upload design.md
Browse files
design.md
CHANGED
|
@@ -1,153 +1,132 @@
|
|
| 1 |
# OCC Design Document
|
| 2 |
|
| 3 |
-
##
|
| 4 |
|
| 5 |
-
|
| 6 |
-
2. **Non-Transferable Credits**: Agents cannot launder credits through others.
|
| 7 |
-
3. **Decaying Credits**: Hoarding is discouraged; use-it-or-lose-it dynamics.
|
| 8 |
-
4. **Capability-Based Rights**: Rights are per-resource, not blanket access.
|
| 9 |
-
5. **Auditable Accounting**: Every credit change has provenance.
|
| 10 |
|
| 11 |
-
##
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
- `compute_comparison`: score normalized by tokens/FLOPs used
|
| 20 |
|
| 21 |
-
|
| 22 |
-
- `answer_correctness`: exact / fuzzy match to gold
|
| 23 |
-
- `evidence_support`: NLI entailment check on retrieved evidence
|
| 24 |
-
- `hallucination`: NLI contradiction or unsupported claims
|
| 25 |
-
- `abstention_utility`: correct abstention on unanswerable questions
|
| 26 |
-
- `calibration`: Brier score / ECE on confidence predictions
|
| 27 |
-
- `proper_score`: proper scoring rule reward
|
| 28 |
|
| 29 |
-
|
| 30 |
-
- `decision_quality`: final answer correctness
|
| 31 |
-
- `influence_efficiency`: marginal contribution per token/compute
|
| 32 |
-
- `throughput`: decisions per compute unit
|
| 33 |
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
```
|
| 37 |
-
reward =
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
| 53 |
```
|
| 54 |
|
| 55 |
-
##
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
-
|
| 59 |
-
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
##
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
-
|
| 99 |
-
- `memory_write`
|
| 100 |
-
- `human_escalation`
|
| 101 |
-
|
| 102 |
-
## 5. GRPO Hook
|
| 103 |
-
|
| 104 |
-
We implement a reward function compatible with TRL's GRPOTrainer that maps Oracle outputs to per-group rewards. Since full training may be compute-limited, we provide:
|
| 105 |
-
|
| 106 |
-
1. `reward_fn(completions, oracle_scores)` — returns tensor of rewards
|
| 107 |
-
2. `GRPOHook` class — wraps Oracle + Ledger + Broker for online evaluation
|
| 108 |
-
3. `OfflineComparator` — compares policies using saved trajectories when training is infeasible
|
| 109 |
-
|
| 110 |
-
## 6. Benchmarks
|
| 111 |
-
|
| 112 |
-
### Benchmark 1: Code Compute Allocation
|
| 113 |
-
- Dataset: `openai/openai_humaneval` or `evalplus/humanevalplus`
|
| 114 |
-
- Baselines: fixed compute, verifier retries, OCC allocation
|
| 115 |
-
- Metrics: pass@1, pass@k, tokens used, model calls, cost, compute saved at iso-accuracy
|
| 116 |
-
|
| 117 |
-
### Benchmark 2: Retrieval QA
|
| 118 |
-
- Dataset: synthetic grounded QA + adversarial evidence
|
| 119 |
-
- Baselines: direct answer, RAG, RAG+verifier, OCC
|
| 120 |
-
- Metrics: correctness, hallucination rate, abstention utility, ECE, retrieval calls, cost
|
| 121 |
-
|
| 122 |
-
### Benchmark 3: Multi-Agent Debate
|
| 123 |
-
- Dataset: synthetic factual disputes + code debates
|
| 124 |
-
- Baselines: equal turns, majority vote, confidence-weighted, OCC
|
| 125 |
-
- Metrics: decision quality, compute used, quality per GPU-second, bad-agent containment
|
| 126 |
-
|
| 127 |
-
## 7. Ablations
|
| 128 |
-
|
| 129 |
-
1. No credit ledger (oracle score used directly)
|
| 130 |
-
2. Transferable credits
|
| 131 |
-
3. Non-decaying credits
|
| 132 |
-
4. No abstention reward
|
| 133 |
-
5. No calibration penalty
|
| 134 |
-
6. No cost penalty
|
| 135 |
-
7. No anti-gaming penalty
|
| 136 |
-
8. No broker (oracle score only)
|
| 137 |
-
9. Broker with static rules
|
| 138 |
-
10. Broker with learned/score-based rights
|
| 139 |
-
|
| 140 |
-
## 8. Anti-Gaming Tests
|
| 141 |
-
|
| 142 |
-
- Spam low-value actions
|
| 143 |
-
- Hoard credits
|
| 144 |
-
- Transfer credit indirectly
|
| 145 |
-
- Exploit weak judge
|
| 146 |
-
- Verbose but low-value debate turns
|
| 147 |
-
- Over-abstention
|
| 148 |
-
- Overuse retrieval
|
| 149 |
-
- Manipulate confidence
|
| 150 |
-
- Optimize for unit tests while breaking hidden tests
|
| 151 |
-
- Collude in multi-agent debate
|
| 152 |
-
|
| 153 |
-
Measure: gaming success rate, credit leakage, robustness under judge replacement, quality degradation, broker containment.
|
|
|
|
| 1 |
# OCC Design Document
|
| 2 |
|
| 3 |
+
## Philosophy
|
| 4 |
|
| 5 |
+
Compute is the scarce resource in modern agent systems. Every action — a tool call, a retrieval, a debate turn, a verification pass — costs tokens, GPU seconds, or API dollars. Current systems allocate compute upfront (fixed budget) or reactively (retry on failure). OCC proposes a **proactive, earned-compute** model where agents must demonstrate marginal value before receiving more resources.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
+
## Core Thesis
|
| 8 |
|
| 9 |
+
> "Agents should earn compute, not spend it."
|
| 10 |
|
| 11 |
+
The system is named Oracle-Credit-Compute because every compute decision flows through three stages:
|
| 12 |
+
1. **Oracle:** Score the marginal impact of an action
|
| 13 |
+
2. **Credit:** Update the agent's credit balance based on that score
|
| 14 |
+
3. **Compute:** The broker decides whether to grant the requested resource
|
|
|
|
| 15 |
|
| 16 |
+
## Architecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
### Impact Oracle
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
**Design principle:** Rule-based, auditable, resistant to reward hacking.
|
| 21 |
+
|
| 22 |
+
Neural reward models are vulnerable to Goodhart's Law and reward hacking (Gao et al., 2023; Skalse et al., 2022). A neural RM can be optimized to produce high scores without producing correct answers. OCC uses rule-based scoring with the following properties:
|
| 23 |
+
|
| 24 |
+
- **Verifiable outcomes:** Code correctness is checked by running tests. QA correctness is checked against gold answers. Debate quality is checked against ground truth.
|
| 25 |
+
- **Cost-adjusted scores:** Every score subtracts a compute cost penalty. This prevents agents from achieving correctness through brute-force token spending.
|
| 26 |
+
- **Proper scoring rules:** Calibration bonus via Brier score encourages well-calibrated confidence, not just correctness.
|
| 27 |
+
- **Anti-gaming detectors:** Explicit checks for hidden-test gaming, spam, collusion, and over-abstention.
|
| 28 |
+
|
| 29 |
+
### Credit Ledger
|
| 30 |
+
|
| 31 |
+
**Design principle:** Non-transferable, decaying, capability-scoped.
|
| 32 |
+
|
| 33 |
+
- **Non-transferable:** `transfer()` always returns `False`. This prevents colluding agents from pooling credits or laundering them through intermediaries.
|
| 34 |
+
- **Exponential decay:** Idle credits decay at rate λ per time step. This prevents hoarding and encourages agents to use credits or lose them.
|
| 35 |
+
- **Capability-scoped:** Credits are scoped to specific capabilities (`retrieval`, `model_call`, `file_write`). An agent that is good at retrieval should not automatically get dangerous write permissions.
|
| 36 |
+
- **Full provenance:** Every entry has an oracle score, compute cost, timestamp, and reason. This enables auditing and debugging.
|
| 37 |
+
|
| 38 |
+
### Resource Broker
|
| 39 |
+
|
| 40 |
+
**Design principle:** Risk-adjusted, capability-based, dynamic.
|
| 41 |
+
|
| 42 |
+
Resources are classified by risk:
|
| 43 |
+
- **Low:** `retrieval_call`, `debate_turn` — threshold 0.5 credits
|
| 44 |
+
- **Medium:** `model_call`, `verifier_call`, `memory_write` — threshold 2.0 credits
|
| 45 |
+
- **High:** `file_write`, `shell_execute`, `human_escalation` — threshold 5.0 credits, may require approval
|
| 46 |
+
|
| 47 |
+
The broker can make six decisions:
|
| 48 |
+
- `ALLOW`: credits ≥ threshold, no flags
|
| 49 |
+
- `DENY`: credits < threshold × 0.5
|
| 50 |
+
- `REQUIRE_APPROVAL`: high-risk + high risk score
|
| 51 |
+
- `DOWNGRADE`: credits between 0.5× and 1.0× threshold → downgrade to cheaper resource
|
| 52 |
+
- `ESCALATE`: repeated denials from same agent
|
| 53 |
+
- `ASK_JUSTIFICATION`: credits insufficient but agent has some history
|
| 54 |
+
|
| 55 |
+
### GRPO/RL Hook
|
| 56 |
+
|
| 57 |
+
**Design principle:** Reward = verified impact - compute cost.
|
| 58 |
+
|
| 59 |
+
The reward function wraps the Impact Oracle and produces a scalar reward per completion. It is designed to be passed directly to TRL's `GRPOTrainer` as `reward_funcs`.
|
| 60 |
+
|
| 61 |
+
The offline comparator allows policy comparison without training:
|
| 62 |
+
1. Generate trajectories from two policies on the same test set
|
| 63 |
+
2. Score both with the same reward hook
|
| 64 |
+
3. Compare mean rewards, win rates, and failure rates
|
| 65 |
+
|
| 66 |
+
## Reward Formula
|
| 67 |
|
| 68 |
```
|
| 69 |
+
reward =
|
| 70 |
+
verified_task_score
|
| 71 |
+
+ abstention_utility
|
| 72 |
+
+ calibration_bonus
|
| 73 |
+
- hallucination_penalty
|
| 74 |
+
- confident_wrong_penalty
|
| 75 |
+
- compute_cost_penalty
|
| 76 |
+
- gaming_penalty
|
| 77 |
+
|
| 78 |
+
Where:
|
| 79 |
+
verified_task_score = correctness * weight_correctness
|
| 80 |
+
abstention_utility = +1.0 if correct abstain, -1.0 if wrong abstain
|
| 81 |
+
calibration_bonus = (1 - brier_score) * weight_calibration
|
| 82 |
+
brier_score = (confidence - outcome)^2
|
| 83 |
+
hallucination_penalty = 2.0 if entailment < 0.5 and contradiction > 0.5
|
| 84 |
+
confident_wrong_penalty = 3.0 if confidence > 0.8 and correctness < 0.5
|
| 85 |
+
compute_cost_penalty = compute_cost * 0.0001
|
| 86 |
+
gaming_penalty = 2.0 if hidden_tests fail while public pass
|
| 87 |
```
|
| 88 |
|
| 89 |
+
## Anti-Gaming Design
|
| 90 |
+
|
| 91 |
+
### Spam Attacks
|
| 92 |
+
- Detection: Repeated low-value actions (compute > 100, raw_score < 0.5)
|
| 93 |
+
- Containment: Oracle subtracts gaming_penalty. Ledger can revoke all credits on explicit detection.
|
| 94 |
+
|
| 95 |
+
### Hidden-Test Gaming
|
| 96 |
+
- Detection: `public_pass=True` but `hidden_pass=False`
|
| 97 |
+
- Containment: Immediate gaming_penalty=2.0 subtracted from raw score.
|
| 98 |
+
|
| 99 |
+
### Credit Laundering
|
| 100 |
+
- Prevention: `transfer()` returns `False` unconditionally.
|
| 101 |
+
|
| 102 |
+
### Credit Hoarding
|
| 103 |
+
- Prevention: Exponential decay on idle credits.
|
| 104 |
+
|
| 105 |
+
### Over-Abstention
|
| 106 |
+
- Detection: Agent abstains on answerable questions.
|
| 107 |
+
- Containment: Wrong abstentions get -abstention_bonus (-1.0).
|
| 108 |
+
|
| 109 |
+
### Confidence Manipulation
|
| 110 |
+
- Detection: Brier score in calibration bonus.
|
| 111 |
+
- Containment: Overconfident wrong answers get confident_wrong_penalty=3.0.
|
| 112 |
+
|
| 113 |
+
## Compute Budgeting
|
| 114 |
+
|
| 115 |
+
The system assumes a fixed compute budget per task. The broker enforces this by:
|
| 116 |
+
1. Tracking cumulative compute cost in the ledger entries
|
| 117 |
+
2. Denying requests when the agent's credit balance is below the threshold
|
| 118 |
+
3. Downgrading to cheaper resources when balance is marginal
|
| 119 |
+
|
| 120 |
+
## Failure Modes
|
| 121 |
+
|
| 122 |
+
1. **Oracle brittleness:** If the scoring rules are incomplete, agents will find and exploit the gaps.
|
| 123 |
+
2. **Broker conservatism:** If thresholds are too high, agents cannot act even when they should.
|
| 124 |
+
3. **Decay too aggressive:** If λ is too high, agents lose credits before completing multi-step tasks.
|
| 125 |
+
4. **Scope explosion:** Capability-scoped credits multiply the state space.
|
| 126 |
+
|
| 127 |
+
## Future Extensions
|
| 128 |
+
|
| 129 |
+
1. **Hierarchical broker:** Nested capability scopes (e.g., `model_call/code` vs `model_call/qa`).
|
| 130 |
+
2. **Dynamic thresholds:** Learn thresholds from historical data rather than hardcoding.
|
| 131 |
+
3. **Peer review:** Multiple oracles vote on controversial actions.
|
| 132 |
+
4. **Human-in-the-loop:** Escalate high-risk decisions to human reviewers with credit incentives.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|