narcolepticchicken
/

occ-stack

ml-intern

Model card Files Files and versions

xet

Community

narcolepticchicken commited on 27 days ago

Commit

ceee38b

verified ·

1 Parent(s): 745d481

Upload design.md

Browse files

Files changed (1) hide show

design.md +119 -140

design.md CHANGED Viewed

@@ -1,153 +1,132 @@
 # OCC Design Document
-## 1. Core Principles
-1. **Verified Impact First**: Credits are earned only after an oracle verifies marginal value.
-2. **Non-Transferable Credits**: Agents cannot launder credits through others.
-3. **Decaying Credits**: Hoarding is discouraged; use-it-or-lose-it dynamics.
-4. **Capability-Based Rights**: Rights are per-resource, not blanket access.
-5. **Auditable Accounting**: Every credit change has provenance.
-## 2. Impact Oracle
-### Scoring Modes
-**Code Tasks**
-- `unit_test_pass`: binary pass/fail
-- `pass_at_k`: fraction passing among k samples
-- `regression`: does the new state break prior passing tests?
-- `compute_comparison`: score normalized by tokens/FLOPs used
-**Retrieval QA Tasks**
-- `answer_correctness`: exact / fuzzy match to gold
-- `evidence_support`: NLI entailment check on retrieved evidence
-- `hallucination`: NLI contradiction or unsupported claims
-- `abstention_utility`: correct abstention on unanswerable questions
-- `calibration`: Brier score / ECE on confidence predictions
-- `proper_score`: proper scoring rule reward
-**Multi-Agent Debate Tasks**
-- `decision_quality`: final answer correctness
-- `influence_efficiency`: marginal contribution per token/compute
-- `throughput`: decisions per compute unit
-### Reward Formula
 ```
-reward = verified_task_score
-       + abstention_utility
-       + calibration_bonus
-       - hallucination_penalty
-       - confident_wrong_penalty
-       - compute_cost_penalty
-       - gaming_penalty
-where:
-  verified_task_score ∈ [0, 1]   (pass/fail or accuracy)
-  abstention_utility ∈ {-1, 0, +1}  (+1 for correct abstain, -1 for incorrect abstain)
-  calibration_bonus = (1 - brier_score) * 0.2
-  hallucination_penalty = contradiction_score * 0.5
-  confident_wrong_penalty = confidence * (1 - correct) * 0.3
-  compute_cost_penalty = (cost / budget) * 0.2
-  gaming_penalty = detected_pattern_penalty (see below)
 ```
-### Gaming Detection
-- **Spam**: repeated low-value actions within short window → penalty
-- **Hoarding**: credit balance above threshold for N epochs → decay acceleration
-- **Transfer**: indirect credit laundering via coordinated task submission → ban
-- **Judge exploitation**: output distribution shift toward weak-judge preferences → KL penalty
-- **Over-abstention**: abstention rate > threshold → negative reward
-- **Verbose padding**: tokens per unit impact below threshold → penalty
-## 3. Credit Ledger
-### Schema
-Each entry: `(agent_id, task_id, action_id, earned, spent, decayed, remaining, reason, oracle_score, compute_cost, timestamp, capability_scope)`
-### Rules
-1. **Non-transferable**: `transfer(from, to, amount)` always returns `False`.
-2. **Decay**: `remaining *= exp(-lambda * delta_t)` each evaluation cycle.
-3. **Task scope**: credits earned in task A cannot fund task B unless explicitly pooled.
-4. **Capability scope**: credits for "retrieval" cannot fund "file_write".
-5. **Revocation**: negative outcomes can revoke credits retroactively within a window.
-6. **Provenance**: every entry references an oracle decision hash.
-## 4. Resource Broker
-### Decision Matrix
-| Condition | Decision |
-|-----------|----------|
-| credit >= threshold, low risk | `allow` |
-| credit < threshold, low risk | `deny` |
-| credit >= threshold, high risk | `require_approval` |
-| credit >= threshold, suspicious pattern | `downgrade` or `escalate` |
-| emergency override | `escalate` |
-### Resources
-- `model_call_small` / `model_call_large`
-- `retrieval_call`
-- `verifier_call`
-- `debate_turn`
-- `file_write`
-- `shell_execute`
-- `memory_write`
-- `human_escalation`
-## 5. GRPO Hook
-We implement a reward function compatible with TRL's GRPOTrainer that maps Oracle outputs to per-group rewards. Since full training may be compute-limited, we provide:
-1. `reward_fn(completions, oracle_scores)` — returns tensor of rewards
-2. `GRPOHook` class — wraps Oracle + Ledger + Broker for online evaluation
-3. `OfflineComparator` — compares policies using saved trajectories when training is infeasible
-## 6. Benchmarks
-### Benchmark 1: Code Compute Allocation
-- Dataset: `openai/openai_humaneval` or `evalplus/humanevalplus`
-- Baselines: fixed compute, verifier retries, OCC allocation
-- Metrics: pass@1, pass@k, tokens used, model calls, cost, compute saved at iso-accuracy
-### Benchmark 2: Retrieval QA
-- Dataset: synthetic grounded QA + adversarial evidence
-- Baselines: direct answer, RAG, RAG+verifier, OCC
-- Metrics: correctness, hallucination rate, abstention utility, ECE, retrieval calls, cost
-### Benchmark 3: Multi-Agent Debate
-- Dataset: synthetic factual disputes + code debates
-- Baselines: equal turns, majority vote, confidence-weighted, OCC
-- Metrics: decision quality, compute used, quality per GPU-second, bad-agent containment
-## 7. Ablations
-1. No credit ledger (oracle score used directly)
-2. Transferable credits
-3. Non-decaying credits
-4. No abstention reward
-5. No calibration penalty
-6. No cost penalty
-7. No anti-gaming penalty
-8. No broker (oracle score only)
-9. Broker with static rules
-10. Broker with learned/score-based rights
-## 8. Anti-Gaming Tests
-- Spam low-value actions
-- Hoard credits
-- Transfer credit indirectly
-- Exploit weak judge
-- Verbose but low-value debate turns
-- Over-abstention
-- Overuse retrieval
-- Manipulate confidence
-- Optimize for unit tests while breaking hidden tests
-- Collude in multi-agent debate
-Measure: gaming success rate, credit leakage, robustness under judge replacement, quality degradation, broker containment.

 # OCC Design Document
+## Philosophy
+Compute is the scarce resource in modern agent systems. Every action — a tool call, a retrieval, a debate turn, a verification pass — costs tokens, GPU seconds, or API dollars. Current systems allocate compute upfront (fixed budget) or reactively (retry on failure). OCC proposes a **proactive, earned-compute** model where agents must demonstrate marginal value before receiving more resources.
+## Core Thesis
+> "Agents should earn compute, not spend it."
+The system is named Oracle-Credit-Compute because every compute decision flows through three stages:
+1. **Oracle:** Score the marginal impact of an action
+2. **Credit:** Update the agent's credit balance based on that score
+3. **Compute:** The broker decides whether to grant the requested resource
+## Architecture
+### Impact Oracle
+**Design principle:** Rule-based, auditable, resistant to reward hacking.
+Neural reward models are vulnerable to Goodhart's Law and reward hacking (Gao et al., 2023; Skalse et al., 2022). A neural RM can be optimized to produce high scores without producing correct answers. OCC uses rule-based scoring with the following properties:
+- **Verifiable outcomes:** Code correctness is checked by running tests. QA correctness is checked against gold answers. Debate quality is checked against ground truth.
+- **Cost-adjusted scores:** Every score subtracts a compute cost penalty. This prevents agents from achieving correctness through brute-force token spending.
+- **Proper scoring rules:** Calibration bonus via Brier score encourages well-calibrated confidence, not just correctness.
+- **Anti-gaming detectors:** Explicit checks for hidden-test gaming, spam, collusion, and over-abstention.
+### Credit Ledger
+**Design principle:** Non-transferable, decaying, capability-scoped.
+- **Non-transferable:** `transfer()` always returns `False`. This prevents colluding agents from pooling credits or laundering them through intermediaries.
+- **Exponential decay:** Idle credits decay at rate λ per time step. This prevents hoarding and encourages agents to use credits or lose them.
+- **Capability-scoped:** Credits are scoped to specific capabilities (`retrieval`, `model_call`, `file_write`). An agent that is good at retrieval should not automatically get dangerous write permissions.
+- **Full provenance:** Every entry has an oracle score, compute cost, timestamp, and reason. This enables auditing and debugging.
+### Resource Broker
+**Design principle:** Risk-adjusted, capability-based, dynamic.
+Resources are classified by risk:
+- **Low:** `retrieval_call`, `debate_turn` — threshold 0.5 credits
+- **Medium:** `model_call`, `verifier_call`, `memory_write` — threshold 2.0 credits
+- **High:** `file_write`, `shell_execute`, `human_escalation` — threshold 5.0 credits, may require approval
+The broker can make six decisions:
+- `ALLOW`: credits ≥ threshold, no flags
+- `DENY`: credits < threshold × 0.5
+- `REQUIRE_APPROVAL`: high-risk + high risk score
+- `DOWNGRADE`: credits between 0.5× and 1.0× threshold → downgrade to cheaper resource
+- `ESCALATE`: repeated denials from same agent
+- `ASK_JUSTIFICATION`: credits insufficient but agent has some history
+### GRPO/RL Hook
+**Design principle:** Reward = verified impact - compute cost.
+The reward function wraps the Impact Oracle and produces a scalar reward per completion. It is designed to be passed directly to TRL's `GRPOTrainer` as `reward_funcs`.
+The offline comparator allows policy comparison without training:
+1. Generate trajectories from two policies on the same test set
+2. Score both with the same reward hook
+3. Compare mean rewards, win rates, and failure rates
+## Reward Formula
 ```
+reward =
+  verified_task_score
+  + abstention_utility
+  + calibration_bonus
+  - hallucination_penalty
+  - confident_wrong_penalty
+  - compute_cost_penalty
+  - gaming_penalty
+Where:
+  verified_task_score = correctness * weight_correctness
+  abstention_utility = +1.0 if correct abstain, -1.0 if wrong abstain
+  calibration_bonus = (1 - brier_score) * weight_calibration
+  brier_score = (confidence - outcome)^2
+  hallucination_penalty = 2.0 if entailment < 0.5 and contradiction > 0.5
+  confident_wrong_penalty = 3.0 if confidence > 0.8 and correctness < 0.5
+  compute_cost_penalty = compute_cost * 0.0001
+  gaming_penalty = 2.0 if hidden_tests fail while public pass
 ```
+## Anti-Gaming Design
+### Spam Attacks
+- Detection: Repeated low-value actions (compute > 100, raw_score < 0.5)
+- Containment: Oracle subtracts gaming_penalty. Ledger can revoke all credits on explicit detection.
+### Hidden-Test Gaming
+- Detection: `public_pass=True` but `hidden_pass=False`
+- Containment: Immediate gaming_penalty=2.0 subtracted from raw score.
+### Credit Laundering
+- Prevention: `transfer()` returns `False` unconditionally.
+### Credit Hoarding
+- Prevention: Exponential decay on idle credits.
+### Over-Abstention
+- Detection: Agent abstains on answerable questions.
+- Containment: Wrong abstentions get -abstention_bonus (-1.0).
+### Confidence Manipulation
+- Detection: Brier score in calibration bonus.
+- Containment: Overconfident wrong answers get confident_wrong_penalty=3.0.
+## Compute Budgeting
+The system assumes a fixed compute budget per task. The broker enforces this by:
+1. Tracking cumulative compute cost in the ledger entries
+2. Denying requests when the agent's credit balance is below the threshold
+3. Downgrading to cheaper resources when balance is marginal
+## Failure Modes
+1. **Oracle brittleness:** If the scoring rules are incomplete, agents will find and exploit the gaps.
+2. **Broker conservatism:** If thresholds are too high, agents cannot act even when they should.
+3. **Decay too aggressive:** If λ is too high, agents lose credits before completing multi-step tasks.
+4. **Scope explosion:** Capability-scoped credits multiply the state space.
+## Future Extensions
+1. **Hierarchical broker:** Nested capability scopes (e.g., `model_call/code` vs `model_call/qa`).
+2. **Dynamic thresholds:** Learn thresholds from historical data rather than hardcoding.
+3. **Peer review:** Multiple oracles vote on controversial actions.
+4. **Human-in-the-loop:** Escalate high-risk decisions to human reviewers with credit incentives.