occ-stack / design.md

Upload design.md

adf7987 verified 21 days ago

preview code

raw

history blame contribute delete

11.9 kB

OCC: Formal System Definition

Overview

OCC (Oracle-Credit-Compute) is a mechanism-design layer that governs agent access to compute, retrieval, debate turns, tool execution, and other resources. It treats compute allocation as a security boundary rather than a performance optimization.

Core Insight

In multi-agent systems, compute is not neutral. Extra turns, tokens, and tool calls can amplify adversarial influence unless access to deliberation is governed by verified marginal contribution. OCC makes agent compute scarce, earned, scoped, decaying, and auditable.

Formal Definition

Entities

Let:

A = {a₁, a₂, ..., aₙ} be a set of agents
T = {t₁, t₂, ..., tₘ} be a set of tasks
R = {r₁, r₂, ..., rₖ} be a set of resource types (model calls, retrieval, debate turns, tool execution, file writes, etc.)
C = {c₁, c₂, ..., cₗ} be a set of capability scopes
O be an Impact Oracle that maps (action, context, outcome) → score ∈ [−1, 1]

Credit State

Each agent a has a credit vector at time step t:

credit[a, t] ∈ ℝ₊  (non-negative real)

Credits are:

Non-transferable: ∀a,b ∈ A, a≠b, credit[b,t] cannot increase from credit[a,t]
Decaying: credit[a, t+1] = decay(credit[a,t]) where decay(x) = x · δ, δ ∈ (0,1)
Task-scoped: credits can be bound to a specific task τ
Capability-scoped: credits can be earmarked for capability scope c

Earning Function

earn(a, action, oracle_score, compute_cost) → Δ ∈ ℝ

Δ = f(oracle_score, compute_cost, calibration, abstention_utility)

Where f must satisfy:

oracle_score < 0 ⇒ Δ ≤ 0 (negative contribution yields ≤ 0 credit)
oracle_score = 0 ⇒ Δ = 0 (neutral action neither earns nor loses)
oracle_score > 0 ⇒ Δ > 0 (positive contribution earns credit)
compute_cost > 0 reduces Δ proportionally
calibration_error > threshold reduces Δ
confident_wrong action (high confidence + oracle_score < 0) ⇒ Δ < 0 (penalty)

Spend Function

spend(a, resource_type, capability_scope) → {allow, deny, downgrade, escalate, require_approval}

allow if: credit[a,t] ≥ cost(resource_type, capability_scope)
             AND a has capability_scope_policy[scope]
             AND credit_decay_rate[a] ≤ max_decay
             AND gaming_score[a] ≤ gaming_threshold

Decay Schedule

decay(credit[t]) = credit[t] · δ

where:
  δ = 0.995  (per-turn decay, ~5% per 10 turns)
  Or task-scoped: δ = 1.0 until task completion, then δ = 0.0 (credits expire)

Credit Caps

credit[a,t] ≤ credit_cap(capability_scope)

credit_cap translates to maximum resource access:
  Model calls:     credit_cap / cost_per_call
  Retrieval calls: credit_cap / cost_per_retrieval
  Debate turns:    credit_cap / cost_per_turn

Oracle Scoring

oracle_score = α₁ · correctness(a, t, outcome)
             + α₂ · evidence_support(a, t, evidence)
             + α₃ · improvement_over_prior(a, t, prior_state)
             + α₄ · calibration(a, t, prediction, outcome)
             + α₅ · abstention_utility(a, t, decision_to_abstain)
             − β₁ · hallucination(a, t, evidence)
             − β₂ · confident_wrong(a, t, prediction, outcome, confidence)
             − β₃ · wasteful_compute(a, t, compute_used, value_produced)
             − β₄ · gaming_suspicion(a, t, action_pattern)

where:
  correctness:        1 if correct, 0 if incorrect, −1 if harmful
  evidence_support:   1 if evidence fully supports, 0 if neutral, −1 if contradicts
  improvement:        + if better than prior, 0 if same, − if worse
  calibration:        + if well-calibrated, − if overconfident
  abstention_utility: + if abstaining was correct, − if it was evasive but answerable
  hallucination:      − if generated claim contradicts evidence
  confident_wrong:    − if high confidence AND incorrect (larger penalty than regular wrong)
  wasteful_compute:   − if compute used ≫ value produced
  gaming_suspicion:   − if action pattern matches known gaming signatures

Default weights (tunable):
  α = [0.30, 0.15, 0.10, 0.10, 0.15]
  β = [0.20, 0.25, 0.15, 0.20]

Reward Function (for RL/GRPO)

reward(a, action, context, outcome) =
    oracle_score(a, action, context, outcome)
    + abstention_utility
    + calibration_bonus
    − hallucination_penalty
    − confident_wrong_penalty
    − compute_cost · cost_multiplier
    − gaming_penalty(a, history)

Constrained to [−1, 1].

System Invariants

Non-transferability: ∀a,b ∈ A, a≠b: Δcredit[b] from a's action = 0
Positive decay: ∀a: credit[a, t+1] ≤ credit[a, t] unless earned
Capability scoping: access(r) requires scope_policy[r] AND credit ≥ cost(r)
External verification: oracle_score depends only on oracle O, not on a
Append-only ledger: credit events are immutable once recorded
Oracle separation: spending agent cannot directly influence oracle O
Negative contribution: oracle_score < 0 ⇒ Δ ≤ 0
Credit ≠ identity trust: high credit does not imply trusted access to all resources
Reversal possible: credit can be retroactively reduced on new evidence
Bounded credit: credit[a,t] ≤ credit_cap(scope) always

Ledger Event Schema

Every credit mutation produces an immutable event:

Event	Fields
CREDIT_GRANTED	agent_id, amount, reason, oracle_score, task_id, timestamp
CREDIT_DECAYED	agent_id, amount_decayed, new_balance, timestamp
CREDIT_SPENT	agent_id, amount, resource_type, capability_scope, task_id, timestamp
TURN_DENIED	agent_id, reason (insufficient_credit/wrong_scope/gaming_threshold), timestamp
ORACLE_SCORE_RECORDED	agent_id, action_id, score, confidence, evidence_ref, timestamp
CAPABILITY_SCOPE_CHANGED	agent_id, old_scope, new_scope, reason, timestamp
AGENT_PENALIZED	agent_id, penalty_amount, reason, evidence, timestamp
VERIFICATION_REVERSED	original_event_hash, new_score, reason, timestamp
POOL_EXHAUSTED	task_id, remaining_credit, timestamp
POLICY_UPDATED	parameter_changes, reason, timestamp

Each event includes:

event_hash: SHA-256 of (previous_event_hash + event_data)
parent_event_hash: chain to previous event
agent_id
task_id
timestamp (UTC ISO 8601)
capability_scope
oracle_id
score (if applicable)
credit_delta
reason (human-readable)
evidence_pointer (URI or hash to evidence)

Resource Broker Decision Model

For each request (agent a, resource r, scope c):

function decide(a, r, c):
    if not has_scope(a, c):
        return DENY(reason="missing capability scope")
    
    if credit[a] < cost(r, c):
        if credit[a] >= cost(downgraded(r), c):
            return DOWNGRADE(alternative=downgraded(r), reason="insufficient credit for requested tier")
        return DENY(reason="insufficient credit")
    
    if gaming_score[a] > GAMING_THRESHOLD:
        return REQUIRE_APPROVAL(reason="gaming suspicion")
    
    if risk(r, a, c) > RISK_THRESHOLD:
        return REQUIRE_APPROVAL(reason="high-risk action")
    
    if credit[a] < cost(r, c) * 2:  # running low
        return ALLOW_WITH_WARNING(reason="low credit warning")
    
    return ALLOW

Resource Types and Costs

Resource	Base Cost	Capability Scope
model_call_small	1	basic_inference
model_call_large	5	premium_inference
retrieval_call	2	retrieval
verifier_call	3	verification
debate_turn	3	deliberation
file_write	5	tool_execution
shell_exec	8	tool_execution
memory_write	2	memory
human_escalation	20	escalation

When To Use OCC

OCC is valuable when	OCC is overkill when
Agents have heterogeneous reliability	Single-agent tasks suffice
Long-running tasks need budget discipline	Ground truth is immediate and cheap
Debate/collaboration can be poisoned	Adversarial participation is impossible
Compute is expensive	All agents have equal trust and capability
Auditability matters	Task budget is tiny (a few calls)
Agents can earn durable authority	Latency matters more than robustness
Post-hoc accountability required	Verifier/oracle cost exceeds saved compute
Agents can game naive allocation	There are no bad actors in the system

Threat Model

Attack	What Adversary Controls	Success Condition	OCC Defense	Residual Risk
Credit farming	Task selection	Accumulates budget via easy tasks	Decay + credit caps	Slow gaming over many cheap tasks
Collusion	Multiple agent identities	Transfers influence between agents	Non-transferability	Vote-ring behavior (same answer)
Oracle spoofing	Persuasive but wrong answers	Earns false credit	Verifier separation from spender	Judge hacking via prompt injection
Griefing	Burns others' budget	Lowers group accuracy	Capability-scoped spend	Indirect poisoning via bad data
Sandbagging	Hides competence	Manipulates future allocation	Decay + exploration bonus	Hard to detect without history
Identity laundering	Resets agent identity	Escapes penalties	Identity binding to account	Account churn (rate-limited)
Sybil agents	Many weak agents	Captures compute pool	Admission control	Deployment-specific new-account policy
Strategic abstention	Avoids penalties	Hoards credit	Reward shaping for participation	Conservatism bias
Verbosity gaming	Produces long but vacuous responses	Appears high-quality to weak oracle	Token-cost multiplier in reward	Requires quality oracle
Confidence manipulation	Overstates certainty	Earns calibration bonus deceptively	Proper scoring rules	Hard to calibrate perfectly

Relationship to Prior Work

OCC builds on:

AI safety debate (Irving, Christiano, Amodei 2018): Debate as a mechanism for surfacing truth. OCC adds: debate turns are not free speech — they are auditable compute privileges.
GRPO/RLVR (Shazeer et al. 2024): Group-relative policy optimization. OCC provides the reward function that makes GRPO converge to allocation policies.
Proper scoring rules: OCC's calibration and abstention rewards are proper scoring rule implementations.
Capability-based security: OCC's broker follows OS capability-system principles applied to agent API access.

OCC departs from:

Budget-aware reasoning (e.g., token-budget RL): OCC is not about minimizing compute — it's about governing compute access.
Adaptive inference (early exit, cascade): OCC governs who gets compute, not when to stop computing.
Multi-agent debate for accuracy: OCC does not claim debate improves accuracy. It claims debate without allocation control amplifies adversarial influence.

Implementation Reference

Python package at: https://huggingface.co/narcolepticchicken/occ-stack

/occ
  /oracle     → oracle.py     (Impact Oracle: scoring, marginal impact, proper scoring)
  /ledger     → ledger.py     (Credit Ledger: non-transferable, decaying, scoped credits)
  /broker     → broker.py     (Resource Broker: capability-based access control)
  /rl         → reward.py     (Reward function combining oracle + anti-gaming)
             → grpo_hook.py  (TRL GRPOTrainer integration)
  /benchmarks → benchmark_debate.py, benchmark_code.py, benchmark_retrieval_qa.py
  /configs    → YAML configurations for experiments
  /reports    → results, analysis, final report

Last updated: May 8, 2026. Version: 1.0.