occ-stack / README.md
narcolepticchicken's picture
Upload README.md
745d481 verified
|
raw
history blame
6.42 kB

OCC: Oracle-Credit-Compute System

A minimal open-source stack for cost-aware, compute-efficient agent systems.

What is OCC?

Modern agent systems waste test-time compute because every tool call, retrieval, debate turn, or verification pass consumes resources without proving marginal value. OCC treats compute as a budgeted, non-transferable resource that agents must earn through verified impact.

Core Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Impact Oracle  │────▢│  Credit Ledger  │────▢│ Resource Broker β”‚
β”‚  (score action) β”‚     β”‚  (earn/spend)   β”‚     β”‚ (allow/deny)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                              β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β–Ό
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚  GRPO/RL Hookβ”‚
                     β”‚ (reward func) β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Impact Oracle (oracle/)

Rule-based scoring for:

  • Code tasks: unit tests, pass@k, regression detection, hidden-test gaming
  • Retrieval QA: answer correctness, evidence NLI (entailment/contradiction), abstention utility, calibration bonus (Brier score)
  • Multi-agent debate: decision quality, marginal contribution, influence efficiency

All scores are cost-adjusted: reward = verified_impact - compute_cost * penalty_rate

2. Credit Ledger (ledger/)

  • Non-transferable credits (laundering prevention)
  • Exponential decay on idle credits (hoarding prevention)
  • Capability-scoped rights (retrieval credits β‰  file-write credits)
  • Full provenance with oracle hash and reason

3. Resource Broker (broker/)

Capability-based access control:

  • Low risk: retrieval_call, debate_turn
  • Medium risk: model_call, verifier_call, memory_write
  • High risk: file_write, shell_execute, human_escalation

Decisions: allow, deny, require_approval, downgrade, escalate, ask_justification

4. GRPO/RL Hook (rl/)

TRL-compatible reward function wrapping the Impact Oracle. Includes offline policy comparator for ablation studies without GPU training.

Installation

pip install -e .
# For NLI evidence scoring:
pip install sentence-transformers
# For real LLM inference:
pip install transformers datasets
# For GRPO training:
pip install trl accelerate

Quick Start

# Run all benchmarks and ablations
python -m benchmarks.eval_runner

# Run individual benchmarks
python -m benchmarks.benchmark_code
python -m benchmarks.benchmark_retrieval_qa
python -m benchmarks.benchmark_debate

# Run with real NLI model (requires sentence-transformers)
python -m benchmarks.benchmark_retrieval_qa_nli

# Adversarial debate benchmark
python -m benchmarks.benchmark_debate_adversarial

# GRPO offline demonstrator
python -m rl.grpo_train_demo

Benchmark Results

Code Compute Allocation (Simulated)

Strategy pass@1 Compute Savings
Fixed (expensive agent) 0.780 17,500 β€”
Verifier-guided retries 0.980 26,600 -52%
OCC tiered escalation 0.780 8,350 52.3%

OCC tries cheap agents first, escalates only on failure. At iso-accuracy (0.780 pass@1), it reduces compute by 52%.

Code Compute Allocation (Real LLM - Qwen2.5-Coder-0.5B)

GPU job running on T4. Script: jobs/run_real_llm_standalone.py

Retrieval QA (with real NLI - cross-encoder/nli-deberta-v3-xsmall)

Strategy Accuracy ECE Retrievals
Direct answer 0.580 0.226 0
RAG baseline 0.750 0.167 338
RAG + verifier 0.790 0.151 344
OCC baseline 0.710 0.201 227
OCC + real NLI needs calibration β€” 220

Note: OCC + NLI shows stronger evidence quality but broker thresholds are too conservative on neutral evidence. Needs tuning for production use.

Multi-Agent Debate

With 50% adversarial agents:

Strategy Accuracy Quality/Compute
Equal turns 0.760 0.001275
Confidence-weighted 0.560 0.000924
OCC credit allocation 0.760 0.001196

OCC contains adversarial agents while confidence-weighted voting collapses (bad agents exploit high confidence).

Anti-Gaming

Attack Detection Containment
Spam low-value actions 100% credit exhaustion Credits = 0
Hidden-test gaming 100% oracle detection Immediate penalty
Over-abstention 70% oracle penalization Wrong abstentions punished

Project Structure

/occ
  /oracle        - Impact Oracle implementation
  /ledger        - Credit Ledger with decay and provenance
  /broker        - Capability-based Resource Broker
  /rl            - GRPO reward hooks and offline comparator
  /benchmarks    - Code, QA, and debate benchmarks
  /jobs          - GPU job scripts for real LLM inference
  /reports       - Evaluation results (JSON)
  /configs       - Configuration files

Limitations & Next Steps

  1. Retrieval QA needs better NLI calibration. Real NLI scores are strong but broker thresholds are too aggressive on neutral evidence.
  2. All benchmarks use simulated agents for tractability. Real LLM inference script (jobs/run_real_llm_standalone.py) is submitted as a GPU job.
  3. GRPO training hook is implemented but not trained on real data. Offline comparator validates the reward design.
  4. Cost model is token-count only. Real cost should include model size, latency, and API pricing.

Citation

@software{occ_stack,
  title = {OCC: Oracle-Credit-Compute System for Agentic Compute Allocation},
  author = {narcolepticchicken},
  year = {2026},
  url = {https://huggingface.co/narcolepticchicken/occ-stack}
}

License

Apache 2.0