occ-stack / README.md

Upload README.md

745d481 verified 27 days ago

preview code

raw

history blame

6.42 kB

OCC: Oracle-Credit-Compute System

A minimal open-source stack for cost-aware, compute-efficient agent systems.

What is OCC?

Modern agent systems waste test-time compute because every tool call, retrieval, debate turn, or verification pass consumes resources without proving marginal value. OCC treats compute as a budgeted, non-transferable resource that agents must earn through verified impact.

Core Architecture

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Impact Oracle  │────▶│  Credit Ledger  │────▶│ Resource Broker │
│  (score action) │     │  (earn/spend)   │     │ (allow/deny)    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
         │                                              │
         └──────────────────┬───────────────────────────┘
                            ▼
                     ┌──────────────┐
                     │  GRPO/RL Hook│
                     │ (reward func) │
                     └──────────────┘

1. Impact Oracle (`oracle/`)

Rule-based scoring for:

Code tasks: unit tests, pass@k, regression detection, hidden-test gaming
Retrieval QA: answer correctness, evidence NLI (entailment/contradiction), abstention utility, calibration bonus (Brier score)
Multi-agent debate: decision quality, marginal contribution, influence efficiency

All scores are cost-adjusted: reward = verified_impact - compute_cost * penalty_rate

2. Credit Ledger (`ledger/`)

Non-transferable credits (laundering prevention)
Exponential decay on idle credits (hoarding prevention)
Capability-scoped rights (retrieval credits ≠ file-write credits)
Full provenance with oracle hash and reason

3. Resource Broker (`broker/`)

Capability-based access control:

Low risk: retrieval_call, debate_turn
Medium risk: model_call, verifier_call, memory_write
High risk: file_write, shell_execute, human_escalation

Decisions: allow, deny, require_approval, downgrade, escalate, ask_justification

4. GRPO/RL Hook (`rl/`)

TRL-compatible reward function wrapping the Impact Oracle. Includes offline policy comparator for ablation studies without GPU training.

Installation

pip install -e .
# For NLI evidence scoring:
pip install sentence-transformers
# For real LLM inference:
pip install transformers datasets
# For GRPO training:
pip install trl accelerate

Quick Start

# Run all benchmarks and ablations
python -m benchmarks.eval_runner

# Run individual benchmarks
python -m benchmarks.benchmark_code
python -m benchmarks.benchmark_retrieval_qa
python -m benchmarks.benchmark_debate

# Run with real NLI model (requires sentence-transformers)
python -m benchmarks.benchmark_retrieval_qa_nli

# Adversarial debate benchmark
python -m benchmarks.benchmark_debate_adversarial

# GRPO offline demonstrator
python -m rl.grpo_train_demo

Benchmark Results

Code Compute Allocation (Simulated)

Strategy	pass@1	Compute	Savings
Fixed (expensive agent)	0.780	17,500	—
Verifier-guided retries	0.980	26,600	-52%
OCC tiered escalation	0.780	8,350	52.3%

OCC tries cheap agents first, escalates only on failure. At iso-accuracy (0.780 pass@1), it reduces compute by 52%.

Code Compute Allocation (Real LLM - Qwen2.5-Coder-0.5B)

GPU job running on T4. Script: jobs/run_real_llm_standalone.py

Retrieval QA (with real NLI - cross-encoder/nli-deberta-v3-xsmall)

Strategy	Accuracy	ECE	Retrievals
Direct answer	0.580	0.226	0
RAG baseline	0.750	0.167	338
RAG + verifier	0.790	0.151	344
OCC baseline	0.710	0.201	227
OCC + real NLI	needs calibration	—	220

Note: OCC + NLI shows stronger evidence quality but broker thresholds are too conservative on neutral evidence. Needs tuning for production use.

Multi-Agent Debate

With 50% adversarial agents:

Strategy	Accuracy	Quality/Compute
Equal turns	0.760	0.001275
Confidence-weighted	0.560	0.000924
OCC credit allocation	0.760	0.001196

OCC contains adversarial agents while confidence-weighted voting collapses (bad agents exploit high confidence).

Anti-Gaming

Attack	Detection	Containment
Spam low-value actions	100% credit exhaustion	Credits = 0
Hidden-test gaming	100% oracle detection	Immediate penalty
Over-abstention	70% oracle penalization	Wrong abstentions punished

Project Structure

/occ
  /oracle        - Impact Oracle implementation
  /ledger        - Credit Ledger with decay and provenance
  /broker        - Capability-based Resource Broker
  /rl            - GRPO reward hooks and offline comparator
  /benchmarks    - Code, QA, and debate benchmarks
  /jobs          - GPU job scripts for real LLM inference
  /reports       - Evaluation results (JSON)
  /configs       - Configuration files

Limitations & Next Steps

Retrieval QA needs better NLI calibration. Real NLI scores are strong but broker thresholds are too aggressive on neutral evidence.
All benchmarks use simulated agents for tractability. Real LLM inference script (jobs/run_real_llm_standalone.py) is submitted as a GPU job.
GRPO training hook is implemented but not trained on real data. Offline comparator validates the reward design.
Cost model is token-count only. Real cost should include model size, latency, and API pricing.

Citation

@software{occ_stack,
  title = {OCC: Oracle-Credit-Compute System for Agentic Compute Allocation},
  author = {narcolepticchicken},
  year = {2026},
  url = {https://huggingface.co/narcolepticchicken/occ-stack}
}

License

Apache 2.0