Upload README.md

745d481 verified 27 days ago

6.42 kB

	# OCC: Oracle-Credit-Compute System

	A minimal open-source stack for cost-aware, compute-efficient agent systems.

	## What is OCC?

	Modern agent systems waste test-time compute because every tool call, retrieval, debate turn, or verification pass consumes resources without proving marginal value. OCC treats compute as a budgeted, non-transferable resource that agents must earn through verified impact.

	## Core Architecture

	```
	┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
	│ Impact Oracle │────▶│ Credit Ledger │────▶│ Resource Broker │
	│ (score action) │ │ (earn/spend) │ │ (allow/deny) │
	└─────────────────┘ └─────────────────┘ └─────────────────┘
	│ │
	└──────────────────┬───────────────────────────┘
	▼
	┌──────────────┐
	│ GRPO/RL Hook│
	│ (reward func) │
	└──────────────┘
	```

	### 1. Impact Oracle (`oracle/`)

	Rule-based scoring for:
	- Code tasks: unit tests, pass@k, regression detection, hidden-test gaming
	- Retrieval QA: answer correctness, evidence NLI (entailment/contradiction), abstention utility, calibration bonus (Brier score)
	- Multi-agent debate: decision quality, marginal contribution, influence efficiency

	All scores are cost-adjusted: `reward = verified_impact - compute_cost * penalty_rate`

	### 2. Credit Ledger (`ledger/`)

	- Non-transferable credits (laundering prevention)
	- Exponential decay on idle credits (hoarding prevention)
	- Capability-scoped rights (retrieval credits ≠ file-write credits)
	- Full provenance with oracle hash and reason

	### 3. Resource Broker (`broker/`)

	Capability-based access control:
	- Low risk: `retrieval_call`, `debate_turn`
	- Medium risk: `model_call`, `verifier_call`, `memory_write`
	- High risk: `file_write`, `shell_execute`, `human_escalation`

	Decisions: `allow`, `deny`, `require_approval`, `downgrade`, `escalate`, `ask_justification`

	### 4. GRPO/RL Hook (`rl/`)

	TRL-compatible reward function wrapping the Impact Oracle. Includes offline policy comparator for ablation studies without GPU training.

	## Installation

	```bash
	pip install -e .
	# For NLI evidence scoring:
	pip install sentence-transformers
	# For real LLM inference:
	pip install transformers datasets
	# For GRPO training:
	pip install trl accelerate
	```

	## Quick Start

	```bash
	# Run all benchmarks and ablations
	python -m benchmarks.eval_runner

	# Run individual benchmarks
	python -m benchmarks.benchmark_code
	python -m benchmarks.benchmark_retrieval_qa
	python -m benchmarks.benchmark_debate

	# Run with real NLI model (requires sentence-transformers)
	python -m benchmarks.benchmark_retrieval_qa_nli

	# Adversarial debate benchmark
	python -m benchmarks.benchmark_debate_adversarial

	# GRPO offline demonstrator
	python -m rl.grpo_train_demo
	```

	## Benchmark Results

	### Code Compute Allocation (Simulated)

	\| Strategy \| pass@1 \| Compute \| Savings \|
	\|----------\|--------\|---------\|---------\|
	\| Fixed (expensive agent) \| 0.780 \| 17,500 \| — \|
	\| Verifier-guided retries \| 0.980 \| 26,600 \| -52% \|
	\| OCC tiered escalation \| 0.780 \| 8,350 \| 52.3% \|

	OCC tries cheap agents first, escalates only on failure. At iso-accuracy (0.780 pass@1), it reduces compute by 52%.

	### Code Compute Allocation (Real LLM - Qwen2.5-Coder-0.5B)

	GPU job running on T4. Script: `jobs/run_real_llm_standalone.py`

	### Retrieval QA (with real NLI - cross-encoder/nli-deberta-v3-xsmall)

	\| Strategy \| Accuracy \| ECE \| Retrievals \|
	\|----------\|----------\|-----\|------------\|
	\| Direct answer \| 0.580 \| 0.226 \| 0 \|
	\| RAG baseline \| 0.750 \| 0.167 \| 338 \|
	\| RAG + verifier \| 0.790 \| 0.151 \| 344 \|
	\| OCC baseline \| 0.710 \| 0.201 \| 227 \|
	\| OCC + real NLI \| needs calibration \| — \| 220 \|

	Note: OCC + NLI shows stronger evidence quality but broker thresholds are too conservative on neutral evidence. Needs tuning for production use.

	### Multi-Agent Debate

	With 50% adversarial agents:

	\| Strategy \| Accuracy \| Quality/Compute \|
	\|----------\|----------\|-----------------\|
	\| Equal turns \| 0.760 \| 0.001275 \|
	\| Confidence-weighted \| 0.560 \| 0.000924 \|
	\| OCC credit allocation \| 0.760 \| 0.001196 \|

	OCC contains adversarial agents while confidence-weighted voting collapses (bad agents exploit high confidence).

	### Anti-Gaming

	\| Attack \| Detection \| Containment \|
	\|--------\|-----------\|-------------\|
	\| Spam low-value actions \| 100% credit exhaustion \| Credits = 0 \|
	\| Hidden-test gaming \| 100% oracle detection \| Immediate penalty \|
	\| Over-abstention \| 70% oracle penalization \| Wrong abstentions punished \|

	## Project Structure

	```
	/occ
	/oracle - Impact Oracle implementation
	/ledger - Credit Ledger with decay and provenance
	/broker - Capability-based Resource Broker
	/rl - GRPO reward hooks and offline comparator
	/benchmarks - Code, QA, and debate benchmarks
	/jobs - GPU job scripts for real LLM inference
	/reports - Evaluation results (JSON)
	/configs - Configuration files
	```

	## Limitations & Next Steps

	1. Retrieval QA needs better NLI calibration. Real NLI scores are strong but broker thresholds are too aggressive on neutral evidence.
	2. All benchmarks use simulated agents for tractability. Real LLM inference script (`jobs/run_real_llm_standalone.py`) is submitted as a GPU job.
	3. GRPO training hook is implemented but not trained on real data. Offline comparator validates the reward design.
	4. Cost model is token-count only. Real cost should include model size, latency, and API pricing.

	## Citation

	```bibtex
	@software{occ_stack,
	title = {OCC: Oracle-Credit-Compute System for Agentic Compute Allocation},
	author = {narcolepticchicken},
	year = {2026},
	url = {https://huggingface.co/narcolepticchicken/occ-stack}
	}
	```

	## License

	Apache 2.0