causal-stream / README.md
skyruh's picture
docs: define explicit F1 evidence scoring math and reward hacking guardrails
eff8e56
---
title: CausalStream
emoji: 🌊
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
---
# CausalStream: High-Fidelity SRE Logic Environment
CausalStream is a specialized Reinforcement Learning (RL) environment designed to benchmark and train agents in **temporal causal reasoning** and **resource allocation under uncertainty**. It simulates a complex streaming data infrastructure (Kafka/Flink pattern) where agents must diagnose production incidents.
## �️ Core Engine Architecture
### 1. The Stochastic World Clock (Ticked Execution)
Unlike traditional "static" debugging environments, CausalStream operates on a discrete **Tick** system.
- **Temporal Drift**: Every action that consumes compute or network resources (e.g., sampling a raw stream) advances the world clock.
- **Event Physics**: Events are generated with a stochastic latency model: `arrival_time = event_time + base_latency + jitter`.
- **Incident Injection**: Incidents are not just boolean flags; they modify the distribution of the stream. For example, a `LATENCY_SPIKE` incident increases the `base_latency` variance, causing "late arrival" data loss in aggregation windows.
### 2. Tiered Observation Model
Information is not free. Agents interact with a hierarchical observation space:
- **Dashboard (L0)**: Free to read. Provides aggregated metrics (Revenue, Error Rate). High signal but low granularity.
- **Stream Samples (L1)**: Costs **1 Tick**. Provides raw JSON event snippets. Necessary for detecting jitter and schema mismatches.
- **Lineage Graph (L2)**: Costs **1 Tick**. Provides the SQL logic and dependency chain of the data pipeline.
### 3. Counterfactual Query Engine (Causal Discovery)
To prove a hypothesis, agents can execute **Counterfactual Actions**.
- **Action**: `ask_counterfactual(window_offset=X)`
- **Mechanism**: The engine forks the internal state and simulates what the metrics *would have been* if a variable (like the late-arrival window) was modified.
- **Significance**: This forces agents to move beyond pattern matching and perform true scientific experimentation.
## ⚖️ Deterministic Grading & F1 Evidence Scoring
CausalStream solves the "Subjective Reasoning" problem by requiring agents to submit a structured **Theory**. The environment enforces strict, deterministic grading based on the F1 score of evidence retrieval.
- **The Theory**: The agent submits a `RootCauseEnum` (e.g., `OUT_OF_ORDER`) and a `List[Evidence]` (specific JSON tracking keys or IDs).
- **The Formula**:
- `Precision = True_Positives / (True_Positives + False_Positives)`
- `Recall = True_Positives / (True_Positives + False_Negatives)`
- `F1_Score = 2 * (Precision * Recall) / (Precision + Recall)`
- **Scoring Bounds**:
- Incorrect cause: `0.0`.
- Correct cause: Exactly mathematically matches the F1 bound of `(0.0, 1.0]`.
### 🛡️ Anti-Reward Hacking (Exploit Guardrails)
To prevent agents from "loop farming" API endpoints (Reward Hacking), all tools exist within a **Stateful Tracking Set**. A specific semantic action type is only rewarded *once* per episode. If an agent loops `sample_stream` indefinitely, the reward collapses to +0.00.
## 🚀 Technical Setup
### Local Development
```bash
pip install -r requirements.txt
python server.py --port 7860
```
### Docker Deployment
The environment is optimized for **2 vCPU / 8GB RAM** constraints and exposes a REST API via FastAPI.
```bash
docker build -t causal-stream .
docker run -p 7860:7860 causal-stream
```
---
*Developed for the Meta PyTorch OpenEnv Hackathon 2026. Focus: Causal Intelligence.*