causal-stream / README.md
skyruh's picture
docs: define explicit F1 evidence scoring math and reward hacking guardrails
eff8e56
metadata
title: CausalStream
emoji: 🌊
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false

CausalStream: High-Fidelity SRE Logic Environment

CausalStream is a specialized Reinforcement Learning (RL) environment designed to benchmark and train agents in temporal causal reasoning and resource allocation under uncertainty. It simulates a complex streaming data infrastructure (Kafka/Flink pattern) where agents must diagnose production incidents.

�️ Core Engine Architecture

1. The Stochastic World Clock (Ticked Execution)

Unlike traditional "static" debugging environments, CausalStream operates on a discrete Tick system.

  • Temporal Drift: Every action that consumes compute or network resources (e.g., sampling a raw stream) advances the world clock.
  • Event Physics: Events are generated with a stochastic latency model: arrival_time = event_time + base_latency + jitter.
  • Incident Injection: Incidents are not just boolean flags; they modify the distribution of the stream. For example, a LATENCY_SPIKE incident increases the base_latency variance, causing "late arrival" data loss in aggregation windows.

2. Tiered Observation Model

Information is not free. Agents interact with a hierarchical observation space:

  • Dashboard (L0): Free to read. Provides aggregated metrics (Revenue, Error Rate). High signal but low granularity.
  • Stream Samples (L1): Costs 1 Tick. Provides raw JSON event snippets. Necessary for detecting jitter and schema mismatches.
  • Lineage Graph (L2): Costs 1 Tick. Provides the SQL logic and dependency chain of the data pipeline.

3. Counterfactual Query Engine (Causal Discovery)

To prove a hypothesis, agents can execute Counterfactual Actions.

  • Action: ask_counterfactual(window_offset=X)
  • Mechanism: The engine forks the internal state and simulates what the metrics would have been if a variable (like the late-arrival window) was modified.
  • Significance: This forces agents to move beyond pattern matching and perform true scientific experimentation.

⚖️ Deterministic Grading & F1 Evidence Scoring

CausalStream solves the "Subjective Reasoning" problem by requiring agents to submit a structured Theory. The environment enforces strict, deterministic grading based on the F1 score of evidence retrieval.

  • The Theory: The agent submits a RootCauseEnum (e.g., OUT_OF_ORDER) and a List[Evidence] (specific JSON tracking keys or IDs).
  • The Formula:
    • Precision = True_Positives / (True_Positives + False_Positives)
    • Recall = True_Positives / (True_Positives + False_Negatives)
    • F1_Score = 2 * (Precision * Recall) / (Precision + Recall)
  • Scoring Bounds:
    • Incorrect cause: 0.0.
    • Correct cause: Exactly mathematically matches the F1 bound of (0.0, 1.0].

🛡️ Anti-Reward Hacking (Exploit Guardrails)

To prevent agents from "loop farming" API endpoints (Reward Hacking), all tools exist within a Stateful Tracking Set. A specific semantic action type is only rewarded once per episode. If an agent loops sample_stream indefinitely, the reward collapses to +0.00.

🚀 Technical Setup

Local Development

pip install -r requirements.txt
python server.py --port 7860

Docker Deployment

The environment is optimized for 2 vCPU / 8GB RAM constraints and exposes a REST API via FastAPI.

docker build -t causal-stream .
docker run -p 7860:7860 causal-stream

Developed for the Meta PyTorch OpenEnv Hackathon 2026. Focus: Causal Intelligence.