--- title: ForensicShell OpenEnv emoji: 🔎 colorFrom: red colorTo: gray sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv - forensics - security - rl --- # ForensicShell — OpenEnv Environment A real-world digital forensics environment for the OpenEnv RL framework. The agent investigates a pre-seeded "breached" Linux host using read-only structured actions (`list_dir`, `read_file`, `grep`, `stat`) and submits a structured `ForensicReport` that is graded deterministically against hidden ground truth. ## Why this environment? Most RL environments are toys (games, classification, echo). ForensicShell simulates something a junior SOC analyst actually does on day one: SSH into a compromised box, read the logs, find the modified files, hash the backdoor, and reconstruct the attacker's kill chain. It is **not a game**, the grader is **fully deterministic**, and partial credit is awarded per subfield so the reward function gives a real gradient instead of a 0/1 cliff. ## Tasks The environment exposes three difficulty tiers, selectable at `reset(task_id=...)`. | `task_id` | Difficulty | What the agent must determine | |---|---|---| | `t1_login` | **Easy** | Compromised user + initial source IP | | `t2_modified` | **Medium** | + List of modified system files + SHA256 of the backdoor binary | | `t3_timeline` | **Hard** | + Ordered attacker kill-chain timeline (login → recon → privesc → persistence → exfil) | Each task ships with a different hand-authored scenario: different usernames, IPs, attacker tools, and attack patterns. Ground truth is held inside the env and never exposed to the agent through the action API. ## Action space A single discriminated action `ForensicShellAction` with `action_type` selecting the verb: | `action_type` | Required fields | Effect | |---|---|---| | `list_dir` | `path` | List immediate children of a directory in the synthetic FS | | `read_file` | `path`, `max_bytes` | Read the contents of a file (truncated to `max_bytes`) | | `grep` | `pattern`, `path` | Return matching lines with line numbers (max 100 hits) | | `stat` | `path` | Return size + SHA256 of the file's bytes | | `submit_report` | `report` (ForensicReport) | Terminal — grades the report and ends the episode | The agent has **30 steps per episode**. Failing to submit before the budget is exhausted ends the episode with reward 0. ## Observation space ```python class ForensicShellObservation(Observation): output: str # human-readable result of the last action task_id: str # current task identifier task_description: str # what the agent must determine steps_remaining: int # remaining action budget action_error: Optional[str] # error message if the last action failed, else None done: bool reward: float # 0.0 except on the terminal submit_report step metadata: dict ``` ## Reward function Rewards are returned only on the terminal `submit_report` step. The grader is deterministic and awards partial credit per subfield, so the reward signal has meaningful gradient: | Task | Grader composition | |---|---| | `t1_login` | `0.5 * user_correct + 0.5 * ip_correct` | | `t2_modified` | `0.2*user + 0.2*ip + 0.3*Jaccard(modified_files) + 0.3*sha256_correct` | | `t3_timeline` | `0.15*user + 0.15*ip + 0.15*files + 0.15*sha + 0.20*phase_F1 + 0.20*Kendall_tau_ordering` | All rewards clip to `[0.0, 1.0]`. See `server/grader.py` for the full implementation. ## Quick start (client) ```python import asyncio from forensic_shell import ForensicShellAction, ForensicShellEnv from forensic_shell.models import ForensicReport, TimelineEvent async def main(): async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env: result = await env.reset(task_id="t1_login") print(result.observation.task_description) result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log")) result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log")) result = await env.step(ForensicShellAction( action_type="submit_report", report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"), )) print(f"reward={result.reward:.3f} done={result.done}") asyncio.run(main()) ``` ## Building locally ```bash docker build -t forensic-shell:latest -f server/Dockerfile . docker run -p 8000:8000 forensic-shell:latest ``` The server exposes: - `POST /reset` — start a new episode - `POST /step` — execute an action - `GET /state` — episode state - `GET /health` — health check - `GET /docs` — OpenAPI docs - `WS /ws` — persistent WebSocket session (used by `EnvClient`) ## Running the baseline The repo root contains `inference.py` which runs an OpenAI-client-compatible LLM through all three tasks and emits hackathon-formatted `[START]/[STEP]/[END]` log lines to stdout. ```bash export HF_TOKEN= export API_BASE_URL=https://api.groq.com/openai/v1 # or HF Router export MODEL_NAME=llama-3.3-70b-versatile export LOCAL_IMAGE_NAME=forensic-shell:latest python inference.py ``` A local baseline run (Llama-3.3-70B via Groq) scores roughly: | Task | Score | |---|---| | `t1_login` (easy) | **1.000** | | `t2_modified` (medium) | **0.500** | | `t3_timeline` (hard) | **0.750** | Scores are non-trivial without being trivially solved — exactly what an RL training signal needs. ## License BSD-3-Clause (matches OpenEnv core).