Spaces:
Sleeping
Sleeping
| title: ForensicShell OpenEnv | |
| emoji: π | |
| colorFrom: red | |
| colorTo: gray | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| - forensics | |
| - security | |
| - rl | |
| # ForensicShell β OpenEnv Environment | |
| A real-world digital forensics environment for the OpenEnv RL framework. The agent | |
| investigates a pre-seeded "breached" Linux host using read-only structured actions | |
| (`list_dir`, `read_file`, `grep`, `stat`) and submits a structured `ForensicReport` | |
| that is graded deterministically against hidden ground truth. | |
| ## Why this environment? | |
| Most RL environments are toys (games, classification, echo). ForensicShell simulates | |
| something a junior SOC analyst actually does on day one: SSH into a compromised box, | |
| read the logs, find the modified files, hash the backdoor, and reconstruct the | |
| attacker's kill chain. It is **not a game**, the grader is **fully deterministic**, | |
| and partial credit is awarded per subfield so the reward function gives a real | |
| gradient instead of a 0/1 cliff. | |
| ## Tasks | |
| The environment exposes three difficulty tiers, selectable at `reset(task_id=...)`. | |
| | `task_id` | Difficulty | What the agent must determine | | |
| |---|---|---| | |
| | `t1_login` | **Easy** | Compromised user + initial source IP | | |
| | `t2_modified` | **Medium** | + List of modified system files + SHA256 of the backdoor binary | | |
| | `t3_timeline` | **Hard** | + Ordered attacker kill-chain timeline (login β recon β privesc β persistence β exfil) | | |
| Each task ships with a different hand-authored scenario: different usernames, IPs, | |
| attacker tools, and attack patterns. Ground truth is held inside the env and never | |
| exposed to the agent through the action API. | |
| ## Action space | |
| A single discriminated action `ForensicShellAction` with `action_type` selecting the verb: | |
| | `action_type` | Required fields | Effect | | |
| |---|---|---| | |
| | `list_dir` | `path` | List immediate children of a directory in the synthetic FS | | |
| | `read_file` | `path`, `max_bytes` | Read the contents of a file (truncated to `max_bytes`) | | |
| | `grep` | `pattern`, `path` | Return matching lines with line numbers (max 100 hits) | | |
| | `stat` | `path` | Return size + SHA256 of the file's bytes | | |
| | `submit_report` | `report` (ForensicReport) | Terminal β grades the report and ends the episode | | |
| The agent has **30 steps per episode**. Failing to submit before the budget is | |
| exhausted ends the episode with reward 0. | |
| ## Observation space | |
| ```python | |
| class ForensicShellObservation(Observation): | |
| output: str # human-readable result of the last action | |
| task_id: str # current task identifier | |
| task_description: str # what the agent must determine | |
| steps_remaining: int # remaining action budget | |
| action_error: Optional[str] # error message if the last action failed, else None | |
| done: bool | |
| reward: float # 0.0 except on the terminal submit_report step | |
| metadata: dict | |
| ``` | |
| ## Reward function | |
| Rewards are returned only on the terminal `submit_report` step. The grader is | |
| deterministic and awards partial credit per subfield, so the reward signal has | |
| meaningful gradient: | |
| | Task | Grader composition | | |
| |---|---| | |
| | `t1_login` | `0.5 * user_correct + 0.5 * ip_correct` | | |
| | `t2_modified` | `0.2*user + 0.2*ip + 0.3*Jaccard(modified_files) + 0.3*sha256_correct` | | |
| | `t3_timeline` | `0.15*user + 0.15*ip + 0.15*files + 0.15*sha + 0.20*phase_F1 + 0.20*Kendall_tau_ordering` | | |
| All rewards clip to `[0.0, 1.0]`. See `server/grader.py` for the full implementation. | |
| ## Quick start (client) | |
| ```python | |
| import asyncio | |
| from forensic_shell import ForensicShellAction, ForensicShellEnv | |
| from forensic_shell.models import ForensicReport, TimelineEvent | |
| async def main(): | |
| async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env: | |
| result = await env.reset(task_id="t1_login") | |
| print(result.observation.task_description) | |
| result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log")) | |
| result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log")) | |
| result = await env.step(ForensicShellAction( | |
| action_type="submit_report", | |
| report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"), | |
| )) | |
| print(f"reward={result.reward:.3f} done={result.done}") | |
| asyncio.run(main()) | |
| ``` | |
| ## Building locally | |
| ```bash | |
| docker build -t forensic-shell:latest -f server/Dockerfile . | |
| docker run -p 8000:8000 forensic-shell:latest | |
| ``` | |
| The server exposes: | |
| - `POST /reset` β start a new episode | |
| - `POST /step` β execute an action | |
| - `GET /state` β episode state | |
| - `GET /health` β health check | |
| - `GET /docs` β OpenAPI docs | |
| - `WS /ws` β persistent WebSocket session (used by `EnvClient`) | |
| ## Running the baseline | |
| The repo root contains `inference.py` which runs an OpenAI-client-compatible LLM | |
| through all three tasks and emits hackathon-formatted `[START]/[STEP]/[END]` log | |
| lines to stdout. | |
| ```bash | |
| export HF_TOKEN=<your-key> | |
| export API_BASE_URL=https://api.groq.com/openai/v1 # or HF Router | |
| export MODEL_NAME=llama-3.3-70b-versatile | |
| export LOCAL_IMAGE_NAME=forensic-shell:latest | |
| python inference.py | |
| ``` | |
| A local baseline run (Llama-3.3-70B via Groq) scores roughly: | |
| | Task | Score | | |
| |---|---| | |
| | `t1_login` (easy) | **1.000** | | |
| | `t2_modified` (medium) | **0.500** | | |
| | `t3_timeline` (hard) | **0.750** | | |
| Scores are non-trivial without being trivially solved β exactly what an RL training | |
| signal needs. | |
| ## License | |
| BSD-3-Clause (matches OpenEnv core). | |