Spaces:
Sleeping
Sleeping
File size: 5,559 Bytes
339394b f909af8 339394b f909af8 339394b f909af8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | ---
title: ForensicShell OpenEnv
emoji: π
colorFrom: red
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- forensics
- security
- rl
---
# ForensicShell β OpenEnv Environment
A real-world digital forensics environment for the OpenEnv RL framework. The agent
investigates a pre-seeded "breached" Linux host using read-only structured actions
(`list_dir`, `read_file`, `grep`, `stat`) and submits a structured `ForensicReport`
that is graded deterministically against hidden ground truth.
## Why this environment?
Most RL environments are toys (games, classification, echo). ForensicShell simulates
something a junior SOC analyst actually does on day one: SSH into a compromised box,
read the logs, find the modified files, hash the backdoor, and reconstruct the
attacker's kill chain. It is **not a game**, the grader is **fully deterministic**,
and partial credit is awarded per subfield so the reward function gives a real
gradient instead of a 0/1 cliff.
## Tasks
The environment exposes three difficulty tiers, selectable at `reset(task_id=...)`.
| `task_id` | Difficulty | What the agent must determine |
|---|---|---|
| `t1_login` | **Easy** | Compromised user + initial source IP |
| `t2_modified` | **Medium** | + List of modified system files + SHA256 of the backdoor binary |
| `t3_timeline` | **Hard** | + Ordered attacker kill-chain timeline (login β recon β privesc β persistence β exfil) |
Each task ships with a different hand-authored scenario: different usernames, IPs,
attacker tools, and attack patterns. Ground truth is held inside the env and never
exposed to the agent through the action API.
## Action space
A single discriminated action `ForensicShellAction` with `action_type` selecting the verb:
| `action_type` | Required fields | Effect |
|---|---|---|
| `list_dir` | `path` | List immediate children of a directory in the synthetic FS |
| `read_file` | `path`, `max_bytes` | Read the contents of a file (truncated to `max_bytes`) |
| `grep` | `pattern`, `path` | Return matching lines with line numbers (max 100 hits) |
| `stat` | `path` | Return size + SHA256 of the file's bytes |
| `submit_report` | `report` (ForensicReport) | Terminal β grades the report and ends the episode |
The agent has **30 steps per episode**. Failing to submit before the budget is
exhausted ends the episode with reward 0.
## Observation space
```python
class ForensicShellObservation(Observation):
output: str # human-readable result of the last action
task_id: str # current task identifier
task_description: str # what the agent must determine
steps_remaining: int # remaining action budget
action_error: Optional[str] # error message if the last action failed, else None
done: bool
reward: float # 0.0 except on the terminal submit_report step
metadata: dict
```
## Reward function
Rewards are returned only on the terminal `submit_report` step. The grader is
deterministic and awards partial credit per subfield, so the reward signal has
meaningful gradient:
| Task | Grader composition |
|---|---|
| `t1_login` | `0.5 * user_correct + 0.5 * ip_correct` |
| `t2_modified` | `0.2*user + 0.2*ip + 0.3*Jaccard(modified_files) + 0.3*sha256_correct` |
| `t3_timeline` | `0.15*user + 0.15*ip + 0.15*files + 0.15*sha + 0.20*phase_F1 + 0.20*Kendall_tau_ordering` |
All rewards clip to `[0.0, 1.0]`. See `server/grader.py` for the full implementation.
## Quick start (client)
```python
import asyncio
from forensic_shell import ForensicShellAction, ForensicShellEnv
from forensic_shell.models import ForensicReport, TimelineEvent
async def main():
async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env:
result = await env.reset(task_id="t1_login")
print(result.observation.task_description)
result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log"))
result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log"))
result = await env.step(ForensicShellAction(
action_type="submit_report",
report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"),
))
print(f"reward={result.reward:.3f} done={result.done}")
asyncio.run(main())
```
## Building locally
```bash
docker build -t forensic-shell:latest -f server/Dockerfile .
docker run -p 8000:8000 forensic-shell:latest
```
The server exposes:
- `POST /reset` β start a new episode
- `POST /step` β execute an action
- `GET /state` β episode state
- `GET /health` β health check
- `GET /docs` β OpenAPI docs
- `WS /ws` β persistent WebSocket session (used by `EnvClient`)
## Running the baseline
The repo root contains `inference.py` which runs an OpenAI-client-compatible LLM
through all three tasks and emits hackathon-formatted `[START]/[STEP]/[END]` log
lines to stdout.
```bash
export HF_TOKEN=<your-key>
export API_BASE_URL=https://api.groq.com/openai/v1 # or HF Router
export MODEL_NAME=llama-3.3-70b-versatile
export LOCAL_IMAGE_NAME=forensic-shell:latest
python inference.py
```
A local baseline run (Llama-3.3-70B via Groq) scores roughly:
| Task | Score |
|---|---|
| `t1_login` (easy) | **1.000** |
| `t2_modified` (medium) | **0.500** |
| `t3_timeline` (hard) | **0.750** |
Scores are non-trivial without being trivially solved β exactly what an RL training
signal needs.
## License
BSD-3-Clause (matches OpenEnv core).
|