forensic-shell / README.md
yashppawar's picture
Upload folder using huggingface_hub
f909af8 verified
---
title: ForensicShell OpenEnv
emoji: πŸ”Ž
colorFrom: red
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- forensics
- security
- rl
---
# ForensicShell β€” OpenEnv Environment
A real-world digital forensics environment for the OpenEnv RL framework. The agent
investigates a pre-seeded "breached" Linux host using read-only structured actions
(`list_dir`, `read_file`, `grep`, `stat`) and submits a structured `ForensicReport`
that is graded deterministically against hidden ground truth.
## Why this environment?
Most RL environments are toys (games, classification, echo). ForensicShell simulates
something a junior SOC analyst actually does on day one: SSH into a compromised box,
read the logs, find the modified files, hash the backdoor, and reconstruct the
attacker's kill chain. It is **not a game**, the grader is **fully deterministic**,
and partial credit is awarded per subfield so the reward function gives a real
gradient instead of a 0/1 cliff.
## Tasks
The environment exposes three difficulty tiers, selectable at `reset(task_id=...)`.
| `task_id` | Difficulty | What the agent must determine |
|---|---|---|
| `t1_login` | **Easy** | Compromised user + initial source IP |
| `t2_modified` | **Medium** | + List of modified system files + SHA256 of the backdoor binary |
| `t3_timeline` | **Hard** | + Ordered attacker kill-chain timeline (login β†’ recon β†’ privesc β†’ persistence β†’ exfil) |
Each task ships with a different hand-authored scenario: different usernames, IPs,
attacker tools, and attack patterns. Ground truth is held inside the env and never
exposed to the agent through the action API.
## Action space
A single discriminated action `ForensicShellAction` with `action_type` selecting the verb:
| `action_type` | Required fields | Effect |
|---|---|---|
| `list_dir` | `path` | List immediate children of a directory in the synthetic FS |
| `read_file` | `path`, `max_bytes` | Read the contents of a file (truncated to `max_bytes`) |
| `grep` | `pattern`, `path` | Return matching lines with line numbers (max 100 hits) |
| `stat` | `path` | Return size + SHA256 of the file's bytes |
| `submit_report` | `report` (ForensicReport) | Terminal β€” grades the report and ends the episode |
The agent has **30 steps per episode**. Failing to submit before the budget is
exhausted ends the episode with reward 0.
## Observation space
```python
class ForensicShellObservation(Observation):
output: str # human-readable result of the last action
task_id: str # current task identifier
task_description: str # what the agent must determine
steps_remaining: int # remaining action budget
action_error: Optional[str] # error message if the last action failed, else None
done: bool
reward: float # 0.0 except on the terminal submit_report step
metadata: dict
```
## Reward function
Rewards are returned only on the terminal `submit_report` step. The grader is
deterministic and awards partial credit per subfield, so the reward signal has
meaningful gradient:
| Task | Grader composition |
|---|---|
| `t1_login` | `0.5 * user_correct + 0.5 * ip_correct` |
| `t2_modified` | `0.2*user + 0.2*ip + 0.3*Jaccard(modified_files) + 0.3*sha256_correct` |
| `t3_timeline` | `0.15*user + 0.15*ip + 0.15*files + 0.15*sha + 0.20*phase_F1 + 0.20*Kendall_tau_ordering` |
All rewards clip to `[0.0, 1.0]`. See `server/grader.py` for the full implementation.
## Quick start (client)
```python
import asyncio
from forensic_shell import ForensicShellAction, ForensicShellEnv
from forensic_shell.models import ForensicReport, TimelineEvent
async def main():
async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env:
result = await env.reset(task_id="t1_login")
print(result.observation.task_description)
result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log"))
result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log"))
result = await env.step(ForensicShellAction(
action_type="submit_report",
report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"),
))
print(f"reward={result.reward:.3f} done={result.done}")
asyncio.run(main())
```
## Building locally
```bash
docker build -t forensic-shell:latest -f server/Dockerfile .
docker run -p 8000:8000 forensic-shell:latest
```
The server exposes:
- `POST /reset` β€” start a new episode
- `POST /step` β€” execute an action
- `GET /state` β€” episode state
- `GET /health` β€” health check
- `GET /docs` β€” OpenAPI docs
- `WS /ws` β€” persistent WebSocket session (used by `EnvClient`)
## Running the baseline
The repo root contains `inference.py` which runs an OpenAI-client-compatible LLM
through all three tasks and emits hackathon-formatted `[START]/[STEP]/[END]` log
lines to stdout.
```bash
export HF_TOKEN=<your-key>
export API_BASE_URL=https://api.groq.com/openai/v1 # or HF Router
export MODEL_NAME=llama-3.3-70b-versatile
export LOCAL_IMAGE_NAME=forensic-shell:latest
python inference.py
```
A local baseline run (Llama-3.3-70B via Groq) scores roughly:
| Task | Score |
|---|---|
| `t1_login` (easy) | **1.000** |
| `t2_modified` (medium) | **0.500** |
| `t3_timeline` (hard) | **0.750** |
Scores are non-trivial without being trivially solved β€” exactly what an RL training
signal needs.
## License
BSD-3-Clause (matches OpenEnv core).