File size: 5,559 Bytes
339394b
f909af8
 
 
 
339394b
 
f909af8
 
 
 
 
 
 
339394b
 
f909af8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: ForensicShell OpenEnv
emoji: πŸ”Ž
colorFrom: red
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - forensics
  - security
  - rl
---

# ForensicShell β€” OpenEnv Environment

A real-world digital forensics environment for the OpenEnv RL framework. The agent
investigates a pre-seeded "breached" Linux host using read-only structured actions
(`list_dir`, `read_file`, `grep`, `stat`) and submits a structured `ForensicReport`
that is graded deterministically against hidden ground truth.

## Why this environment?

Most RL environments are toys (games, classification, echo). ForensicShell simulates
something a junior SOC analyst actually does on day one: SSH into a compromised box,
read the logs, find the modified files, hash the backdoor, and reconstruct the
attacker's kill chain. It is **not a game**, the grader is **fully deterministic**,
and partial credit is awarded per subfield so the reward function gives a real
gradient instead of a 0/1 cliff.

## Tasks

The environment exposes three difficulty tiers, selectable at `reset(task_id=...)`.

| `task_id` | Difficulty | What the agent must determine |
|---|---|---|
| `t1_login` | **Easy** | Compromised user + initial source IP |
| `t2_modified` | **Medium** | + List of modified system files + SHA256 of the backdoor binary |
| `t3_timeline` | **Hard** | + Ordered attacker kill-chain timeline (login β†’ recon β†’ privesc β†’ persistence β†’ exfil) |

Each task ships with a different hand-authored scenario: different usernames, IPs,
attacker tools, and attack patterns. Ground truth is held inside the env and never
exposed to the agent through the action API.

## Action space

A single discriminated action `ForensicShellAction` with `action_type` selecting the verb:

| `action_type` | Required fields | Effect |
|---|---|---|
| `list_dir` | `path` | List immediate children of a directory in the synthetic FS |
| `read_file` | `path`, `max_bytes` | Read the contents of a file (truncated to `max_bytes`) |
| `grep` | `pattern`, `path` | Return matching lines with line numbers (max 100 hits) |
| `stat` | `path` | Return size + SHA256 of the file's bytes |
| `submit_report` | `report` (ForensicReport) | Terminal β€” grades the report and ends the episode |

The agent has **30 steps per episode**. Failing to submit before the budget is
exhausted ends the episode with reward 0.

## Observation space

```python
class ForensicShellObservation(Observation):
    output: str               # human-readable result of the last action
    task_id: str              # current task identifier
    task_description: str     # what the agent must determine
    steps_remaining: int      # remaining action budget
    action_error: Optional[str]  # error message if the last action failed, else None
    done: bool
    reward: float             # 0.0 except on the terminal submit_report step
    metadata: dict
```

## Reward function

Rewards are returned only on the terminal `submit_report` step. The grader is
deterministic and awards partial credit per subfield, so the reward signal has
meaningful gradient:

| Task | Grader composition |
|---|---|
| `t1_login` | `0.5 * user_correct + 0.5 * ip_correct` |
| `t2_modified` | `0.2*user + 0.2*ip + 0.3*Jaccard(modified_files) + 0.3*sha256_correct` |
| `t3_timeline` | `0.15*user + 0.15*ip + 0.15*files + 0.15*sha + 0.20*phase_F1 + 0.20*Kendall_tau_ordering` |

All rewards clip to `[0.0, 1.0]`. See `server/grader.py` for the full implementation.

## Quick start (client)

```python
import asyncio
from forensic_shell import ForensicShellAction, ForensicShellEnv
from forensic_shell.models import ForensicReport, TimelineEvent

async def main():
    async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env:
        result = await env.reset(task_id="t1_login")
        print(result.observation.task_description)

        result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log"))
        result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log"))
        result = await env.step(ForensicShellAction(
            action_type="submit_report",
            report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"),
        ))
        print(f"reward={result.reward:.3f} done={result.done}")

asyncio.run(main())
```

## Building locally

```bash
docker build -t forensic-shell:latest -f server/Dockerfile .
docker run -p 8000:8000 forensic-shell:latest
```

The server exposes:

- `POST /reset` β€” start a new episode
- `POST /step` β€” execute an action
- `GET /state` β€” episode state
- `GET /health` β€” health check
- `GET /docs` β€” OpenAPI docs
- `WS /ws` β€” persistent WebSocket session (used by `EnvClient`)

## Running the baseline

The repo root contains `inference.py` which runs an OpenAI-client-compatible LLM
through all three tasks and emits hackathon-formatted `[START]/[STEP]/[END]` log
lines to stdout.

```bash
export HF_TOKEN=<your-key>
export API_BASE_URL=https://api.groq.com/openai/v1   # or HF Router
export MODEL_NAME=llama-3.3-70b-versatile
export LOCAL_IMAGE_NAME=forensic-shell:latest
python inference.py
```

A local baseline run (Llama-3.3-70B via Groq) scores roughly:

| Task | Score |
|---|---|
| `t1_login` (easy) | **1.000** |
| `t2_modified` (medium) | **0.500** |
| `t3_timeline` (hard) | **0.750** |

Scores are non-trivial without being trivially solved β€” exactly what an RL training
signal needs.

## License

BSD-3-Clause (matches OpenEnv core).