Spaces:

yashppawar
/

forensic-shell

Sleeping

File size: 5,559 Bytes

---
title: ForensicShell OpenEnv
emoji: 🔎
colorFrom: red
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - forensics
  - security
  - rl
---

# ForensicShell — OpenEnv Environment

A real-world digital forensics environment for the OpenEnv RL framework. The agent
investigates a pre-seeded "breached" Linux host using read-only structured actions
(`list_dir`, `read_file`, `grep`, `stat`) and submits a structured `ForensicReport`
that is graded deterministically against hidden ground truth.

## Why this environment?

Most RL environments are toys (games, classification, echo). ForensicShell simulates
something a junior SOC analyst actually does on day one: SSH into a compromised box,
read the logs, find the modified files, hash the backdoor, and reconstruct the
attacker's kill chain. It is **not a game**, the grader is **fully deterministic**,
and partial credit is awarded per subfield so the reward function gives a real
gradient instead of a 0/1 cliff.

## Tasks

The environment exposes three difficulty tiers, selectable at `reset(task_id=...)`.

| `task_id` | Difficulty | What the agent must determine |
|---|---|---|
| `t1_login` | **Easy** | Compromised user + initial source IP |
| `t2_modified` | **Medium** | + List of modified system files + SHA256 of the backdoor binary |
| `t3_timeline` | **Hard** | + Ordered attacker kill-chain timeline (login → recon → privesc → persistence → exfil) |

Each task ships with a different hand-authored scenario: different usernames, IPs,
attacker tools, and attack patterns. Ground truth is held inside the env and never
exposed to the agent through the action API.

## Action space

A single discriminated action `ForensicShellAction` with `action_type` selecting the verb:

| `action_type` | Required fields | Effect |
|---|---|---|
| `list_dir` | `path` | List immediate children of a directory in the synthetic FS |
| `read_file` | `path`, `max_bytes` | Read the contents of a file (truncated to `max_bytes`) |
| `grep` | `pattern`, `path` | Return matching lines with line numbers (max 100 hits) |
| `stat` | `path` | Return size + SHA256 of the file's bytes |
| `submit_report` | `report` (ForensicReport) | Terminal — grades the report and ends the episode |

The agent has **30 steps per episode**. Failing to submit before the budget is
exhausted ends the episode with reward 0.

## Observation space

```python
class ForensicShellObservation(Observation):
    output: str               # human-readable result of the last action
    task_id: str              # current task identifier
    task_description: str     # what the agent must determine
    steps_remaining: int      # remaining action budget
    action_error: Optional[str]  # error message if the last action failed, else None
    done: bool
    reward: float             # 0.0 except on the terminal submit_report step
    metadata: dict
```

## Reward function

Rewards are returned only on the terminal `submit_report` step. The grader is
deterministic and awards partial credit per subfield, so the reward signal has
meaningful gradient:

| Task | Grader composition |
|---|---|
| `t1_login` | `0.5 * user_correct + 0.5 * ip_correct` |
| `t2_modified` | `0.2*user + 0.2*ip + 0.3*Jaccard(modified_files) + 0.3*sha256_correct` |
| `t3_timeline` | `0.15*user + 0.15*ip + 0.15*files + 0.15*sha + 0.20*phase_F1 + 0.20*Kendall_tau_ordering` |

All rewards clip to `[0.0, 1.0]`. See `server/grader.py` for the full implementation.

## Quick start (client)

```python
import asyncio
from forensic_shell import ForensicShellAction, ForensicShellEnv
from forensic_shell.models import ForensicReport, TimelineEvent

async def main():
    async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env:
        result = await env.reset(task_id="t1_login")
        print(result.observation.task_description)

        result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log"))
        result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log"))
        result = await env.step(ForensicShellAction(
            action_type="submit_report",
            report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"),
        ))
        print(f"reward={result.reward:.3f} done={result.done}")

asyncio.run(main())
```

## Building locally

```bash
docker build -t forensic-shell:latest -f server/Dockerfile .
docker run -p 8000:8000 forensic-shell:latest
```

The server exposes:

- `POST /reset` — start a new episode
- `POST /step` — execute an action
- `GET /state` — episode state
- `GET /health` — health check
- `GET /docs` — OpenAPI docs
- `WS /ws` — persistent WebSocket session (used by `EnvClient`)

## Running the baseline

The repo root contains `inference.py` which runs an OpenAI-client-compatible LLM
through all three tasks and emits hackathon-formatted `[START]/[STEP]/[END]` log
lines to stdout.

```bash
export HF_TOKEN=<your-key>
export API_BASE_URL=https://api.groq.com/openai/v1   # or HF Router
export MODEL_NAME=llama-3.3-70b-versatile
export LOCAL_IMAGE_NAME=forensic-shell:latest
python inference.py
```

A local baseline run (Llama-3.3-70B via Groq) scores roughly:

| Task | Score |
|---|---|
| `t1_login` (easy) | **1.000** |
| `t2_modified` (medium) | **0.500** |
| `t3_timeline` (hard) | **0.750** |

Scores are non-trivial without being trivially solved — exactly what an RL training
signal needs.

## License

BSD-3-Clause (matches OpenEnv core).