forensic-shell / README.md
yashppawar's picture
Upload folder using huggingface_hub
f909af8 verified
metadata
title: ForensicShell OpenEnv
emoji: πŸ”Ž
colorFrom: red
colorTo: gray
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - forensics
  - security
  - rl

ForensicShell β€” OpenEnv Environment

A real-world digital forensics environment for the OpenEnv RL framework. The agent investigates a pre-seeded "breached" Linux host using read-only structured actions (list_dir, read_file, grep, stat) and submits a structured ForensicReport that is graded deterministically against hidden ground truth.

Why this environment?

Most RL environments are toys (games, classification, echo). ForensicShell simulates something a junior SOC analyst actually does on day one: SSH into a compromised box, read the logs, find the modified files, hash the backdoor, and reconstruct the attacker's kill chain. It is not a game, the grader is fully deterministic, and partial credit is awarded per subfield so the reward function gives a real gradient instead of a 0/1 cliff.

Tasks

The environment exposes three difficulty tiers, selectable at reset(task_id=...).

task_id Difficulty What the agent must determine
t1_login Easy Compromised user + initial source IP
t2_modified Medium + List of modified system files + SHA256 of the backdoor binary
t3_timeline Hard + Ordered attacker kill-chain timeline (login β†’ recon β†’ privesc β†’ persistence β†’ exfil)

Each task ships with a different hand-authored scenario: different usernames, IPs, attacker tools, and attack patterns. Ground truth is held inside the env and never exposed to the agent through the action API.

Action space

A single discriminated action ForensicShellAction with action_type selecting the verb:

action_type Required fields Effect
list_dir path List immediate children of a directory in the synthetic FS
read_file path, max_bytes Read the contents of a file (truncated to max_bytes)
grep pattern, path Return matching lines with line numbers (max 100 hits)
stat path Return size + SHA256 of the file's bytes
submit_report report (ForensicReport) Terminal β€” grades the report and ends the episode

The agent has 30 steps per episode. Failing to submit before the budget is exhausted ends the episode with reward 0.

Observation space

class ForensicShellObservation(Observation):
    output: str               # human-readable result of the last action
    task_id: str              # current task identifier
    task_description: str     # what the agent must determine
    steps_remaining: int      # remaining action budget
    action_error: Optional[str]  # error message if the last action failed, else None
    done: bool
    reward: float             # 0.0 except on the terminal submit_report step
    metadata: dict

Reward function

Rewards are returned only on the terminal submit_report step. The grader is deterministic and awards partial credit per subfield, so the reward signal has meaningful gradient:

Task Grader composition
t1_login 0.5 * user_correct + 0.5 * ip_correct
t2_modified 0.2*user + 0.2*ip + 0.3*Jaccard(modified_files) + 0.3*sha256_correct
t3_timeline 0.15*user + 0.15*ip + 0.15*files + 0.15*sha + 0.20*phase_F1 + 0.20*Kendall_tau_ordering

All rewards clip to [0.0, 1.0]. See server/grader.py for the full implementation.

Quick start (client)

import asyncio
from forensic_shell import ForensicShellAction, ForensicShellEnv
from forensic_shell.models import ForensicReport, TimelineEvent

async def main():
    async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env:
        result = await env.reset(task_id="t1_login")
        print(result.observation.task_description)

        result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log"))
        result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log"))
        result = await env.step(ForensicShellAction(
            action_type="submit_report",
            report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"),
        ))
        print(f"reward={result.reward:.3f} done={result.done}")

asyncio.run(main())

Building locally

docker build -t forensic-shell:latest -f server/Dockerfile .
docker run -p 8000:8000 forensic-shell:latest

The server exposes:

  • POST /reset β€” start a new episode
  • POST /step β€” execute an action
  • GET /state β€” episode state
  • GET /health β€” health check
  • GET /docs β€” OpenAPI docs
  • WS /ws β€” persistent WebSocket session (used by EnvClient)

Running the baseline

The repo root contains inference.py which runs an OpenAI-client-compatible LLM through all three tasks and emits hackathon-formatted [START]/[STEP]/[END] log lines to stdout.

export HF_TOKEN=<your-key>
export API_BASE_URL=https://api.groq.com/openai/v1   # or HF Router
export MODEL_NAME=llama-3.3-70b-versatile
export LOCAL_IMAGE_NAME=forensic-shell:latest
python inference.py

A local baseline run (Llama-3.3-70B via Groq) scores roughly:

Task Score
t1_login (easy) 1.000
t2_modified (medium) 0.500
t3_timeline (hard) 0.750

Scores are non-trivial without being trivially solved β€” exactly what an RL training signal needs.

License

BSD-3-Clause (matches OpenEnv core).