Spaces:

yashppawar
/

forensic-shell

Sleeping

App Files Files Community

forensic-shell / README.md

yashppawar

Upload folder using huggingface_hub

f909af8 verified 7 days ago

preview code

raw

history blame contribute delete

5.56 kB

	---
	title: ForensicShell OpenEnv
	emoji: 🔎
	colorFrom: red
	colorTo: gray
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv
	- forensics
	- security
	- rl
	---

	# ForensicShell — OpenEnv Environment

	A real-world digital forensics environment for the OpenEnv RL framework. The agent
	investigates a pre-seeded "breached" Linux host using read-only structured actions
	(`list_dir`, `read_file`, `grep`, `stat`) and submits a structured `ForensicReport`
	that is graded deterministically against hidden ground truth.

	## Why this environment?

	Most RL environments are toys (games, classification, echo). ForensicShell simulates
	something a junior SOC analyst actually does on day one: SSH into a compromised box,
	read the logs, find the modified files, hash the backdoor, and reconstruct the
	attacker's kill chain. It is not a game, the grader is fully deterministic,
	and partial credit is awarded per subfield so the reward function gives a real
	gradient instead of a 0/1 cliff.

	## Tasks

	The environment exposes three difficulty tiers, selectable at `reset(task_id=...)`.

	\| `task_id` \| Difficulty \| What the agent must determine \|
	\|---\|---\|---\|
	\| `t1_login` \| Easy \| Compromised user + initial source IP \|
	\| `t2_modified` \| Medium \| + List of modified system files + SHA256 of the backdoor binary \|
	\| `t3_timeline` \| Hard \| + Ordered attacker kill-chain timeline (login → recon → privesc → persistence → exfil) \|

	Each task ships with a different hand-authored scenario: different usernames, IPs,
	attacker tools, and attack patterns. Ground truth is held inside the env and never
	exposed to the agent through the action API.

	## Action space

	A single discriminated action `ForensicShellAction` with `action_type` selecting the verb:

	\| `action_type` \| Required fields \| Effect \|
	\|---\|---\|---\|
	\| `list_dir` \| `path` \| List immediate children of a directory in the synthetic FS \|
	\| `read_file` \| `path`, `max_bytes` \| Read the contents of a file (truncated to `max_bytes`) \|
	\| `grep` \| `pattern`, `path` \| Return matching lines with line numbers (max 100 hits) \|
	\| `stat` \| `path` \| Return size + SHA256 of the file's bytes \|
	\| `submit_report` \| `report` (ForensicReport) \| Terminal — grades the report and ends the episode \|

	The agent has 30 steps per episode. Failing to submit before the budget is
	exhausted ends the episode with reward 0.

	## Observation space

	```python
	class ForensicShellObservation(Observation):
	output: str # human-readable result of the last action
	task_id: str # current task identifier
	task_description: str # what the agent must determine
	steps_remaining: int # remaining action budget
	action_error: Optional[str] # error message if the last action failed, else None
	done: bool
	reward: float # 0.0 except on the terminal submit_report step
	metadata: dict
	```

	## Reward function

	Rewards are returned only on the terminal `submit_report` step. The grader is
	deterministic and awards partial credit per subfield, so the reward signal has
	meaningful gradient:

	\| Task \| Grader composition \|
	\|---\|---\|
	\| `t1_login` \| `0.5 * user_correct + 0.5 * ip_correct` \|
	\| `t2_modified` \| `0.2user + 0.2ip + 0.3Jaccard(modified_files) + 0.3sha256_correct` \|
	\| `t3_timeline` \| `0.15user + 0.15ip + 0.15files + 0.15sha + 0.20phase_F1 + 0.20Kendall_tau_ordering` \|

	All rewards clip to `[0.0, 1.0]`. See `server/grader.py` for the full implementation.

	## Quick start (client)

	```python
	import asyncio
	from forensic_shell import ForensicShellAction, ForensicShellEnv
	from forensic_shell.models import ForensicReport, TimelineEvent

	async def main():
	async with ForensicShellEnv(base_url="https://YOUR-SPACE.hf.space") as env:
	result = await env.reset(task_id="t1_login")
	print(result.observation.task_description)

	result = await env.step(ForensicShellAction(action_type="list_dir", path="/var/log"))
	result = await env.step(ForensicShellAction(action_type="read_file", path="/var/log/auth.log"))
	result = await env.step(ForensicShellAction(
	action_type="submit_report",
	report=ForensicReport(compromised_user="alice", initial_ip="198.51.100.77"),
	))
	print(f"reward={result.reward:.3f} done={result.done}")

	asyncio.run(main())
	```

	## Building locally

	```bash
	docker build -t forensic-shell:latest -f server/Dockerfile .
	docker run -p 8000:8000 forensic-shell:latest
	```

	The server exposes:

	- `POST /reset` — start a new episode
	- `POST /step` — execute an action
	- `GET /state` — episode state
	- `GET /health` — health check
	- `GET /docs` — OpenAPI docs
	- `WS /ws` — persistent WebSocket session (used by `EnvClient`)

	## Running the baseline

	The repo root contains `inference.py` which runs an OpenAI-client-compatible LLM
	through all three tasks and emits hackathon-formatted `[START]/[STEP]/[END]` log
	lines to stdout.

	```bash
	export HF_TOKEN=<your-key>
	export API_BASE_URL=https://api.groq.com/openai/v1 # or HF Router
	export MODEL_NAME=llama-3.3-70b-versatile
	export LOCAL_IMAGE_NAME=forensic-shell:latest
	python inference.py
	```

	A local baseline run (Llama-3.3-70B via Groq) scores roughly:

	\| Task \| Score \|
	\|---\|---\|
	\| `t1_login` (easy) \| 1.000 \|
	\| `t2_modified` (medium) \| 0.500 \|
	\| `t3_timeline` (hard) \| 0.750 \|

	Scores are non-trivial without being trivially solved — exactly what an RL training
	signal needs.

	## License

	BSD-3-Clause (matches OpenEnv core).