zorkclaw / README.md
Simon Sassi
chore: add arigraph reference to README
f2a7fb6

A newer version of the Gradio SDK is available: 6.12.0

Upgrade
metadata
title: Text Adventure Agent Submission
emoji: πŸ—Ί
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit

Text Adventure Agent

Overview

This agent plays Z-machine text adventure games (Zork, Lost Pig, etc.) via an MCP server built with FastMCP. The core design idea is to give the agent the best possible context at every step β€” not just the raw observation, but structured room history, graph memory, and synthesized narratives β€” so the underlying LLM can reason more reliably about what to do next.


Architecture

MCP Server (mcp_server.py)

The server wraps Jericho/Frotz and exposes four active tools:

Tool Purpose
play_action Execute a game command and return the result with score/reward info
room_state Return a rich volatile state block (see below)
graph_memory_context Short semantic+episodic memory summary
graph_memory_search Query memory for a specific item or room ("where is lamp")

The key design decision was to make room_state the agent's primary context source. It aggregates everything relevant to the current step into a structured block: room identity, action history, valid actions, world objects, explored map, inventory, and the AriGraph belief state. This avoids the agent having to call multiple tools to piece together the same information.

Valid actions caching: Jericho's get_valid_actions() is expensive β€” it calls the Z-machine for every possible command to check validity. The server avoids calling it every step by caching the last result and only refreshing when the room actually changes (tracked via get_exact_location() before and after each play_action). This gave a significant speed improvement over the baseline.

Rich room history: Each action taken in a room is stored with its full consequence: score before/after, reward, items gained/removed, and truncated result text. The bucket keeps the last 40 actions per room (up from the baseline 10), and the room_state output always includes an ALL_TRIED_IN_THIS_ROOM condensed line listing every action name ever attempted in the current room. This is critical for avoiding loops β€” the LLM can see at a glance that it has already asked the gnome about 30 different topics.


Agent (agent.py) β€” REflAct Loop

Inspired from ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection https://arxiv.org/abs/2505.15182v2

The agent follows a REflAct (Reflect then Act) pattern rather than plain ReAct. At every step it generates:

REFLECTION: <brief belief state + goal progress + what to do next>
OBJECTIVE: <current short-term goal>
TOOL: <tool_name>
ARGS: <JSON arguments>

The OBJECTIVE line persists across steps through an current_objective field on the agent. When the LLM produces a new objective, it replaces the old one and stays visible in every subsequent prompt. This gives the agent a form of working memory for its goal β€” without it, the LLM tends to forget what it was trying to do when the context gets busy.


Multi-Agent Architecture

Four sub-agents run alongside the main action loop, each with a different trigger condition:

1. Narrative Synthesis Agent (every 5 steps or on score change) An LLM call that reads the last 30 history entries and writes a 4–6 sentence factual summary covering: current location, items found/used, obstacles encountered, what succeeded and what failed, and the next logical step. This is more useful than the raw history because the LLM summarizes patterns β€” e.g., "you have tried 12 gnome conversation topics and none yielded items" β€” rather than just repeating individual action results. max_tokens=250 gives it room to be specific.

2. Action Hints Agent (on room change only) An LLM call that receives the current valid actions and the list of already-tried actions in this room. It returns 3–5 priority actions to try next. Caching it on room change rather than every step keeps the token cost low β€” the hints only become stale when the room context changes.

3. Critic Agent (every step, no LLM call) A fast heuristic evaluator that runs before each play_action. It returns a score in [-1.0, +1.0] based on:

  • Is this action in the valid actions list? β†’ +1.0
  • Has it been tried once in this room with no reward? β†’ βˆ’0.5
  • Has it been tried twice or more with no reward? β†’ βˆ’1.0
  • Was a movement direction blocked here before? β†’ βˆ’1.0
  • Has it previously earned reward? β†’ +0.5

When the critic gives βˆ’1.0, it substitutes the proposed action with the best untried valid action (or an untried cardinal direction as fallback). No LLM call required, so it runs every step for free. In practice this prevents the agent from repeating a blocked direction or an already-failed object interaction indefinitely.

4. NPC Puzzle-State Tracker (on NPC conversation stall) An LLM call triggered when β‰₯5 conversation-style actions (ask, tell, say, answer, speak, talk) have been tried in the current room with no score change for β‰₯10 turns. It receives the last 10 actions taken in the room with their full NPC responses and reasons about what the NPC is still waiting for. It identifies exhausted topics, infers the NPC's current state from the pattern of responses, and outputs a structured hint in the form NEEDS: <what NPC wants> | NEXT: <specific action to try>. Triggered at most every 5 steps to limit token cost, and re-triggered on room changes to reset state tracking. The output is injected into the prompt as NPC_PUZZLE_ANALYSIS, giving the main agent explicit guidance on which untried conversation actions are most likely to progress the puzzle.


AriGraph Memory (ari_graph_memory.py)

Inspired from AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents, https://arxiv.org/abs/2407.04363

A local, deterministic, graph-based memory backed by Kuzu β€” an embedded graph database with no network dependency. The design goals were: no LLM calls required for ingestion, fast enough to run synchronously after every game action, and able to answer questions like "where is the lamp" and "what exits has this room been found to have."

Schema:

  • SemanticEntity nodes: rooms, items, the player, directions
  • EpisodicTurn nodes: one per game action, storing action + observation
  • SemanticRel edges: at (player location), has (inventory), located_in (item in room), exit_DIRECTION (room connections), tried_DIRECTION (attempted but unknown directions)
  • Mentions edges: link episodic turns to the entities they refer to

Temporal validity: Every semantic relation carries valid_from (turn number) and valid_to (βˆ’1 meaning "still valid"). When the player moves, the old at edge is invalidated and a new one is created. When an item is picked up, its located_in edge is invalidated. This gives the agent correct up-to-date beliefs without needing to replay all history.

Search patterns handled:

  • "Where am I" β†’ looks up active at edge for player
  • "What do I have" β†’ lists active has edges
  • "Exits from [room]" β†’ lists active exit_* edges from that room
  • "Unexplored exits from [room]" β†’ returns directions not yet in exit_* or tried_*
  • "Where is [item]" β†’ looks up active located_in edge for that item

The AriGraph is always active alongside the optional Graphiti backend. Its context_summary() is embedded directly in room_state output as the ARIGRAPH_BELIEF_STATE block, giving the agent a concise belief statement every step at zero extra network cost.


Loop and Stuck Detection

Several independent mechanisms guard against repetitive behaviour:

  • Hard loop: If the last 3 game actions are identical, force look.
  • Frequency loop: If any action appears 4+ times in the last 10, add a visible WARNING to the prompt.
  • Consecutive non-action guard: If the agent calls memory/info tools 3 times in a row without taking a game action, force play_action {"action": "look"}.
  • Stuck warning (escalating): Track turns since last score change. At 10 turns: note suggesting exploration. At 20: urgent warning. At 30: critical alert with specific suggestions (try unexplored exits, open containers, use lamp).
  • ALL_TRIED line: Every step the full list of actions ever taken in the current room is visible at the top of VOLATILE_STATE. This is the main guard against the agent re-asking about things in rooms with many interactions (e.g., asking a character about the same topic repeatedly).
  • Z-machine state hash: After every action, hashlib.md5(env.get_state()) is recorded. If the current machine state has been seen before, room_state emits STATE_REVISIT_COUNT: N. The critic tightens rejection thresholds at count β‰₯ 3, and the LLM prompt receives a [Z-MACHINE LOOP DETECTED] warning urging it to try something genuinely new. This catches cross-room loops and NPC conversation cycles that room-local tracking alone cannot detect.

Design Decisions and Trade-offs

Why AriGraph instead of a simple list? A graph structure makes "where is X" and "exits from room Y" queries natural and efficient. The Kuzu embedded database means no network dependency and deterministic, fast ingestion β€” important when ingestion runs after every single game action.

Why keep narrative synthesis separate from PERMANENT_NARRATIVE? PERMANENT_NARRATIVE (last 10 turns verbatim) gives the LLM the raw recent facts. NARRATIVE_SYNTHESIS gives a higher-level interpretation. Both are useful: the raw history has exact item names and results; the synthesis identifies patterns and failures across a longer window (up to 30 turns).

Critic design: heuristic core with LLM escalation. The per-step critic runs without an API call for common cases (blocked directions, repeated no-reward actions), keeping the per-step cost flat. When the Z-machine state hash signals a confirmed loop (STATE_REVISIT_COUNT >= 3), the critic tightens its rejection threshold and the [Z-MACHINE LOOP DETECTED] warning is injected into the LLM prompt. The LLM then reasons explicitly about breaking the cycle β€” combining the zero-cost heuristic for routine rejection with LLM reasoning for the harder cases where simple pattern-matching is not sufficient.

Handling deep NPC puzzles. The agent addresses the Lost Pig gnome problem through layered signals: ALL_TRIED_IN_THIS_ROOM gives the LLM a full inventory of every conversation topic already attempted; the Narrative Synthesis Agent identifies patterns across the last 30 turns (e.g., "12 gnome topics tried, none scored") and synthesises the next logical step; the Action Hints Agent filters the valid-action list down to untried priority actions on each room entry; the NPC Puzzle-State Tracker fires when a conversation stall is detected (β‰₯5 ask/ tell actions, no score change for β‰₯10 turns) and produces a structured NEEDS: … | NEXT: … directive by reasoning over the NPC's past responses; and the Z-machine state hash detects when the gnome conversation has cycled back to an identical interpreter state, triggering an explicit loop-break prompt. Together these signals give the LLM the information it needs to reason about NPC state rather than blindly retrying exhausted options.


Files

File Description
agent.py ReflAct agent with StudentAgent class, sub-agents, loop/stuck detection, critic
mcp_server.py FastMCP server: play_action, room_state, graph_memory_context, graph_memory_search
ari_graph_memory.py Kuzu-backed semantic+episodic memory (deterministic, no LLM required)
graphiti_game_memory.py Optional Graphiti-backed memory with HuggingFace embeddings
games/zork_env.py Jericho/Frotz wrapper environment
app.py Gradio interface for HF Space

Local Testing

# Install dependencies
pip install -r requirements.txt

# Test the agent locally (runs mcp_server.py automatically via FastMCP)
python agent.py

# Run evaluation (from Agentic-zork parent directory)
uv run evaluation/evaluate.py -s ../zorkclaw -t 5 --max-steps 100 -v