A newer version of the Gradio SDK is available: 6.12.0
title: Text Adventure Agent Submission
emoji: πΊ
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: mit
Text Adventure Agent
Overview
This agent plays Z-machine text adventure games (Zork, Lost Pig, etc.) via an MCP server built with FastMCP. The core design idea is to give the agent the best possible context at every step β not just the raw observation, but structured room history, graph memory, and synthesized narratives β so the underlying LLM can reason more reliably about what to do next.
Architecture
MCP Server (mcp_server.py)
The server wraps Jericho/Frotz and exposes four active tools:
| Tool | Purpose |
|---|---|
play_action |
Execute a game command and return the result with score/reward info |
room_state |
Return a rich volatile state block (see below) |
graph_memory_context |
Short semantic+episodic memory summary |
graph_memory_search |
Query memory for a specific item or room ("where is lamp") |
The key design decision was to make room_state the agent's primary context
source. It aggregates everything relevant to the current step into a structured
block: room identity, action history, valid actions, world objects, explored
map, inventory, and the AriGraph belief state. This avoids the agent having to
call multiple tools to piece together the same information.
Valid actions caching: Jericho's get_valid_actions() is expensive β it
calls the Z-machine for every possible command to check validity. The server
avoids calling it every step by caching the last result and only refreshing when
the room actually changes (tracked via get_exact_location() before and after
each play_action). This gave a significant speed improvement over the
baseline.
Rich room history: Each action taken in a room is stored with its full
consequence: score before/after, reward, items gained/removed, and truncated
result text. The bucket keeps the last 40 actions per room (up from the baseline
10), and the room_state output always includes an ALL_TRIED_IN_THIS_ROOM
condensed line listing every action name ever attempted in the current room.
This is critical for avoiding loops β the LLM can see at a glance that it has
already asked the gnome about 30 different topics.
Agent (agent.py) β REflAct Loop
Inspired from ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection https://arxiv.org/abs/2505.15182v2
The agent follows a REflAct (Reflect then Act) pattern rather than plain ReAct. At every step it generates:
REFLECTION: <brief belief state + goal progress + what to do next>
OBJECTIVE: <current short-term goal>
TOOL: <tool_name>
ARGS: <JSON arguments>
The OBJECTIVE line persists across steps through an current_objective field on
the agent. When the LLM produces a new objective, it replaces the old one and
stays visible in every subsequent prompt. This gives the agent a form of working
memory for its goal β without it, the LLM tends to forget what it was trying to
do when the context gets busy.
Multi-Agent Architecture
Four sub-agents run alongside the main action loop, each with a different trigger condition:
1. Narrative Synthesis Agent (every 5 steps or on score change) An LLM call
that reads the last 30 history entries and writes a 4β6 sentence factual summary
covering: current location, items found/used, obstacles encountered, what
succeeded and what failed, and the next logical step. This is more useful than
the raw history because the LLM summarizes patterns β e.g., "you have tried 12
gnome conversation topics and none yielded items" β rather than just repeating
individual action results. max_tokens=250 gives it room to be specific.
2. Action Hints Agent (on room change only) An LLM call that receives the current valid actions and the list of already-tried actions in this room. It returns 3β5 priority actions to try next. Caching it on room change rather than every step keeps the token cost low β the hints only become stale when the room context changes.
3. Critic Agent (every step, no LLM call) A fast heuristic evaluator that
runs before each play_action. It returns a score in [-1.0, +1.0] based on:
- Is this action in the valid actions list? β +1.0
- Has it been tried once in this room with no reward? β β0.5
- Has it been tried twice or more with no reward? β β1.0
- Was a movement direction blocked here before? β β1.0
- Has it previously earned reward? β +0.5
When the critic gives β1.0, it substitutes the proposed action with the best untried valid action (or an untried cardinal direction as fallback). No LLM call required, so it runs every step for free. In practice this prevents the agent from repeating a blocked direction or an already-failed object interaction indefinitely.
4. NPC Puzzle-State Tracker (on NPC conversation stall) An LLM call
triggered when β₯5 conversation-style actions (ask, tell, say, answer,
speak, talk) have been tried in the current room with no score change for
β₯10 turns. It receives the last 10 actions taken in the room with their full NPC
responses and reasons about what the NPC is still waiting for. It identifies
exhausted topics, infers the NPC's current state from the pattern of responses,
and outputs a structured hint in the form
NEEDS: <what NPC wants> | NEXT: <specific action to try>. Triggered at most
every 5 steps to limit token cost, and re-triggered on room changes to reset
state tracking. The output is injected into the prompt as NPC_PUZZLE_ANALYSIS,
giving the main agent explicit guidance on which untried conversation actions
are most likely to progress the puzzle.
AriGraph Memory (ari_graph_memory.py)
Inspired from AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents, https://arxiv.org/abs/2407.04363
A local, deterministic, graph-based memory backed by Kuzu β an embedded graph database with no network dependency. The design goals were: no LLM calls required for ingestion, fast enough to run synchronously after every game action, and able to answer questions like "where is the lamp" and "what exits has this room been found to have."
Schema:
SemanticEntitynodes: rooms, items, the player, directionsEpisodicTurnnodes: one per game action, storing action + observationSemanticReledges:at(player location),has(inventory),located_in(item in room),exit_DIRECTION(room connections),tried_DIRECTION(attempted but unknown directions)Mentionsedges: link episodic turns to the entities they refer to
Temporal validity: Every semantic relation carries valid_from (turn
number) and valid_to (β1 meaning "still valid"). When the player moves, the
old at edge is invalidated and a new one is created. When an item is picked
up, its located_in edge is invalidated. This gives the agent correct
up-to-date beliefs without needing to replay all history.
Search patterns handled:
- "Where am I" β looks up active
atedge for player - "What do I have" β lists active
hasedges - "Exits from [room]" β lists active
exit_*edges from that room - "Unexplored exits from [room]" β returns directions not yet in
exit_*ortried_* - "Where is [item]" β looks up active
located_inedge for that item
The AriGraph is always active alongside the optional Graphiti backend. Its
context_summary() is embedded directly in room_state output as the
ARIGRAPH_BELIEF_STATE block, giving the agent a concise belief statement every
step at zero extra network cost.
Loop and Stuck Detection
Several independent mechanisms guard against repetitive behaviour:
- Hard loop: If the last 3 game actions are identical, force
look. - Frequency loop: If any action appears 4+ times in the last 10, add a visible WARNING to the prompt.
- Consecutive non-action guard: If the agent calls memory/info tools 3 times
in a row without taking a game action, force
play_action {"action": "look"}. - Stuck warning (escalating): Track turns since last score change. At 10 turns: note suggesting exploration. At 20: urgent warning. At 30: critical alert with specific suggestions (try unexplored exits, open containers, use lamp).
- ALL_TRIED line: Every step the full list of actions ever taken in the
current room is visible at the top of
VOLATILE_STATE. This is the main guard against the agent re-asking about things in rooms with many interactions (e.g., asking a character about the same topic repeatedly). - Z-machine state hash: After every action,
hashlib.md5(env.get_state())is recorded. If the current machine state has been seen before,room_stateemitsSTATE_REVISIT_COUNT: N. The critic tightens rejection thresholds at count β₯ 3, and the LLM prompt receives a[Z-MACHINE LOOP DETECTED]warning urging it to try something genuinely new. This catches cross-room loops and NPC conversation cycles that room-local tracking alone cannot detect.
Design Decisions and Trade-offs
Why AriGraph instead of a simple list? A graph structure makes "where is X" and "exits from room Y" queries natural and efficient. The Kuzu embedded database means no network dependency and deterministic, fast ingestion β important when ingestion runs after every single game action.
Why keep narrative synthesis separate from PERMANENT_NARRATIVE? PERMANENT_NARRATIVE (last 10 turns verbatim) gives the LLM the raw recent facts. NARRATIVE_SYNTHESIS gives a higher-level interpretation. Both are useful: the raw history has exact item names and results; the synthesis identifies patterns and failures across a longer window (up to 30 turns).
Critic design: heuristic core with LLM escalation. The per-step critic runs
without an API call for common cases (blocked directions, repeated no-reward
actions), keeping the per-step cost flat. When the Z-machine state hash signals
a confirmed loop (STATE_REVISIT_COUNT >= 3), the critic tightens its rejection
threshold and the [Z-MACHINE LOOP DETECTED] warning is injected into the LLM
prompt. The LLM then reasons explicitly about breaking the cycle β combining the
zero-cost heuristic for routine rejection with LLM reasoning for the harder
cases where simple pattern-matching is not sufficient.
Handling deep NPC puzzles. The agent addresses the Lost Pig gnome problem
through layered signals: ALL_TRIED_IN_THIS_ROOM gives the LLM a full inventory
of every conversation topic already attempted; the Narrative Synthesis Agent
identifies patterns across the last 30 turns (e.g., "12 gnome topics tried, none
scored") and synthesises the next logical step; the Action Hints Agent filters
the valid-action list down to untried priority actions on each room entry; the
NPC Puzzle-State Tracker fires when a conversation stall is detected (β₯5 ask/
tell actions, no score change for β₯10 turns) and produces a structured
NEEDS: β¦ | NEXT: β¦ directive by reasoning over the NPC's past responses; and
the Z-machine state hash detects when the gnome conversation has cycled back to
an identical interpreter state, triggering an explicit loop-break prompt.
Together these signals give the LLM the information it needs to reason about NPC
state rather than blindly retrying exhausted options.
Files
| File | Description |
|---|---|
agent.py |
ReflAct agent with StudentAgent class, sub-agents, loop/stuck detection, critic |
mcp_server.py |
FastMCP server: play_action, room_state, graph_memory_context, graph_memory_search |
ari_graph_memory.py |
Kuzu-backed semantic+episodic memory (deterministic, no LLM required) |
graphiti_game_memory.py |
Optional Graphiti-backed memory with HuggingFace embeddings |
games/zork_env.py |
Jericho/Frotz wrapper environment |
app.py |
Gradio interface for HF Space |
Local Testing
# Install dependencies
pip install -r requirements.txt
# Test the agent locally (runs mcp_server.py automatically via FastMCP)
python agent.py
# Run evaluation (from Agentic-zork parent directory)
uv run evaluation/evaluate.py -s ../zorkclaw -t 5 --max-steps 100 -v