Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
Building a Reliable MCP Agent for Zork-Style Text Adventures
Text adventures sound trivial: you read a paragraph, type a command, get a new paragraph.
But once you put an LLM in that loop, you learn quickly that the hardest enemies aren’t in the dungeon—they’re in the interface.
What kills most LLM agents in Zork-like games is a predictable set of failure modes:
- Parser brittleness: the game rejects slightly-wrong phrasing.
- Looping: the model repeats actions, rooms, or “no-op” moves.
- Move budget waste: doing “admin” actions that consume moves.
- Prompt bloat: raw history gets too long and too noisy.
- Goal drift: the model forgets what it was trying to do.
Some of these ideas are also exposed in the article "TextQuests: How Good are LLMs at Text-Based Video Games?" (https://arxiv.org/pdf/2507.23701) namely memory, coherence and planning.
So we didn’t build “a prompt.” We built a system with two main components:
- an MCP server that exposes the game through robust tools and instrumentation
- and an agent that treats the LLM as one component among others (memory, planning, recovery policies)
Our focus was on the previous failure modes, and how to design around them with tools and guardrails.
This is a high-level tour of the approach, focusing on the big ideas, without getting into implementation details.
The code is available in the HuggingFace space: https://huggingface.co/spaces/LLM-course/Text-game-agent-EILLES
The Setup: Two Pieces, One Loop
1) mcp_server.py — the game adapter + instrumentation layer
The MCP server acts like the game interface for the agent. It:
- owns the environment (
TextAdventureEnv) - runs commands (
play_action) - tracks exploration metadata (rooms, transitions, tried actions)
- exposes tools that help reasoning without spending moves
- provides safety mechanisms like checkpoints and action simulation
2) agent.py — the policy engine + ReAct decision-maker
The agent:
- outputs strict ReAct steps (THOUGHT -> TOOL -> ARGS)
- can only interact via MCP tools (never “talks to the game” directly)
- uses guardrails to keep the LLM from hallucinating tools/commands, looping, spamming, etc.
- uses two additional LLM calls as specialized modules:
- memory compression (long-term, high-signal memory)
- objective planning (goal updates + suggested next actions)
vWe treat the LLM as a reasoning module, not as a reliable system controller. All safety, consistency, and state tracking are enforced outside the model.
Why “Tooling” is important: The MCP Server as a Game Interface
A Zork parser is not a friendly API. If the model invents commands like “look around carefully”, the game will often respond with something like:
“That sentence isn’t one I recognize.”
If you only expose play_action, the agent becomes a guessing machine.
So the MCP server provides a richer interface that makes the world “legible”:
- Structured state (score, moves, inventory, room, “done”, a stable hash)
- Inventory without spending a move
- Valid actions (best-effort list) for recovery
- A map/graph of explored rooms and transitions
- Actions tried per room to avoid repeating
- Checkpoints to rollback after loops or risky moves
- Action probing (simulate before committing)
This set of tools is what turns the text game into something the agent can navigate reliably.
Part 1 — The MCP Server: Turning a Game into a Usable API
The Server’s Core Idea: Track More Than the Game Tracks
The environment gives you:
- observation text
- score/moves (usually)
- maybe inventory (depending on wrapper)
But it doesn’t give you the extra structure an agent needs to be efficient:
- Where have I been?
- What did I already try here?
- How do rooms connect?
- Am I stuck in a loop?
So the server maintains that meta-state itself:
- a short history of actions and results
- a set of locations (rooms) discovered
- a transition graph (
room --action--> room) - an index of actions tried per location
- checkpoint snapshots for rollback
- a stable-ish state hash used to detect loops
This is not just logging. It becomes actionable tool output the agent can rely on.
Room Awareness: The Small Heuristic That Makes Everything Work
Most downstream reasoning depends on “what room am I in?”
The server uses a heuristic to extract the room title from the observation:
- pick the first plausible “header-like” line
- ignore copyright/revision boilerplate
- ignore long narrative sentences
This matters because room identity powers:
- mapping
- “tried actions” grouping
- loop detection context
- objective tracking (“return to grating”, “open mailbox”, etc.)
If you don’t have stable room identity, the agent’s memory becomes confused.
The Minimal but Critical Tools
play_action(action)
The main interaction tool:
- runs the command
- returns the observation
- appends optional “+points” signals and “GAME OVER”
- never crashes the tool (so the run doesn’t die on edge cases)
This tool is deliberately boring—but highly reliable.
inventory()
A huge move-saver: it returns inventory without advancing the game.
In text adventures, calling inventory as a game command costs a move in many setups, so treating inventory as a tool query is a big advantage.
memory()
A compact summary tool that provides “authoritative state”:
- location
- score/moves
- recent action heads
- last observation
It’s a sanity anchor when the agent gets confused.
valid_actions()
An helpful tool when stuck:
- tries to fetch the actual valid actions if the environment exposes them
- otherwise falls back to a canonical action menu
The agent uses it sparingly—only when stuck or after parser failures.
tried_actions()
The anti-loop tool:
- returns actions already attempted in each room
- helps the agent choose new high-value actions instead of repeating
open mailbox10 times
get_map() and graph()
These expose exploration as:
- a human-readable map (for prompts)
- a structured JSON graph (for future logic/visualization)
Mapping gives the agent an explicit “where have I been?” memory that the LLM doesn’t have to hallucinate.
Guardrail Tools That Make the System Feel "Serious"
Checkpoints (checkpoint_save, checkpoint_restore)
Checkpoints are a reliability hack with real impact:
- if the agent detects a loop or makes a catastrophic move, it can rollback
- we keep at least one “loop” checkpoint as a stable anchor
- we can also maintain a “best” checkpoint after scoring gains
This transforms the exploration strategy:
- you can take risks, because you can recover
action_probe(action) — action simulation without commitment
This is one of the more original parts of the server.
The idea:
- save a snapshot
- perform the action
- record deltas (score, moves, hash, location changes)
- restore the snapshot
- restore tracking metadata too (so probing doesn’t poison history/map)
It returns a compact JSON “what would happen if…?” report.
This enables a strong behavior: evaluating candidate actions via simulation and rollback, without committing a move (when snapshot/restore succeeds).
We keep it cheap (probe only a couple of actions) but it’s an excellent tie-breaker when stuck.
Part 2 — The Agent: ReAct, But Constrained and Safe
Strict ReAct as a Contract (Not a Style)
The agent uses a strict format:
- THOUGHT: one short sentence
- TOOL: one of the allowed tool names
- ARGS: valid JSON
That format is useful for stability:
- the agent becomes machine-parseable
- tool calls are consistent
Important Policy: Command Grammar Discipline
Text adventure parsers punish creativity.
So the agent enforces a tight grammar:
- movement is single-word:
north,in,up, … - interaction is short verb+noun:
open mailbox,take lamp, … - exotic multiword commands are allowed only if they appear exactly in
valid_actions
That last rule is a big deal:
- it prevents the LLM from inventing fancy commands
- it converts “language” into “API calls”
- it makes the agent much more robust across seeds
The Agent’s Guardrails: How We Stop Thrashing
Here are the big guardrail categories (conceptually, not line-by-line):
1) Tool validation
If the model requests an unknown tool:
- we don’t execute it
- we inject feedback listing allowed tools
- we force recovery behavior next
2) Parser failure detection
If the observation looks like a parser error (“I don’t know the word…”, “sentence isn’t recognized”):
- we switch into recovery mode
- we fetch valid actions (once)
- we force a simpler action selection
3) Anti-repeat behavior (local)
We track:
- the last action
- actions blocked in the current room
- actions tried in the current room
If the model repeats a no-progress action:
- we refuse it
- we force a new choice
4) Loop detection (global)
The agent uses the server’s state_hash:
- if the same hash repeats several times, we’re looping
Then we can:
- restore a checkpoint
- re-orient with
look - switch strategy
5) Movement bias (Zork-specific optimization)
When multiple movement options exist:
- “in / up / down” tend to unlock deeper progress
- cardinal directions tend to be broad exploration
So we bias toward in/up/down (especially after seeing them in valid actions).
It’s a small heuristic that often pays off.
Two Specialized LLM Modules: Memory and Planning
This is where the project becomes more than a typical ReAct agent.
Specialized module #1: Memory Compression (Long-Term Memory)
Raw history is short-term memory. It’s verbose, expensive, and noisy.
So we maintain a synthesized memory JSON, updated periodically by an LLM whose only job is to compress experience into decision-useful facts:
- durable facts learned
- obstacles + what is needed
- what items/tools to search for
- open threads worth returning to
- important visited places
We keep it:
- short
- deduplicated
- structured
- bounded (so it doesn’t explode)
If that LLM call fails or returns invalid JSON:
- we simply skip the update
- the run continues safely
The goal is to make the agent stay coherent over long runs.
In addition to this long-term synthesized memory, the agent retrieves an authoritative short-term memory summary every 10 steps via the get_memory() tool, ensuring local consistency and correcting possible drift in recent reasoning.
Specialized module #2: Objective Planning (Goal Management)
Action selection is short-horizon. But Zork requires long-horizon intent.
So we run a separate “planner” LLM that:
- updates objectives (explore, open, unlock, acquire key/lamp, return somewhere)
- proposes up to a few suggested next actions
- provides short evidence
Crucially:
- planner suggestions are not auto-executed
- they are injected into the prompt as guidance
- the main ReAct decision still chooses the next tool/action
This separation reduces goal drift:
- the agent behaves like it has a mental TODO list
- and doesn’t wander aimlessly as often
Deterministic Overrides: Sometimes We Don’t Ask the LLM
Some policies are too important to leave to “model mood.”
Example: treasure acquisition
If we see obvious treasure nouns in visible objects:
- we immediately
take <item> - no debate, no planning, no cleverness
Checkpoints as a Strategy, Not Just a Feature
The agent uses checkpoints like a game speedrunner would:
- keep a “loop” checkpoint as a stable anchor
- save a “best” checkpoint after scoring gains
That means:
- progress is protected
- exploration can be more aggressive
- loop recovery is fast
It’s a pragmatic way to make the system resilient under a move budget.
What You Get From This Approach
Compared to a vanilla “LLM + play_action” loop, this system is:
- more reliable (fewer parser deaths, fewer infinite loops)
- more efficient (less move waste, less repeated actions)
- more scalable (memory doesn’t balloon)
- more coherent (objectives keep the agent on track)
- more intentional (action_probe and valid_actions are used strategically)
Final Takeaway
Text adventures punish the exact things LLMs love:
- improvisation in language
- repetition
- vague intent
- verbose context
So we respond with the opposite:
- strict grammar
- structured state
- explicit recovery
- bounded but long term memory
- deliberate planning
Evaluations
The evaluation has been made on 100 steps and 3 seeds, using lostpig as test game.
The agent showed improved stability, there are fewer loops and parser errors.The tools are used more strategically, especially valid_actions and action_probe which are called mostly when the agent is stuck. The agent also seems to be more intentional, with a better sense of direction and progress, likely thanks to the planning module and the memory compression that keeps track of important facts and objectives.
However, the score progression compared to a vanilla ReAct baseline is not as big as expected: mean of 2 points for our approach and 1 point for the vanilla one.
We can hypothesize that the agent is still not using the tools as effectively as it could, and that the planning module is not providing useful guidance. We can also hypothesize that the evaluation budget (100 steps) is too low to see the benefits of the approach, which is designed to be more effective in longer runs where reliability and coherence matter more.
Here are the results of the evaluation:
Evaluation Results: Text Adventure Agent Submission
Game: lostpig Trials: 3/3 successful Max steps per trial: 100
Score Statistics: Mean: 2.00 Std: 0.00 Min: 2 Max: 2
Exploration: Mean moves: 65.7 Mean locations: 14.3
Per-Trial Scores: [2, 2, 2]
Potential Improvements
- Navigation tool — a
go_to(location)tool that uses the transition graph to find a sequence of moves to go from the current location to the target location (with a BFS algorithm for example) and apply them automatically instead of letting the LLM guessing the path. The agent could reduce move waste and improve reliability.