Spaces:

LLM-course
/

Text-game-agent-EILLES

Sleeping

App Files Files Community

Text-game-agent-EILLES / explanations.md

stephecw

Upload explanations.md

7afb96c verified 2 months ago

preview code

raw

history blame contribute delete

14.6 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

Building a Reliable MCP Agent for Zork-Style Text Adventures

Text adventures sound trivial: you read a paragraph, type a command, get a new paragraph.
But once you put an LLM in that loop, you learn quickly that the hardest enemies aren’t in the dungeon—they’re in the interface.

What kills most LLM agents in Zork-like games is a predictable set of failure modes:

Parser brittleness: the game rejects slightly-wrong phrasing.
Looping: the model repeats actions, rooms, or “no-op” moves.
Move budget waste: doing “admin” actions that consume moves.
Prompt bloat: raw history gets too long and too noisy.
Goal drift: the model forgets what it was trying to do.

Some of these ideas are also exposed in the article "TextQuests: How Good are LLMs at Text-Based Video Games?" (https://arxiv.org/pdf/2507.23701) namely memory, coherence and planning.

So we didn’t build “a prompt.” We built a system with two main components:

an MCP server that exposes the game through robust tools and instrumentation
and an agent that treats the LLM as one component among others (memory, planning, recovery policies)

Our focus was on the previous failure modes, and how to design around them with tools and guardrails.

This is a high-level tour of the approach, focusing on the big ideas, without getting into implementation details.

The code is available in the HuggingFace space: https://huggingface.co/spaces/LLM-course/Text-game-agent-EILLES

The Setup: Two Pieces, One Loop

1) `mcp_server.py` — the game adapter + instrumentation layer

The MCP server acts like the game interface for the agent. It:

owns the environment (TextAdventureEnv)
runs commands (play_action)
tracks exploration metadata (rooms, transitions, tried actions)
exposes tools that help reasoning without spending moves
provides safety mechanisms like checkpoints and action simulation

2) `agent.py` — the policy engine + ReAct decision-maker

The agent:

outputs strict ReAct steps (THOUGHT -> TOOL -> ARGS)
can only interact via MCP tools (never “talks to the game” directly)
uses guardrails to keep the LLM from hallucinating tools/commands, looping, spamming, etc.
uses two additional LLM calls as specialized modules:
- memory compression (long-term, high-signal memory)
- objective planning (goal updates + suggested next actions)

vWe treat the LLM as a reasoning module, not as a reliable system controller. All safety, consistency, and state tracking are enforced outside the model.

Why “Tooling” is important: The MCP Server as a Game Interface

A Zork parser is not a friendly API. If the model invents commands like “look around carefully”, the game will often respond with something like:

“That sentence isn’t one I recognize.”

If you only expose play_action, the agent becomes a guessing machine.

So the MCP server provides a richer interface that makes the world “legible”:

Structured state (score, moves, inventory, room, “done”, a stable hash)
Inventory without spending a move
Valid actions (best-effort list) for recovery
A map/graph of explored rooms and transitions
Actions tried per room to avoid repeating
Checkpoints to rollback after loops or risky moves
Action probing (simulate before committing)

This set of tools is what turns the text game into something the agent can navigate reliably.

Part 1 — The MCP Server: Turning a Game into a Usable API

The Server’s Core Idea: Track More Than the Game Tracks

The environment gives you:

observation text
score/moves (usually)
maybe inventory (depending on wrapper)

But it doesn’t give you the extra structure an agent needs to be efficient:

Where have I been?
What did I already try here?
How do rooms connect?
Am I stuck in a loop?

So the server maintains that meta-state itself:

a short history of actions and results
a set of locations (rooms) discovered
a transition graph (room --action--> room)
an index of actions tried per location
checkpoint snapshots for rollback
a stable-ish state hash used to detect loops

This is not just logging. It becomes actionable tool output the agent can rely on.

Room Awareness: The Small Heuristic That Makes Everything Work

Most downstream reasoning depends on “what room am I in?”

The server uses a heuristic to extract the room title from the observation:

pick the first plausible “header-like” line
ignore copyright/revision boilerplate
ignore long narrative sentences

This matters because room identity powers:

mapping
“tried actions” grouping
loop detection context
objective tracking (“return to grating”, “open mailbox”, etc.)

If you don’t have stable room identity, the agent’s memory becomes confused.

The Minimal but Critical Tools

`play_action(action)`

The main interaction tool:

runs the command
returns the observation
appends optional “+points” signals and “GAME OVER”
never crashes the tool (so the run doesn’t die on edge cases)

This tool is deliberately boring—but highly reliable.

`inventory()`

A huge move-saver: it returns inventory without advancing the game.
In text adventures, calling inventory as a game command costs a move in many setups, so treating inventory as a tool query is a big advantage.

`memory()`

A compact summary tool that provides “authoritative state”:

location
score/moves
recent action heads
last observation

It’s a sanity anchor when the agent gets confused.

`valid_actions()`

An helpful tool when stuck:

tries to fetch the actual valid actions if the environment exposes them
otherwise falls back to a canonical action menu

The agent uses it sparingly—only when stuck or after parser failures.

`tried_actions()`

The anti-loop tool:

returns actions already attempted in each room
helps the agent choose new high-value actions instead of repeating open mailbox 10 times

`get_map()` and `graph()`

These expose exploration as:

a human-readable map (for prompts)
a structured JSON graph (for future logic/visualization)

Mapping gives the agent an explicit “where have I been?” memory that the LLM doesn’t have to hallucinate.

Guardrail Tools That Make the System Feel "Serious"

Checkpoints (`checkpoint_save`, `checkpoint_restore`)

Checkpoints are a reliability hack with real impact:

if the agent detects a loop or makes a catastrophic move, it can rollback
we keep at least one “loop” checkpoint as a stable anchor
we can also maintain a “best” checkpoint after scoring gains

This transforms the exploration strategy:

you can take risks, because you can recover

`action_probe(action)` — action simulation without commitment

This is one of the more original parts of the server.

The idea:

save a snapshot
perform the action
record deltas (score, moves, hash, location changes)
restore the snapshot
restore tracking metadata too (so probing doesn’t poison history/map)

It returns a compact JSON “what would happen if…?” report.

This enables a strong behavior: evaluating candidate actions via simulation and rollback, without committing a move (when snapshot/restore succeeds).

We keep it cheap (probe only a couple of actions) but it’s an excellent tie-breaker when stuck.

Part 2 — The Agent: ReAct, But Constrained and Safe

Strict ReAct as a Contract (Not a Style)

The agent uses a strict format:

THOUGHT: one short sentence
TOOL: one of the allowed tool names
ARGS: valid JSON

That format is useful for stability:

the agent becomes machine-parseable
tool calls are consistent

Important Policy: Command Grammar Discipline

Text adventure parsers punish creativity.

So the agent enforces a tight grammar:

movement is single-word: north, in, up, …
interaction is short verb+noun: open mailbox, take lamp, …
exotic multiword commands are allowed only if they appear exactly in valid_actions

That last rule is a big deal:

it prevents the LLM from inventing fancy commands
it converts “language” into “API calls”
it makes the agent much more robust across seeds

The Agent’s Guardrails: How We Stop Thrashing

Here are the big guardrail categories (conceptually, not line-by-line):

1) Tool validation

If the model requests an unknown tool:

we don’t execute it
we inject feedback listing allowed tools
we force recovery behavior next

2) Parser failure detection

If the observation looks like a parser error (“I don’t know the word…”, “sentence isn’t recognized”):

we switch into recovery mode
we fetch valid actions (once)
we force a simpler action selection

3) Anti-repeat behavior (local)

We track:

the last action
actions blocked in the current room
actions tried in the current room

If the model repeats a no-progress action:

we refuse it
we force a new choice

4) Loop detection (global)

The agent uses the server’s state_hash:

if the same hash repeats several times, we’re looping

Then we can:

restore a checkpoint
re-orient with look
switch strategy

5) Movement bias (Zork-specific optimization)

When multiple movement options exist:

“in / up / down” tend to unlock deeper progress
cardinal directions tend to be broad exploration

So we bias toward in/up/down (especially after seeing them in valid actions).

It’s a small heuristic that often pays off.

Two Specialized LLM Modules: Memory and Planning

This is where the project becomes more than a typical ReAct agent.

Specialized module #1: Memory Compression (Long-Term Memory)

Raw history is short-term memory. It’s verbose, expensive, and noisy.

So we maintain a synthesized memory JSON, updated periodically by an LLM whose only job is to compress experience into decision-useful facts:

durable facts learned
obstacles + what is needed
what items/tools to search for
open threads worth returning to
important visited places

We keep it:

short
deduplicated
structured
bounded (so it doesn’t explode)

If that LLM call fails or returns invalid JSON:

we simply skip the update
the run continues safely

The goal is to make the agent stay coherent over long runs.

In addition to this long-term synthesized memory, the agent retrieves an authoritative short-term memory summary every 10 steps via the get_memory() tool, ensuring local consistency and correcting possible drift in recent reasoning.

Specialized module #2: Objective Planning (Goal Management)

Action selection is short-horizon. But Zork requires long-horizon intent.

So we run a separate “planner” LLM that:

updates objectives (explore, open, unlock, acquire key/lamp, return somewhere)
proposes up to a few suggested next actions
provides short evidence

Crucially:

planner suggestions are not auto-executed
they are injected into the prompt as guidance
the main ReAct decision still chooses the next tool/action

This separation reduces goal drift:

the agent behaves like it has a mental TODO list
and doesn’t wander aimlessly as often

Deterministic Overrides: Sometimes We Don’t Ask the LLM

Some policies are too important to leave to “model mood.”

Example: treasure acquisition
If we see obvious treasure nouns in visible objects:

we immediately take <item>
no debate, no planning, no cleverness

Checkpoints as a Strategy, Not Just a Feature

The agent uses checkpoints like a game speedrunner would:

keep a “loop” checkpoint as a stable anchor
save a “best” checkpoint after scoring gains

That means:

progress is protected
exploration can be more aggressive
loop recovery is fast

It’s a pragmatic way to make the system resilient under a move budget.

What You Get From This Approach

Compared to a vanilla “LLM + play_action” loop, this system is:

more reliable (fewer parser deaths, fewer infinite loops)
more efficient (less move waste, less repeated actions)
more scalable (memory doesn’t balloon)
more coherent (objectives keep the agent on track)
more intentional (action_probe and valid_actions are used strategically)

Final Takeaway

Text adventures punish the exact things LLMs love:

improvisation in language
repetition
vague intent
verbose context

So we respond with the opposite:

strict grammar
structured state
explicit recovery
bounded but long term memory
deliberate planning

Evaluations

The evaluation has been made on 100 steps and 3 seeds, using lostpig as test game. The agent showed improved stability, there are fewer loops and parser errors.The tools are used more strategically, especially valid_actions and action_probe which are called mostly when the agent is stuck. The agent also seems to be more intentional, with a better sense of direction and progress, likely thanks to the planning module and the memory compression that keeps track of important facts and objectives.

However, the score progression compared to a vanilla ReAct baseline is not as big as expected: mean of 2 points for our approach and 1 point for the vanilla one.

We can hypothesize that the agent is still not using the tools as effectively as it could, and that the planning module is not providing useful guidance. We can also hypothesize that the evaluation budget (100 steps) is too low to see the benefits of the approach, which is designed to be more effective in longer runs where reliability and coherence matter more.

Here are the results of the evaluation:

Evaluation Results: Text Adventure Agent Submission

Game: lostpig Trials: 3/3 successful Max steps per trial: 100

Score Statistics: Mean: 2.00 Std: 0.00 Min: 2 Max: 2

Exploration: Mean moves: 65.7 Mean locations: 14.3

Per-Trial Scores: [2, 2, 2]

Potential Improvements

Navigation tool — a go_to(location) tool that uses the transition graph to find a sequence of moves to go from the current location to the target location (with a BFS algorithm for example) and apply them automatically instead of letting the LLM guessing the path. The agent could reduce move waste and improve reliability.