crvrfr's picture
Upload 3 files
9a1ae58 verified

ROUVROY Clément - Agentic Zork

Homework for LLM for Code and Proof course.

AI Disclaimer: Most of the code is AI written. I prefered to read papers and articles about the subject to understand it deeply rather than taking 5 hours putting everything together. This report is purely human written and no AI was involved in its making.

This report goes through every addition I have done for this project compared to the base implementation.

Jericho Integration

In my original testing, the agent was hallucinating moves based on where he was. For example, Fountain room led him to try put water in glass... he had no glass on him.

Following indications, I integrated the Jericho framework to provide rich observations to the agent. For example, here is how I get available objects:

def get_objects_in_room(self) -> list[str]:
    stack = []
    child = room.child
    while child > 0:
        stack.append(child)
        child = self.env.env.get_object(child).sibling

On zork1, my agent was dying of things that were not straightfoward indicated to be death enforcing. Hence, as Jericho exposes a set_state function, I decided to let my agent takes the special Guardian action that allows him to test an action without dying (this is like if we stop in forest and think "hmmm is this gonna kill me?" and ask Google for it). This is done easily using:

def check_action(self, action: str) -> str:
    saved = self.env.env.get_state()
    test_state = self.env.step(action)
    self.env.env.set_state(saved)

Memory system

The base StudentAgent has no memory mechanism apart from the history list. This was a huge issue. For example, it is well-known that we should throw coins into a fountain, and it is also known that coins should not be let on the ground, hence all my agents were:

  1. Taking the coin
  2. Throwing it in the fountain
  3. "Oh no, we should not let a coin on the ground" -> take the coin.
  4. "Oh a fountain, let's toss it inside" -> throw the coin.
  5. "Oh no, we should not let a coin on the ground" -> ...

This led me to the first memory kind: ActionMemory, which stores knowledge about ineffective actions (and also existing rooms).

Also, the AI model (especially the 3B model) was not able to "switch objectives". For example, in lostpig you need to move a chair to an other room and get over it. The agent think "oh I saw a chair", move to the room with the chair and thinks "oh what is in the book"", it lacks memory of what to do. For this I added WorkingMemory with short-term goals and current goals (the AI can write in it).

Finally, the AI was cycling A LOT (for example in the fountain room), for this I added a MapMemory. It also leverages thinking like "this is too high for me, maybe I can get X at the library ?".

Action heuristic

Instead of having the LLM guess actions from a raw list or its own imagination, which leads to weird action (like "put fire on pants" ???), I do the following:

  1. Get all valid actions using jericho,
  2. Associate an index to each of them, and an heuristic based on memory.
  3. Ask the LLM to outputs an index.
  4. Return the corresponding action.

This reduces drastically the hallucinations, though the model is then "capped" to the heuristics. Based on my experiments (limited to one QWEN model) this is working a bit better. Moreover, as Instruct models have a lot of this kind of data in training dataset, they are used to thinking over this input.

Fallback for cycles

I had a lot of issue of the LLM cycling without any new score made. For this I added a unproductive_streak, which triggers a special warning if no score was edited in more than K turns:

if self.unproductive_streak >= K:
    reflection = (
        f"Step {step}: {self.unproductive_streak} steps without progress at {location}. "
        f"Last action '{action}' had no effect. Need to try a fundamentally different approach."
    )

This is combined with a hard verification in the prompt. If the agent repeats the exact same action at least 2 times in the last 10 turns, an annotation is appended directly next to the action in the available choices to discourage it from looping:

"  1. action [Already try {count} times in the last 10 steps !]"

Goal Management

As explained before, the LLM lacks of "Global goal / direction". I added a GoalManager that uses the LLM to generate high-level goals when the agent stagnates. The context passed to it is highly concise:

def update_goals_with_llm(self, seed: int, step: int, score: int, max_score: int, ...):
    context_parts = [
        f"Step {step}, Score {score}/{max_score}, Location: {location}",
        f"Inventory: {', '.join(inventory) if inventory else 'nothing'}",
        f"Rooms visited: {len(kg_map.visit_count)}",
    ]
    # ask the LLM to generate a new goal with the knowledge.

This gives the agent a renewed sense of direction after exploring a large branch of the map.