---
title: StoryWeaver
emoji: 📖
colorFrom: red
colorTo: purple
sdk: gradio
sdk_version: 4.43.0
app_file: app.py
python_version: "3.10"
pinned: false
license: mit
short_description: Interactive NLP story engine with evaluation and logging
---

# StoryWeaver

StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.

This README is written for teammates who need to:

- understand how the system is organized
- run the app locally
- know where to change prompts, rules, or UI
- collect evaluation results for the report
- debug a bad interaction without reading the whole codebase first

## What This Repository Contains

At a high level, the project has five responsibilities:

1. parse player input into structured intent
2. keep the world state consistent across turns
3. generate the next story response and options
4. expose the system through a Gradio UI
5. export logs and run reproducible evaluation

This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.

## Quick Start

### 1. Install dependencies

```bash
pip install -r requirements.txt
```

### 2. Create `.env`

Create a `.env` file in the project root:

```env
QWEN_API_KEY=your_api_key_here
```

Optional:

```env
STORYWEAVER_LOG_DIR=logs/interactions
```

### 3. Run the app

```bash
python app.py
```

Default local URL:

- `http://localhost:7860`

### 4. Run evaluation

```bash
python evaluation/run_evaluations.py --task all --repeats 3
```

Useful variants:

```bash
python evaluation/run_evaluations.py --task intent
python evaluation/run_evaluations.py --task consistency
python evaluation/run_evaluations.py --task latency --repeats 5
python evaluation/run_evaluations.py --task branch
```

## Recommended Reading Order

If you are new to the repo, read files in this order:

1. [state_manager.py](./state_manager.py)
   Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
2. [nlu_engine.py](./nlu_engine.py)
   Why: this shows how raw player text becomes structured intent.
3. [story_engine.py](./story_engine.py)
   Why: this is the main generation pipeline and fallback logic.
4. [app.py](./app.py)
   Why: this connects the UI with the engines and now also writes interaction logs.
5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
   Why: this shows how we measure the system for the report.

If you only have 10 minutes, start with:

- `GameState.pre_validate_action`
- `GameState.check_consistency`
- `GameState.apply_changes`
- `NLUEngine.parse_intent`
- `StoryEngine.generate_story_stream`
- `process_user_input` in [app.py](./app.py)

## Repository Map

```text
StoryWeaver/
|-- app.py
|-- nlu_engine.py
|-- story_engine.py
|-- state_manager.py
|-- telemetry.py
|-- utils.py
|-- requirements.txt
|-- evaluation/
|   |-- run_evaluations.py
|   |-- datasets/
|   `-- results/
`-- logs/
    `-- interactions/
```

Core responsibilities by file:

- [app.py](./app.py)
  Gradio app, session lifecycle, UI callbacks, per-turn logging.
- [state_manager.py](./state_manager.py)
  Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
- [nlu_engine.py](./nlu_engine.py)
  Intent parsing. Uses LLM parsing when available and keyword fallback when not.
- [story_engine.py](./story_engine.py)
  Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
- [telemetry.py](./telemetry.py)
  Session metadata and JSONL interaction log export.
- [utils.py](./utils.py)
  API client setup, Qwen calls, JSON extraction, retry helpers.
- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
  Reproducible experiment runner for the report.

## System Architecture

The main runtime path is:

`Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log`

There are two ideas that matter most in this codebase:

### 1. `GameState` is the source of truth

Almost everything meaningful lives in [state_manager.py](./state_manager.py):

- player stats
- location
- time and weather
- inventory and equipment
- quests
- NPC states
- event history

When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.

### 2. The app is a coordinator, not the game logic

[app.py](./app.py) should mostly:

- receive user input
- call NLU
- call the story engine
- update the chat UI
- write telemetry logs

If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer.

## Runtime Flow

### Text input flow

For normal text input, the path is:

1. `process_user_input` receives raw text from the UI
2. `NLUEngine.parse_intent` converts it into a structured intent dict
3. `GameState.pre_validate_action` blocks clearly invalid actions early
4. `StoryEngine.generate_story_stream` runs the main narrative pipeline
5. `GameState.check_consistency` and `apply_changes` update state
6. UI is refreshed with story text, options, and status panel
7. `_record_interaction_log` writes a JSONL record to disk

### Option click flow

Button clicks do not go through full free-text parsing. Instead:

1. the selected option is converted to an intent-like dict
2. the story engine processes it the same way as text input
3. the result is rendered and logged

This is useful because option interactions and free-text interactions now share the same evaluation and observability format.

## Main Modules in More Detail

### `state_manager.py`

This file defines:

- `PlayerState`
- `WorldState`
- `GameEvent`
- `GameState`

Important methods:

- `pre_validate_action`
  Rejects obviously invalid actions before calling the model.
- `check_consistency`
  Detects contradictions in proposed state changes.
- `apply_changes`
  Applies state changes and returns a readable change log.
- `validate`
  Makes sure the resulting state is legal.
- `to_prompt`
  Serializes the current game state into prompt-ready text.

When to edit this file:

- adding new items, NPCs, quests, or locations
- adding deterministic rules
- improving consistency checks
- changing state serialization for prompts

### `nlu_engine.py`

This file is responsible for intent recognition.

Current behavior:

- try LLM parsing first
- fall back to keyword rules if parsing fails
- return a normalized intent dict with `parser_source`

Current intent labels include:

- `ATTACK`
- `TALK`
- `MOVE`
- `EXPLORE`
- `USE_ITEM`
- `TRADE`
- `EQUIP`
- `REST`
- `QUEST`
- `SKILL`
- `PICKUP`
- `FLEE`
- `CUSTOM`

When to edit this file:

- adding a new intent type
- improving keyword fallback
- adding target extraction logic
- improving low-confidence handling

### `story_engine.py`

This is the main generation module.

It currently handles:

- opening generation
- story generation for each turn
- streaming and non-streaming paths
- default/fallback outputs
- consistency-aware regeneration
- response telemetry such as fallback reason and engine mode

Important methods:

- `generate_opening_stream`
- `generate_story`
- `generate_story_stream`
- `process_option_selection_stream`
- `_fallback_response`

When to edit this file:

- changing prompts
- changing multi-stage generation logic
- changing fallback behavior
- adding generation-side telemetry

### `app.py`

This file is the UI entry point and interaction orchestrator.

Important responsibilities:

- create a new game session
- start and restart the app session
- process text input
- process option clicks
- update Gradio components
- write structured interaction logs

When to edit this file:

- changing UI flow
- adding debug panels
- changing how logs are written
- changing how outputs are displayed

### `telemetry.py`

This file handles structured log export.

It is intentionally simple and file-based:

- one session gets one JSONL file
- one turn becomes one JSON object line

This is useful for:

- report case studies
- measuring fallback rate
- debugging weird turns
- collecting examples for later evaluation

## Logging and Observability

Interaction logs are written under:

- [logs/interactions](./logs/interactions)

Each turn record includes at least:

- input source
- user input
- NLU result
- latency
- fallback metadata
- state changes
- consistency issues
- final output text
- post-turn state snapshot

Example shape:

```json
{
  "timestamp": "2026-03-14T18:55:00",
  "session_id": "sw-20260314-185500-ab12cd34",
  "turn_index": 3,
  "input_source": "text_input",
  "user_input": "和村长老伯谈谈最近森林里的怪事",
  "nlu_result": {
    "intent": "TALK",
    "target": "村长老伯",
    "parser_source": "llm"
  },
  "latency_ms": 842.13,
  "used_fallback": false,
  "state_changes": {},
  "output_text": "...",
  "post_turn_snapshot": {
    "location": "村庄广场"
  }
}
```

If you need to debug a bad interaction, the fastest path is:

1. check the log file
2. inspect `nlu_result`
3. inspect `telemetry.used_fallback`
4. inspect `state_changes`
5. inspect the post-turn snapshot

## Evaluation Pipeline

Evaluation entry point:

- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)

Datasets:

- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
- [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json)
- [evaluation/datasets/latency.json](./evaluation/datasets/latency.json)
- [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json)

Results:

- [evaluation/results](./evaluation/results)

### What each task measures

#### Intent

- labeled input -> predicted intent
- optional target matching
- parser source breakdown
- per-example latency

#### Consistency

- action guard correctness via `pre_validate_action`
- contradiction detection via `check_consistency`

#### Latency

- NLU latency
- generation latency
- total latency
- fallback rate

#### Branch divergence

- same start state, different choices
- compare resulting story text
- compare option differences
- compare state snapshot differences

## Common Development Tasks

### Add a new intent

You will usually need to touch:

- [nlu_engine.py](./nlu_engine.py)
- [state_manager.py](./state_manager.py)
- [story_engine.py](./story_engine.py)
- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)

Suggested checklist:

1. add the label to the NLU logic
2. decide whether it needs pre-validation
3. make sure story prompts know how to handle it
4. add at least a few evaluation examples

### Add a new location, NPC, quest, or item

Most of the time you only need:

- [state_manager.py](./state_manager.py)

That file contains the initial world setup and registry-style data.

### Add more evaluation cases

Edit files under:

- [evaluation/datasets](./evaluation/datasets)

This is the easiest way to improve the report without changing runtime logic.

### Investigate a strange game turn

Check in this order:

1. interaction log under `logs/interactions`
2. `parser_source` in the NLU result
3. `telemetry` in the final story result
4. whether `pre_validate_action` rejected or allowed the turn
5. whether `check_consistency` flagged anything

### Change UI behavior without touching gameplay

Edit:

- [app.py](./app.py)

Try not to put game rules in the UI layer.

## Environment Notes

### If `QWEN_API_KEY` is missing

- warning logs will appear
- some paths will still run through fallback logic
- evaluation can still execute, but model-quality conclusions are not meaningful

### If `openai` is not installed

- the repo can still import in some cases because the client is lazily initialized
- full Qwen generation will not work
- evaluation scripts will mostly reflect fallback behavior

### If `gradio` is not installed

- the app cannot launch
- offline evaluation scripts can still be useful

## Current Known Limitations

These are the main gaps we still know about:

- some item and equipment effects are stored as metadata but not fully executed as deterministic rules
- combat and trade are still more prompt-driven than rule-driven
- branch divergence is much more meaningful with a real model than in fallback-only mode
- evaluation quality depends on whether the real model environment is available

## Suggested Team Workflow

If multiple teammates are working in parallel, this split is usually clean:

- gameplay/state teammate
  Focus on [state_manager.py](./state_manager.py)
- prompt/generation teammate
  Focus on [story_engine.py](./story_engine.py)
- NLU/evaluation teammate
  Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation)
- UI/demo teammate
  Focus on [app.py](./app.py)
- report teammate
  Focus on `evaluation/results`, `logs/interactions`, and case-study collection

## What To Use in the Final Report

For the course report, the most useful artifacts from this repo are:

- evaluation JSON outputs under `evaluation/results`
- interaction logs under `logs/interactions`
- dataset files under `evaluation/datasets`
- readable state transitions from `change_log`
- fallback metadata from `telemetry`

These can directly support:

- experiment setup
- metric definition
- result tables
- success cases
- failure case analysis

## License

MIT