NewProject / README.md
PPP
chore: downgrade to Gradio 4.x for compatibility
673d037
---
title: StoryWeaver
emoji: 📖
colorFrom: red
colorTo: purple
sdk: gradio
sdk_version: 4.43.0
app_file: app.py
python_version: "3.10"
pinned: false
license: mit
short_description: Interactive NLP story engine with evaluation and logging
---
# StoryWeaver
StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.
This README is written for teammates who need to:
- understand how the system is organized
- run the app locally
- know where to change prompts, rules, or UI
- collect evaluation results for the report
- debug a bad interaction without reading the whole codebase first
## What This Repository Contains
At a high level, the project has five responsibilities:
1. parse player input into structured intent
2. keep the world state consistent across turns
3. generate the next story response and options
4. expose the system through a Gradio UI
5. export logs and run reproducible evaluation
This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.
## Quick Start
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
### 2. Create `.env`
Create a `.env` file in the project root:
```env
QWEN_API_KEY=your_api_key_here
```
Optional:
```env
STORYWEAVER_LOG_DIR=logs/interactions
```
### 3. Run the app
```bash
python app.py
```
Default local URL:
- `http://localhost:7860`
### 4. Run evaluation
```bash
python evaluation/run_evaluations.py --task all --repeats 3
```
Useful variants:
```bash
python evaluation/run_evaluations.py --task intent
python evaluation/run_evaluations.py --task consistency
python evaluation/run_evaluations.py --task latency --repeats 5
python evaluation/run_evaluations.py --task branch
```
## Recommended Reading Order
If you are new to the repo, read files in this order:
1. [state_manager.py](./state_manager.py)
Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
2. [nlu_engine.py](./nlu_engine.py)
Why: this shows how raw player text becomes structured intent.
3. [story_engine.py](./story_engine.py)
Why: this is the main generation pipeline and fallback logic.
4. [app.py](./app.py)
Why: this connects the UI with the engines and now also writes interaction logs.
5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
Why: this shows how we measure the system for the report.
If you only have 10 minutes, start with:
- `GameState.pre_validate_action`
- `GameState.check_consistency`
- `GameState.apply_changes`
- `NLUEngine.parse_intent`
- `StoryEngine.generate_story_stream`
- `process_user_input` in [app.py](./app.py)
## Repository Map
```text
StoryWeaver/
|-- app.py
|-- nlu_engine.py
|-- story_engine.py
|-- state_manager.py
|-- telemetry.py
|-- utils.py
|-- requirements.txt
|-- evaluation/
| |-- run_evaluations.py
| |-- datasets/
| `-- results/
`-- logs/
`-- interactions/
```
Core responsibilities by file:
- [app.py](./app.py)
Gradio app, session lifecycle, UI callbacks, per-turn logging.
- [state_manager.py](./state_manager.py)
Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
- [nlu_engine.py](./nlu_engine.py)
Intent parsing. Uses LLM parsing when available and keyword fallback when not.
- [story_engine.py](./story_engine.py)
Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
- [telemetry.py](./telemetry.py)
Session metadata and JSONL interaction log export.
- [utils.py](./utils.py)
API client setup, Qwen calls, JSON extraction, retry helpers.
- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
Reproducible experiment runner for the report.
## System Architecture
The main runtime path is:
`Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log`
There are two ideas that matter most in this codebase:
### 1. `GameState` is the source of truth
Almost everything meaningful lives in [state_manager.py](./state_manager.py):
- player stats
- location
- time and weather
- inventory and equipment
- quests
- NPC states
- event history
When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.
### 2. The app is a coordinator, not the game logic
[app.py](./app.py) should mostly:
- receive user input
- call NLU
- call the story engine
- update the chat UI
- write telemetry logs
If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer.
## Runtime Flow
### Text input flow
For normal text input, the path is:
1. `process_user_input` receives raw text from the UI
2. `NLUEngine.parse_intent` converts it into a structured intent dict
3. `GameState.pre_validate_action` blocks clearly invalid actions early
4. `StoryEngine.generate_story_stream` runs the main narrative pipeline
5. `GameState.check_consistency` and `apply_changes` update state
6. UI is refreshed with story text, options, and status panel
7. `_record_interaction_log` writes a JSONL record to disk
### Option click flow
Button clicks do not go through full free-text parsing. Instead:
1. the selected option is converted to an intent-like dict
2. the story engine processes it the same way as text input
3. the result is rendered and logged
This is useful because option interactions and free-text interactions now share the same evaluation and observability format.
## Main Modules in More Detail
### `state_manager.py`
This file defines:
- `PlayerState`
- `WorldState`
- `GameEvent`
- `GameState`
Important methods:
- `pre_validate_action`
Rejects obviously invalid actions before calling the model.
- `check_consistency`
Detects contradictions in proposed state changes.
- `apply_changes`
Applies state changes and returns a readable change log.
- `validate`
Makes sure the resulting state is legal.
- `to_prompt`
Serializes the current game state into prompt-ready text.
When to edit this file:
- adding new items, NPCs, quests, or locations
- adding deterministic rules
- improving consistency checks
- changing state serialization for prompts
### `nlu_engine.py`
This file is responsible for intent recognition.
Current behavior:
- try LLM parsing first
- fall back to keyword rules if parsing fails
- return a normalized intent dict with `parser_source`
Current intent labels include:
- `ATTACK`
- `TALK`
- `MOVE`
- `EXPLORE`
- `USE_ITEM`
- `TRADE`
- `EQUIP`
- `REST`
- `QUEST`
- `SKILL`
- `PICKUP`
- `FLEE`
- `CUSTOM`
When to edit this file:
- adding a new intent type
- improving keyword fallback
- adding target extraction logic
- improving low-confidence handling
### `story_engine.py`
This is the main generation module.
It currently handles:
- opening generation
- story generation for each turn
- streaming and non-streaming paths
- default/fallback outputs
- consistency-aware regeneration
- response telemetry such as fallback reason and engine mode
Important methods:
- `generate_opening_stream`
- `generate_story`
- `generate_story_stream`
- `process_option_selection_stream`
- `_fallback_response`
When to edit this file:
- changing prompts
- changing multi-stage generation logic
- changing fallback behavior
- adding generation-side telemetry
### `app.py`
This file is the UI entry point and interaction orchestrator.
Important responsibilities:
- create a new game session
- start and restart the app session
- process text input
- process option clicks
- update Gradio components
- write structured interaction logs
When to edit this file:
- changing UI flow
- adding debug panels
- changing how logs are written
- changing how outputs are displayed
### `telemetry.py`
This file handles structured log export.
It is intentionally simple and file-based:
- one session gets one JSONL file
- one turn becomes one JSON object line
This is useful for:
- report case studies
- measuring fallback rate
- debugging weird turns
- collecting examples for later evaluation
## Logging and Observability
Interaction logs are written under:
- [logs/interactions](./logs/interactions)
Each turn record includes at least:
- input source
- user input
- NLU result
- latency
- fallback metadata
- state changes
- consistency issues
- final output text
- post-turn state snapshot
Example shape:
```json
{
"timestamp": "2026-03-14T18:55:00",
"session_id": "sw-20260314-185500-ab12cd34",
"turn_index": 3,
"input_source": "text_input",
"user_input": "和村长老伯谈谈最近森林里的怪事",
"nlu_result": {
"intent": "TALK",
"target": "村长老伯",
"parser_source": "llm"
},
"latency_ms": 842.13,
"used_fallback": false,
"state_changes": {},
"output_text": "...",
"post_turn_snapshot": {
"location": "村庄广场"
}
}
```
If you need to debug a bad interaction, the fastest path is:
1. check the log file
2. inspect `nlu_result`
3. inspect `telemetry.used_fallback`
4. inspect `state_changes`
5. inspect the post-turn snapshot
## Evaluation Pipeline
Evaluation entry point:
- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
Datasets:
- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
- [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json)
- [evaluation/datasets/latency.json](./evaluation/datasets/latency.json)
- [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json)
Results:
- [evaluation/results](./evaluation/results)
### What each task measures
#### Intent
- labeled input -> predicted intent
- optional target matching
- parser source breakdown
- per-example latency
#### Consistency
- action guard correctness via `pre_validate_action`
- contradiction detection via `check_consistency`
#### Latency
- NLU latency
- generation latency
- total latency
- fallback rate
#### Branch divergence
- same start state, different choices
- compare resulting story text
- compare option differences
- compare state snapshot differences
## Common Development Tasks
### Add a new intent
You will usually need to touch:
- [nlu_engine.py](./nlu_engine.py)
- [state_manager.py](./state_manager.py)
- [story_engine.py](./story_engine.py)
- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
Suggested checklist:
1. add the label to the NLU logic
2. decide whether it needs pre-validation
3. make sure story prompts know how to handle it
4. add at least a few evaluation examples
### Add a new location, NPC, quest, or item
Most of the time you only need:
- [state_manager.py](./state_manager.py)
That file contains the initial world setup and registry-style data.
### Add more evaluation cases
Edit files under:
- [evaluation/datasets](./evaluation/datasets)
This is the easiest way to improve the report without changing runtime logic.
### Investigate a strange game turn
Check in this order:
1. interaction log under `logs/interactions`
2. `parser_source` in the NLU result
3. `telemetry` in the final story result
4. whether `pre_validate_action` rejected or allowed the turn
5. whether `check_consistency` flagged anything
### Change UI behavior without touching gameplay
Edit:
- [app.py](./app.py)
Try not to put game rules in the UI layer.
## Environment Notes
### If `QWEN_API_KEY` is missing
- warning logs will appear
- some paths will still run through fallback logic
- evaluation can still execute, but model-quality conclusions are not meaningful
### If `openai` is not installed
- the repo can still import in some cases because the client is lazily initialized
- full Qwen generation will not work
- evaluation scripts will mostly reflect fallback behavior
### If `gradio` is not installed
- the app cannot launch
- offline evaluation scripts can still be useful
## Current Known Limitations
These are the main gaps we still know about:
- some item and equipment effects are stored as metadata but not fully executed as deterministic rules
- combat and trade are still more prompt-driven than rule-driven
- branch divergence is much more meaningful with a real model than in fallback-only mode
- evaluation quality depends on whether the real model environment is available
## Suggested Team Workflow
If multiple teammates are working in parallel, this split is usually clean:
- gameplay/state teammate
Focus on [state_manager.py](./state_manager.py)
- prompt/generation teammate
Focus on [story_engine.py](./story_engine.py)
- NLU/evaluation teammate
Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation)
- UI/demo teammate
Focus on [app.py](./app.py)
- report teammate
Focus on `evaluation/results`, `logs/interactions`, and case-study collection
## What To Use in the Final Report
For the course report, the most useful artifacts from this repo are:
- evaluation JSON outputs under `evaluation/results`
- interaction logs under `logs/interactions`
- dataset files under `evaluation/datasets`
- readable state transitions from `change_log`
- fallback metadata from `telemetry`
These can directly support:
- experiment setup
- metric definition
- result tables
- success cases
- failure case analysis
## License
MIT