Spaces:
Runtime error
A newer version of the Gradio SDK is available: 6.11.0
title: StoryWeaver
emoji: 📖
colorFrom: red
colorTo: purple
sdk: gradio
sdk_version: 4.43.0
app_file: app.py
python_version: '3.10'
pinned: false
license: mit
short_description: Interactive NLP story engine with evaluation and logging
StoryWeaver
StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.
This README is written for teammates who need to:
- understand how the system is organized
- run the app locally
- know where to change prompts, rules, or UI
- collect evaluation results for the report
- debug a bad interaction without reading the whole codebase first
What This Repository Contains
At a high level, the project has five responsibilities:
- parse player input into structured intent
- keep the world state consistent across turns
- generate the next story response and options
- expose the system through a Gradio UI
- export logs and run reproducible evaluation
This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.
Quick Start
1. Install dependencies
pip install -r requirements.txt
2. Create .env
Create a .env file in the project root:
QWEN_API_KEY=your_api_key_here
Optional:
STORYWEAVER_LOG_DIR=logs/interactions
3. Run the app
python app.py
Default local URL:
http://localhost:7860
4. Run evaluation
python evaluation/run_evaluations.py --task all --repeats 3
Useful variants:
python evaluation/run_evaluations.py --task intent
python evaluation/run_evaluations.py --task consistency
python evaluation/run_evaluations.py --task latency --repeats 5
python evaluation/run_evaluations.py --task branch
Recommended Reading Order
If you are new to the repo, read files in this order:
- state_manager.py Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
- nlu_engine.py Why: this shows how raw player text becomes structured intent.
- story_engine.py Why: this is the main generation pipeline and fallback logic.
- app.py Why: this connects the UI with the engines and now also writes interaction logs.
- evaluation/run_evaluations.py Why: this shows how we measure the system for the report.
If you only have 10 minutes, start with:
GameState.pre_validate_actionGameState.check_consistencyGameState.apply_changesNLUEngine.parse_intentStoryEngine.generate_story_streamprocess_user_inputin app.py
Repository Map
StoryWeaver/
|-- app.py
|-- nlu_engine.py
|-- story_engine.py
|-- state_manager.py
|-- telemetry.py
|-- utils.py
|-- requirements.txt
|-- evaluation/
| |-- run_evaluations.py
| |-- datasets/
| `-- results/
`-- logs/
`-- interactions/
Core responsibilities by file:
- app.py Gradio app, session lifecycle, UI callbacks, per-turn logging.
- state_manager.py Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
- nlu_engine.py Intent parsing. Uses LLM parsing when available and keyword fallback when not.
- story_engine.py Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
- telemetry.py Session metadata and JSONL interaction log export.
- utils.py API client setup, Qwen calls, JSON extraction, retry helpers.
- evaluation/run_evaluations.py Reproducible experiment runner for the report.
System Architecture
The main runtime path is:
Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log
There are two ideas that matter most in this codebase:
1. GameState is the source of truth
Almost everything meaningful lives in state_manager.py:
- player stats
- location
- time and weather
- inventory and equipment
- quests
- NPC states
- event history
When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.
2. The app is a coordinator, not the game logic
app.py should mostly:
- receive user input
- call NLU
- call the story engine
- update the chat UI
- write telemetry logs
If a new feature changes game rules, it probably belongs in state_manager.py or story_engine.py, not in the UI layer.
Runtime Flow
Text input flow
For normal text input, the path is:
process_user_inputreceives raw text from the UINLUEngine.parse_intentconverts it into a structured intent dictGameState.pre_validate_actionblocks clearly invalid actions earlyStoryEngine.generate_story_streamruns the main narrative pipelineGameState.check_consistencyandapply_changesupdate state- UI is refreshed with story text, options, and status panel
_record_interaction_logwrites a JSONL record to disk
Option click flow
Button clicks do not go through full free-text parsing. Instead:
- the selected option is converted to an intent-like dict
- the story engine processes it the same way as text input
- the result is rendered and logged
This is useful because option interactions and free-text interactions now share the same evaluation and observability format.
Main Modules in More Detail
state_manager.py
This file defines:
PlayerStateWorldStateGameEventGameState
Important methods:
pre_validate_actionRejects obviously invalid actions before calling the model.check_consistencyDetects contradictions in proposed state changes.apply_changesApplies state changes and returns a readable change log.validateMakes sure the resulting state is legal.to_promptSerializes the current game state into prompt-ready text.
When to edit this file:
- adding new items, NPCs, quests, or locations
- adding deterministic rules
- improving consistency checks
- changing state serialization for prompts
nlu_engine.py
This file is responsible for intent recognition.
Current behavior:
- try LLM parsing first
- fall back to keyword rules if parsing fails
- return a normalized intent dict with
parser_source
Current intent labels include:
ATTACKTALKMOVEEXPLOREUSE_ITEMTRADEEQUIPRESTQUESTSKILLPICKUPFLEECUSTOM
When to edit this file:
- adding a new intent type
- improving keyword fallback
- adding target extraction logic
- improving low-confidence handling
story_engine.py
This is the main generation module.
It currently handles:
- opening generation
- story generation for each turn
- streaming and non-streaming paths
- default/fallback outputs
- consistency-aware regeneration
- response telemetry such as fallback reason and engine mode
Important methods:
generate_opening_streamgenerate_storygenerate_story_streamprocess_option_selection_stream_fallback_response
When to edit this file:
- changing prompts
- changing multi-stage generation logic
- changing fallback behavior
- adding generation-side telemetry
app.py
This file is the UI entry point and interaction orchestrator.
Important responsibilities:
- create a new game session
- start and restart the app session
- process text input
- process option clicks
- update Gradio components
- write structured interaction logs
When to edit this file:
- changing UI flow
- adding debug panels
- changing how logs are written
- changing how outputs are displayed
telemetry.py
This file handles structured log export.
It is intentionally simple and file-based:
- one session gets one JSONL file
- one turn becomes one JSON object line
This is useful for:
- report case studies
- measuring fallback rate
- debugging weird turns
- collecting examples for later evaluation
Logging and Observability
Interaction logs are written under:
Each turn record includes at least:
- input source
- user input
- NLU result
- latency
- fallback metadata
- state changes
- consistency issues
- final output text
- post-turn state snapshot
Example shape:
{
"timestamp": "2026-03-14T18:55:00",
"session_id": "sw-20260314-185500-ab12cd34",
"turn_index": 3,
"input_source": "text_input",
"user_input": "和村长老伯谈谈最近森林里的怪事",
"nlu_result": {
"intent": "TALK",
"target": "村长老伯",
"parser_source": "llm"
},
"latency_ms": 842.13,
"used_fallback": false,
"state_changes": {},
"output_text": "...",
"post_turn_snapshot": {
"location": "村庄广场"
}
}
If you need to debug a bad interaction, the fastest path is:
- check the log file
- inspect
nlu_result - inspect
telemetry.used_fallback - inspect
state_changes - inspect the post-turn snapshot
Evaluation Pipeline
Evaluation entry point:
Datasets:
- evaluation/datasets/intent_accuracy.json
- evaluation/datasets/consistency.json
- evaluation/datasets/latency.json
- evaluation/datasets/branch_divergence.json
Results:
What each task measures
Intent
- labeled input -> predicted intent
- optional target matching
- parser source breakdown
- per-example latency
Consistency
- action guard correctness via
pre_validate_action - contradiction detection via
check_consistency
Latency
- NLU latency
- generation latency
- total latency
- fallback rate
Branch divergence
- same start state, different choices
- compare resulting story text
- compare option differences
- compare state snapshot differences
Common Development Tasks
Add a new intent
You will usually need to touch:
Suggested checklist:
- add the label to the NLU logic
- decide whether it needs pre-validation
- make sure story prompts know how to handle it
- add at least a few evaluation examples
Add a new location, NPC, quest, or item
Most of the time you only need:
That file contains the initial world setup and registry-style data.
Add more evaluation cases
Edit files under:
This is the easiest way to improve the report without changing runtime logic.
Investigate a strange game turn
Check in this order:
- interaction log under
logs/interactions parser_sourcein the NLU resulttelemetryin the final story result- whether
pre_validate_actionrejected or allowed the turn - whether
check_consistencyflagged anything
Change UI behavior without touching gameplay
Edit:
Try not to put game rules in the UI layer.
Environment Notes
If QWEN_API_KEY is missing
- warning logs will appear
- some paths will still run through fallback logic
- evaluation can still execute, but model-quality conclusions are not meaningful
If openai is not installed
- the repo can still import in some cases because the client is lazily initialized
- full Qwen generation will not work
- evaluation scripts will mostly reflect fallback behavior
If gradio is not installed
- the app cannot launch
- offline evaluation scripts can still be useful
Current Known Limitations
These are the main gaps we still know about:
- some item and equipment effects are stored as metadata but not fully executed as deterministic rules
- combat and trade are still more prompt-driven than rule-driven
- branch divergence is much more meaningful with a real model than in fallback-only mode
- evaluation quality depends on whether the real model environment is available
Suggested Team Workflow
If multiple teammates are working in parallel, this split is usually clean:
- gameplay/state teammate Focus on state_manager.py
- prompt/generation teammate Focus on story_engine.py
- NLU/evaluation teammate Focus on nlu_engine.py and evaluation
- UI/demo teammate Focus on app.py
- report teammate
Focus on
evaluation/results,logs/interactions, and case-study collection
What To Use in the Final Report
For the course report, the most useful artifacts from this repo are:
- evaluation JSON outputs under
evaluation/results - interaction logs under
logs/interactions - dataset files under
evaluation/datasets - readable state transitions from
change_log - fallback metadata from
telemetry
These can directly support:
- experiment setup
- metric definition
- result tables
- success cases
- failure case analysis
License
MIT