wzh0617 commited on
Commit
c0abd39
·
verified ·
1 Parent(s): 7d50051

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +557 -558
  2. app.py +0 -0
  3. requirements.txt +4 -5
README.md CHANGED
@@ -1,558 +1,557 @@
1
- ---
2
- title: StoryWeaver
3
- emoji: 📖
4
- colorFrom: red
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 4.44.0
8
- python_version: "3.10"
9
- app_file: app.py
10
- pinned: false
11
- license: mit
12
- short_description: Interactive NLP story engine with evaluation and logging
13
- ---
14
-
15
- # StoryWeaver
16
-
17
- StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.
18
-
19
- This README is written for teammates who need to:
20
-
21
- - understand how the system is organized
22
- - run the app locally
23
- - know where to change prompts, rules, or UI
24
- - collect evaluation results for the report
25
- - debug a bad interaction without reading the whole codebase first
26
-
27
- ## What This Repository Contains
28
-
29
- At a high level, the project has five responsibilities:
30
-
31
- 1. parse player input into structured intent
32
- 2. keep the world state consistent across turns
33
- 3. generate the next story response and options
34
- 4. expose the system through a Gradio UI
35
- 5. export logs and run reproducible evaluation
36
-
37
- This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.
38
-
39
- ## Quick Start
40
-
41
- ### 1. Install dependencies
42
-
43
- ```bash
44
- pip install -r requirements.txt
45
- ```
46
-
47
- ### 2. Create `.env`
48
-
49
- Create a `.env` file in the project root:
50
-
51
- ```env
52
- QWEN_API_KEY=your_api_key_here
53
- ```
54
-
55
- Optional:
56
-
57
- ```env
58
- STORYWEAVER_LOG_DIR=logs/interactions
59
- ```
60
-
61
- ### 3. Run the app
62
-
63
- ```bash
64
- python app.py
65
- ```
66
-
67
- Default local URL:
68
-
69
- - `http://localhost:7860`
70
-
71
- ### 4. Run evaluation
72
-
73
- ```bash
74
- python evaluation/run_evaluations.py --task all --repeats 3
75
- ```
76
-
77
- Useful variants:
78
-
79
- ```bash
80
- python evaluation/run_evaluations.py --task intent
81
- python evaluation/run_evaluations.py --task consistency
82
- python evaluation/run_evaluations.py --task latency --repeats 5
83
- python evaluation/run_evaluations.py --task branch
84
- ```
85
-
86
- ## Deploy to Hugging Face Spaces (Web Upload)
87
-
88
- If you want to deploy quickly without using git commands, use this checklist:
89
-
90
- 1. Create a new Space on Hugging Face:
91
- - SDK: `Gradio`
92
- - Python version: default is fine (3.10+)
93
- 2. Upload project files from this repository root.
94
- 3. Do **not** upload local-only files/directories:
95
- - `venv/`, `.venv/`, `.env`, `__pycache__/`, `.gradio/`, `logs/`, `evaluation/results/`
96
- 4. In Space settings, add secret:
97
- - `QWEN_API_KEY`
98
- 5. Wait for build to finish, then open the Space URL.
99
-
100
- This repository already uses the standard Gradio entrypoint in `app.py`, so Spaces will start the app automatically.
101
-
102
- ## Recommended Reading Order
103
-
104
- If you are new to the repo, read files in this order:
105
-
106
- 1. [state_manager.py](./state_manager.py)
107
- Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
108
- 2. [nlu_engine.py](./nlu_engine.py)
109
- Why: this shows how raw player text becomes structured intent.
110
- 3. [story_engine.py](./story_engine.py)
111
- Why: this is the main generation pipeline and fallback logic.
112
- 4. [app.py](./app.py)
113
- Why: this connects the UI with the engines and now also writes interaction logs.
114
- 5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
115
- Why: this shows how we measure the system for the report.
116
-
117
- If you only have 10 minutes, start with:
118
-
119
- - `GameState.pre_validate_action`
120
- - `GameState.check_consistency`
121
- - `GameState.apply_changes`
122
- - `NLUEngine.parse_intent`
123
- - `StoryEngine.generate_story_stream`
124
- - `process_user_input` in [app.py](./app.py)
125
-
126
- ## Repository Map
127
-
128
- ```text
129
- StoryWeaver/
130
- |-- app.py
131
- |-- nlu_engine.py
132
- |-- story_engine.py
133
- |-- state_manager.py
134
- |-- telemetry.py
135
- |-- utils.py
136
- |-- requirements.txt
137
- |-- evaluation/
138
- | |-- run_evaluations.py
139
- | |-- datasets/
140
- | `-- results/
141
- `-- logs/
142
- `-- interactions/
143
- ```
144
-
145
- Core responsibilities by file:
146
-
147
- - [app.py](./app.py)
148
- Gradio app, session lifecycle, UI callbacks, per-turn logging.
149
- - [state_manager.py](./state_manager.py)
150
- Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
151
- - [nlu_engine.py](./nlu_engine.py)
152
- Intent parsing. Uses LLM parsing when available and keyword fallback when not.
153
- - [story_engine.py](./story_engine.py)
154
- Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
155
- - [telemetry.py](./telemetry.py)
156
- Session metadata and JSONL interaction log export.
157
- - [utils.py](./utils.py)
158
- API client setup, Qwen calls, JSON extraction, retry helpers.
159
- - [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
160
- Reproducible experiment runner for the report.
161
-
162
- ## System Architecture
163
-
164
- The main runtime path is:
165
-
166
- `Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log`
167
-
168
- There are two ideas that matter most in this codebase:
169
-
170
- ### 1. `GameState` is the source of truth
171
-
172
- Almost everything meaningful lives in [state_manager.py](./state_manager.py):
173
-
174
- - player stats
175
- - location
176
- - time and weather
177
- - inventory and equipment
178
- - quests
179
- - NPC states
180
- - event history
181
-
182
- When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.
183
-
184
- ### 2. The app is a coordinator, not the game logic
185
-
186
- [app.py](./app.py) should mostly:
187
-
188
- - receive user input
189
- - call NLU
190
- - call the story engine
191
- - update the chat UI
192
- - write telemetry logs
193
-
194
- If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer.
195
-
196
- ## Runtime Flow
197
-
198
- ### Text input flow
199
-
200
- For normal text input, the path is:
201
-
202
- 1. `process_user_input` receives raw text from the UI
203
- 2. `NLUEngine.parse_intent` converts it into a structured intent dict
204
- 3. `GameState.pre_validate_action` blocks clearly invalid actions early
205
- 4. `StoryEngine.generate_story_stream` runs the main narrative pipeline
206
- 5. `GameState.check_consistency` and `apply_changes` update state
207
- 6. UI is refreshed with story text, options, and status panel
208
- 7. `_record_interaction_log` writes a JSONL record to disk
209
-
210
- ### Option click flow
211
-
212
- Button clicks do not go through full free-text parsing. Instead:
213
-
214
- 1. the selected option is converted to an intent-like dict
215
- 2. the story engine processes it the same way as text input
216
- 3. the result is rendered and logged
217
-
218
- This is useful because option interactions and free-text interactions now share the same evaluation and observability format.
219
-
220
- ## Main Modules in More Detail
221
-
222
- ### `state_manager.py`
223
-
224
- This file defines:
225
-
226
- - `PlayerState`
227
- - `WorldState`
228
- - `GameEvent`
229
- - `GameState`
230
-
231
- Important methods:
232
-
233
- - `pre_validate_action`
234
- Rejects obviously invalid actions before calling the model.
235
- - `check_consistency`
236
- Detects contradictions in proposed state changes.
237
- - `apply_changes`
238
- Applies state changes and returns a readable change log.
239
- - `validate`
240
- Makes sure the resulting state is legal.
241
- - `to_prompt`
242
- Serializes the current game state into prompt-ready text.
243
-
244
- When to edit this file:
245
-
246
- - adding new items, NPCs, quests, or locations
247
- - adding deterministic rules
248
- - improving consistency checks
249
- - changing state serialization for prompts
250
-
251
- ### `nlu_engine.py`
252
-
253
- This file is responsible for intent recognition.
254
-
255
- Current behavior:
256
-
257
- - try LLM parsing first
258
- - fall back to keyword rules if parsing fails
259
- - return a normalized intent dict with `parser_source`
260
-
261
- Current intent labels include:
262
-
263
- - `ATTACK`
264
- - `TALK`
265
- - `MOVE`
266
- - `EXPLORE`
267
- - `USE_ITEM`
268
- - `TRADE`
269
- - `EQUIP`
270
- - `REST`
271
- - `QUEST`
272
- - `SKILL`
273
- - `PICKUP`
274
- - `FLEE`
275
- - `CUSTOM`
276
-
277
- When to edit this file:
278
-
279
- - adding a new intent type
280
- - improving keyword fallback
281
- - adding target extraction logic
282
- - improving low-confidence handling
283
-
284
- ### `story_engine.py`
285
-
286
- This is the main generation module.
287
-
288
- It currently handles:
289
-
290
- - opening generation
291
- - story generation for each turn
292
- - streaming and non-streaming paths
293
- - default/fallback outputs
294
- - consistency-aware regeneration
295
- - response telemetry such as fallback reason and engine mode
296
-
297
- Important methods:
298
-
299
- - `generate_opening_stream`
300
- - `generate_story`
301
- - `generate_story_stream`
302
- - `process_option_selection_stream`
303
- - `_fallback_response`
304
-
305
- When to edit this file:
306
-
307
- - changing prompts
308
- - changing multi-stage generation logic
309
- - changing fallback behavior
310
- - adding generation-side telemetry
311
-
312
- ### `app.py`
313
-
314
- This file is the UI entry point and interaction orchestrator.
315
-
316
- Important responsibilities:
317
-
318
- - create a new game session
319
- - start and restart the app session
320
- - process text input
321
- - process option clicks
322
- - update Gradio components
323
- - write structured interaction logs
324
-
325
- When to edit this file:
326
-
327
- - changing UI flow
328
- - adding debug panels
329
- - changing how logs are written
330
- - changing how outputs are displayed
331
-
332
- ### `telemetry.py`
333
-
334
- This file handles structured log export.
335
-
336
- It is intentionally simple and file-based:
337
-
338
- - one session gets one JSONL file
339
- - one turn becomes one JSON object line
340
-
341
- This is useful for:
342
-
343
- - report case studies
344
- - measuring fallback rate
345
- - debugging weird turns
346
- - collecting examples for later evaluation
347
-
348
- ## Logging and Observability
349
-
350
- Interaction logs are written under:
351
-
352
- - [logs/interactions](./logs/interactions)
353
-
354
- Each turn record includes at least:
355
-
356
- - input source
357
- - user input
358
- - NLU result
359
- - latency
360
- - fallback metadata
361
- - state changes
362
- - consistency issues
363
- - final output text
364
- - post-turn state snapshot
365
-
366
- Example shape:
367
-
368
- ```json
369
- {
370
- "timestamp": "2026-03-14T18:55:00",
371
- "session_id": "sw-20260314-185500-ab12cd34",
372
- "turn_index": 3,
373
- "input_source": "text_input",
374
- "user_input": "和村长老伯谈谈最近森林里的怪事",
375
- "nlu_result": {
376
- "intent": "TALK",
377
- "target": "村长老伯",
378
- "parser_source": "llm"
379
- },
380
- "latency_ms": 842.13,
381
- "used_fallback": false,
382
- "state_changes": {},
383
- "output_text": "...",
384
- "post_turn_snapshot": {
385
- "location": "村庄广场"
386
- }
387
- }
388
- ```
389
-
390
- If you need to debug a bad interaction, the fastest path is:
391
-
392
- 1. check the log file
393
- 2. inspect `nlu_result`
394
- 3. inspect `telemetry.used_fallback`
395
- 4. inspect `state_changes`
396
- 5. inspect the post-turn snapshot
397
-
398
- ## Evaluation Pipeline
399
-
400
- Evaluation entry point:
401
-
402
- - [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
403
-
404
- Datasets:
405
-
406
- - [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
407
- - [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json)
408
- - [evaluation/datasets/latency.json](./evaluation/datasets/latency.json)
409
- - [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json)
410
-
411
- Results:
412
-
413
- - [evaluation/results](./evaluation/results)
414
-
415
- ### What each task measures
416
-
417
- #### Intent
418
-
419
- - labeled input -> predicted intent
420
- - optional target matching
421
- - parser source breakdown
422
- - per-example latency
423
-
424
- #### Consistency
425
-
426
- - action guard correctness via `pre_validate_action`
427
- - contradiction detection via `check_consistency`
428
-
429
- #### Latency
430
-
431
- - NLU latency
432
- - generation latency
433
- - total latency
434
- - fallback rate
435
-
436
- #### Branch divergence
437
-
438
- - same start state, different choices
439
- - compare resulting story text
440
- - compare option differences
441
- - compare state snapshot differences
442
-
443
- ## Common Development Tasks
444
-
445
- ### Add a new intent
446
-
447
- You will usually need to touch:
448
-
449
- - [nlu_engine.py](./nlu_engine.py)
450
- - [state_manager.py](./state_manager.py)
451
- - [story_engine.py](./story_engine.py)
452
- - [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
453
-
454
- Suggested checklist:
455
-
456
- 1. add the label to the NLU logic
457
- 2. decide whether it needs pre-validation
458
- 3. make sure story prompts know how to handle it
459
- 4. add at least a few evaluation examples
460
-
461
- ### Add a new location, NPC, quest, or item
462
-
463
- Most of the time you only need:
464
-
465
- - [state_manager.py](./state_manager.py)
466
-
467
- That file contains the initial world setup and registry-style data.
468
-
469
- ### Add more evaluation cases
470
-
471
- Edit files under:
472
-
473
- - [evaluation/datasets](./evaluation/datasets)
474
-
475
- This is the easiest way to improve the report without changing runtime logic.
476
-
477
- ### Investigate a strange game turn
478
-
479
- Check in this order:
480
-
481
- 1. interaction log under `logs/interactions`
482
- 2. `parser_source` in the NLU result
483
- 3. `telemetry` in the final story result
484
- 4. whether `pre_validate_action` rejected or allowed the turn
485
- 5. whether `check_consistency` flagged anything
486
-
487
- ### Change UI behavior without touching gameplay
488
-
489
- Edit:
490
-
491
- - [app.py](./app.py)
492
-
493
- Try not to put game rules in the UI layer.
494
-
495
- ## Environment Notes
496
-
497
- ### If `QWEN_API_KEY` is missing
498
-
499
- - warning logs will appear
500
- - some paths will still run through fallback logic
501
- - evaluation can still execute, but model-quality conclusions are not meaningful
502
-
503
- ### If `openai` is not installed
504
-
505
- - the repo can still import in some cases because the client is lazily initialized
506
- - full Qwen generation will not work
507
- - evaluation scripts will mostly reflect fallback behavior
508
-
509
- ### If `gradio` is not installed
510
-
511
- - the app cannot launch
512
- - offline evaluation scripts can still be useful
513
-
514
- ## Current Known Limitations
515
-
516
- These are the main gaps we still know about:
517
-
518
- - some item and equipment effects are stored as metadata but not fully executed as deterministic rules
519
- - combat and trade are still more prompt-driven than rule-driven
520
- - branch divergence is much more meaningful with a real model than in fallback-only mode
521
- - evaluation quality depends on whether the real model environment is available
522
-
523
- ## Suggested Team Workflow
524
-
525
- If multiple teammates are working in parallel, this split is usually clean:
526
-
527
- - gameplay/state teammate
528
- Focus on [state_manager.py](./state_manager.py)
529
- - prompt/generation teammate
530
- Focus on [story_engine.py](./story_engine.py)
531
- - NLU/evaluation teammate
532
- Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation)
533
- - UI/demo teammate
534
- Focus on [app.py](./app.py)
535
- - report teammate
536
- Focus on `evaluation/results`, `logs/interactions`, and case-study collection
537
-
538
- ## What To Use in the Final Report
539
-
540
- For the course report, the most useful artifacts from this repo are:
541
-
542
- - evaluation JSON outputs under `evaluation/results`
543
- - interaction logs under `logs/interactions`
544
- - dataset files under `evaluation/datasets`
545
- - readable state transitions from `change_log`
546
- - fallback metadata from `telemetry`
547
-
548
- These can directly support:
549
-
550
- - experiment setup
551
- - metric definition
552
- - result tables
553
- - success cases
554
- - failure case analysis
555
-
556
- ## License
557
-
558
- MIT
 
1
+ ---
2
+ title: StoryWeaver
3
+ emoji: 📖
4
+ colorFrom: red
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 6.7.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ short_description: Interactive NLP story engine with evaluation and logging
12
+ ---
13
+
14
+ # StoryWeaver
15
+
16
+ StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.
17
+
18
+ This README is written for teammates who need to:
19
+
20
+ - understand how the system is organized
21
+ - run the app locally
22
+ - know where to change prompts, rules, or UI
23
+ - collect evaluation results for the report
24
+ - debug a bad interaction without reading the whole codebase first
25
+
26
+ ## What This Repository Contains
27
+
28
+ At a high level, the project has five responsibilities:
29
+
30
+ 1. parse player input into structured intent
31
+ 2. keep the world state consistent across turns
32
+ 3. generate the next story response and options
33
+ 4. expose the system through a Gradio UI
34
+ 5. export logs and run reproducible evaluation
35
+
36
+ This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.
37
+
38
+ ## Quick Start
39
+
40
+ ### 1. Install dependencies
41
+
42
+ ```bash
43
+ pip install -r requirements.txt
44
+ ```
45
+
46
+ ### 2. Create `.env`
47
+
48
+ Create a `.env` file in the project root:
49
+
50
+ ```env
51
+ QWEN_API_KEY=your_api_key_here
52
+ ```
53
+
54
+ Optional:
55
+
56
+ ```env
57
+ STORYWEAVER_LOG_DIR=logs/interactions
58
+ ```
59
+
60
+ ### 3. Run the app
61
+
62
+ ```bash
63
+ python app.py
64
+ ```
65
+
66
+ Default local URL:
67
+
68
+ - `http://localhost:7860`
69
+
70
+ ### 4. Run evaluation
71
+
72
+ ```bash
73
+ python evaluation/run_evaluations.py --task all --repeats 3
74
+ ```
75
+
76
+ Useful variants:
77
+
78
+ ```bash
79
+ python evaluation/run_evaluations.py --task intent
80
+ python evaluation/run_evaluations.py --task consistency
81
+ python evaluation/run_evaluations.py --task latency --repeats 5
82
+ python evaluation/run_evaluations.py --task branch
83
+ ```
84
+
85
+ ## Deploy to Hugging Face Spaces (Web Upload)
86
+
87
+ If you want to deploy quickly without using git commands, use this checklist:
88
+
89
+ 1. Create a new Space on Hugging Face:
90
+ - SDK: `Gradio`
91
+ - Python version: default is fine (3.10+)
92
+ 2. Upload project files from this repository root.
93
+ 3. Do **not** upload local-only files/directories:
94
+ - `venv/`, `.venv/`, `.env`, `__pycache__/`, `.gradio/`, `logs/`, `evaluation/results/`
95
+ 4. In Space settings, add secret:
96
+ - `QWEN_API_KEY`
97
+ 5. Wait for build to finish, then open the Space URL.
98
+
99
+ This repository already uses the standard Gradio entrypoint in `app.py`, so Spaces will start the app automatically.
100
+
101
+ ## Recommended Reading Order
102
+
103
+ If you are new to the repo, read files in this order:
104
+
105
+ 1. [state_manager.py](./state_manager.py)
106
+ Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
107
+ 2. [nlu_engine.py](./nlu_engine.py)
108
+ Why: this shows how raw player text becomes structured intent.
109
+ 3. [story_engine.py](./story_engine.py)
110
+ Why: this is the main generation pipeline and fallback logic.
111
+ 4. [app.py](./app.py)
112
+ Why: this connects the UI with the engines and now also writes interaction logs.
113
+ 5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
114
+ Why: this shows how we measure the system for the report.
115
+
116
+ If you only have 10 minutes, start with:
117
+
118
+ - `GameState.pre_validate_action`
119
+ - `GameState.check_consistency`
120
+ - `GameState.apply_changes`
121
+ - `NLUEngine.parse_intent`
122
+ - `StoryEngine.generate_story_stream`
123
+ - `process_user_input` in [app.py](./app.py)
124
+
125
+ ## Repository Map
126
+
127
+ ```text
128
+ StoryWeaver/
129
+ |-- app.py
130
+ |-- nlu_engine.py
131
+ |-- story_engine.py
132
+ |-- state_manager.py
133
+ |-- telemetry.py
134
+ |-- utils.py
135
+ |-- requirements.txt
136
+ |-- evaluation/
137
+ | |-- run_evaluations.py
138
+ | |-- datasets/
139
+ | `-- results/
140
+ `-- logs/
141
+ `-- interactions/
142
+ ```
143
+
144
+ Core responsibilities by file:
145
+
146
+ - [app.py](./app.py)
147
+ Gradio app, session lifecycle, UI callbacks, per-turn logging.
148
+ - [state_manager.py](./state_manager.py)
149
+ Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
150
+ - [nlu_engine.py](./nlu_engine.py)
151
+ Intent parsing. Uses LLM parsing when available and keyword fallback when not.
152
+ - [story_engine.py](./story_engine.py)
153
+ Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
154
+ - [telemetry.py](./telemetry.py)
155
+ Session metadata and JSONL interaction log export.
156
+ - [utils.py](./utils.py)
157
+ API client setup, Qwen calls, JSON extraction, retry helpers.
158
+ - [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
159
+ Reproducible experiment runner for the report.
160
+
161
+ ## System Architecture
162
+
163
+ The main runtime path is:
164
+
165
+ `Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log`
166
+
167
+ There are two ideas that matter most in this codebase:
168
+
169
+ ### 1. `GameState` is the source of truth
170
+
171
+ Almost everything meaningful lives in [state_manager.py](./state_manager.py):
172
+
173
+ - player stats
174
+ - location
175
+ - time and weather
176
+ - inventory and equipment
177
+ - quests
178
+ - NPC states
179
+ - event history
180
+
181
+ When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.
182
+
183
+ ### 2. The app is a coordinator, not the game logic
184
+
185
+ [app.py](./app.py) should mostly:
186
+
187
+ - receive user input
188
+ - call NLU
189
+ - call the story engine
190
+ - update the chat UI
191
+ - write telemetry logs
192
+
193
+ If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer.
194
+
195
+ ## Runtime Flow
196
+
197
+ ### Text input flow
198
+
199
+ For normal text input, the path is:
200
+
201
+ 1. `process_user_input` receives raw text from the UI
202
+ 2. `NLUEngine.parse_intent` converts it into a structured intent dict
203
+ 3. `GameState.pre_validate_action` blocks clearly invalid actions early
204
+ 4. `StoryEngine.generate_story_stream` runs the main narrative pipeline
205
+ 5. `GameState.check_consistency` and `apply_changes` update state
206
+ 6. UI is refreshed with story text, options, and status panel
207
+ 7. `_record_interaction_log` writes a JSONL record to disk
208
+
209
+ ### Option click flow
210
+
211
+ Button clicks do not go through full free-text parsing. Instead:
212
+
213
+ 1. the selected option is converted to an intent-like dict
214
+ 2. the story engine processes it the same way as text input
215
+ 3. the result is rendered and logged
216
+
217
+ This is useful because option interactions and free-text interactions now share the same evaluation and observability format.
218
+
219
+ ## Main Modules in More Detail
220
+
221
+ ### `state_manager.py`
222
+
223
+ This file defines:
224
+
225
+ - `PlayerState`
226
+ - `WorldState`
227
+ - `GameEvent`
228
+ - `GameState`
229
+
230
+ Important methods:
231
+
232
+ - `pre_validate_action`
233
+ Rejects obviously invalid actions before calling the model.
234
+ - `check_consistency`
235
+ Detects contradictions in proposed state changes.
236
+ - `apply_changes`
237
+ Applies state changes and returns a readable change log.
238
+ - `validate`
239
+ Makes sure the resulting state is legal.
240
+ - `to_prompt`
241
+ Serializes the current game state into prompt-ready text.
242
+
243
+ When to edit this file:
244
+
245
+ - adding new items, NPCs, quests, or locations
246
+ - adding deterministic rules
247
+ - improving consistency checks
248
+ - changing state serialization for prompts
249
+
250
+ ### `nlu_engine.py`
251
+
252
+ This file is responsible for intent recognition.
253
+
254
+ Current behavior:
255
+
256
+ - try LLM parsing first
257
+ - fall back to keyword rules if parsing fails
258
+ - return a normalized intent dict with `parser_source`
259
+
260
+ Current intent labels include:
261
+
262
+ - `ATTACK`
263
+ - `TALK`
264
+ - `MOVE`
265
+ - `EXPLORE`
266
+ - `USE_ITEM`
267
+ - `TRADE`
268
+ - `EQUIP`
269
+ - `REST`
270
+ - `QUEST`
271
+ - `SKILL`
272
+ - `PICKUP`
273
+ - `FLEE`
274
+ - `CUSTOM`
275
+
276
+ When to edit this file:
277
+
278
+ - adding a new intent type
279
+ - improving keyword fallback
280
+ - adding target extraction logic
281
+ - improving low-confidence handling
282
+
283
+ ### `story_engine.py`
284
+
285
+ This is the main generation module.
286
+
287
+ It currently handles:
288
+
289
+ - opening generation
290
+ - story generation for each turn
291
+ - streaming and non-streaming paths
292
+ - default/fallback outputs
293
+ - consistency-aware regeneration
294
+ - response telemetry such as fallback reason and engine mode
295
+
296
+ Important methods:
297
+
298
+ - `generate_opening_stream`
299
+ - `generate_story`
300
+ - `generate_story_stream`
301
+ - `process_option_selection_stream`
302
+ - `_fallback_response`
303
+
304
+ When to edit this file:
305
+
306
+ - changing prompts
307
+ - changing multi-stage generation logic
308
+ - changing fallback behavior
309
+ - adding generation-side telemetry
310
+
311
+ ### `app.py`
312
+
313
+ This file is the UI entry point and interaction orchestrator.
314
+
315
+ Important responsibilities:
316
+
317
+ - create a new game session
318
+ - start and restart the app session
319
+ - process text input
320
+ - process option clicks
321
+ - update Gradio components
322
+ - write structured interaction logs
323
+
324
+ When to edit this file:
325
+
326
+ - changing UI flow
327
+ - adding debug panels
328
+ - changing how logs are written
329
+ - changing how outputs are displayed
330
+
331
+ ### `telemetry.py`
332
+
333
+ This file handles structured log export.
334
+
335
+ It is intentionally simple and file-based:
336
+
337
+ - one session gets one JSONL file
338
+ - one turn becomes one JSON object line
339
+
340
+ This is useful for:
341
+
342
+ - report case studies
343
+ - measuring fallback rate
344
+ - debugging weird turns
345
+ - collecting examples for later evaluation
346
+
347
+ ## Logging and Observability
348
+
349
+ Interaction logs are written under:
350
+
351
+ - [logs/interactions](./logs/interactions)
352
+
353
+ Each turn record includes at least:
354
+
355
+ - input source
356
+ - user input
357
+ - NLU result
358
+ - latency
359
+ - fallback metadata
360
+ - state changes
361
+ - consistency issues
362
+ - final output text
363
+ - post-turn state snapshot
364
+
365
+ Example shape:
366
+
367
+ ```json
368
+ {
369
+ "timestamp": "2026-03-14T18:55:00",
370
+ "session_id": "sw-20260314-185500-ab12cd34",
371
+ "turn_index": 3,
372
+ "input_source": "text_input",
373
+ "user_input": "和村长老伯谈谈最近森林里的怪事",
374
+ "nlu_result": {
375
+ "intent": "TALK",
376
+ "target": "村长老伯",
377
+ "parser_source": "llm"
378
+ },
379
+ "latency_ms": 842.13,
380
+ "used_fallback": false,
381
+ "state_changes": {},
382
+ "output_text": "...",
383
+ "post_turn_snapshot": {
384
+ "location": "村庄广场"
385
+ }
386
+ }
387
+ ```
388
+
389
+ If you need to debug a bad interaction, the fastest path is:
390
+
391
+ 1. check the log file
392
+ 2. inspect `nlu_result`
393
+ 3. inspect `telemetry.used_fallback`
394
+ 4. inspect `state_changes`
395
+ 5. inspect the post-turn snapshot
396
+
397
+ ## Evaluation Pipeline
398
+
399
+ Evaluation entry point:
400
+
401
+ - [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
402
+
403
+ Datasets:
404
+
405
+ - [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
406
+ - [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json)
407
+ - [evaluation/datasets/latency.json](./evaluation/datasets/latency.json)
408
+ - [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json)
409
+
410
+ Results:
411
+
412
+ - [evaluation/results](./evaluation/results)
413
+
414
+ ### What each task measures
415
+
416
+ #### Intent
417
+
418
+ - labeled input -> predicted intent
419
+ - optional target matching
420
+ - parser source breakdown
421
+ - per-example latency
422
+
423
+ #### Consistency
424
+
425
+ - action guard correctness via `pre_validate_action`
426
+ - contradiction detection via `check_consistency`
427
+
428
+ #### Latency
429
+
430
+ - NLU latency
431
+ - generation latency
432
+ - total latency
433
+ - fallback rate
434
+
435
+ #### Branch divergence
436
+
437
+ - same start state, different choices
438
+ - compare resulting story text
439
+ - compare option differences
440
+ - compare state snapshot differences
441
+
442
+ ## Common Development Tasks
443
+
444
+ ### Add a new intent
445
+
446
+ You will usually need to touch:
447
+
448
+ - [nlu_engine.py](./nlu_engine.py)
449
+ - [state_manager.py](./state_manager.py)
450
+ - [story_engine.py](./story_engine.py)
451
+ - [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
452
+
453
+ Suggested checklist:
454
+
455
+ 1. add the label to the NLU logic
456
+ 2. decide whether it needs pre-validation
457
+ 3. make sure story prompts know how to handle it
458
+ 4. add at least a few evaluation examples
459
+
460
+ ### Add a new location, NPC, quest, or item
461
+
462
+ Most of the time you only need:
463
+
464
+ - [state_manager.py](./state_manager.py)
465
+
466
+ That file contains the initial world setup and registry-style data.
467
+
468
+ ### Add more evaluation cases
469
+
470
+ Edit files under:
471
+
472
+ - [evaluation/datasets](./evaluation/datasets)
473
+
474
+ This is the easiest way to improve the report without changing runtime logic.
475
+
476
+ ### Investigate a strange game turn
477
+
478
+ Check in this order:
479
+
480
+ 1. interaction log under `logs/interactions`
481
+ 2. `parser_source` in the NLU result
482
+ 3. `telemetry` in the final story result
483
+ 4. whether `pre_validate_action` rejected or allowed the turn
484
+ 5. whether `check_consistency` flagged anything
485
+
486
+ ### Change UI behavior without touching gameplay
487
+
488
+ Edit:
489
+
490
+ - [app.py](./app.py)
491
+
492
+ Try not to put game rules in the UI layer.
493
+
494
+ ## Environment Notes
495
+
496
+ ### If `QWEN_API_KEY` is missing
497
+
498
+ - warning logs will appear
499
+ - some paths will still run through fallback logic
500
+ - evaluation can still execute, but model-quality conclusions are not meaningful
501
+
502
+ ### If `openai` is not installed
503
+
504
+ - the repo can still import in some cases because the client is lazily initialized
505
+ - full Qwen generation will not work
506
+ - evaluation scripts will mostly reflect fallback behavior
507
+
508
+ ### If `gradio` is not installed
509
+
510
+ - the app cannot launch
511
+ - offline evaluation scripts can still be useful
512
+
513
+ ## Current Known Limitations
514
+
515
+ These are the main gaps we still know about:
516
+
517
+ - some item and equipment effects are stored as metadata but not fully executed as deterministic rules
518
+ - combat and trade are still more prompt-driven than rule-driven
519
+ - branch divergence is much more meaningful with a real model than in fallback-only mode
520
+ - evaluation quality depends on whether the real model environment is available
521
+
522
+ ## Suggested Team Workflow
523
+
524
+ If multiple teammates are working in parallel, this split is usually clean:
525
+
526
+ - gameplay/state teammate
527
+ Focus on [state_manager.py](./state_manager.py)
528
+ - prompt/generation teammate
529
+ Focus on [story_engine.py](./story_engine.py)
530
+ - NLU/evaluation teammate
531
+ Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation)
532
+ - UI/demo teammate
533
+ Focus on [app.py](./app.py)
534
+ - report teammate
535
+ Focus on `evaluation/results`, `logs/interactions`, and case-study collection
536
+
537
+ ## What To Use in the Final Report
538
+
539
+ For the course report, the most useful artifacts from this repo are:
540
+
541
+ - evaluation JSON outputs under `evaluation/results`
542
+ - interaction logs under `logs/interactions`
543
+ - dataset files under `evaluation/datasets`
544
+ - readable state transitions from `change_log`
545
+ - fallback metadata from `telemetry`
546
+
547
+ These can directly support:
548
+
549
+ - experiment setup
550
+ - metric definition
551
+ - result tables
552
+ - success cases
553
+ - failure case analysis
554
+
555
+ ## License
556
+
557
+ MIT
 
app.py CHANGED
The diff for this file is too large to render. See raw diff
 
requirements.txt CHANGED
@@ -1,5 +1,4 @@
1
- gradio==4.44.0
2
- huggingface_hub==0.25.2
3
- openai==1.51.2
4
- python-dotenv==1.0.1
5
- pydantic==2.9.2
 
1
+ openai>=1.0.0
2
+ gradio==6.7.0
3
+ python-dotenv>=1.0.0
4
+ pydantic>=2.0.0