Spaces:
Sleeping
Sleeping
| title: StateBench Explorer | |
| emoji: ๐ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.9.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| datasets: | |
| - parslee/statebench | |
| # StateBench Explorer | |
| Interactive inspection tool for the [StateBench](https://huggingface.co/datasets/parslee/statebench) benchmark. | |
| ## Features | |
| - **Browse timelines** from train/validation/test splits | |
| - **Filter by track** (13 evaluation tracks) | |
| - **View events**: conversation turns, state writes, supersessions, queries | |
| - **Inspect ground truth**: expected decisions, must mention, must not mention | |
| - **Compare baselines**: see context built by different memory strategies | |
| ## Usage | |
| 1. Select a **split** (test, validation, train) | |
| 2. Optionally filter by **track** | |
| 3. Choose a **timeline ID** | |
| 4. Select a **baseline** to see how it builds context | |
| 5. Click **Inspect Timeline** | |
| ## Tracks | |
| | Track | Description | | |
| |-------|-------------| | |
| | supersession | Facts invalidated by newer information | | |
| | commitment_durability | Commitments survive interruptions | | |
| | scope_permission | Role-based access control | | |
| | causality | Multi-constraint dependencies | | |
| | ... | See dataset card for full list | | |
| ## Links | |
| - [Dataset](https://huggingface.co/datasets/parslee/statebench) | |
| - [GitHub](https://github.com/Parslee-ai/statebench) | |
| - [Evaluation Guide](https://github.com/Parslee-ai/statebench#evaluation) | |