Spaces:
Running
title: LLM Assessment Explorer
emoji: 🫣
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: LLM moderation profiles and judges classification
datasets:
- PITTI/speechmap-questions
- PITTI/speechmap-responses-v3
- PITTI/speechmap-assessments-v3
LLM Assessment Explorer
An interactive TypeScript app for exploring and comparing differences in Large Language Model (LLM) assessments. This tool helps visualize how different "judge" models classify the same LLM-generated responses, providing deep insights into inter-rater reliability and model behavior.
Core Features
- Compare Any Two Judges: Select any two LLM judges from the dataset to compare their assessments side-by-side.
- Filter by Theme: Narrow down the analysis to specific topics or domains by filtering by question theme.
- Sankey Chart: Visualize the reclassification flow, showing how assessments from Judge 1 are categorized by Judge 2.
- Transition Matrix (Heatmap): Get a clear, at-a-glance overview of agreement and disagreement between the two selected judges.
- Drill-Down to Details: Click on any chart element to inspect the specific items, including the original question, the LLM's response, and the detailed analysis from both judges.
Speechmap Data
This application explores datasets derived from xlr8harder's Speechmap and llm-compliance projects. The data has been indexed and aggregated for efficient exploration.
The underlying dataset from HuggingFace includes:
- 2.4k questions: speechmap-questions
- 369k responses: speechmap-responses
- 2.07k LLM-judge assessments: speechmap-assessments
- The assessment dataset combines the original assessments from the Speechmap project by
gpt-4o, assessments bymistral-small-3.1-2503,mistral-small-3.2-2506,gemma3-27b-it,deepseek-v3.2,qwen3-next-80B-A3B-instructand manual annotations.
- The assessment dataset combines the original assessments from the Speechmap project by
Quick Start
Prerequisites
You need to have Node.js (which includes npm) installed on your machine. Requires Node version >=20.15.1
Installation & Setup
Clone the repository:
git clone https://github.com/pappitti/speechmap-judges.git cd speechmap-judgesVite Dev Mode
Install Dependencies:npm installFetch Data and Build the Database:
This command downloads the Parquet datasets from Hugging Face and creates a localdatabase.duckdbfile at the root of the project.npm run db:rebuildThis project includes a branch running on duckdb-wasm. That branch does not require this step 3 : you can run
npm run devdirectly afternpm install(ornpm run buildand thennpm run previewfor production). However, that branch was never merged with the main branch because database persistence is tricky with duckdb-wasm so, right now, the database must be built again each time the app is started, which is really bad UX. IndexedDB is not an option ; more work is required on that branch.
Also, duckdb-wasm in not as fast as expected for a database of this sizeRun the application:
This command starts the React frontend development server.npm run devOpen http://localhost:5173 (or the URL provided in your terminal) to view it in your browser.
Production Build (Docker)
docker build -t speechmap-judges-prod .Run the application:
docker run -p 7860:7860 --rm --name speechmap-judges-container speechmap-judges-prodOpen http://localhost:7860 to view it in your browser.
Acknowledgments
Whether you want to promote free speech or moderation, understanding biases in LLMs—and in the case of this project, biases in LLM-judges—is critical. Against this backdrop, the Speechmap project by xlr8harder is a very important initiative.