speechmap-judges / README.md
pappitti's picture
change README header
8a5be25
---
title: LLM Assessment Explorer
emoji: 🫣
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: LLM moderation profiles and judges classification
datasets:
- PITTI/speechmap-questions
- PITTI/speechmap-responses-v3
- PITTI/speechmap-assessments-v3
---
# LLM Assessment Explorer
[Speechmap-judges Demo](https://github.com/user-attachments/assets/f94f0ef9-7ad6-419d-823a-56e828061092)
An interactive TypeScript app for exploring and comparing differences in Large Language Model (LLM) assessments. This tool helps visualize how different "judge" models classify the same LLM-generated responses, providing deep insights into inter-rater reliability and model behavior.
### Core Features
* **Compare Any Two Judges**: Select any two LLM judges from the dataset to compare their assessments side-by-side.
* **Filter by Theme**: Narrow down the analysis to specific topics or domains by filtering by question theme.
* **Sankey Chart**: Visualize the reclassification flow, showing how assessments from Judge 1 are categorized by Judge 2.
* **Transition Matrix (Heatmap)**: Get a clear, at-a-glance overview of agreement and disagreement between the two selected judges.
* **Drill-Down to Details**: Click on any chart element to inspect the specific items, including the original question, the LLM's response, and the detailed analysis from both judges.
## Speechmap Data
This application explores datasets derived from xlr8harder's [Speechmap](https://speechmap.ai/) and [llm-compliance](https://github.com/xlr8harder/llm-compliance) projects. The data has been indexed and aggregated for efficient exploration.
The underlying dataset from HuggingFace includes:
* **2.4k questions**: [speechmap-questions](https://huggingface.co/datasets/PITTI/speechmap-questions)
* **369k responses**: [speechmap-responses](https://huggingface.co/datasets/PITTI/speechmap-responses-v3)
* **2.07k LLM-judge assessments**: [speechmap-assessments](https://huggingface.co/datasets/PITTI/speechmap-assessments-v3)
* The assessment dataset combines the original assessments from the Speechmap project by `gpt-4o`, assessments by `mistral-small-3.1-2503`, `mistral-small-3.2-2506`, `gemma3-27b-it`, `deepseek-v3.2`, `qwen3-next-80B-A3B-instruct` and manual annotations.
## Quick Start
### Prerequisites
You need to have [Node.js](https://nodejs.org/) (which includes npm) installed on your machine. Requires Node version >=20.15.1
### Installation & Setup
1. **Clone the repository:**
```sh
git clone https://github.com/pappitti/speechmap-judges.git
cd speechmap-judges
```
2. **Vite Dev Mode**
**Install Dependencies:**
```sh
npm install
```
**Fetch Data and Build the Database:**
This command downloads the Parquet datasets from Hugging Face and creates a local `database.duckdb` file at the root of the project.
```sh
npm run db:rebuild
```
This project includes a branch running on duckdb-wasm. That branch does not require this step 3 : you can run `npm run dev` directly after `npm install` (or `npm run build` and then `npm run preview` for production). However, that branch was never merged with the main branch because database persistence is tricky with duckdb-wasm so, right now, the database must be built again each time the app is started, which is really bad UX. IndexedDB is not an option ; more work is required on that branch.
_Also, duckdb-wasm in not as fast as expected for a database of this size_
**Run the application:**
This command starts the React frontend development server.
```sh
npm run dev
```
Open [http://localhost:5173](http://localhost:5173) (or the URL provided in your terminal) to view it in your browser.
3. **Production Build (Docker)**
```sh
docker build -t speechmap-judges-prod .
```
**Run the application:**
```sh
docker run -p 7860:7860 --rm --name speechmap-judges-container speechmap-judges-prod
```
Open [http://localhost:7860](http://localhost:7860) to view it in your browser.
## Acknowledgments
Whether you want to promote free speech or moderation, understanding biases in LLMs—and in the case of this project, biases in LLM-judges—is critical. Against this backdrop, the Speechmap project by xlr8harder is a very important initiative.