--- title: LLM Assessment Explorer emoji: 🫣 colorFrom: purple colorTo: indigo sdk: docker app_port: 7860 pinned: false license: apache-2.0 short_description: LLM moderation profiles and judges classification datasets: - PITTI/speechmap-questions - PITTI/speechmap-responses-v3 - PITTI/speechmap-assessments-v3 --- # LLM Assessment Explorer [Speechmap-judges Demo](https://github.com/user-attachments/assets/f94f0ef9-7ad6-419d-823a-56e828061092) An interactive TypeScript app for exploring and comparing differences in Large Language Model (LLM) assessments. This tool helps visualize how different "judge" models classify the same LLM-generated responses, providing deep insights into inter-rater reliability and model behavior. ### Core Features * **Compare Any Two Judges**: Select any two LLM judges from the dataset to compare their assessments side-by-side. * **Filter by Theme**: Narrow down the analysis to specific topics or domains by filtering by question theme. * **Sankey Chart**: Visualize the reclassification flow, showing how assessments from Judge 1 are categorized by Judge 2. * **Transition Matrix (Heatmap)**: Get a clear, at-a-glance overview of agreement and disagreement between the two selected judges. * **Drill-Down to Details**: Click on any chart element to inspect the specific items, including the original question, the LLM's response, and the detailed analysis from both judges. ## Speechmap Data This application explores datasets derived from xlr8harder's [Speechmap](https://speechmap.ai/) and [llm-compliance](https://github.com/xlr8harder/llm-compliance) projects. The data has been indexed and aggregated for efficient exploration. The underlying dataset from HuggingFace includes: * **2.4k questions**: [speechmap-questions](https://huggingface.co/datasets/PITTI/speechmap-questions) * **369k responses**: [speechmap-responses](https://huggingface.co/datasets/PITTI/speechmap-responses-v3) * **2.07k LLM-judge assessments**: [speechmap-assessments](https://huggingface.co/datasets/PITTI/speechmap-assessments-v3) * The assessment dataset combines the original assessments from the Speechmap project by `gpt-4o`, assessments by `mistral-small-3.1-2503`, `mistral-small-3.2-2506`, `gemma3-27b-it`, `deepseek-v3.2`, `qwen3-next-80B-A3B-instruct` and manual annotations. ## Quick Start ### Prerequisites You need to have [Node.js](https://nodejs.org/) (which includes npm) installed on your machine. Requires Node version >=20.15.1 ### Installation & Setup 1. **Clone the repository:** ```sh git clone https://github.com/pappitti/speechmap-judges.git cd speechmap-judges ``` 2. **Vite Dev Mode** **Install Dependencies:** ```sh npm install ``` **Fetch Data and Build the Database:** This command downloads the Parquet datasets from Hugging Face and creates a local `database.duckdb` file at the root of the project. ```sh npm run db:rebuild ``` This project includes a branch running on duckdb-wasm. That branch does not require this step 3 : you can run `npm run dev` directly after `npm install` (or `npm run build` and then `npm run preview` for production). However, that branch was never merged with the main branch because database persistence is tricky with duckdb-wasm so, right now, the database must be built again each time the app is started, which is really bad UX. IndexedDB is not an option ; more work is required on that branch. _Also, duckdb-wasm in not as fast as expected for a database of this size_ **Run the application:** This command starts the React frontend development server. ```sh npm run dev ``` Open [http://localhost:5173](http://localhost:5173) (or the URL provided in your terminal) to view it in your browser. 3. **Production Build (Docker)** ```sh docker build -t speechmap-judges-prod . ``` **Run the application:** ```sh docker run -p 7860:7860 --rm --name speechmap-judges-container speechmap-judges-prod ``` Open [http://localhost:7860](http://localhost:7860) to view it in your browser. ## Acknowledgments Whether you want to promote free speech or moderation, understanding biases in LLMs—and in the case of this project, biases in LLM-judges—is critical. Against this backdrop, the Speechmap project by xlr8harder is a very important initiative.