Spaces:
Running
Running
| title: LLM Assessment Explorer | |
| emoji: 🫣 | |
| colorFrom: purple | |
| colorTo: indigo | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: LLM moderation profiles and judges classification | |
| datasets: | |
| - PITTI/speechmap-questions | |
| - PITTI/speechmap-responses-v3 | |
| - PITTI/speechmap-assessments-v3 | |
| # LLM Assessment Explorer | |
| [Speechmap-judges Demo](https://github.com/user-attachments/assets/f94f0ef9-7ad6-419d-823a-56e828061092) | |
| An interactive TypeScript app for exploring and comparing differences in Large Language Model (LLM) assessments. This tool helps visualize how different "judge" models classify the same LLM-generated responses, providing deep insights into inter-rater reliability and model behavior. | |
| ### Core Features | |
| * **Compare Any Two Judges**: Select any two LLM judges from the dataset to compare their assessments side-by-side. | |
| * **Filter by Theme**: Narrow down the analysis to specific topics or domains by filtering by question theme. | |
| * **Sankey Chart**: Visualize the reclassification flow, showing how assessments from Judge 1 are categorized by Judge 2. | |
| * **Transition Matrix (Heatmap)**: Get a clear, at-a-glance overview of agreement and disagreement between the two selected judges. | |
| * **Drill-Down to Details**: Click on any chart element to inspect the specific items, including the original question, the LLM's response, and the detailed analysis from both judges. | |
| ## Speechmap Data | |
| This application explores datasets derived from xlr8harder's [Speechmap](https://speechmap.ai/) and [llm-compliance](https://github.com/xlr8harder/llm-compliance) projects. The data has been indexed and aggregated for efficient exploration. | |
| The underlying dataset from HuggingFace includes: | |
| * **2.4k questions**: [speechmap-questions](https://huggingface.co/datasets/PITTI/speechmap-questions) | |
| * **369k responses**: [speechmap-responses](https://huggingface.co/datasets/PITTI/speechmap-responses-v3) | |
| * **2.07k LLM-judge assessments**: [speechmap-assessments](https://huggingface.co/datasets/PITTI/speechmap-assessments-v3) | |
| * The assessment dataset combines the original assessments from the Speechmap project by `gpt-4o`, assessments by `mistral-small-3.1-2503`, `mistral-small-3.2-2506`, `gemma3-27b-it`, `deepseek-v3.2`, `qwen3-next-80B-A3B-instruct` and manual annotations. | |
| ## Quick Start | |
| ### Prerequisites | |
| You need to have [Node.js](https://nodejs.org/) (which includes npm) installed on your machine. Requires Node version >=20.15.1 | |
| ### Installation & Setup | |
| 1. **Clone the repository:** | |
| ```sh | |
| git clone https://github.com/pappitti/speechmap-judges.git | |
| cd speechmap-judges | |
| ``` | |
| 2. **Vite Dev Mode** | |
| **Install Dependencies:** | |
| ```sh | |
| npm install | |
| ``` | |
| **Fetch Data and Build the Database:** | |
| This command downloads the Parquet datasets from Hugging Face and creates a local `database.duckdb` file at the root of the project. | |
| ```sh | |
| npm run db:rebuild | |
| ``` | |
| This project includes a branch running on duckdb-wasm. That branch does not require this step 3 : you can run `npm run dev` directly after `npm install` (or `npm run build` and then `npm run preview` for production). However, that branch was never merged with the main branch because database persistence is tricky with duckdb-wasm so, right now, the database must be built again each time the app is started, which is really bad UX. IndexedDB is not an option ; more work is required on that branch. | |
| _Also, duckdb-wasm in not as fast as expected for a database of this size_ | |
| **Run the application:** | |
| This command starts the React frontend development server. | |
| ```sh | |
| npm run dev | |
| ``` | |
| Open [http://localhost:5173](http://localhost:5173) (or the URL provided in your terminal) to view it in your browser. | |
| 3. **Production Build (Docker)** | |
| ```sh | |
| docker build -t speechmap-judges-prod . | |
| ``` | |
| **Run the application:** | |
| ```sh | |
| docker run -p 7860:7860 --rm --name speechmap-judges-container speechmap-judges-prod | |
| ``` | |
| Open [http://localhost:7860](http://localhost:7860) to view it in your browser. | |
| ## Acknowledgments | |
| Whether you want to promote free speech or moderation, understanding biases in LLMs—and in the case of this project, biases in LLM-judges—is critical. Against this backdrop, the Speechmap project by xlr8harder is a very important initiative. | |