speechmap-judges / README.md
pappitti's picture
change README header
8a5be25
metadata
title: LLM Assessment Explorer
emoji: 🫣
colorFrom: purple
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
short_description: LLM moderation profiles and judges classification
datasets:
  - PITTI/speechmap-questions
  - PITTI/speechmap-responses-v3
  - PITTI/speechmap-assessments-v3

LLM Assessment Explorer

Speechmap-judges Demo

An interactive TypeScript app for exploring and comparing differences in Large Language Model (LLM) assessments. This tool helps visualize how different "judge" models classify the same LLM-generated responses, providing deep insights into inter-rater reliability and model behavior.

Core Features

  • Compare Any Two Judges: Select any two LLM judges from the dataset to compare their assessments side-by-side.
  • Filter by Theme: Narrow down the analysis to specific topics or domains by filtering by question theme.
  • Sankey Chart: Visualize the reclassification flow, showing how assessments from Judge 1 are categorized by Judge 2.
  • Transition Matrix (Heatmap): Get a clear, at-a-glance overview of agreement and disagreement between the two selected judges.
  • Drill-Down to Details: Click on any chart element to inspect the specific items, including the original question, the LLM's response, and the detailed analysis from both judges.

Speechmap Data

This application explores datasets derived from xlr8harder's Speechmap and llm-compliance projects. The data has been indexed and aggregated for efficient exploration.

The underlying dataset from HuggingFace includes:

  • 2.4k questions: speechmap-questions
  • 369k responses: speechmap-responses
  • 2.07k LLM-judge assessments: speechmap-assessments
    • The assessment dataset combines the original assessments from the Speechmap project by gpt-4o, assessments by mistral-small-3.1-2503, mistral-small-3.2-2506, gemma3-27b-it, deepseek-v3.2, qwen3-next-80B-A3B-instruct and manual annotations.

Quick Start

Prerequisites

You need to have Node.js (which includes npm) installed on your machine. Requires Node version >=20.15.1

Installation & Setup

  1. Clone the repository:

    git clone https://github.com/pappitti/speechmap-judges.git
    cd speechmap-judges
    
  2. Vite Dev Mode
    Install Dependencies:

    npm install
    

    Fetch Data and Build the Database:
    This command downloads the Parquet datasets from Hugging Face and creates a local database.duckdb file at the root of the project.

    npm run db:rebuild
    

    This project includes a branch running on duckdb-wasm. That branch does not require this step 3 : you can run npm run dev directly after npm install (or npm run build and then npm run preview for production). However, that branch was never merged with the main branch because database persistence is tricky with duckdb-wasm so, right now, the database must be built again each time the app is started, which is really bad UX. IndexedDB is not an option ; more work is required on that branch.
    Also, duckdb-wasm in not as fast as expected for a database of this size

    Run the application:
    This command starts the React frontend development server.

    npm run dev
    

    Open http://localhost:5173 (or the URL provided in your terminal) to view it in your browser.

  3. Production Build (Docker)

    docker build -t speechmap-judges-prod .
    

    Run the application:

    docker run -p 7860:7860 --rm --name speechmap-judges-container speechmap-judges-prod
    

    Open http://localhost:7860 to view it in your browser.

Acknowledgments

Whether you want to promote free speech or moderation, understanding biases in LLMs—and in the case of this project, biases in LLM-judges—is critical. Against this backdrop, the Speechmap project by xlr8harder is a very important initiative.