Spaces:

mlfoundations
/

OpenThoughts_data_explorer

Running

App Files Files Community

jmercat commited on Jun 2, 2025

Commit

3a9cbd7

1 Parent(s): 8daa4df

Fix HuggingFace Space configuration - add proper SDK settings and clean requirements

Browse files

Files changed (2) hide show

README.md +22 -63
requirements.txt +0 -9

README.md CHANGED Viewed

@@ -1,82 +1,41 @@
 ---
-title: OpenThoughts Model Benchmark Explorer
 emoji: 📊
 colorFrom: blue
 colorTo: red
 sdk: streamlit
 sdk_version: 1.28.0
-app_file: benchmark_explorer_app.py
 pinned: false
-license: mit
 ---
-# 🔬 OpenThoughts Evalchemy Benchmark Explorer
-Exploring correlations and relationships between LLMs performance across different reasoning benchmarks.
-This explorer is built on top of the [OpenThoughts](https://github.com/open-thoughts/open-thoughts) project to explore the model that we have trained and evaluated as well as external models that we have evaluated.
-All evaluation results were produced and logged using [Evalchemy](https://github.com/mlfoundations/evalchemy).
 ## Features
-### 📊 Overview Dashboard
-- Key metrics and dataset statistics
-- Benchmark coverage visualization
-- Quick correlation insights
-- Category-based analysis
-### 🔥 Interactive Heatmap
-- Multiple correlation methods (Pearson, Spearman, Kendall)
-- Interactive hover tooltips
-- Real-time correlation statistics
-- Distribution analysis
-### 📈 Scatter Plot Explorer
-- Dynamic benchmark selection
-- Interactive scatter plots with regression lines
-- Multiple correlation coefficients
-- Data point exploration
-### 🎯 Model Performance Analysis
-- Model search and filtering
-- Performance rankings
-- Radar chart comparisons
-- Side-by-side model analysis
-### 📋 Statistical Summary
-- Comprehensive dataset statistics
-- Benchmark-wise analysis
-- Export capabilities
-- Correlation summaries
-### 🔬 Uncertainty Analysis
-- Measurement precision analysis
-- Error bar visualizations with 95% CI
-- Signal-to-noise ratios
-- Uncertainty-aware correlations
-## Benchmark Categories
-- **Math** (red): AIME24, AIME25, AMC23, MATH500
-- **Code** (blue): CodeElo, CodeForces, LiveCodeBench v2 & v5
-- **Science** (green): GPQADiamond, JEEBench
-- **General** (orange): MMLUPro, HLE
-## Data Filtering Options
-- Category-based filtering
-- Zero-value filtering with threshold
-- Minimum coverage requirements
-- Dynamic slider ranges based on actual data
-## Architecture
-- **Frontend**: Streamlit with Plotly interactive visualizations
-- **Backend**: Pandas/NumPy for data processing, SciPy for statistics
-- **Caching**: Smart caching for performance optimization
-- **Real-time**: On-the-fly correlation computation for dynamic filtering
-## Usage
-The application automatically loads benchmark data and provides six specialized analysis modules. Use the sidebar controls to filter data and customize the analysis based on your needs.
-Perfect for researchers, practitioners, and anyone interested in understanding the relationships between different AI evaluation benchmarks.

 ---
+title: OpenThoughts Benchmark Explorer
 emoji: 📊
 colorFrom: blue
 colorTo: red
 sdk: streamlit
 sdk_version: 1.28.0
+app_file: app.py
 pinned: false
+license: apache-2.0
 ---
+# OpenThoughts Evalchemy Benchmark Explorer
+A comprehensive web application for exploring OpenThoughts benchmark correlations and model performance.
 ## Features
+- Interactive correlation heatmaps
+- Scatter plot explorer with uncertainty analysis
+- Model performance comparisons
+- Statistical summaries and uncertainty analysis
+## Usage
+The app automatically loads benchmark data and provides multiple views for analysis:
+1. **Overview Dashboard**: High-level summary of benchmarks and correlations
+2. **Interactive Heatmap**: Correlation matrix visualization
+3. **Scatter Explorer**: Detailed pairwise benchmark comparisons
+4. **Model Performance**: Individual model analysis
+5. **Statistical Summary**: Correlation statistics across methods
+6. **Uncertainty Analysis**: Measurement reliability analysis
+## Data Files
+The app requires two CSV files:
+- `comprehensive_benchmark_scores.csv`: Main benchmark scores
+- `benchmark_standard_errors.csv`: Standard error estimates (optional)
+These files should be in the root directory of the repository.

requirements.txt CHANGED Viewed

@@ -1,12 +1,3 @@
-fastapi
-uvicorn
-requests
-sqlalchemy
-asyncpg
-aiohttp
-python-json-logger
-psycopg2-binary
-antlr4-python3-runtime==4.11
 streamlit>=1.28.0
 pandas>=2.0.0
 numpy>=1.24.0

 streamlit>=1.28.0
 pandas>=2.0.0
 numpy>=1.24.0