| title: OpenThoughts Benchmark Explorer | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: red | |
| sdk: streamlit | |
| sdk_version: 1.28.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| # OpenThoughts Evalchemy Benchmark Explorer | |
| A comprehensive web application for exploring OpenThoughts benchmark correlations and model performance. | |
| ## Features | |
| - Interactive correlation heatmaps | |
| - Scatter plot explorer with uncertainty analysis | |
| - Model performance comparisons | |
| - Statistical summaries and uncertainty analysis | |
| ## Usage | |
| The app automatically loads benchmark data and provides multiple views for analysis: | |
| 1. **Overview Dashboard**: High-level summary of benchmarks and correlations | |
| 2. **Interactive Heatmap**: Correlation matrix visualization | |
| 3. **Scatter Explorer**: Detailed pairwise benchmark comparisons | |
| 4. **Model Performance**: Individual model analysis | |
| 5. **Statistical Summary**: Correlation statistics across methods | |
| 6. **Uncertainty Analysis**: Measurement reliability analysis | |
| ## Data Files | |
| The app requires two CSV files: | |
| - `comprehensive_benchmark_scores.csv`: Main benchmark scores | |
| - `benchmark_standard_errors.csv`: Standard error estimates (optional) | |
| These files should be in the root directory of the repository. | |