Spaces:
Sleeping
Sleeping
File size: 5,755 Bytes
2018997 328cf71 acd8e16 3cf16fb acd8e16 3cf16fb 328cf71 3cf16fb acd8e16 328cf71 acd8e16 328cf71 3cf16fb acd8e16 328cf71 acd8e16 328cf71 acd8e16 328cf71 acd8e16 328cf71 acd8e16 328cf71 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | ---
title: DataEngEval
emoji: π₯
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
license: apache-2.0
short_description: The Benchmarking Hub for Data Engineering + AI
tags:
- leaderboard
- evaluation
- sql
- code-generation
- data-engineering
---
# DataEngEval
A comprehensive evaluation platform for systematically benchmarking performance across various models and programming languages, focusing on data engineering tools and technologies.
## π Features
- **Multi-use-case evaluation**: SQL generation, Python data processing, documentation generation
- **Real-world datasets**: NYC Taxi queries, data transformation algorithms, technical documentation
- **Comprehensive metrics**: Correctness, execution success, syntax validation, performance
- **Remote inference**: Uses Hugging Face Inference API (no local model downloads)
- **Mock mode**: Works without API keys for demos
## π― Current Use Cases
### SQL Generation
- **Dataset**: NYC Taxi Small
- **Dialects**: Presto, BigQuery, Snowflake
- **Metrics**: Correctness, execution, result matching, dialect compliance
### Code Generation
- **Python**: Data processing algorithms, ETL pipelines, data transformation functions
- **Metrics**: Syntax correctness, execution success, data processing accuracy, code quality
### Documentation Generation
- **Technical Documentation**: API documentation, system architecture, data pipeline documentation
- **Metrics**: Content accuracy, completeness, technical clarity, formatting quality
## ποΈ Project Structure
```
dataeng-leaderboard/
βββ app.py # Main Gradio application
βββ requirements.txt # Dependencies for Hugging Face Spaces
βββ config/ # Configuration files
β βββ app.yaml # App settings
β βββ models.yaml # Model configurations
β βββ metrics.yaml # Scoring weights
β βββ use_cases.yaml # Use case definitions
βββ src/ # Source code modules
β βββ evaluator.py # Dataset management and evaluation
β βββ models_registry.py # Model configuration and interfaces
β βββ scoring.py # Metrics computation
β βββ utils/ # Utility functions
βββ tasks/ # Multi-use-case datasets
β βββ sql_generation/ # SQL generation tasks
β βββ code_generation/ # Python data processing tasks
β βββ documentation/ # Technical documentation tasks
βββ prompts/ # SQL generation templates
βββ test/ # Test files
```
## π Quick Start
### Running on Hugging Face Spaces
1. **Fork this Space**: Click "Fork" on the Hugging Face Space
2. **Configure**: Add your `HF_TOKEN` as a secret in Space settings (optional)
3. **Deploy**: The Space will automatically build and deploy
4. **Use**: Access the Space URL to start evaluating models
### Running Locally
1. Clone this repository:
```bash
git clone <repository-url>
cd dataeng-leaderboard
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up environment variables (optional):
```bash
export HF_TOKEN="your_huggingface_token" # For Hugging Face models
```
4. Run the application:
```bash
gradio app.py
```
## π Usage
### Evaluating Models
1. **Select Dataset**: Choose from available datasets (NYC Taxi)
2. **Choose Dialect**: Select target SQL dialect (Presto, BigQuery, Snowflake)
3. **Pick Test Case**: Select a specific natural language question to evaluate
4. **Select Models**: Choose one or more models to evaluate
5. **Run Evaluation**: Click "Run Evaluation" to generate SQL and compute metrics
6. **View Results**: See individual results and updated leaderboard
### Understanding Metrics
The platform computes several metrics for each evaluation:
- **Correctness (Exact)**: Binary score (0/1) for exact result match
- **Execution Success**: Binary score (0/1) for successful SQL execution
- **Result Match F1**: F1 score for partial result matching
- **Latency**: Response time in milliseconds
- **Readability**: Score based on SQL structure and formatting
- **Dialect Compliance**: Binary score (0/1) for successful SQL transpilation
**Composite Score** combines all metrics with weights:
- Correctness: 40%
- Execution Success: 25%
- Result Match F1: 15%
- Dialect Compliance: 10%
- Readability: 5%
- Latency: 5%
## βοΈ Configuration
### Adding New Models
Edit `config/models.yaml` to add new models:
```yaml
models:
- name: "Your Model Name"
provider: "huggingface"
model_id: "your/model-id"
params:
max_new_tokens: 512
temperature: 0.1
description: "Description of your model"
```
### Adding New Datasets
1. Create a new folder under `tasks/` (e.g., `tasks/my_dataset/`)
2. Add three required files:
**`schema.sql`**: Database schema definition
**`loader.py`**: Database creation script
**`cases.yaml`**: Test cases with questions and reference SQL
## π€ Contributing
### Adding New Features
1. Fork the repository
2. Create a feature branch
3. Implement your changes
4. Test thoroughly
5. Submit a pull request
### Testing
Run the test suite:
```bash
python run_tests.py
```
## π License
This project is licensed under the Apache-2.0 License.
## π Acknowledgments
- Built with [Gradio](https://gradio.app/)
- SQL transpilation powered by [sqlglot](https://github.com/tobymao/sqlglot)
- Database execution using [DuckDB](https://duckdb.org/)
- Model APIs from [Hugging Face](https://huggingface.co/)
- Deployed on [Hugging Face Spaces](https://huggingface.co/spaces)
|