File size: 5,755 Bytes
2018997
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
328cf71
acd8e16
3cf16fb
acd8e16
 
 
3cf16fb
 
328cf71
 
 
 
 
 
 
 
 
 
 
 
3cf16fb
 
 
 
 
 
acd8e16
 
 
 
 
 
 
328cf71
 
 
 
 
acd8e16
 
 
 
 
328cf71
 
3cf16fb
 
acd8e16
328cf71
acd8e16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
328cf71
acd8e16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
328cf71
acd8e16
 
 
 
328cf71
acd8e16
 
 
 
 
 
 
328cf71
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
title: DataEngEval
emoji: πŸ₯‡
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
license: apache-2.0
short_description: The Benchmarking Hub for Data Engineering + AI
tags:
- leaderboard
- evaluation
- sql
- code-generation
- data-engineering
---

# DataEngEval

A comprehensive evaluation platform for systematically benchmarking performance across various models and programming languages, focusing on data engineering tools and technologies.

## πŸš€ Features

- **Multi-use-case evaluation**: SQL generation, Python data processing, documentation generation
- **Real-world datasets**: NYC Taxi queries, data transformation algorithms, technical documentation
- **Comprehensive metrics**: Correctness, execution success, syntax validation, performance
- **Remote inference**: Uses Hugging Face Inference API (no local model downloads)
- **Mock mode**: Works without API keys for demos

## 🎯 Current Use Cases

### SQL Generation
- **Dataset**: NYC Taxi Small
- **Dialects**: Presto, BigQuery, Snowflake
- **Metrics**: Correctness, execution, result matching, dialect compliance

### Code Generation
- **Python**: Data processing algorithms, ETL pipelines, data transformation functions
- **Metrics**: Syntax correctness, execution success, data processing accuracy, code quality

### Documentation Generation
- **Technical Documentation**: API documentation, system architecture, data pipeline documentation
- **Metrics**: Content accuracy, completeness, technical clarity, formatting quality

## πŸ—οΈ Project Structure

```
dataeng-leaderboard/
β”œβ”€β”€ app.py                     # Main Gradio application
β”œβ”€β”€ requirements.txt           # Dependencies for Hugging Face Spaces
β”œβ”€β”€ config/                    # Configuration files
β”‚   β”œβ”€β”€ app.yaml              # App settings
β”‚   β”œβ”€β”€ models.yaml           # Model configurations
β”‚   β”œβ”€β”€ metrics.yaml          # Scoring weights
β”‚   └── use_cases.yaml        # Use case definitions
β”œβ”€β”€ src/                      # Source code modules
β”‚   β”œβ”€β”€ evaluator.py          # Dataset management and evaluation
β”‚   β”œβ”€β”€ models_registry.py    # Model configuration and interfaces
β”‚   β”œβ”€β”€ scoring.py            # Metrics computation
β”‚   └── utils/                # Utility functions
β”œβ”€β”€ tasks/                    # Multi-use-case datasets
β”‚   β”œβ”€β”€ sql_generation/      # SQL generation tasks
β”‚   β”œβ”€β”€ code_generation/     # Python data processing tasks
β”‚   └── documentation/       # Technical documentation tasks
β”œβ”€β”€ prompts/                  # SQL generation templates
└── test/                     # Test files
```

## πŸš€ Quick Start

### Running on Hugging Face Spaces

1. **Fork this Space**: Click "Fork" on the Hugging Face Space
2. **Configure**: Add your `HF_TOKEN` as a secret in Space settings (optional)
3. **Deploy**: The Space will automatically build and deploy
4. **Use**: Access the Space URL to start evaluating models

### Running Locally

1. Clone this repository:
```bash
git clone <repository-url>
cd dataeng-leaderboard
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Set up environment variables (optional):
```bash
export HF_TOKEN="your_huggingface_token"  # For Hugging Face models
```

4. Run the application:
```bash
gradio app.py
```

## πŸ“Š Usage

### Evaluating Models

1. **Select Dataset**: Choose from available datasets (NYC Taxi)
2. **Choose Dialect**: Select target SQL dialect (Presto, BigQuery, Snowflake)
3. **Pick Test Case**: Select a specific natural language question to evaluate
4. **Select Models**: Choose one or more models to evaluate
5. **Run Evaluation**: Click "Run Evaluation" to generate SQL and compute metrics
6. **View Results**: See individual results and updated leaderboard

### Understanding Metrics

The platform computes several metrics for each evaluation:

- **Correctness (Exact)**: Binary score (0/1) for exact result match
- **Execution Success**: Binary score (0/1) for successful SQL execution
- **Result Match F1**: F1 score for partial result matching
- **Latency**: Response time in milliseconds
- **Readability**: Score based on SQL structure and formatting
- **Dialect Compliance**: Binary score (0/1) for successful SQL transpilation

**Composite Score** combines all metrics with weights:
- Correctness: 40%
- Execution Success: 25%
- Result Match F1: 15%
- Dialect Compliance: 10%
- Readability: 5%
- Latency: 5%

## βš™οΈ Configuration

### Adding New Models

Edit `config/models.yaml` to add new models:

```yaml
models:
  - name: "Your Model Name"
    provider: "huggingface"
    model_id: "your/model-id"
    params:
      max_new_tokens: 512
      temperature: 0.1
    description: "Description of your model"
```

### Adding New Datasets

1. Create a new folder under `tasks/` (e.g., `tasks/my_dataset/`)
2. Add three required files:

**`schema.sql`**: Database schema definition
**`loader.py`**: Database creation script
**`cases.yaml`**: Test cases with questions and reference SQL

## 🀝 Contributing

### Adding New Features

1. Fork the repository
2. Create a feature branch
3. Implement your changes
4. Test thoroughly
5. Submit a pull request

### Testing

Run the test suite:
```bash
python run_tests.py
```

## πŸ“„ License

This project is licensed under the Apache-2.0 License.

## πŸ™ Acknowledgments

- Built with [Gradio](https://gradio.app/)
- SQL transpilation powered by [sqlglot](https://github.com/tobymao/sqlglot)
- Database execution using [DuckDB](https://duckdb.org/)
- Model APIs from [Hugging Face](https://huggingface.co/)
- Deployed on [Hugging Face Spaces](https://huggingface.co/spaces)