DataEngEval

Sleeping

File size: 5,755 Bytes

---
title: DataEngEval
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
license: apache-2.0
short_description: The Benchmarking Hub for Data Engineering + AI
tags:
- leaderboard
- evaluation
- sql
- code-generation
- data-engineering
---

# DataEngEval

A comprehensive evaluation platform for systematically benchmarking performance across various models and programming languages, focusing on data engineering tools and technologies.

## 🚀 Features

- **Multi-use-case evaluation**: SQL generation, Python data processing, documentation generation
- **Real-world datasets**: NYC Taxi queries, data transformation algorithms, technical documentation
- **Comprehensive metrics**: Correctness, execution success, syntax validation, performance
- **Remote inference**: Uses Hugging Face Inference API (no local model downloads)
- **Mock mode**: Works without API keys for demos

## 🎯 Current Use Cases

### SQL Generation
- **Dataset**: NYC Taxi Small
- **Dialects**: Presto, BigQuery, Snowflake
- **Metrics**: Correctness, execution, result matching, dialect compliance

### Code Generation
- **Python**: Data processing algorithms, ETL pipelines, data transformation functions
- **Metrics**: Syntax correctness, execution success, data processing accuracy, code quality

### Documentation Generation
- **Technical Documentation**: API documentation, system architecture, data pipeline documentation
- **Metrics**: Content accuracy, completeness, technical clarity, formatting quality

## 🏗️ Project Structure

```
dataeng-leaderboard/
├── app.py                     # Main Gradio application
├── requirements.txt           # Dependencies for Hugging Face Spaces
├── config/                    # Configuration files
│   ├── app.yaml              # App settings
│   ├── models.yaml           # Model configurations
│   ├── metrics.yaml          # Scoring weights
│   └── use_cases.yaml        # Use case definitions
├── src/                      # Source code modules
│   ├── evaluator.py          # Dataset management and evaluation
│   ├── models_registry.py    # Model configuration and interfaces
│   ├── scoring.py            # Metrics computation
│   └── utils/                # Utility functions
├── tasks/                    # Multi-use-case datasets
│   ├── sql_generation/      # SQL generation tasks
│   ├── code_generation/     # Python data processing tasks
│   └── documentation/       # Technical documentation tasks
├── prompts/                  # SQL generation templates
└── test/                     # Test files
```

## 🚀 Quick Start

### Running on Hugging Face Spaces

1. **Fork this Space**: Click "Fork" on the Hugging Face Space
2. **Configure**: Add your `HF_TOKEN` as a secret in Space settings (optional)
3. **Deploy**: The Space will automatically build and deploy
4. **Use**: Access the Space URL to start evaluating models

### Running Locally

1. Clone this repository:
```bash
git clone <repository-url>
cd dataeng-leaderboard
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Set up environment variables (optional):
```bash
export HF_TOKEN="your_huggingface_token"  # For Hugging Face models
```

4. Run the application:
```bash
gradio app.py
```

## 📊 Usage

### Evaluating Models

1. **Select Dataset**: Choose from available datasets (NYC Taxi)
2. **Choose Dialect**: Select target SQL dialect (Presto, BigQuery, Snowflake)
3. **Pick Test Case**: Select a specific natural language question to evaluate
4. **Select Models**: Choose one or more models to evaluate
5. **Run Evaluation**: Click "Run Evaluation" to generate SQL and compute metrics
6. **View Results**: See individual results and updated leaderboard

### Understanding Metrics

The platform computes several metrics for each evaluation:

- **Correctness (Exact)**: Binary score (0/1) for exact result match
- **Execution Success**: Binary score (0/1) for successful SQL execution
- **Result Match F1**: F1 score for partial result matching
- **Latency**: Response time in milliseconds
- **Readability**: Score based on SQL structure and formatting
- **Dialect Compliance**: Binary score (0/1) for successful SQL transpilation

**Composite Score** combines all metrics with weights:
- Correctness: 40%
- Execution Success: 25%
- Result Match F1: 15%
- Dialect Compliance: 10%
- Readability: 5%
- Latency: 5%

## ⚙️ Configuration

### Adding New Models

Edit `config/models.yaml` to add new models:

```yaml
models:
  - name: "Your Model Name"
    provider: "huggingface"
    model_id: "your/model-id"
    params:
      max_new_tokens: 512
      temperature: 0.1
    description: "Description of your model"
```

### Adding New Datasets

1. Create a new folder under `tasks/` (e.g., `tasks/my_dataset/`)
2. Add three required files:

**`schema.sql`**: Database schema definition
**`loader.py`**: Database creation script
**`cases.yaml`**: Test cases with questions and reference SQL

## 🤝 Contributing

### Adding New Features

1. Fork the repository
2. Create a feature branch
3. Implement your changes
4. Test thoroughly
5. Submit a pull request

### Testing

Run the test suite:
```bash
python run_tests.py
```

## 📄 License

This project is licensed under the Apache-2.0 License.

## 🙏 Acknowledgments

- Built with [Gradio](https://gradio.app/)
- SQL transpilation powered by [sqlglot](https://github.com/tobymao/sqlglot)
- Database execution using [DuckDB](https://duckdb.org/)
- Model APIs from [Hugging Face](https://huggingface.co/)
- Deployed on [Hugging Face Spaces](https://huggingface.co/spaces)