DataEngEval / README.md
uparekh01151's picture
docs: update README to focus on data engineering tools
3cf16fb
---
title: DataEngEval
emoji: πŸ₯‡
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: "4.44.0"
app_file: app.py
pinned: false
license: apache-2.0
short_description: The Benchmarking Hub for Data Engineering + AI
tags:
- leaderboard
- evaluation
- sql
- code-generation
- data-engineering
---
# DataEngEval
A comprehensive evaluation platform for systematically benchmarking performance across various models and programming languages, focusing on data engineering tools and technologies.
## πŸš€ Features
- **Multi-use-case evaluation**: SQL generation, Python data processing, documentation generation
- **Real-world datasets**: NYC Taxi queries, data transformation algorithms, technical documentation
- **Comprehensive metrics**: Correctness, execution success, syntax validation, performance
- **Remote inference**: Uses Hugging Face Inference API (no local model downloads)
- **Mock mode**: Works without API keys for demos
## 🎯 Current Use Cases
### SQL Generation
- **Dataset**: NYC Taxi Small
- **Dialects**: Presto, BigQuery, Snowflake
- **Metrics**: Correctness, execution, result matching, dialect compliance
### Code Generation
- **Python**: Data processing algorithms, ETL pipelines, data transformation functions
- **Metrics**: Syntax correctness, execution success, data processing accuracy, code quality
### Documentation Generation
- **Technical Documentation**: API documentation, system architecture, data pipeline documentation
- **Metrics**: Content accuracy, completeness, technical clarity, formatting quality
## πŸ—οΈ Project Structure
```
dataeng-leaderboard/
β”œβ”€β”€ app.py # Main Gradio application
β”œβ”€β”€ requirements.txt # Dependencies for Hugging Face Spaces
β”œβ”€β”€ config/ # Configuration files
β”‚ β”œβ”€β”€ app.yaml # App settings
β”‚ β”œβ”€β”€ models.yaml # Model configurations
β”‚ β”œβ”€β”€ metrics.yaml # Scoring weights
β”‚ └── use_cases.yaml # Use case definitions
β”œβ”€β”€ src/ # Source code modules
β”‚ β”œβ”€β”€ evaluator.py # Dataset management and evaluation
β”‚ β”œβ”€β”€ models_registry.py # Model configuration and interfaces
β”‚ β”œβ”€β”€ scoring.py # Metrics computation
β”‚ └── utils/ # Utility functions
β”œβ”€β”€ tasks/ # Multi-use-case datasets
β”‚ β”œβ”€β”€ sql_generation/ # SQL generation tasks
β”‚ β”œβ”€β”€ code_generation/ # Python data processing tasks
β”‚ └── documentation/ # Technical documentation tasks
β”œβ”€β”€ prompts/ # SQL generation templates
└── test/ # Test files
```
## πŸš€ Quick Start
### Running on Hugging Face Spaces
1. **Fork this Space**: Click "Fork" on the Hugging Face Space
2. **Configure**: Add your `HF_TOKEN` as a secret in Space settings (optional)
3. **Deploy**: The Space will automatically build and deploy
4. **Use**: Access the Space URL to start evaluating models
### Running Locally
1. Clone this repository:
```bash
git clone <repository-url>
cd dataeng-leaderboard
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up environment variables (optional):
```bash
export HF_TOKEN="your_huggingface_token" # For Hugging Face models
```
4. Run the application:
```bash
gradio app.py
```
## πŸ“Š Usage
### Evaluating Models
1. **Select Dataset**: Choose from available datasets (NYC Taxi)
2. **Choose Dialect**: Select target SQL dialect (Presto, BigQuery, Snowflake)
3. **Pick Test Case**: Select a specific natural language question to evaluate
4. **Select Models**: Choose one or more models to evaluate
5. **Run Evaluation**: Click "Run Evaluation" to generate SQL and compute metrics
6. **View Results**: See individual results and updated leaderboard
### Understanding Metrics
The platform computes several metrics for each evaluation:
- **Correctness (Exact)**: Binary score (0/1) for exact result match
- **Execution Success**: Binary score (0/1) for successful SQL execution
- **Result Match F1**: F1 score for partial result matching
- **Latency**: Response time in milliseconds
- **Readability**: Score based on SQL structure and formatting
- **Dialect Compliance**: Binary score (0/1) for successful SQL transpilation
**Composite Score** combines all metrics with weights:
- Correctness: 40%
- Execution Success: 25%
- Result Match F1: 15%
- Dialect Compliance: 10%
- Readability: 5%
- Latency: 5%
## βš™οΈ Configuration
### Adding New Models
Edit `config/models.yaml` to add new models:
```yaml
models:
- name: "Your Model Name"
provider: "huggingface"
model_id: "your/model-id"
params:
max_new_tokens: 512
temperature: 0.1
description: "Description of your model"
```
### Adding New Datasets
1. Create a new folder under `tasks/` (e.g., `tasks/my_dataset/`)
2. Add three required files:
**`schema.sql`**: Database schema definition
**`loader.py`**: Database creation script
**`cases.yaml`**: Test cases with questions and reference SQL
## 🀝 Contributing
### Adding New Features
1. Fork the repository
2. Create a feature branch
3. Implement your changes
4. Test thoroughly
5. Submit a pull request
### Testing
Run the test suite:
```bash
python run_tests.py
```
## πŸ“„ License
This project is licensed under the Apache-2.0 License.
## πŸ™ Acknowledgments
- Built with [Gradio](https://gradio.app/)
- SQL transpilation powered by [sqlglot](https://github.com/tobymao/sqlglot)
- Database execution using [DuckDB](https://duckdb.org/)
- Model APIs from [Hugging Face](https://huggingface.co/)
- Deployed on [Hugging Face Spaces](https://huggingface.co/spaces)