--- title: DataEngEval emoji: 🥇 colorFrom: green colorTo: indigo sdk: gradio sdk_version: "4.44.0" app_file: app.py pinned: false license: apache-2.0 short_description: The Benchmarking Hub for Data Engineering + AI tags: - leaderboard - evaluation - sql - code-generation - data-engineering --- # DataEngEval A comprehensive evaluation platform for systematically benchmarking performance across various models and programming languages, focusing on data engineering tools and technologies. ## 🚀 Features - **Multi-use-case evaluation**: SQL generation, Python data processing, documentation generation - **Real-world datasets**: NYC Taxi queries, data transformation algorithms, technical documentation - **Comprehensive metrics**: Correctness, execution success, syntax validation, performance - **Remote inference**: Uses Hugging Face Inference API (no local model downloads) - **Mock mode**: Works without API keys for demos ## 🎯 Current Use Cases ### SQL Generation - **Dataset**: NYC Taxi Small - **Dialects**: Presto, BigQuery, Snowflake - **Metrics**: Correctness, execution, result matching, dialect compliance ### Code Generation - **Python**: Data processing algorithms, ETL pipelines, data transformation functions - **Metrics**: Syntax correctness, execution success, data processing accuracy, code quality ### Documentation Generation - **Technical Documentation**: API documentation, system architecture, data pipeline documentation - **Metrics**: Content accuracy, completeness, technical clarity, formatting quality ## 🏗️ Project Structure ``` dataeng-leaderboard/ ├── app.py # Main Gradio application ├── requirements.txt # Dependencies for Hugging Face Spaces ├── config/ # Configuration files │ ├── app.yaml # App settings │ ├── models.yaml # Model configurations │ ├── metrics.yaml # Scoring weights │ └── use_cases.yaml # Use case definitions ├── src/ # Source code modules │ ├── evaluator.py # Dataset management and evaluation │ ├── models_registry.py # Model configuration and interfaces │ ├── scoring.py # Metrics computation │ └── utils/ # Utility functions ├── tasks/ # Multi-use-case datasets │ ├── sql_generation/ # SQL generation tasks │ ├── code_generation/ # Python data processing tasks │ └── documentation/ # Technical documentation tasks ├── prompts/ # SQL generation templates └── test/ # Test files ``` ## 🚀 Quick Start ### Running on Hugging Face Spaces 1. **Fork this Space**: Click "Fork" on the Hugging Face Space 2. **Configure**: Add your `HF_TOKEN` as a secret in Space settings (optional) 3. **Deploy**: The Space will automatically build and deploy 4. **Use**: Access the Space URL to start evaluating models ### Running Locally 1. Clone this repository: ```bash git clone cd dataeng-leaderboard ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Set up environment variables (optional): ```bash export HF_TOKEN="your_huggingface_token" # For Hugging Face models ``` 4. Run the application: ```bash gradio app.py ``` ## 📊 Usage ### Evaluating Models 1. **Select Dataset**: Choose from available datasets (NYC Taxi) 2. **Choose Dialect**: Select target SQL dialect (Presto, BigQuery, Snowflake) 3. **Pick Test Case**: Select a specific natural language question to evaluate 4. **Select Models**: Choose one or more models to evaluate 5. **Run Evaluation**: Click "Run Evaluation" to generate SQL and compute metrics 6. **View Results**: See individual results and updated leaderboard ### Understanding Metrics The platform computes several metrics for each evaluation: - **Correctness (Exact)**: Binary score (0/1) for exact result match - **Execution Success**: Binary score (0/1) for successful SQL execution - **Result Match F1**: F1 score for partial result matching - **Latency**: Response time in milliseconds - **Readability**: Score based on SQL structure and formatting - **Dialect Compliance**: Binary score (0/1) for successful SQL transpilation **Composite Score** combines all metrics with weights: - Correctness: 40% - Execution Success: 25% - Result Match F1: 15% - Dialect Compliance: 10% - Readability: 5% - Latency: 5% ## ⚙️ Configuration ### Adding New Models Edit `config/models.yaml` to add new models: ```yaml models: - name: "Your Model Name" provider: "huggingface" model_id: "your/model-id" params: max_new_tokens: 512 temperature: 0.1 description: "Description of your model" ``` ### Adding New Datasets 1. Create a new folder under `tasks/` (e.g., `tasks/my_dataset/`) 2. Add three required files: **`schema.sql`**: Database schema definition **`loader.py`**: Database creation script **`cases.yaml`**: Test cases with questions and reference SQL ## 🤝 Contributing ### Adding New Features 1. Fork the repository 2. Create a feature branch 3. Implement your changes 4. Test thoroughly 5. Submit a pull request ### Testing Run the test suite: ```bash python run_tests.py ``` ## 📄 License This project is licensed under the Apache-2.0 License. ## 🙏 Acknowledgments - Built with [Gradio](https://gradio.app/) - SQL transpilation powered by [sqlglot](https://github.com/tobymao/sqlglot) - Database execution using [DuckDB](https://duckdb.org/) - Model APIs from [Hugging Face](https://huggingface.co/) - Deployed on [Hugging Face Spaces](https://huggingface.co/spaces)