Spaces:
Sleeping
Sleeping
| title: DataEngEval | |
| emoji: π₯ | |
| colorFrom: green | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: "4.44.0" | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: The Benchmarking Hub for Data Engineering + AI | |
| tags: | |
| - leaderboard | |
| - evaluation | |
| - sql | |
| - code-generation | |
| - data-engineering | |
| # DataEngEval | |
| A comprehensive evaluation platform for systematically benchmarking performance across various models and programming languages, focusing on data engineering tools and technologies. | |
| ## π Features | |
| - **Multi-use-case evaluation**: SQL generation, Python data processing, documentation generation | |
| - **Real-world datasets**: NYC Taxi queries, data transformation algorithms, technical documentation | |
| - **Comprehensive metrics**: Correctness, execution success, syntax validation, performance | |
| - **Remote inference**: Uses Hugging Face Inference API (no local model downloads) | |
| - **Mock mode**: Works without API keys for demos | |
| ## π― Current Use Cases | |
| ### SQL Generation | |
| - **Dataset**: NYC Taxi Small | |
| - **Dialects**: Presto, BigQuery, Snowflake | |
| - **Metrics**: Correctness, execution, result matching, dialect compliance | |
| ### Code Generation | |
| - **Python**: Data processing algorithms, ETL pipelines, data transformation functions | |
| - **Metrics**: Syntax correctness, execution success, data processing accuracy, code quality | |
| ### Documentation Generation | |
| - **Technical Documentation**: API documentation, system architecture, data pipeline documentation | |
| - **Metrics**: Content accuracy, completeness, technical clarity, formatting quality | |
| ## ποΈ Project Structure | |
| ``` | |
| dataeng-leaderboard/ | |
| βββ app.py # Main Gradio application | |
| βββ requirements.txt # Dependencies for Hugging Face Spaces | |
| βββ config/ # Configuration files | |
| β βββ app.yaml # App settings | |
| β βββ models.yaml # Model configurations | |
| β βββ metrics.yaml # Scoring weights | |
| β βββ use_cases.yaml # Use case definitions | |
| βββ src/ # Source code modules | |
| β βββ evaluator.py # Dataset management and evaluation | |
| β βββ models_registry.py # Model configuration and interfaces | |
| β βββ scoring.py # Metrics computation | |
| β βββ utils/ # Utility functions | |
| βββ tasks/ # Multi-use-case datasets | |
| β βββ sql_generation/ # SQL generation tasks | |
| β βββ code_generation/ # Python data processing tasks | |
| β βββ documentation/ # Technical documentation tasks | |
| βββ prompts/ # SQL generation templates | |
| βββ test/ # Test files | |
| ``` | |
| ## π Quick Start | |
| ### Running on Hugging Face Spaces | |
| 1. **Fork this Space**: Click "Fork" on the Hugging Face Space | |
| 2. **Configure**: Add your `HF_TOKEN` as a secret in Space settings (optional) | |
| 3. **Deploy**: The Space will automatically build and deploy | |
| 4. **Use**: Access the Space URL to start evaluating models | |
| ### Running Locally | |
| 1. Clone this repository: | |
| ```bash | |
| git clone <repository-url> | |
| cd dataeng-leaderboard | |
| ``` | |
| 2. Install dependencies: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. Set up environment variables (optional): | |
| ```bash | |
| export HF_TOKEN="your_huggingface_token" # For Hugging Face models | |
| ``` | |
| 4. Run the application: | |
| ```bash | |
| gradio app.py | |
| ``` | |
| ## π Usage | |
| ### Evaluating Models | |
| 1. **Select Dataset**: Choose from available datasets (NYC Taxi) | |
| 2. **Choose Dialect**: Select target SQL dialect (Presto, BigQuery, Snowflake) | |
| 3. **Pick Test Case**: Select a specific natural language question to evaluate | |
| 4. **Select Models**: Choose one or more models to evaluate | |
| 5. **Run Evaluation**: Click "Run Evaluation" to generate SQL and compute metrics | |
| 6. **View Results**: See individual results and updated leaderboard | |
| ### Understanding Metrics | |
| The platform computes several metrics for each evaluation: | |
| - **Correctness (Exact)**: Binary score (0/1) for exact result match | |
| - **Execution Success**: Binary score (0/1) for successful SQL execution | |
| - **Result Match F1**: F1 score for partial result matching | |
| - **Latency**: Response time in milliseconds | |
| - **Readability**: Score based on SQL structure and formatting | |
| - **Dialect Compliance**: Binary score (0/1) for successful SQL transpilation | |
| **Composite Score** combines all metrics with weights: | |
| - Correctness: 40% | |
| - Execution Success: 25% | |
| - Result Match F1: 15% | |
| - Dialect Compliance: 10% | |
| - Readability: 5% | |
| - Latency: 5% | |
| ## βοΈ Configuration | |
| ### Adding New Models | |
| Edit `config/models.yaml` to add new models: | |
| ```yaml | |
| models: | |
| - name: "Your Model Name" | |
| provider: "huggingface" | |
| model_id: "your/model-id" | |
| params: | |
| max_new_tokens: 512 | |
| temperature: 0.1 | |
| description: "Description of your model" | |
| ``` | |
| ### Adding New Datasets | |
| 1. Create a new folder under `tasks/` (e.g., `tasks/my_dataset/`) | |
| 2. Add three required files: | |
| **`schema.sql`**: Database schema definition | |
| **`loader.py`**: Database creation script | |
| **`cases.yaml`**: Test cases with questions and reference SQL | |
| ## π€ Contributing | |
| ### Adding New Features | |
| 1. Fork the repository | |
| 2. Create a feature branch | |
| 3. Implement your changes | |
| 4. Test thoroughly | |
| 5. Submit a pull request | |
| ### Testing | |
| Run the test suite: | |
| ```bash | |
| python run_tests.py | |
| ``` | |
| ## π License | |
| This project is licensed under the Apache-2.0 License. | |
| ## π Acknowledgments | |
| - Built with [Gradio](https://gradio.app/) | |
| - SQL transpilation powered by [sqlglot](https://github.com/tobymao/sqlglot) | |
| - Database execution using [DuckDB](https://duckdb.org/) | |
| - Model APIs from [Hugging Face](https://huggingface.co/) | |
| - Deployed on [Hugging Face Spaces](https://huggingface.co/spaces) | |