DataEngEval

Sleeping

App Files Files Community

DataEngEval / README.md

uparekh01151

docs: update README to focus on data engineering tools

3cf16fb 5 months ago

preview code

raw

history blame contribute delete

5.76 kB

	---
	title: DataEngEval
	emoji: 🥇
	colorFrom: green
	colorTo: indigo
	sdk: gradio
	sdk_version: "4.44.0"
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: The Benchmarking Hub for Data Engineering + AI
	tags:
	- leaderboard
	- evaluation
	- sql
	- code-generation
	- data-engineering
	---

	# DataEngEval

	A comprehensive evaluation platform for systematically benchmarking performance across various models and programming languages, focusing on data engineering tools and technologies.

	## 🚀 Features

	- Multi-use-case evaluation: SQL generation, Python data processing, documentation generation
	- Real-world datasets: NYC Taxi queries, data transformation algorithms, technical documentation
	- Comprehensive metrics: Correctness, execution success, syntax validation, performance
	- Remote inference: Uses Hugging Face Inference API (no local model downloads)
	- Mock mode: Works without API keys for demos

	## 🎯 Current Use Cases

	### SQL Generation
	- Dataset: NYC Taxi Small
	- Dialects: Presto, BigQuery, Snowflake
	- Metrics: Correctness, execution, result matching, dialect compliance

	### Code Generation
	- Python: Data processing algorithms, ETL pipelines, data transformation functions
	- Metrics: Syntax correctness, execution success, data processing accuracy, code quality

	### Documentation Generation
	- Technical Documentation: API documentation, system architecture, data pipeline documentation
	- Metrics: Content accuracy, completeness, technical clarity, formatting quality

	## 🏗️ Project Structure

	```
	dataeng-leaderboard/
	├── app.py # Main Gradio application
	├── requirements.txt # Dependencies for Hugging Face Spaces
	├── config/ # Configuration files
	│ ├── app.yaml # App settings
	│ ├── models.yaml # Model configurations
	│ ├── metrics.yaml # Scoring weights
	│ └── use_cases.yaml # Use case definitions
	├── src/ # Source code modules
	│ ├── evaluator.py # Dataset management and evaluation
	│ ├── models_registry.py # Model configuration and interfaces
	│ ├── scoring.py # Metrics computation
	│ └── utils/ # Utility functions
	├── tasks/ # Multi-use-case datasets
	│ ├── sql_generation/ # SQL generation tasks
	│ ├── code_generation/ # Python data processing tasks
	│ └── documentation/ # Technical documentation tasks
	├── prompts/ # SQL generation templates
	└── test/ # Test files
	```

	## 🚀 Quick Start

	### Running on Hugging Face Spaces

	1. Fork this Space: Click "Fork" on the Hugging Face Space
	2. Configure: Add your `HF_TOKEN` as a secret in Space settings (optional)
	3. Deploy: The Space will automatically build and deploy
	4. Use: Access the Space URL to start evaluating models

	### Running Locally

	1. Clone this repository:
	```bash
	git clone <repository-url>
	cd dataeng-leaderboard
	```

	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Set up environment variables (optional):
	```bash
	export HF_TOKEN="your_huggingface_token" # For Hugging Face models
	```

	4. Run the application:
	```bash
	gradio app.py
	```

	## 📊 Usage

	### Evaluating Models

	1. Select Dataset: Choose from available datasets (NYC Taxi)
	2. Choose Dialect: Select target SQL dialect (Presto, BigQuery, Snowflake)
	3. Pick Test Case: Select a specific natural language question to evaluate
	4. Select Models: Choose one or more models to evaluate
	5. Run Evaluation: Click "Run Evaluation" to generate SQL and compute metrics
	6. View Results: See individual results and updated leaderboard

	### Understanding Metrics

	The platform computes several metrics for each evaluation:

	- Correctness (Exact): Binary score (0/1) for exact result match
	- Execution Success: Binary score (0/1) for successful SQL execution
	- Result Match F1: F1 score for partial result matching
	- Latency: Response time in milliseconds
	- Readability: Score based on SQL structure and formatting
	- Dialect Compliance: Binary score (0/1) for successful SQL transpilation

	Composite Score combines all metrics with weights:
	- Correctness: 40%
	- Execution Success: 25%
	- Result Match F1: 15%
	- Dialect Compliance: 10%
	- Readability: 5%
	- Latency: 5%

	## ⚙️ Configuration

	### Adding New Models

	Edit `config/models.yaml` to add new models:

	```yaml
	models:
	- name: "Your Model Name"
	provider: "huggingface"
	model_id: "your/model-id"
	params:
	max_new_tokens: 512
	temperature: 0.1
	description: "Description of your model"
	```

	### Adding New Datasets

	1. Create a new folder under `tasks/` (e.g., `tasks/my_dataset/`)
	2. Add three required files:

	`schema.sql`: Database schema definition
	`loader.py`: Database creation script
	`cases.yaml`: Test cases with questions and reference SQL

	## 🤝 Contributing

	### Adding New Features

	1. Fork the repository
	2. Create a feature branch
	3. Implement your changes
	4. Test thoroughly
	5. Submit a pull request

	### Testing

	Run the test suite:
	```bash
	python run_tests.py
	```

	## 📄 License

	This project is licensed under the Apache-2.0 License.

	## 🙏 Acknowledgments

	- Built with [Gradio](https://gradio.app/)
	- SQL transpilation powered by [sqlglot](https://github.com/tobymao/sqlglot)
	- Database execution using [DuckDB](https://duckdb.org/)
	- Model APIs from [Hugging Face](https://huggingface.co/)
	- Deployed on [Hugging Face Spaces](https://huggingface.co/spaces)