# Evaluation Tasks This directory contains evaluation tasks organized by use case. ## Structure ``` tasks/ ├── sql_generation/ # SQL generation tasks │ └── nyc_taxi_small/ # NYC Taxi dataset ├── code_generation/ # Code generation tasks │ ├── python_algorithms/ # Python algorithm tasks │ └── go_algorithms/ # Go algorithm tasks └── documentation/ # Documentation generation tasks ├── technical_docs/ # Technical documentation tasks └── api_documentation/ # API documentation tasks ``` ## Use Cases ### 1. SQL Generation - **Purpose**: Evaluate models on natural language to SQL query generation - **Datasets**: NYC Taxi Small - **Dialects**: Presto, BigQuery, Snowflake - **Metrics**: Correctness, execution success, result matching, dialect compliance ### 2. Code Generation - **Purpose**: Evaluate models on natural language to source code generation - **Languages**: Python, Go, JavaScript, Java - **Datasets**: Algorithm implementations, web services, data structures - **Metrics**: Syntax correctness, compilation success, execution success, code quality ### 3. Documentation Generation - **Purpose**: Evaluate models on natural language to technical documentation - **Formats**: Markdown, HTML, JSON, YAML - **Datasets**: API docs, technical guides, installation instructions - **Metrics**: Accuracy, completeness, clarity, format compliance ## Task Structure Each task directory contains: ### Required Files - `cases.yaml` - Test cases with questions and reference outputs - `loader.py` - Data loading and test execution utilities - `schema.sql` - Database schema (for SQL tasks) - `test_data.json` - Test data for evaluation (for code/doc tasks) ### Optional Files - `README.md` - Task-specific documentation - `requirements.txt` - Task-specific dependencies - `config.yaml` - Task-specific configuration ## Adding New Tasks 1. Create a new directory under the appropriate use case 2. Add the required files (`cases.yaml`, `loader.py`) 3. Define test cases with questions and reference outputs 4. Implement data loading and evaluation logic 5. Update the main configuration files ## Evaluation Metrics ### SQL Generation - **Correctness**: Exact match with reference SQL - **Execution Success**: SQL executes without errors - **Result Matching**: F1 score comparing query results - **Dialect Compliance**: Proper SQL transpilation - **Readability**: SQL structure and formatting ### Code Generation - **Syntax Correctness**: Code compiles without syntax errors - **Compilation Success**: Code builds successfully - **Execution Success**: Code runs and produces expected output - **Code Quality**: Follows language best practices - **Performance**: Code efficiency and optimization ### Documentation Generation - **Accuracy**: Content matches reference documentation - **Completeness**: Covers all required information - **Clarity**: Easy to understand and follow - **Format Compliance**: Follows specified documentation format - **Technical Correctness**: Technically accurate information