# DATA3: Programming Problems Generation Dataset ## Dataset Overview DATA3 is a large-scale programming problems generation dataset that contains AI-generated programming problems inspired by real scientific computing code snippets. The dataset consists of 22,532 programming problems, each paired with a comprehensive solution. These problems focus on scientific computing concepts such as numerical algorithms, data analysis, mathematical modeling, and computational methods in chemistry, biology, and physics. ## Dataset Statistics - **Total Samples**: 22,532 programming problems - **Total Data Size**: ~496 MB - **Data Format**: JSONL (JSON Lines, one JSON object per line) - **Encoding**: UTF-8 - **Primary Language**: Python (dominant in source code) - **Average Input Tokens**: ~697 tokens per prompt - **Average Output Tokens**: ~5,378 tokens per response ## Dataset Structure The dataset is stored in JSONL format, where each line contains a complete JSON object representing one programming problem with its solution. ### Data Field Description Each JSON object contains the following fields: | Field Name | Type | Description | |------------|------|-------------| | `metadata` | Object | Metadata about the source code that inspired the problem | | `metadata.original_index` | String | Original index of the source function | | `metadata.function_name` | String | Name of the source function | | `metadata.repo_name` | String | Repository name (may be empty) | | `metadata.path` | String | File path (may be empty) | | `metadata.language` | String | Programming language of source code | | `metadata.relevance_score` | Integer | Relevance score of the source function | | `metadata.function_start_line` | String | Starting line number of the function | | `metadata.function_end_line` | String | Ending line number of the function | | `prompt` | String | The prompt used to generate the programming problem | | `response` | String | Generated response containing problem description and solution | | `usage` | Object | API usage statistics for generation | | `usage.input_tokens` | Integer | Number of input tokens used | | `usage.output_tokens` | Integer | Number of output tokens generated | | `usage.total_tokens` | Integer | Total tokens (input + output) | | `usage.input_cost` | Float | Cost for input tokens | | `usage.output_cost` | Float | Cost for output tokens | | `usage.request_cost` | Float | Total cost for the request | | `timestamp` | String | ISO format timestamp of generation | | `row_number` | Integer | Row number in the dataset | ### Response Structure The `response` field contains a structured markdown document with two main sections: 1. **Problem Description**: A self-contained problem description that: - Provides all necessary context and background - Clearly states what needs to be implemented - Specifies input/output format and constraints - Explains domain-specific concepts - Does NOT directly reference the original code snippet 2. **Solution**: A comprehensive Python solution that: - Accurately solves the problem - Includes clear comments explaining the approach - Uses appropriate scientific computing libraries (numpy, scipy, etc.) - Is complete and runnable - Follows best practices for scientific computing ## Problem Categories The programming problems in this dataset focus on scientific computing concepts: - **Numerical Algorithms and Simulations**: Gradient descent, optimization, numerical integration - **Data Analysis and Visualization**: Statistical analysis, plotting, data processing - **Mathematical Modeling**: Linear regression, differential equations, statistical models - **Scientific Data Processing**: Molecular data, biological data, chemical data processing - **Computational Methods**: Methods in chemistry, biology, physics, and materials science ## Generation Process The programming problems were generated through the following process: 1. **Source Code Selection**: Functions were extracted from domain-specific repositories based on relevance scores 2. **Context Preparation**: Source code snippets were prepared with project context 3. **Prompt Engineering**: A structured prompt was used to guide the generation of programming problems 4. **Problem Generation**: AI models generated self-contained problems inspired by (but not directly copying) the source code 5. **Solution Generation**: Comprehensive solutions were generated for each problem 6. **Quality Control**: Problems and solutions were validated for correctness and completeness ### Key Characteristics - **Self-Contained**: Each problem includes all necessary context without requiring the original code - **Inspired, Not Copied**: Problems are inspired by source code but create new, interesting scenarios - **Complete Solutions**: Every problem includes a working, well-commented solution - **Domain-Specific**: Problems focus on scientific and technical domains - **Code-Inspired**: Problems are generated from real scientific computing code snippets ## Usage Guidelines ### Data Loading ```python import jsonlines # Load the dataset problems = [] with jsonlines.open('programming_problems.jsonl', 'r') as reader: for obj in reader: problems.append(obj) print(f"Total problems: {len(problems)}") ``` ### Accessing Problem and Solution ```python # Access a specific problem problem = problems[0] # Extract problem description and solution from response response = problem['response'] # The response contains markdown with [Problem Description] and [Solution] sections # You can parse it to extract the problem and solution separately ``` ### Extracting Problem Descriptions ```python import re def extract_problem_description(response): """Extract problem description from response.""" # Look for the Problem Description section pattern = r'## Problem Description(.*?)(?=## Solution|$)' match = re.search(pattern, response, re.DOTALL) if match: return match.group(1).strip() return None def extract_solution(response): """Extract solution code from response.""" # Look for code blocks in the Solution section pattern = r'## Solution.*?```python\s*(.*?)```' match = re.search(pattern, response, re.DOTALL) if match: return match.group(1).strip() return None # Extract problem and solution for problem in problems[:5]: # First 5 problems problem_desc = extract_problem_description(problem['response']) solution = extract_solution(problem['response']) print(f"Problem: {problem['metadata']['function_name']}") print(f"Description length: {len(problem_desc) if problem_desc else 0} chars") print(f"Solution length: {len(solution) if solution else 0} chars") ``` ### Filtering by Language ```python # Filter problems based on source language python_problems = [ p for p in problems if p['metadata'].get('language', '').lower() == 'python' ] print(f"Python-based problems: {len(python_problems)}") ``` ### Filtering by Relevance Score ```python # Filter high-relevance problems high_relevance = [ p for p in problems if p['metadata'].get('relevance_score', 0) >= 80 ] print(f"High-relevance problems: {len(high_relevance)}") ``` ### Analyzing Token Usage ```python # Analyze API usage statistics total_input_tokens = sum(p['usage']['input_tokens'] for p in problems) total_output_tokens = sum(p['usage']['output_tokens'] for p in problems) total_cost = sum(p['usage']['request_cost'] for p in problems) print(f"Total input tokens: {total_input_tokens:,}") print(f"Total output tokens: {total_output_tokens:,}") print(f"Total cost: ${total_cost:.4f}") ``` ## Use Cases This dataset is suitable for: 1. **Content Generation**: Creating programming exercises and problem sets 2. **Code-to-Problem Generation**: Training models to generate problems from code 3. **Problem-Solution Pairing**: Studying the relationship between problems and solutions 4. **Scientific Computing Education**: Teaching numerical methods and scientific programming 5. **Dataset Augmentation**: Expanding programming problem datasets 6. **Code Understanding**: Training models to understand code semantics through problem generation 7. **Automated Tutoring**: Building systems that generate practice problems ## Important Notes 1. **File Size**: The dataset file is moderately large (~496 MB), ensure sufficient memory when loading 2. **JSONL Format**: Each line is a complete JSON object; process line-by-line for memory efficiency 3. **Response Format**: The `response` field contains markdown-formatted text with problem and solution sections 4. **Code Extraction**: Solutions are embedded in markdown code blocks; parsing may be needed to extract clean code 5. **Metadata Completeness**: Some metadata fields (repo_name, path, language) may be empty for certain samples 6. **Problem Independence**: Each problem is self-contained and does not require the original source code 7. **Solution Correctness**: Solutions are AI-generated; validation may be needed for production use ## Data Processing Example ```python import jsonlines import re def parse_problem_response(response): """Parse response into structured problem and solution.""" # Extract problem description problem_match = re.search( r'## Problem Description\s*\n(.*?)(?=\n## Solution|\Z)', response, re.DOTALL ) problem_desc = problem_match.group(1).strip() if problem_match else None # Extract solution code solution_match = re.search( r'```python\s*(.*?)```', response, re.DOTALL ) solution_code = solution_match.group(1).strip() if solution_match else None return { 'problem_description': problem_desc, 'solution_code': solution_code } # Process dataset processed_problems = [] with jsonlines.open('programming_problems.jsonl', 'r') as reader: for obj in reader: parsed = parse_problem_response(obj['response']) processed_problems.append({ 'function_name': obj['metadata']['function_name'], 'language': obj['metadata'].get('language', ''), 'relevance_score': obj['metadata'].get('relevance_score', 0), 'problem': parsed['problem_description'], 'solution': parsed['solution_code'], 'timestamp': obj['timestamp'] }) print(f"Processed {len(processed_problems)} problems") ```