| # DATA3: Programming Problems Generation Dataset | |
| ## Dataset Overview | |
| DATA3 is a large-scale programming problems generation dataset that contains AI-generated programming problems inspired by real scientific computing code snippets. The dataset consists of 22,532 programming problems, each paired with a comprehensive solution. These problems focus on scientific computing concepts such as numerical algorithms, data analysis, mathematical modeling, and computational methods in chemistry, biology, and physics. | |
| ## Dataset Statistics | |
| - **Total Samples**: 22,532 programming problems | |
| - **Total Data Size**: ~496 MB | |
| - **Data Format**: JSONL (JSON Lines, one JSON object per line) | |
| - **Encoding**: UTF-8 | |
| - **Primary Language**: Python (dominant in source code) | |
| - **Average Input Tokens**: ~697 tokens per prompt | |
| - **Average Output Tokens**: ~5,378 tokens per response | |
| ## Dataset Structure | |
| The dataset is stored in JSONL format, where each line contains a complete JSON object representing one programming problem with its solution. | |
| ### Data Field Description | |
| Each JSON object contains the following fields: | |
| | Field Name | Type | Description | | |
| |------------|------|-------------| | |
| | `metadata` | Object | Metadata about the source code that inspired the problem | | |
| | `metadata.original_index` | String | Original index of the source function | | |
| | `metadata.function_name` | String | Name of the source function | | |
| | `metadata.repo_name` | String | Repository name (may be empty) | | |
| | `metadata.path` | String | File path (may be empty) | | |
| | `metadata.language` | String | Programming language of source code | | |
| | `metadata.relevance_score` | Integer | Relevance score of the source function | | |
| | `metadata.function_start_line` | String | Starting line number of the function | | |
| | `metadata.function_end_line` | String | Ending line number of the function | | |
| | `prompt` | String | The prompt used to generate the programming problem | | |
| | `response` | String | Generated response containing problem description and solution | | |
| | `usage` | Object | API usage statistics for generation | | |
| | `usage.input_tokens` | Integer | Number of input tokens used | | |
| | `usage.output_tokens` | Integer | Number of output tokens generated | | |
| | `usage.total_tokens` | Integer | Total tokens (input + output) | | |
| | `usage.input_cost` | Float | Cost for input tokens | | |
| | `usage.output_cost` | Float | Cost for output tokens | | |
| | `usage.request_cost` | Float | Total cost for the request | | |
| | `timestamp` | String | ISO format timestamp of generation | | |
| | `row_number` | Integer | Row number in the dataset | | |
| ### Response Structure | |
| The `response` field contains a structured markdown document with two main sections: | |
| 1. **Problem Description**: A self-contained problem description that: | |
| - Provides all necessary context and background | |
| - Clearly states what needs to be implemented | |
| - Specifies input/output format and constraints | |
| - Explains domain-specific concepts | |
| - Does NOT directly reference the original code snippet | |
| 2. **Solution**: A comprehensive Python solution that: | |
| - Accurately solves the problem | |
| - Includes clear comments explaining the approach | |
| - Uses appropriate scientific computing libraries (numpy, scipy, etc.) | |
| - Is complete and runnable | |
| - Follows best practices for scientific computing | |
| ## Problem Categories | |
| The programming problems in this dataset focus on scientific computing concepts: | |
| - **Numerical Algorithms and Simulations**: Gradient descent, optimization, numerical integration | |
| - **Data Analysis and Visualization**: Statistical analysis, plotting, data processing | |
| - **Mathematical Modeling**: Linear regression, differential equations, statistical models | |
| - **Scientific Data Processing**: Molecular data, biological data, chemical data processing | |
| - **Computational Methods**: Methods in chemistry, biology, physics, and materials science | |
| ## Generation Process | |
| The programming problems were generated through the following process: | |
| 1. **Source Code Selection**: Functions were extracted from domain-specific repositories based on relevance scores | |
| 2. **Context Preparation**: Source code snippets were prepared with project context | |
| 3. **Prompt Engineering**: A structured prompt was used to guide the generation of programming problems | |
| 4. **Problem Generation**: AI models generated self-contained problems inspired by (but not directly copying) the source code | |
| 5. **Solution Generation**: Comprehensive solutions were generated for each problem | |
| 6. **Quality Control**: Problems and solutions were validated for correctness and completeness | |
| ### Key Characteristics | |
| - **Self-Contained**: Each problem includes all necessary context without requiring the original code | |
| - **Inspired, Not Copied**: Problems are inspired by source code but create new, interesting scenarios | |
| - **Complete Solutions**: Every problem includes a working, well-commented solution | |
| - **Domain-Specific**: Problems focus on scientific and technical domains | |
| - **Code-Inspired**: Problems are generated from real scientific computing code snippets | |
| ## Usage Guidelines | |
| ### Data Loading | |
| ```python | |
| import jsonlines | |
| # Load the dataset | |
| problems = [] | |
| with jsonlines.open('programming_problems.jsonl', 'r') as reader: | |
| for obj in reader: | |
| problems.append(obj) | |
| print(f"Total problems: {len(problems)}") | |
| ``` | |
| ### Accessing Problem and Solution | |
| ```python | |
| # Access a specific problem | |
| problem = problems[0] | |
| # Extract problem description and solution from response | |
| response = problem['response'] | |
| # The response contains markdown with [Problem Description] and [Solution] sections | |
| # You can parse it to extract the problem and solution separately | |
| ``` | |
| ### Extracting Problem Descriptions | |
| ```python | |
| import re | |
| def extract_problem_description(response): | |
| """Extract problem description from response.""" | |
| # Look for the Problem Description section | |
| pattern = r'## Problem Description(.*?)(?=## Solution|$)' | |
| match = re.search(pattern, response, re.DOTALL) | |
| if match: | |
| return match.group(1).strip() | |
| return None | |
| def extract_solution(response): | |
| """Extract solution code from response.""" | |
| # Look for code blocks in the Solution section | |
| pattern = r'## Solution.*?```python\s*(.*?)```' | |
| match = re.search(pattern, response, re.DOTALL) | |
| if match: | |
| return match.group(1).strip() | |
| return None | |
| # Extract problem and solution | |
| for problem in problems[:5]: # First 5 problems | |
| problem_desc = extract_problem_description(problem['response']) | |
| solution = extract_solution(problem['response']) | |
| print(f"Problem: {problem['metadata']['function_name']}") | |
| print(f"Description length: {len(problem_desc) if problem_desc else 0} chars") | |
| print(f"Solution length: {len(solution) if solution else 0} chars") | |
| ``` | |
| ### Filtering by Language | |
| ```python | |
| # Filter problems based on source language | |
| python_problems = [ | |
| p for p in problems | |
| if p['metadata'].get('language', '').lower() == 'python' | |
| ] | |
| print(f"Python-based problems: {len(python_problems)}") | |
| ``` | |
| ### Filtering by Relevance Score | |
| ```python | |
| # Filter high-relevance problems | |
| high_relevance = [ | |
| p for p in problems | |
| if p['metadata'].get('relevance_score', 0) >= 80 | |
| ] | |
| print(f"High-relevance problems: {len(high_relevance)}") | |
| ``` | |
| ### Analyzing Token Usage | |
| ```python | |
| # Analyze API usage statistics | |
| total_input_tokens = sum(p['usage']['input_tokens'] for p in problems) | |
| total_output_tokens = sum(p['usage']['output_tokens'] for p in problems) | |
| total_cost = sum(p['usage']['request_cost'] for p in problems) | |
| print(f"Total input tokens: {total_input_tokens:,}") | |
| print(f"Total output tokens: {total_output_tokens:,}") | |
| print(f"Total cost: ${total_cost:.4f}") | |
| ``` | |
| ## Use Cases | |
| This dataset is suitable for: | |
| 1. **Content Generation**: Creating programming exercises and problem sets | |
| 2. **Code-to-Problem Generation**: Training models to generate problems from code | |
| 3. **Problem-Solution Pairing**: Studying the relationship between problems and solutions | |
| 4. **Scientific Computing Education**: Teaching numerical methods and scientific programming | |
| 5. **Dataset Augmentation**: Expanding programming problem datasets | |
| 6. **Code Understanding**: Training models to understand code semantics through problem generation | |
| 7. **Automated Tutoring**: Building systems that generate practice problems | |
| ## Important Notes | |
| 1. **File Size**: The dataset file is moderately large (~496 MB), ensure sufficient memory when loading | |
| 2. **JSONL Format**: Each line is a complete JSON object; process line-by-line for memory efficiency | |
| 3. **Response Format**: The `response` field contains markdown-formatted text with problem and solution sections | |
| 4. **Code Extraction**: Solutions are embedded in markdown code blocks; parsing may be needed to extract clean code | |
| 5. **Metadata Completeness**: Some metadata fields (repo_name, path, language) may be empty for certain samples | |
| 6. **Problem Independence**: Each problem is self-contained and does not require the original source code | |
| 7. **Solution Correctness**: Solutions are AI-generated; validation may be needed for production use | |
| ## Data Processing Example | |
| ```python | |
| import jsonlines | |
| import re | |
| def parse_problem_response(response): | |
| """Parse response into structured problem and solution.""" | |
| # Extract problem description | |
| problem_match = re.search( | |
| r'## Problem Description\s*\n(.*?)(?=\n## Solution|\Z)', | |
| response, | |
| re.DOTALL | |
| ) | |
| problem_desc = problem_match.group(1).strip() if problem_match else None | |
| # Extract solution code | |
| solution_match = re.search( | |
| r'```python\s*(.*?)```', | |
| response, | |
| re.DOTALL | |
| ) | |
| solution_code = solution_match.group(1).strip() if solution_match else None | |
| return { | |
| 'problem_description': problem_desc, | |
| 'solution_code': solution_code | |
| } | |
| # Process dataset | |
| processed_problems = [] | |
| with jsonlines.open('programming_problems.jsonl', 'r') as reader: | |
| for obj in reader: | |
| parsed = parse_problem_response(obj['response']) | |
| processed_problems.append({ | |
| 'function_name': obj['metadata']['function_name'], | |
| 'language': obj['metadata'].get('language', ''), | |
| 'relevance_score': obj['metadata'].get('relevance_score', 0), | |
| 'problem': parsed['problem_description'], | |
| 'solution': parsed['solution_code'], | |
| 'timestamp': obj['timestamp'] | |
| }) | |
| print(f"Processed {len(processed_problems)} problems") | |
| ``` | |