# DATA3: Programming Problems Generation Dataset

## Dataset Overview

DATA3 is a large-scale programming problems generation dataset that contains AI-generated programming problems inspired by real scientific computing code snippets. The dataset consists of 22,532 programming problems, each paired with a comprehensive solution. These problems focus on scientific computing concepts such as numerical algorithms, data analysis, mathematical modeling, and computational methods in chemistry, biology, and physics.

## Dataset Statistics

- **Total Samples**: 22,532 programming problems
- **Total Data Size**: ~496 MB
- **Data Format**: JSONL (JSON Lines, one JSON object per line)
- **Encoding**: UTF-8
- **Primary Language**: Python (dominant in source code)
- **Average Input Tokens**: ~697 tokens per prompt
- **Average Output Tokens**: ~5,378 tokens per response

## Dataset Structure

The dataset is stored in JSONL format, where each line contains a complete JSON object representing one programming problem with its solution.

### Data Field Description

Each JSON object contains the following fields:

| Field Name | Type | Description |
|------------|------|-------------|
| `metadata` | Object | Metadata about the source code that inspired the problem |
| `metadata.original_index` | String | Original index of the source function |
| `metadata.function_name` | String | Name of the source function |
| `metadata.repo_name` | String | Repository name (may be empty) |
| `metadata.path` | String | File path (may be empty) |
| `metadata.language` | String | Programming language of source code |
| `metadata.relevance_score` | Integer | Relevance score of the source function |
| `metadata.function_start_line` | String | Starting line number of the function |
| `metadata.function_end_line` | String | Ending line number of the function |
| `prompt` | String | The prompt used to generate the programming problem |
| `response` | String | Generated response containing problem description and solution |
| `usage` | Object | API usage statistics for generation |
| `usage.input_tokens` | Integer | Number of input tokens used |
| `usage.output_tokens` | Integer | Number of output tokens generated |
| `usage.total_tokens` | Integer | Total tokens (input + output) |
| `usage.input_cost` | Float | Cost for input tokens |
| `usage.output_cost` | Float | Cost for output tokens |
| `usage.request_cost` | Float | Total cost for the request |
| `timestamp` | String | ISO format timestamp of generation |
| `row_number` | Integer | Row number in the dataset |

### Response Structure

The `response` field contains a structured markdown document with two main sections:

1. **Problem Description**: A self-contained problem description that:
   - Provides all necessary context and background
   - Clearly states what needs to be implemented
   - Specifies input/output format and constraints
   - Explains domain-specific concepts
   - Does NOT directly reference the original code snippet

2. **Solution**: A comprehensive Python solution that:
   - Accurately solves the problem
   - Includes clear comments explaining the approach
   - Uses appropriate scientific computing libraries (numpy, scipy, etc.)
   - Is complete and runnable
   - Follows best practices for scientific computing

## Problem Categories

The programming problems in this dataset focus on scientific computing concepts:

- **Numerical Algorithms and Simulations**: Gradient descent, optimization, numerical integration
- **Data Analysis and Visualization**: Statistical analysis, plotting, data processing
- **Mathematical Modeling**: Linear regression, differential equations, statistical models
- **Scientific Data Processing**: Molecular data, biological data, chemical data processing
- **Computational Methods**: Methods in chemistry, biology, physics, and materials science

## Generation Process

The programming problems were generated through the following process:

1. **Source Code Selection**: Functions were extracted from domain-specific repositories based on relevance scores
2. **Context Preparation**: Source code snippets were prepared with project context
3. **Prompt Engineering**: A structured prompt was used to guide the generation of programming problems
4. **Problem Generation**: AI models generated self-contained problems inspired by (but not directly copying) the source code
5. **Solution Generation**: Comprehensive solutions were generated for each problem
6. **Quality Control**: Problems and solutions were validated for correctness and completeness

### Key Characteristics

- **Self-Contained**: Each problem includes all necessary context without requiring the original code
- **Inspired, Not Copied**: Problems are inspired by source code but create new, interesting scenarios
- **Complete Solutions**: Every problem includes a working, well-commented solution
- **Domain-Specific**: Problems focus on scientific and technical domains
- **Code-Inspired**: Problems are generated from real scientific computing code snippets

## Usage Guidelines

### Data Loading

```python
import jsonlines

# Load the dataset
problems = []
with jsonlines.open('programming_problems.jsonl', 'r') as reader:
    for obj in reader:
        problems.append(obj)

print(f"Total problems: {len(problems)}")
```

### Accessing Problem and Solution

```python
# Access a specific problem
problem = problems[0]

# Extract problem description and solution from response
response = problem['response']

# The response contains markdown with [Problem Description] and [Solution] sections
# You can parse it to extract the problem and solution separately
```

### Extracting Problem Descriptions

```python
import re

def extract_problem_description(response):
    """Extract problem description from response."""
    # Look for the Problem Description section
    pattern = r'## Problem Description(.*?)(?=## Solution|$)'
    match = re.search(pattern, response, re.DOTALL)
    if match:
        return match.group(1).strip()
    return None

def extract_solution(response):
    """Extract solution code from response."""
    # Look for code blocks in the Solution section
    pattern = r'## Solution.*?```python\s*(.*?)```'
    match = re.search(pattern, response, re.DOTALL)
    if match:
        return match.group(1).strip()
    return None

# Extract problem and solution
for problem in problems[:5]:  # First 5 problems
    problem_desc = extract_problem_description(problem['response'])
    solution = extract_solution(problem['response'])
    print(f"Problem: {problem['metadata']['function_name']}")
    print(f"Description length: {len(problem_desc) if problem_desc else 0} chars")
    print(f"Solution length: {len(solution) if solution else 0} chars")
```

### Filtering by Language

```python
# Filter problems based on source language
python_problems = [
    p for p in problems 
    if p['metadata'].get('language', '').lower() == 'python'
]

print(f"Python-based problems: {len(python_problems)}")
```

### Filtering by Relevance Score

```python
# Filter high-relevance problems
high_relevance = [
    p for p in problems 
    if p['metadata'].get('relevance_score', 0) >= 80
]

print(f"High-relevance problems: {len(high_relevance)}")
```

### Analyzing Token Usage

```python
# Analyze API usage statistics
total_input_tokens = sum(p['usage']['input_tokens'] for p in problems)
total_output_tokens = sum(p['usage']['output_tokens'] for p in problems)
total_cost = sum(p['usage']['request_cost'] for p in problems)

print(f"Total input tokens: {total_input_tokens:,}")
print(f"Total output tokens: {total_output_tokens:,}")
print(f"Total cost: ${total_cost:.4f}")
```

## Use Cases

This dataset is suitable for:

1. **Content Generation**: Creating programming exercises and problem sets
2. **Code-to-Problem Generation**: Training models to generate problems from code
3. **Problem-Solution Pairing**: Studying the relationship between problems and solutions
4. **Scientific Computing Education**: Teaching numerical methods and scientific programming
5. **Dataset Augmentation**: Expanding programming problem datasets
6. **Code Understanding**: Training models to understand code semantics through problem generation
7. **Automated Tutoring**: Building systems that generate practice problems

## Important Notes

1. **File Size**: The dataset file is moderately large (~496 MB), ensure sufficient memory when loading
2. **JSONL Format**: Each line is a complete JSON object; process line-by-line for memory efficiency
3. **Response Format**: The `response` field contains markdown-formatted text with problem and solution sections
4. **Code Extraction**: Solutions are embedded in markdown code blocks; parsing may be needed to extract clean code
5. **Metadata Completeness**: Some metadata fields (repo_name, path, language) may be empty for certain samples
6. **Problem Independence**: Each problem is self-contained and does not require the original source code
7. **Solution Correctness**: Solutions are AI-generated; validation may be needed for production use

## Data Processing Example

```python
import jsonlines
import re

def parse_problem_response(response):
    """Parse response into structured problem and solution."""
    # Extract problem description
    problem_match = re.search(
        r'## Problem Description\s*\n(.*?)(?=\n## Solution|\Z)', 
        response, 
        re.DOTALL
    )
    problem_desc = problem_match.group(1).strip() if problem_match else None
    
    # Extract solution code
    solution_match = re.search(
        r'```python\s*(.*?)```', 
        response, 
        re.DOTALL
    )
    solution_code = solution_match.group(1).strip() if solution_match else None
    
    return {
        'problem_description': problem_desc,
        'solution_code': solution_code
    }

# Process dataset
processed_problems = []
with jsonlines.open('programming_problems.jsonl', 'r') as reader:
    for obj in reader:
        parsed = parse_problem_response(obj['response'])
        processed_problems.append({
            'function_name': obj['metadata']['function_name'],
            'language': obj['metadata'].get('language', ''),
            'relevance_score': obj['metadata'].get('relevance_score', 0),
            'problem': parsed['problem_description'],
            'solution': parsed['solution_code'],
            'timestamp': obj['timestamp']
        })

print(f"Processed {len(processed_problems)} problems")
```