DouDou

Upload data3/README.md with huggingface_hub

5c31870 verified 22 days ago

10.5 kB

	# DATA3: Programming Problems Generation Dataset

	## Dataset Overview

	DATA3 is a large-scale programming problems generation dataset that contains AI-generated programming problems inspired by real scientific computing code snippets. The dataset consists of 22,532 programming problems, each paired with a comprehensive solution. These problems focus on scientific computing concepts such as numerical algorithms, data analysis, mathematical modeling, and computational methods in chemistry, biology, and physics.

	## Dataset Statistics

	- Total Samples: 22,532 programming problems
	- Total Data Size: ~496 MB
	- Data Format: JSONL (JSON Lines, one JSON object per line)
	- Encoding: UTF-8
	- Primary Language: Python (dominant in source code)
	- Average Input Tokens: ~697 tokens per prompt
	- Average Output Tokens: ~5,378 tokens per response

	## Dataset Structure

	The dataset is stored in JSONL format, where each line contains a complete JSON object representing one programming problem with its solution.

	### Data Field Description

	Each JSON object contains the following fields:

	\| Field Name \| Type \| Description \|
	\|------------\|------\|-------------\|
	\| `metadata` \| Object \| Metadata about the source code that inspired the problem \|
	\| `metadata.original_index` \| String \| Original index of the source function \|
	\| `metadata.function_name` \| String \| Name of the source function \|
	\| `metadata.repo_name` \| String \| Repository name (may be empty) \|
	\| `metadata.path` \| String \| File path (may be empty) \|
	\| `metadata.language` \| String \| Programming language of source code \|
	\| `metadata.relevance_score` \| Integer \| Relevance score of the source function \|
	\| `metadata.function_start_line` \| String \| Starting line number of the function \|
	\| `metadata.function_end_line` \| String \| Ending line number of the function \|
	\| `prompt` \| String \| The prompt used to generate the programming problem \|
	\| `response` \| String \| Generated response containing problem description and solution \|
	\| `usage` \| Object \| API usage statistics for generation \|
	\| `usage.input_tokens` \| Integer \| Number of input tokens used \|
	\| `usage.output_tokens` \| Integer \| Number of output tokens generated \|
	\| `usage.total_tokens` \| Integer \| Total tokens (input + output) \|
	\| `usage.input_cost` \| Float \| Cost for input tokens \|
	\| `usage.output_cost` \| Float \| Cost for output tokens \|
	\| `usage.request_cost` \| Float \| Total cost for the request \|
	\| `timestamp` \| String \| ISO format timestamp of generation \|
	\| `row_number` \| Integer \| Row number in the dataset \|

	### Response Structure

	The `response` field contains a structured markdown document with two main sections:

	1. Problem Description: A self-contained problem description that:
	- Provides all necessary context and background
	- Clearly states what needs to be implemented
	- Specifies input/output format and constraints
	- Explains domain-specific concepts
	- Does NOT directly reference the original code snippet

	2. Solution: A comprehensive Python solution that:
	- Accurately solves the problem
	- Includes clear comments explaining the approach
	- Uses appropriate scientific computing libraries (numpy, scipy, etc.)
	- Is complete and runnable
	- Follows best practices for scientific computing

	## Problem Categories

	The programming problems in this dataset focus on scientific computing concepts:

	- Numerical Algorithms and Simulations: Gradient descent, optimization, numerical integration
	- Data Analysis and Visualization: Statistical analysis, plotting, data processing
	- Mathematical Modeling: Linear regression, differential equations, statistical models
	- Scientific Data Processing: Molecular data, biological data, chemical data processing
	- Computational Methods: Methods in chemistry, biology, physics, and materials science

	## Generation Process

	The programming problems were generated through the following process:

	1. Source Code Selection: Functions were extracted from domain-specific repositories based on relevance scores
	2. Context Preparation: Source code snippets were prepared with project context
	3. Prompt Engineering: A structured prompt was used to guide the generation of programming problems
	4. Problem Generation: AI models generated self-contained problems inspired by (but not directly copying) the source code
	5. Solution Generation: Comprehensive solutions were generated for each problem
	6. Quality Control: Problems and solutions were validated for correctness and completeness

	### Key Characteristics

	- Self-Contained: Each problem includes all necessary context without requiring the original code
	- Inspired, Not Copied: Problems are inspired by source code but create new, interesting scenarios
	- Complete Solutions: Every problem includes a working, well-commented solution
	- Domain-Specific: Problems focus on scientific and technical domains
	- Code-Inspired: Problems are generated from real scientific computing code snippets

	## Usage Guidelines

	### Data Loading

	```python
	import jsonlines

	# Load the dataset
	problems = []
	with jsonlines.open('programming_problems.jsonl', 'r') as reader:
	for obj in reader:
	problems.append(obj)

	print(f"Total problems: {len(problems)}")
	```

	### Accessing Problem and Solution

	```python
	# Access a specific problem
	problem = problems[0]

	# Extract problem description and solution from response
	response = problem['response']

	# The response contains markdown with [Problem Description] and [Solution] sections
	# You can parse it to extract the problem and solution separately
	```

	### Extracting Problem Descriptions

	```python
	import re

	def extract_problem_description(response):
	"""Extract problem description from response."""
	# Look for the Problem Description section
	pattern = r'## Problem Description(.*?)(?=## Solution\|$)'
	match = re.search(pattern, response, re.DOTALL)
	if match:
	return match.group(1).strip()
	return None

	def extract_solution(response):
	"""Extract solution code from response."""
	# Look for code blocks in the Solution section
	pattern = r'## Solution.?```python\s(.*?)```'
	match = re.search(pattern, response, re.DOTALL)
	if match:
	return match.group(1).strip()
	return None

	# Extract problem and solution
	for problem in problems[:5]: # First 5 problems
	problem_desc = extract_problem_description(problem['response'])
	solution = extract_solution(problem['response'])
	print(f"Problem: {problem['metadata']['function_name']}")
	print(f"Description length: {len(problem_desc) if problem_desc else 0} chars")
	print(f"Solution length: {len(solution) if solution else 0} chars")
	```

	### Filtering by Language

	```python
	# Filter problems based on source language
	python_problems = [
	p for p in problems
	if p['metadata'].get('language', '').lower() == 'python'
	]

	print(f"Python-based problems: {len(python_problems)}")
	```

	### Filtering by Relevance Score

	```python
	# Filter high-relevance problems
	high_relevance = [
	p for p in problems
	if p['metadata'].get('relevance_score', 0) >= 80
	]

	print(f"High-relevance problems: {len(high_relevance)}")
	```

	### Analyzing Token Usage

	```python
	# Analyze API usage statistics
	total_input_tokens = sum(p['usage']['input_tokens'] for p in problems)
	total_output_tokens = sum(p['usage']['output_tokens'] for p in problems)
	total_cost = sum(p['usage']['request_cost'] for p in problems)

	print(f"Total input tokens: {total_input_tokens:,}")
	print(f"Total output tokens: {total_output_tokens:,}")
	print(f"Total cost: ${total_cost:.4f}")
	```

	## Use Cases

	This dataset is suitable for:

	1. Content Generation: Creating programming exercises and problem sets
	2. Code-to-Problem Generation: Training models to generate problems from code
	3. Problem-Solution Pairing: Studying the relationship between problems and solutions
	4. Scientific Computing Education: Teaching numerical methods and scientific programming
	5. Dataset Augmentation: Expanding programming problem datasets
	6. Code Understanding: Training models to understand code semantics through problem generation
	7. Automated Tutoring: Building systems that generate practice problems

	## Important Notes

	1. File Size: The dataset file is moderately large (~496 MB), ensure sufficient memory when loading
	2. JSONL Format: Each line is a complete JSON object; process line-by-line for memory efficiency
	3. Response Format: The `response` field contains markdown-formatted text with problem and solution sections
	4. Code Extraction: Solutions are embedded in markdown code blocks; parsing may be needed to extract clean code
	5. Metadata Completeness: Some metadata fields (repo_name, path, language) may be empty for certain samples
	6. Problem Independence: Each problem is self-contained and does not require the original source code
	7. Solution Correctness: Solutions are AI-generated; validation may be needed for production use

	## Data Processing Example

	```python
	import jsonlines
	import re

	def parse_problem_response(response):
	"""Parse response into structured problem and solution."""
	# Extract problem description
	problem_match = re.search(
	r'## Problem Description\s\n(.?)(?=\n## Solution\|\Z)',
	response,
	re.DOTALL
	)
	problem_desc = problem_match.group(1).strip() if problem_match else None

	# Extract solution code
	solution_match = re.search(
	r'```python\s(.?)```',
	response,
	re.DOTALL
	)
	solution_code = solution_match.group(1).strip() if solution_match else None

	return {
	'problem_description': problem_desc,
	'solution_code': solution_code
	}

	# Process dataset
	processed_problems = []
	with jsonlines.open('programming_problems.jsonl', 'r') as reader:
	for obj in reader:
	parsed = parse_problem_response(obj['response'])
	processed_problems.append({
	'function_name': obj['metadata']['function_name'],
	'language': obj['metadata'].get('language', ''),
	'relevance_score': obj['metadata'].get('relevance_score', 0),
	'problem': parsed['problem_description'],
	'solution': parsed['solution_code'],
	'timestamp': obj['timestamp']
	})

	print(f"Processed {len(processed_problems)} problems")
	```