DouDou

Upload data2/README.md with huggingface_hub

900dd38 verified 25 days ago

8.81 kB

	# DATA2: Code-Documentation Alignment Dataset

	## Dataset Overview

	DATA2 is a large-scale code-documentation alignment dataset that pairs function-level code samples with AI-generated documentation strings (docstrings). The dataset contains 500,000 function-level code samples extracted from domain-specific repositories, each paired with a comprehensive docstring generated using Google's Gemini model. This dataset is designed for training and evaluating code documentation generation models, code understanding systems, and documentation quality assessment tools.

	## Dataset Statistics

	- Total Samples: 500,000 function-level code samples
	- Total Data Size: ~2.9 GB
	- Data Format: JSONL (JSON Lines, one JSON object per line)
	- Encoding: UTF-8

	## Dataset Structure

	The dataset is stored in JSONL format, where each line contains a complete JSON object representing one function sample with its associated documentation.

	### Data Field Description

	Each JSON object contains the following fields:

	\| Field Name \| Type \| Description \|
	\|------------\|------\|-------------\|
	\| `language` \| String \| Programming language of the code (e.g., "python", "java", "rust", "cpp") \|
	\| `name` \| String \| Function/method name \|
	\| `qualified_name` \| String \| Fully qualified name of the function (e.g., "ClassName.method_name") \|
	\| `file` \| String \| Absolute file path in the source repository \|
	\| `start_line` \| Integer \| Starting line number of the function in the source file \|
	\| `end_line` \| Integer \| Ending line number of the function in the source file \|
	\| `score` \| Float \| Relevance score for the function (0.0 to 1.0) \|
	\| `md_summary` \| String \| Markdown-formatted project summary/README content \|
	\| `md_score` \| Float \| Quality score for the project summary (0.0 to 1.0) \|
	\| `final_score` \| Float \| Combined final score (score × md_score) \|
	\| `code_content` \| String \| Complete function code content (from start_line to end_line) \|
	\| `results` \| Object \| Documentation generation results containing: \|
	\| `results.idx` \| Integer \| Index of the sample in the generation queue \|
	\| `results.status` \| String \| Generation status: "ok" (success), "error" (failed), or "stopped" \|
	\| `results.output` \| String \| Generated docstring/documentation (in code block format) \|

	### Programming Language Distribution

	Based on a sample analysis, the dataset is primarily composed of:

	- Python: ~90.6% (dominant language)
	- Java: ~5.2%
	- Rust: ~2.5%
	- C++: ~1.3%
	- C: ~0.5%
	- Go: <0.1%
	- Other languages: <0.1%

	## Documentation Generation Process

	The documentation strings in this dataset were generated using LLM through the following process:

	1. Function Extraction: Functions were extracted from domain-specific repositories based on relevance scores
	2. Context Preparation: Each function was paired with its project's README/summary for context
	3. Prompt Engineering: A structured prompt was used to guide the model in generating comprehensive docstrings
	4. Generation: The LLM generated detailed docstrings following Python docstring conventions
	5. Quality Control: Generated documentation was validated and aligned with the original code

	### Documentation Format

	The generated docstrings follow a structured format including:

	- Function Purpose: Clear explanation of what the function does
	- Parameters: Detailed parameter descriptions with types and meanings
	- Return Values: Return type and value descriptions
	- Side Effects: Important side effects or state changes
	- Exceptions: Potential exceptions and error conditions
	- Assumptions: Constraints and assumptions about inputs
	- Notes: Additional context and implementation details

	## Data Source

	The dataset is derived from domain-specific code repositories, specifically:

	- Source: GitHub repositories filtered from a large-scale domain-specific code collection
	- Selection Criteria: Functions were selected based on:
	- Relevance scores (function-level and project-level)
	- Code quality indicators
	- Domain specificity
	- Coverage: Functions span multiple domains including biology, chemistry, materials science, medicine, and computational methods

	## Dataset Characteristics

	1. High-Quality Documentation: Each function is paired with comprehensive, AI-generated documentation that follows professional standards
	2. Rich Context: Documentation is generated with access to both the function code and project-level context (README summaries)
	3. Diverse Code Types: Covers various programming languages and coding styles
	4. Domain-Specific: Focuses on scientific and technical domains, providing specialized terminology and use cases
	5. Structured Format: Consistent JSONL format enables easy parsing and batch processing
	6. Complete Metadata: Includes file paths, line numbers, and scoring information for traceability

	## Usage Guidelines

	### Data Loading

	```python
	import jsonlines

	# Load the dataset
	samples = []
	with jsonlines.open('alignment.jsonl', 'r') as reader:
	for obj in reader:
	samples.append(obj)

	print(f"Total samples: {len(samples)}")
	```

	### Accessing Code and Documentation

	```python
	# Extract code and documentation for a sample
	sample = samples[0]

	code = sample['code_content']
	function_name = sample['name']
	language = sample['language']

	# Access generated documentation
	if sample['results']['status'] == 'ok':
	docstring = sample['results']['output']
	print(f"Function: {function_name}")
	print(f"Documentation:\n{docstring}")
	```

	### Filtering by Language

	```python
	# Filter Python functions only
	python_samples = [
	s for s in samples
	if s['language'] == 'python' and s['results']['status'] == 'ok'
	]

	print(f"Python samples with documentation: {len(python_samples)}")
	```

	### Filtering by Quality Score

	```python
	# Filter high-quality samples
	high_quality = [
	s for s in samples
	if s['final_score'] > 0.15 and s['results']['status'] == 'ok'
	]

	print(f"High-quality samples: {len(high_quality)}")
	```

	### Extracting Documentation Only

	```python
	# Extract all successful documentation strings
	documentations = []
	for sample in samples:
	if sample['results']['status'] == 'ok':
	doc = {
	'function_name': sample['name'],
	'qualified_name': sample['qualified_name'],
	'language': sample['language'],
	'code': sample['code_content'],
	'docstring': sample['results']['output']
	}
	documentations.append(doc)
	```

	## Use Cases

	This dataset is suitable for:

	1. Code Documentation Generation: Training models to generate docstrings from code
	2. Documentation Quality Assessment: Evaluating the quality of generated documentation
	3. Code Understanding: Training models to understand code semantics
	4. Documentation Completion: Fine-tuning models for automatic documentation generation
	5. Code-to-Documentation Alignment: Studying the relationship between code and documentation
	6. Domain-Specific NLP: Training models on scientific and technical terminology

	## Important Notes

	1. File Size: The dataset file is large (~2.9 GB), ensure sufficient memory and storage when loading
	2. JSONL Format: Each line is a complete JSON object; the file can be processed line-by-line for memory efficiency
	3. Status Field: Always check `results.status` before using `results.output`; only "ok" status indicates successful generation
	4. Code Content: The `code_content` field contains the complete function code, which may include long implementations
	5. Documentation Format: Generated documentation is in markdown code block format (```python ... ```); you may need to extract the content
	6. Context Dependency: Documentation quality may vary based on the availability and quality of project README summaries

	## Data Processing Example

	```python
	import jsonlines
	import re

	def extract_docstring_content(docstring_block):
	"""Extract docstring content from markdown code block."""
	# Remove markdown code block markers
	pattern = r'```(?:python\|code)?\s(.?)```'
	match = re.search(pattern, docstring_block, re.DOTALL)
	if match:
	return match.group(1).strip()
	return docstring_block.strip()

	# Process dataset and extract clean docstrings
	processed_samples = []
	with jsonlines.open('alignment.jsonl', 'r') as reader:
	for obj in reader:
	if obj['results']['status'] == 'ok':
	clean_docstring = extract_docstring_content(obj['results']['output'])
	processed_samples.append({
	'function': obj['name'],
	'code': obj['code_content'],
	'docstring': clean_docstring,
	'language': obj['language']
	})
	```