| # DATA2: Code-Documentation Alignment Dataset | |
| ## Dataset Overview | |
| DATA2 is a large-scale code-documentation alignment dataset that pairs function-level code samples with AI-generated documentation strings (docstrings). The dataset contains 500,000 function-level code samples extracted from domain-specific repositories, each paired with a comprehensive docstring generated using Google's Gemini model. This dataset is designed for training and evaluating code documentation generation models, code understanding systems, and documentation quality assessment tools. | |
| ## Dataset Statistics | |
| - **Total Samples**: 500,000 function-level code samples | |
| - **Total Data Size**: ~2.9 GB | |
| - **Data Format**: JSONL (JSON Lines, one JSON object per line) | |
| - **Encoding**: UTF-8 | |
| ## Dataset Structure | |
| The dataset is stored in JSONL format, where each line contains a complete JSON object representing one function sample with its associated documentation. | |
| ### Data Field Description | |
| Each JSON object contains the following fields: | |
| | Field Name | Type | Description | | |
| |------------|------|-------------| | |
| | `language` | String | Programming language of the code (e.g., "python", "java", "rust", "cpp") | | |
| | `name` | String | Function/method name | | |
| | `qualified_name` | String | Fully qualified name of the function (e.g., "ClassName.method_name") | | |
| | `file` | String | Absolute file path in the source repository | | |
| | `start_line` | Integer | Starting line number of the function in the source file | | |
| | `end_line` | Integer | Ending line number of the function in the source file | | |
| | `score` | Float | Relevance score for the function (0.0 to 1.0) | | |
| | `md_summary` | String | Markdown-formatted project summary/README content | | |
| | `md_score` | Float | Quality score for the project summary (0.0 to 1.0) | | |
| | `final_score` | Float | Combined final score (score × md_score) | | |
| | `code_content` | String | Complete function code content (from start_line to end_line) | | |
| | `results` | Object | Documentation generation results containing: | | |
| | `results.idx` | Integer | Index of the sample in the generation queue | | |
| | `results.status` | String | Generation status: "ok" (success), "error" (failed), or "stopped" | | |
| | `results.output` | String | Generated docstring/documentation (in code block format) | | |
| ### Programming Language Distribution | |
| Based on a sample analysis, the dataset is primarily composed of: | |
| - **Python**: ~90.6% (dominant language) | |
| - **Java**: ~5.2% | |
| - **Rust**: ~2.5% | |
| - **C++**: ~1.3% | |
| - **C**: ~0.5% | |
| - **Go**: <0.1% | |
| - **Other languages**: <0.1% | |
| ## Documentation Generation Process | |
| The documentation strings in this dataset were generated using LLM through the following process: | |
| 1. **Function Extraction**: Functions were extracted from domain-specific repositories based on relevance scores | |
| 2. **Context Preparation**: Each function was paired with its project's README/summary for context | |
| 3. **Prompt Engineering**: A structured prompt was used to guide the model in generating comprehensive docstrings | |
| 4. **Generation**: The LLM generated detailed docstrings following Python docstring conventions | |
| 5. **Quality Control**: Generated documentation was validated and aligned with the original code | |
| ### Documentation Format | |
| The generated docstrings follow a structured format including: | |
| - **Function Purpose**: Clear explanation of what the function does | |
| - **Parameters**: Detailed parameter descriptions with types and meanings | |
| - **Return Values**: Return type and value descriptions | |
| - **Side Effects**: Important side effects or state changes | |
| - **Exceptions**: Potential exceptions and error conditions | |
| - **Assumptions**: Constraints and assumptions about inputs | |
| - **Notes**: Additional context and implementation details | |
| ## Data Source | |
| The dataset is derived from domain-specific code repositories, specifically: | |
| - **Source**: GitHub repositories filtered from a large-scale domain-specific code collection | |
| - **Selection Criteria**: Functions were selected based on: | |
| - Relevance scores (function-level and project-level) | |
| - Code quality indicators | |
| - Domain specificity | |
| - **Coverage**: Functions span multiple domains including biology, chemistry, materials science, medicine, and computational methods | |
| ## Dataset Characteristics | |
| 1. **High-Quality Documentation**: Each function is paired with comprehensive, AI-generated documentation that follows professional standards | |
| 2. **Rich Context**: Documentation is generated with access to both the function code and project-level context (README summaries) | |
| 3. **Diverse Code Types**: Covers various programming languages and coding styles | |
| 4. **Domain-Specific**: Focuses on scientific and technical domains, providing specialized terminology and use cases | |
| 5. **Structured Format**: Consistent JSONL format enables easy parsing and batch processing | |
| 6. **Complete Metadata**: Includes file paths, line numbers, and scoring information for traceability | |
| ## Usage Guidelines | |
| ### Data Loading | |
| ```python | |
| import jsonlines | |
| # Load the dataset | |
| samples = [] | |
| with jsonlines.open('alignment.jsonl', 'r') as reader: | |
| for obj in reader: | |
| samples.append(obj) | |
| print(f"Total samples: {len(samples)}") | |
| ``` | |
| ### Accessing Code and Documentation | |
| ```python | |
| # Extract code and documentation for a sample | |
| sample = samples[0] | |
| code = sample['code_content'] | |
| function_name = sample['name'] | |
| language = sample['language'] | |
| # Access generated documentation | |
| if sample['results']['status'] == 'ok': | |
| docstring = sample['results']['output'] | |
| print(f"Function: {function_name}") | |
| print(f"Documentation:\n{docstring}") | |
| ``` | |
| ### Filtering by Language | |
| ```python | |
| # Filter Python functions only | |
| python_samples = [ | |
| s for s in samples | |
| if s['language'] == 'python' and s['results']['status'] == 'ok' | |
| ] | |
| print(f"Python samples with documentation: {len(python_samples)}") | |
| ``` | |
| ### Filtering by Quality Score | |
| ```python | |
| # Filter high-quality samples | |
| high_quality = [ | |
| s for s in samples | |
| if s['final_score'] > 0.15 and s['results']['status'] == 'ok' | |
| ] | |
| print(f"High-quality samples: {len(high_quality)}") | |
| ``` | |
| ### Extracting Documentation Only | |
| ```python | |
| # Extract all successful documentation strings | |
| documentations = [] | |
| for sample in samples: | |
| if sample['results']['status'] == 'ok': | |
| doc = { | |
| 'function_name': sample['name'], | |
| 'qualified_name': sample['qualified_name'], | |
| 'language': sample['language'], | |
| 'code': sample['code_content'], | |
| 'docstring': sample['results']['output'] | |
| } | |
| documentations.append(doc) | |
| ``` | |
| ## Use Cases | |
| This dataset is suitable for: | |
| 1. **Code Documentation Generation**: Training models to generate docstrings from code | |
| 2. **Documentation Quality Assessment**: Evaluating the quality of generated documentation | |
| 3. **Code Understanding**: Training models to understand code semantics | |
| 4. **Documentation Completion**: Fine-tuning models for automatic documentation generation | |
| 5. **Code-to-Documentation Alignment**: Studying the relationship between code and documentation | |
| 6. **Domain-Specific NLP**: Training models on scientific and technical terminology | |
| ## Important Notes | |
| 1. **File Size**: The dataset file is large (~2.9 GB), ensure sufficient memory and storage when loading | |
| 2. **JSONL Format**: Each line is a complete JSON object; the file can be processed line-by-line for memory efficiency | |
| 3. **Status Field**: Always check `results.status` before using `results.output`; only "ok" status indicates successful generation | |
| 4. **Code Content**: The `code_content` field contains the complete function code, which may include long implementations | |
| 5. **Documentation Format**: Generated documentation is in markdown code block format (```python ... ```); you may need to extract the content | |
| 6. **Context Dependency**: Documentation quality may vary based on the availability and quality of project README summaries | |
| ## Data Processing Example | |
| ```python | |
| import jsonlines | |
| import re | |
| def extract_docstring_content(docstring_block): | |
| """Extract docstring content from markdown code block.""" | |
| # Remove markdown code block markers | |
| pattern = r'```(?:python|code)?\s*(.*?)```' | |
| match = re.search(pattern, docstring_block, re.DOTALL) | |
| if match: | |
| return match.group(1).strip() | |
| return docstring_block.strip() | |
| # Process dataset and extract clean docstrings | |
| processed_samples = [] | |
| with jsonlines.open('alignment.jsonl', 'r') as reader: | |
| for obj in reader: | |
| if obj['results']['status'] == 'ok': | |
| clean_docstring = extract_docstring_content(obj['results']['output']) | |
| processed_samples.append({ | |
| 'function': obj['name'], | |
| 'code': obj['code_content'], | |
| 'docstring': clean_docstring, | |
| 'language': obj['language'] | |
| }) | |
| ``` | |