|
|
---
|
|
|
tags:
|
|
|
- code
|
|
|
- programming
|
|
|
- dataset
|
|
|
pretty_name: "Coding Dataset"
|
|
|
---
|
|
|
|
|
|
# Coding Dataset
|
|
|
|
|
|
Production-grade dataset for training AI coding agents.
|
|
|
|
|
|
## Dataset Summary
|
|
|
|
|
|
- **Total Examples**: 6 (demo)
|
|
|
- **Languages**: Python, JavaScript, Java
|
|
|
- **Task Types**: Code Generation
|
|
|
- **License**: CC0-1.0
|
|
|
|
|
|
## Dataset Structure
|
|
|
|
|
|
### Data Splits
|
|
|
|
|
|
- train: 70% of data
|
|
|
- validation: 15% of data
|
|
|
- test: 15% of data
|
|
|
|
|
|
### Features
|
|
|
|
|
|
- `id` (string): Unique identifier
|
|
|
- `code` (string): Source code snippet
|
|
|
- `code_description` (string): Natural language description
|
|
|
- `programming_language` (string): Language (python, javascript, java, etc.)
|
|
|
- `task_type` (string): Type of task
|
|
|
- `difficulty_level` (string): Difficulty (beginner, intermediate, advanced, expert)
|
|
|
- `quality_score` (float): Quality score 0.0-1.0
|
|
|
- `is_tested` (bool): Code is tested
|
|
|
- `has_bugs` (bool): Known bugs exist
|
|
|
- `lines_of_code` (int): Number of lines
|
|
|
- `collected_at` (string): Collection timestamp
|
|
|
|
|
|
## Usage
|
|
|
|
|
|
```python
|
|
|
from datasets import load_dataset
|
|
|
|
|
|
# Load dataset
|
|
|
dataset = load_dataset("romcmu863/code-dataset")
|
|
|
|
|
|
# Access splits
|
|
|
train = dataset['train']
|
|
|
validation = dataset['validation']
|
|
|
test = dataset['test']
|
|
|
|
|
|
# Get first example
|
|
|
example = train[0]
|
|
|
print(example['code_description'])
|
|
|
print(example['code'])
|
|
|
```
|
|
|
|
|
|
## License
|
|
|
|
|
|
CC0-1.0
|
|
|
|
|
|
## Created
|
|
|
|
|
|
2025-10-25
|
|
|
|