File size: 1,405 Bytes
7fec244
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---

tags:
- code
- programming
- dataset
pretty_name: "Coding Dataset"
---


# Coding Dataset

Production-grade dataset for training AI coding agents.

## Dataset Summary

- **Total Examples**: 6 (demo)
- **Languages**: Python, JavaScript, Java
- **Task Types**: Code Generation
- **License**: CC0-1.0

## Dataset Structure

### Data Splits

- train: 70% of data
- validation: 15% of data  
- test: 15% of data

### Features

- `id` (string): Unique identifier
- `code` (string): Source code snippet
- `code_description` (string): Natural language description
- `programming_language` (string): Language (python, javascript, java, etc.)
- `task_type` (string): Type of task
- `difficulty_level` (string): Difficulty (beginner, intermediate, advanced, expert)
- `quality_score` (float): Quality score 0.0-1.0
- `is_tested` (bool): Code is tested
- `has_bugs` (bool): Known bugs exist
- `lines_of_code` (int): Number of lines
- `collected_at` (string): Collection timestamp

## Usage

```python

from datasets import load_dataset



# Load dataset

dataset = load_dataset("romcmu863/code-dataset")



# Access splits

train = dataset['train']

validation = dataset['validation']

test = dataset['test']



# Get first example

example = train[0]

print(example['code_description'])

print(example['code'])

```

## License

CC0-1.0

## Created

2025-10-25