File size: 1,075 Bytes
08cbeb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# OLMo Code Clean Dataset

This dataset contains cleaned Python 2 and Python 3 code chunks for language model fine-tuning.

## Dataset Description

- **Repository:** olmo-code-dataset
- **Type:** Code dataset
- **Languages:** Python 2, Python 3
- **Format:** JSONL (JSON Lines)
- **Purpose:** Fine-tuning language models for code generation

## Files

The dataset contains multiple JSONL files:
- `python2_chunk_*.jsonl`: Python 2 code chunks
- `python3_chunk_*.jsonl`: Python 3 code chunks

## Data Format

Each line in the JSONL files contains a JSON object with:
```json
{
    "text": "code content here",
    "metadata": {
        "extension": "python2" or "python3",
        "source": "original source information",
        "length": "token length"
    }
}
```

## Usage

```python
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("dipikakhullar/olmo-code-dataset")

# Access training data
train_data = dataset["train"]
```

## Citation

If you use this dataset, please cite the original sources and this repository.

## License

MIT License