File size: 1,075 Bytes
08cbeb0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# OLMo Code Clean Dataset
This dataset contains cleaned Python 2 and Python 3 code chunks for language model fine-tuning.
## Dataset Description
- **Repository:** olmo-code-dataset
- **Type:** Code dataset
- **Languages:** Python 2, Python 3
- **Format:** JSONL (JSON Lines)
- **Purpose:** Fine-tuning language models for code generation
## Files
The dataset contains multiple JSONL files:
- `python2_chunk_*.jsonl`: Python 2 code chunks
- `python3_chunk_*.jsonl`: Python 3 code chunks
## Data Format
Each line in the JSONL files contains a JSON object with:
```json
{
"text": "code content here",
"metadata": {
"extension": "python2" or "python3",
"source": "original source information",
"length": "token length"
}
}
```
## Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("dipikakhullar/olmo-code-dataset")
# Access training data
train_data = dataset["train"]
```
## Citation
If you use this dataset, please cite the original sources and this repository.
## License
MIT License
|