dipikakhullar
/

olmo-code-dataset

Model card Files Files and versions

olmo-code-dataset / README.md

dipikakhullar's picture

Upload README.md with huggingface_hub

08cbeb0 verified 7 months ago

|

history blame contribute delete

1.08 kB

	# OLMo Code Clean Dataset

	This dataset contains cleaned Python 2 and Python 3 code chunks for language model fine-tuning.

	## Dataset Description

	- Repository: olmo-code-dataset
	- Type: Code dataset
	- Languages: Python 2, Python 3
	- Format: JSONL (JSON Lines)
	- Purpose: Fine-tuning language models for code generation

	## Files

	The dataset contains multiple JSONL files:
	- `python2_chunk_*.jsonl`: Python 2 code chunks
	- `python3_chunk_*.jsonl`: Python 3 code chunks

	## Data Format

	Each line in the JSONL files contains a JSON object with:
	```json
	{
	"text": "code content here",
	"metadata": {
	"extension": "python2" or "python3",
	"source": "original source information",
	"length": "token length"
	}
	}
	```

	## Usage

	```python
	from datasets import load_dataset

	# Load the dataset
	dataset = load_dataset("dipikakhullar/olmo-code-dataset")

	# Access training data
	train_data = dataset["train"]
	```

	## Citation

	If you use this dataset, please cite the original sources and this repository.

	## License

	MIT License