# OLMo Code Clean Dataset This dataset contains cleaned Python 2 and Python 3 code chunks for language model fine-tuning. ## Dataset Description - **Repository:** olmo-code-dataset - **Type:** Code dataset - **Languages:** Python 2, Python 3 - **Format:** JSONL (JSON Lines) - **Purpose:** Fine-tuning language models for code generation ## Files The dataset contains multiple JSONL files: - `python2_chunk_*.jsonl`: Python 2 code chunks - `python3_chunk_*.jsonl`: Python 3 code chunks ## Data Format Each line in the JSONL files contains a JSON object with: ```json { "text": "code content here", "metadata": { "extension": "python2" or "python3", "source": "original source information", "length": "token length" } } ``` ## Usage ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("dipikakhullar/olmo-code-dataset") # Access training data train_data = dataset["train"] ``` ## Citation If you use this dataset, please cite the original sources and this repository. ## License MIT License