| # OLMo Code Clean Dataset | |
| This dataset contains cleaned Python 2 and Python 3 code chunks for language model fine-tuning. | |
| ## Dataset Description | |
| - **Repository:** olmo-code-dataset | |
| - **Type:** Code dataset | |
| - **Languages:** Python 2, Python 3 | |
| - **Format:** JSONL (JSON Lines) | |
| - **Purpose:** Fine-tuning language models for code generation | |
| ## Files | |
| The dataset contains multiple JSONL files: | |
| - `python2_chunk_*.jsonl`: Python 2 code chunks | |
| - `python3_chunk_*.jsonl`: Python 3 code chunks | |
| ## Data Format | |
| Each line in the JSONL files contains a JSON object with: | |
| ```json | |
| { | |
| "text": "code content here", | |
| "metadata": { | |
| "extension": "python2" or "python3", | |
| "source": "original source information", | |
| "length": "token length" | |
| } | |
| } | |
| ``` | |
| ## Usage | |
| ```python | |
| from datasets import load_dataset | |
| # Load the dataset | |
| dataset = load_dataset("dipikakhullar/olmo-code-dataset") | |
| # Access training data | |
| train_data = dataset["train"] | |
| ``` | |
| ## Citation | |
| If you use this dataset, please cite the original sources and this repository. | |
| ## License | |
| MIT License | |