dipikakhullar
/

olmo-code-dataset

dipikakhullar commited on Jun 24, 2025

Commit

08cbeb0

verified ·

1 Parent(s): 6942c24

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md ADDED Viewed

+# OLMo Code Clean Dataset
+This dataset contains cleaned Python 2 and Python 3 code chunks for language model fine-tuning.
+## Dataset Description
+- **Repository:** olmo-code-dataset
+- **Type:** Code dataset
+- **Languages:** Python 2, Python 3
+- **Format:** JSONL (JSON Lines)
+- **Purpose:** Fine-tuning language models for code generation
+## Files
+The dataset contains multiple JSONL files:
+- `python2_chunk_*.jsonl`: Python 2 code chunks
+- `python3_chunk_*.jsonl`: Python 3 code chunks
+## Data Format
+Each line in the JSONL files contains a JSON object with:
+```json
+{
+    "text": "code content here",
+    "metadata": {
+        "extension": "python2" or "python3",
+        "source": "original source information",
+        "length": "token length"
+    }
+}
+```
+## Usage
+```python
+from datasets import load_dataset
+# Load the dataset
+dataset = load_dataset("dipikakhullar/olmo-code-dataset")
+# Access training data
+train_data = dataset["train"]
+```
+## Citation
+If you use this dataset, please cite the original sources and this repository.
+## License
+MIT License