dipikakhullar commited on
Commit
08cbeb0
·
verified ·
1 Parent(s): 6942c24

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +51 -0
README.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OLMo Code Clean Dataset
2
+
3
+ This dataset contains cleaned Python 2 and Python 3 code chunks for language model fine-tuning.
4
+
5
+ ## Dataset Description
6
+
7
+ - **Repository:** olmo-code-dataset
8
+ - **Type:** Code dataset
9
+ - **Languages:** Python 2, Python 3
10
+ - **Format:** JSONL (JSON Lines)
11
+ - **Purpose:** Fine-tuning language models for code generation
12
+
13
+ ## Files
14
+
15
+ The dataset contains multiple JSONL files:
16
+ - `python2_chunk_*.jsonl`: Python 2 code chunks
17
+ - `python3_chunk_*.jsonl`: Python 3 code chunks
18
+
19
+ ## Data Format
20
+
21
+ Each line in the JSONL files contains a JSON object with:
22
+ ```json
23
+ {
24
+ "text": "code content here",
25
+ "metadata": {
26
+ "extension": "python2" or "python3",
27
+ "source": "original source information",
28
+ "length": "token length"
29
+ }
30
+ }
31
+ ```
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ from datasets import load_dataset
37
+
38
+ # Load the dataset
39
+ dataset = load_dataset("dipikakhullar/olmo-code-dataset")
40
+
41
+ # Access training data
42
+ train_data = dataset["train"]
43
+ ```
44
+
45
+ ## Citation
46
+
47
+ If you use this dataset, please cite the original sources and this repository.
48
+
49
+ ## License
50
+
51
+ MIT License