amkyawdev commited on
Commit
a5faf54
Β·
verified Β·
1 Parent(s): 210ece5

Connect all storage bucket files

Browse files
Files changed (1) hide show
  1. README.md +114 -78
README.md CHANGED
@@ -1,17 +1,10 @@
1
  ---
2
- annotations_creators:
3
- - no-annotation
4
- language_creators:
5
- - found
6
- languages:
7
  - my
8
  - en
9
- licenses:
10
- - apache-2.0
11
- multilinguality:
12
- - multilingual
13
- size_categories:
14
- - n_1M_to_n_10M
15
  source_datasets:
16
  - amkyawdev/myanmar-llm-data
17
  - amkyawdev/mm-llm-coder-agent-dataset
@@ -40,6 +33,17 @@ This dataset combines three source datasets for training LLMs with Myanmar langu
40
 
41
  **Total Samples: 3,020,347**
42
 
 
 
 
 
 
 
 
 
 
 
 
43
  ### Features
44
 
45
  - **Myanmar Language Support**: Native Burmese (မြန်မာစာ) conversations and translations
@@ -54,43 +58,39 @@ Each sample contains:
54
 
55
  ```python
56
  {
57
- # Core Fields
58
- "messages": [
59
- {"role": "system", "content": "You are a helpful assistant."},
60
- {"role": "user", "content": "User input here"},
61
- {"role": "assistant", "content": "Response here"}
62
- ],
63
-
64
- # Task Definition
65
- "instruction": "Task instruction",
66
- "category": "Task category",
67
- "language": "en or my",
68
- "difficulty": "beginner, intermediate, or advanced",
69
- "response": "Expected response/output",
70
- "task_type": "Type of task",
71
-
72
- # Execution Feedback
73
- "execution_feedback": {
74
- "status": "completed or pending_validation",
75
- "result": "Execution result",
76
- "error_type": "runtime_error, syntax_error, etc.",
77
- "error_message": "Error details",
78
- "execution_time_ms": 1000,
79
- },
80
-
81
- # Extended Fields
82
- "framework": "React, Express, etc.",
83
- "runtime": "Node.js, Python, etc.",
84
- "database": "MongoDB, PostgreSQL, etc.",
85
- "validated": True/False,
86
- "rating": 0.0 to 1.0,
87
- "complexity_score": 1 to 10,
88
-
89
- # Metadata
90
- "metadata": {
91
- "created_at": "2024-01-01T00:00:00",
92
- "difficulty": "beginner/intermediate/advanced",
93
- }
94
  }
95
  ```
96
 
@@ -100,15 +100,35 @@ Each sample contains:
100
  from datasets import load_dataset
101
 
102
  # Load the entire dataset
103
- dataset = load_dataset("amkyawdev/combined-myanmar-llm-dataset")
104
 
105
  # Load specific split
106
- train_ds = load_dataset("amkyawdev/combined-myanmar-llm-dataset", split="train")
107
 
108
  # Access a single sample
109
  sample = train_ds[0]
110
  print(sample["messages"])
111
  print(sample["execution_feedback"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
  ```
113
 
114
  ### Use Cases
@@ -121,11 +141,15 @@ print(sample["execution_feedback"])
121
 
122
  ### Data Format
123
 
124
- This dataset is available in two formats:
125
 
126
  1. **HuggingFace Dataset**: Standard format with all fields
127
  2. **ADP Format**: JSONL format with execution feedback
128
  - Location: `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl`
 
 
 
 
129
 
130
  ### License
131
 
@@ -135,10 +159,10 @@ Apache 2.0 License
135
 
136
  ```bibtex
137
  @dataset{combined_myanmar_llm,
138
- title={Combined Myanmar LLM Dataset},
139
- author={amkyawdev},
140
- year={2024},
141
- url={https://huggingface.co/datasets/amkyawdev/combined-myanmar-llm-dataset}
142
  }
143
  ```
144
 
@@ -158,6 +182,17 @@ Apache 2.0 License
158
 
159
  **ပါဝင်မှု စုစုပါး: ၃,၀၂၀,၃၄၇**
160
 
 
 
 
 
 
 
 
 
 
 
 
161
  ### ပါဝင်တဲ့ ထချက်များ
162
 
163
  - **မြန်မာစာ ပါဝင်မှု**: မြန်မာစာ ပါးဆက်များနှင့် α€˜α€¬α€žα€¬α€•α€Όα€”α€Ία€™α€»α€¬α€Έ
@@ -169,26 +204,23 @@ Apache 2.0 License
169
  ### α€–α€½α€²α€·α€…α€Šα€Ία€Έα€•α€«
170
 
171
  ```python
172
- {
173
- # ထခြေခဢ ဖိုင်များ
174
- "messages": [...],
175
- "instruction": "...",
176
- "category": "...",
177
- "language": "my α€žα€­α€―α€·α€™α€Ÿα€―α€α€Ί en",
178
- "difficulty": "beginner, intermediate, advanced",
179
- "response": "...",
180
- "task_type": "...",
181
-
182
- # Execution Feedback
183
- "execution_feedback": {
184
- "status": "completed α€žα€­α€―α€·α€™α€Ÿα€―α€α€Ί pending_validation",
185
- "result": "...",
186
- "error_type": "...",
187
- "error_message": "...",
188
- },
189
-
190
- # နောက်ထပ် ဖိုင်များ
191
- "metadata": {...}
192
  }
193
  ```
194
 
@@ -196,10 +228,14 @@ Apache 2.0 License
196
 
197
  ```python
198
  from datasets import load_dataset
199
-
200
- dataset = load_dataset("amkyawdev/combined-myanmar-llm-dataset")
201
  train_ds = dataset["train"]
202
  sample = train_ds[0]
 
 
 
 
 
203
  ```
204
 
205
  ### α€žα€―α€Άα€Έα€•α€Όα€―α€”α€Šα€Ία€Έ ထမျိုးမျိုး
@@ -216,4 +252,4 @@ Apache 2.0 License
216
 
217
  ### Dataset URL
218
 
219
- https://huggingface.co/datasets/amkyawdev/combined-myanmar-llm-dataset
 
1
  ---
2
+ license: apache-2.0
3
+ language:
 
 
 
4
  - my
5
  - en
6
+ multilinguality: multilingual
7
+ size_categories: n_1M_to_n_10M
 
 
 
 
8
  source_datasets:
9
  - amkyawdev/myanmar-llm-data
10
  - amkyawdev/mm-llm-coder-agent-dataset
 
33
 
34
  **Total Samples: 3,020,347**
35
 
36
+ ### Storage Files
37
+
38
+ This dataset also connects to storage bucket files for additional data:
39
+
40
+ | File | Location | Size |
41
+ |------|----------|------|
42
+ | ADP Execution Feedback | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl` | 50.8 MB |
43
+ | Myanmar LLM Clean | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl` | 2.79 GB |
44
+ | Myanmar LLM Formatted | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-formatted.jsonl` | 24.8 MB |
45
+ | Myanmar LLM Data Formatted | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-data-formatted.jsonl` | 3.7 kB |
46
+
47
  ### Features
48
 
49
  - **Myanmar Language Support**: Native Burmese (မြန်မာစာ) conversations and translations
 
58
 
59
  ```python
60
  {
61
+ # Core Fields
62
+ "messages": [
63
+ {"role": "system", "content": "You are a helpful assistant."},
64
+ {"role": "user", "content": "User input here"},
65
+ {"role": "assistant", "content": "Response here"}
66
+ ],
67
+ # Task Definition
68
+ "instruction": "Task instruction",
69
+ "category": "Task category",
70
+ "language": "en or my",
71
+ "difficulty": "beginner, intermediate, or advanced",
72
+ "response": "Expected response/output",
73
+ "task_type": "Type of task",
74
+ # Execution Feedback
75
+ "execution_feedback": {
76
+ "status": "completed or pending_validation",
77
+ "result": "Execution result",
78
+ "error_type": "runtime_error, syntax_error, etc.",
79
+ "error_message": "Error details",
80
+ "execution_time_ms": 1000,
81
+ },
82
+ # Extended Fields
83
+ "framework": "React, Express, etc.",
84
+ "runtime": "Node.js, Python, etc.",
85
+ "database": "MongoDB, PostgreSQL, etc.",
86
+ "validated": True/False,
87
+ "rating": 0.0 to 1.0,
88
+ "complexity_score": 1 to 10,
89
+ # Metadata
90
+ "metadata": {
91
+ "created_at": "2024-01-01T00:00:00",
92
+ "difficulty": "beginner/intermediate/advanced",
93
+ }
 
 
 
 
94
  }
95
  ```
96
 
 
100
  from datasets import load_dataset
101
 
102
  # Load the entire dataset
103
+ dataset = load_dataset("amkyawdev/mm-llm-code-v3")
104
 
105
  # Load specific split
106
+ train_ds = load_dataset("amkyawdev/mm-llm-code-v3", split="train")
107
 
108
  # Access a single sample
109
  sample = train_ds[0]
110
  print(sample["messages"])
111
  print(sample["execution_feedback"])
112
+
113
+ # Load from storage bucket directly
114
+ from datasets import load_dataset
115
+ ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl")
116
+ ```
117
+
118
+ ### Storage Bucket Access
119
+
120
+ Access data directly from storage bucket:
121
+
122
+ ```python
123
+ # ADP Execution Feedback (50.8 MB)
124
+ from datasets import load_dataset
125
+ adp_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl")
126
+
127
+ # Clean dataset (2.79 GB)
128
+ clean_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl")
129
+
130
+ # Formatted dataset (24.8 MB)
131
+ formatted_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-formatted.jsonl")
132
  ```
133
 
134
  ### Use Cases
 
141
 
142
  ### Data Format
143
 
144
+ This dataset is available in multiple formats:
145
 
146
  1. **HuggingFace Dataset**: Standard format with all fields
147
  2. **ADP Format**: JSONL format with execution feedback
148
  - Location: `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl`
149
+ 3. **Clean Format**: Cleaned and processed data
150
+ - Location: `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl`
151
+ 4. **Formatted Format**: Pre-formatted for training
152
+ - Location: `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-formatted.jsonl`
153
 
154
  ### License
155
 
 
159
 
160
  ```bibtex
161
  @dataset{combined_myanmar_llm,
162
+ title={Combined Myanmar LLM Dataset},
163
+ author={amkyawdev},
164
+ year={2024},
165
+ url={https://huggingface.co/datasets/amkyawdev/mm-llm-code-v3}
166
  }
167
  ```
168
 
 
182
 
183
  **ပါဝင်မှု စုစုပါး: ၃,၀၂၀,၃၄၇**
184
 
185
+ ### Storage ဖိုင်များ
186
+
187
+ Storage bucket မှာရှိတဲ့ α€–α€­α€―α€„α€Ία€™α€»α€¬α€Έα€”α€Ύα€„α€·α€Ία€œα€Šα€Ία€Έ ပါဝင်ပပါး။
188
+
189
+ | ဖိုင် | α€α€Šα€Ία€”α€±α€›α€¬ | α€‘α€›α€½α€šα€Ία€‘α€…α€¬α€Έ |
190
+ |------|----------|------|
191
+ | ADP Execution Feedback | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl` | ၅၀.၈ MB |
192
+ | Myanmar LLM Clean | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl` | ၂.၇၉ GB |
193
+ | Myanmar LLM Formatted | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-formatted.jsonl` | ၂၄.၈ MB |
194
+ | Myanmar LLM Data Formatted | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-data-formatted.jsonl` | ၃.၇ kB |
195
+
196
  ### ပါဝင်တဲ့ ထချက်များ
197
 
198
  - **မြန်မာစာ ပါဝင်မှု**: မြန်မာစာ ပါးဆက်များနှင့် α€˜α€¬α€žα€¬α€•α€Όα€”α€Ία€™α€»α€¬α€Έ
 
204
  ### α€–α€½α€²α€·α€…α€Šα€Ία€Έα€•α€«
205
 
206
  ```python
207
+ # ထခြေခဢ ဖိုင်များ
208
+ "messages": [...],
209
+ "instruction": "...",
210
+ "category": "...",
211
+ "language": "my α€žα€­α€―α€·α€™α€Ÿα€―α€α€Ί en",
212
+ "difficulty": "beginner, intermediate, advanced",
213
+ "response": "...",
214
+ "task_type": "...",
215
+ # Execution Feedback
216
+ "execution_feedback": {
217
+ "status": "completed α€žα€­α€―α€·α€™α€Ÿα€―α€α€Ί pending_validation",
218
+ "result": "...",
219
+ "error_type": "...",
220
+ "error_message": "...",
221
+ },
222
+ # နောက်ထပ် ဖိုင်များ
223
+ "metadata": {...}
 
 
 
224
  }
225
  ```
226
 
 
228
 
229
  ```python
230
  from datasets import load_dataset
231
+ dataset = load_dataset("amkyawdev/mm-llm-code-v3")
 
232
  train_ds = dataset["train"]
233
  sample = train_ds[0]
234
+
235
+ # Storage bucket မှာရှိတဲ့ ဖိုင်များကို α€žα€―α€Άα€Έα€–α€­α€―α€·
236
+ from datasets import load_dataset
237
+ adp_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl")
238
+ clean_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl")
239
  ```
240
 
241
  ### α€žα€―α€Άα€Έα€•α€Όα€―α€”α€Šα€Ία€Έ ထမျိုးမျိုး
 
252
 
253
  ### Dataset URL
254
 
255
+ https://huggingface.co/datasets/amkyawdev/mm-llm-code-v3