amkyawdev commited on
Commit
b8716c5
·
verified ·
1 Parent(s): a5faf54

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +53 -182
README.md CHANGED
@@ -11,9 +11,9 @@ source_datasets:
11
  - amkyawdev/mm-llm-coder-dataset
12
  ---
13
 
14
- # Combined Myanmar LLM Dataset
15
 
16
- A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding. This dataset follows the ADP (Agent Data Protocol) format with execution feedback.
17
 
18
  [English](#english) | [မြန်မာဘာသာ](#myanmar)
19
 
@@ -25,231 +25,102 @@ A comprehensive dataset combining three Myanmar-related datasets for training la
25
 
26
  This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities:
27
 
28
- | Dataset | Description | Samples |
29
- |---------|-------------|---------|
30
- | `amkyawdev/myanmar-llm-data` | Myanmar language conversations, translations, Q&A | 20,327 |
31
- | `amkyawdev/mm-llm-coder-agent-dataset` | Agent workflow for coding tasks | 1,000,020 |
32
- | `amkyawdev/mm-llm-coder-dataset` | Code generation tasks | 2,000,000 |
33
 
34
- **Total Samples: 3,020,347**
35
 
36
- ### Storage Files
37
 
38
- This dataset also connects to storage bucket files for additional data:
39
 
40
- | File | Location | Size |
41
- |------|----------|------|
42
- | ADP Execution Feedback | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl` | 50.8 MB |
43
- | Myanmar LLM Clean | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl` | 2.79 GB |
44
- | Myanmar LLM Formatted | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-formatted.jsonl` | 24.8 MB |
45
- | Myanmar LLM Data Formatted | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-data-formatted.jsonl` | 3.7 kB |
 
 
 
 
 
 
 
 
 
46
 
47
  ### Features
48
 
49
  - **Myanmar Language Support**: Native Burmese (မြန်မာစာ) conversations and translations
50
- - **Code Generation**: Python, JavaScript, TypeScript, and other programming languages
51
  - **Agent Workflows**: Multi-step coding tasks with tool usage
52
- - **Execution Feedback**: Results from code execution including errors and test results
53
  - **Quality Metrics**: Ratings, validation status, and complexity scores
54
 
55
- ### Dataset Structure
56
-
57
- Each sample contains:
58
-
59
- ```python
60
- {
61
- # Core Fields
62
- "messages": [
63
- {"role": "system", "content": "You are a helpful assistant."},
64
- {"role": "user", "content": "User input here"},
65
- {"role": "assistant", "content": "Response here"}
66
- ],
67
- # Task Definition
68
- "instruction": "Task instruction",
69
- "category": "Task category",
70
- "language": "en or my",
71
- "difficulty": "beginner, intermediate, or advanced",
72
- "response": "Expected response/output",
73
- "task_type": "Type of task",
74
- # Execution Feedback
75
- "execution_feedback": {
76
- "status": "completed or pending_validation",
77
- "result": "Execution result",
78
- "error_type": "runtime_error, syntax_error, etc.",
79
- "error_message": "Error details",
80
- "execution_time_ms": 1000,
81
- },
82
- # Extended Fields
83
- "framework": "React, Express, etc.",
84
- "runtime": "Node.js, Python, etc.",
85
- "database": "MongoDB, PostgreSQL, etc.",
86
- "validated": True/False,
87
- "rating": 0.0 to 1.0,
88
- "complexity_score": 1 to 10,
89
- # Metadata
90
- "metadata": {
91
- "created_at": "2024-01-01T00:00:00",
92
- "difficulty": "beginner/intermediate/advanced",
93
- }
94
- }
95
- ```
96
-
97
  ### Usage
98
 
99
  ```python
100
  from datasets import load_dataset
101
 
102
- # Load the entire dataset
103
- dataset = load_dataset("amkyawdev/mm-llm-code-v3")
 
104
 
105
- # Load specific split
106
- train_ds = load_dataset("amkyawdev/mm-llm-code-v3", split="train")
 
107
 
108
- # Access a single sample
109
- sample = train_ds[0]
110
- print(sample["messages"])
111
- print(sample["execution_feedback"])
112
 
113
- # Load from storage bucket directly
114
- from datasets import load_dataset
115
- ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl")
116
- ```
117
-
118
- ### Storage Bucket Access
119
-
120
- Access data directly from storage bucket:
121
-
122
- ```python
123
- # ADP Execution Feedback (50.8 MB)
124
- from datasets import load_dataset
125
- adp_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl")
126
-
127
- # Clean dataset (2.79 GB)
128
- clean_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl")
129
-
130
- # Formatted dataset (24.8 MB)
131
- formatted_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-formatted.jsonl")
132
  ```
133
 
134
  ### Use Cases
135
 
136
- - **Myanmar Language Models**: Training LLMs that understand Burmese/Myanmar language
137
- - **Code Generation**: Training models for programming tasks
 
 
138
  - **Multilingual Tasks**: Translation between English and Myanmar
139
- - **Chatbots**: Building conversational AI for Myanmar speakers
140
- - **Agent Workflows**: Training coding agents with execution feedback
141
-
142
- ### Data Format
143
-
144
- This dataset is available in multiple formats:
145
-
146
- 1. **HuggingFace Dataset**: Standard format with all fields
147
- 2. **ADP Format**: JSONL format with execution feedback
148
- - Location: `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl`
149
- 3. **Clean Format**: Cleaned and processed data
150
- - Location: `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl`
151
- 4. **Formatted Format**: Pre-formatted for training
152
- - Location: `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-formatted.jsonl`
153
 
154
  ### License
155
 
156
  Apache 2.0 License
157
 
158
- ### Citation
159
-
160
- ```bibtex
161
- @dataset{combined_myanmar_llm,
162
- title={Combined Myanmar LLM Dataset},
163
- author={amkyawdev},
164
- year={2024},
165
- url={https://huggingface.co/datasets/amkyawdev/mm-llm-code-v3}
166
- }
167
- ```
168
-
169
  ---
170
 
171
  ## မြန်မာဘာသာ
172
 
173
  ### အနှစ်ချူပါ
174
 
175
- ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် သုံးခုကာင်း ဒေါင်းလုဒ်များကို ပေါင်းစပ်ထားပါ။ ADP (Agent Data Protocol) format နှင့် ပါဝင်ပပါ။
176
-
177
- | ဒေါင်းလုဒ် | ဖော်ပါ | ပါဝင်မှု |
178
- |---------|----------|----------|
179
- | `amkyawdev/myanmar-llm-data` | မြန်မာစာပါးဆက်၊ ဘာသာပြန်၊ Q&A | ၂၀,၃၂၇ |
180
- | `amkyawdev/mm-llm-coder-agent-dataset` | ကုဒ်ရေးလုပ်တဲ့ agents များ | ၁,၀၀၀,၀၂၀ |
181
- | `amkyawdev/mm-llm-coder-dataset` | ကုဒ်ထုတ်လုပ်တဲ့ အလုပ်များ | ၂,၀၀၀,၀၀၀ |
182
-
183
- **ပါဝင်မှု စုစုပါး: ၃,၀၂၀,၃၄၇**
184
-
185
- ### Storage ဖိုင်များ
186
-
187
- Storage bucket မှာရှိတဲ့ ဖိုင်များနှင့်လည်း ပါဝင်ပပါး။
188
-
189
- | ဖိုင် | တည်နေရာ | အရွယ်အစား |
190
- |------|----------|------|
191
- | ADP Execution Feedback | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl` | ၅၀.၈ MB |
192
- | Myanmar LLM Clean | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl` | ၂.၇၉ GB |
193
- | Myanmar LLM Formatted | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-formatted.jsonl` | ၂၄.၈ MB |
194
- | Myanmar LLM Data Formatted | `hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-data-formatted.jsonl` | ၃.၇ kB |
195
 
196
- ### ပါဝင်တဲ့ အချက်များ
 
 
 
 
197
 
198
- - **မြန်မာစာ ပါဝင်မှု**: မြန်မာပါးဆက်များနှင့် ဘာသာပြန်များ
199
- - **ကုဒ်ထုတ်လုပ်ခြင်း**: Python, JavaScript, TypeScript နှင့် အခြားပရိုဂရမ်ဘာသာများ
200
- - **Agent Workflows**: အလုပ်အများအဆင့်ဆင့်လုပ်တဲ့ ကုဒ်ရေးလုပ်တဲ့ အလုပ်များ
201
- - **Execution Feedback**: ကုဒ်လုပ်ခါင်းရလာဒ်၊ အမှားများ၊ စမ်းသပ်ချက်များ
202
- - **အရည်အသွေး**: Rating၊ validation status၊ complexity scores
203
-
204
- ### ဖွဲ့စည်းပါ
205
-
206
- ```python
207
- # အခြေခံ ဖိုင်များ
208
- "messages": [...],
209
- "instruction": "...",
210
- "category": "...",
211
- "language": "my သို့မဟုတ် en",
212
- "difficulty": "beginner, intermediate, advanced",
213
- "response": "...",
214
- "task_type": "...",
215
- # Execution Feedback
216
- "execution_feedback": {
217
- "status": "completed သို့မဟုတ် pending_validation",
218
- "result": "...",
219
- "error_type": "...",
220
- "error_message": "...",
221
- },
222
- # နောက်ထပ် ဖိုင်များ
223
- "metadata": {...}
224
- }
225
- ```
226
 
227
  ### သုံးပါ
228
 
229
  ```python
230
  from datasets import load_dataset
231
- dataset = load_dataset("amkyawdev/mm-llm-code-v3")
232
- train_ds = dataset["train"]
233
- sample = train_ds[0]
234
 
235
- # Storage bucket မှာရှိတဲ့ ဖိုင်များကို သုံးဖို့
236
- from datasets import load_dataset
237
- adp_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-adp-execution-feedback.jsonl")
238
- clean_ds = load_dataset("hf://buckets/amkyawdev/mm-llm-storage/myanmar-llm-clean.jsonl")
239
  ```
240
 
241
- ### သုံးပြုနည်း အမျိုးမျိုး
242
-
243
- - **မြန်မာစာ LLM**: မြန်မာစာနားလည်တဲ့ LLM များကို လေ့ကျင့်ခြင်း
244
- - **ကုဒ်ထုတ်လုပ်ခြင်း**: ပရိုဂရမ်ရေးလုပ်တဲ့ မော်ဒယ်များကို လေ့ကျင့်ခြင်း
245
- - **ဘာသာပြန်**: အင်္ဂလိပ်နဲ့ မြန်မာပါးကြား ပြန်ဆိုခြင်း
246
- - **ခွန်းဖြေ**: မြန်မာစာပါးဆက်ပါ AI များကို ဆောက်လုပ်ခြင်း
247
- - **Agent Workflows**: Execution feedback နဲ့ ကုဒ် agents များကို လေ့ကျင့်ခြင်း
248
-
249
  ### License
250
 
251
  Apache 2.0 License
252
-
253
- ### Dataset URL
254
-
255
- https://huggingface.co/datasets/amkyawdev/mm-llm-code-v3
 
11
  - amkyawdev/mm-llm-coder-dataset
12
  ---
13
 
14
+ # Combined Myanmar LLM Code Dataset
15
 
16
+ A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding.
17
 
18
  [English](#english) | [မြန်မာဘာသာ](#myanmar)
19
 
 
25
 
26
  This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities:
27
 
28
+ | Source | Dataset | Description | Type | Samples |
29
+ |--------|---------|-------------|------|---------|
30
+ | chat-skill.md | [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data) | Myanmar conversations, translations, Q&A | Chat/Skill | ~54,553 |
31
+ | agent-skill.md | [amkyawdev/mm-llm-coder-agent-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset) | Agent workflow for coding tasks | Agent/Skill | ~1,000,020 |
32
+ | code-skill.md | [amkyawdev/mm-llm-coder-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset) | Code generation and debugging | Code/Skill | ~2,000,000 |
33
 
34
+ **Total Samples: ~3,020,347**
35
 
36
+ ### Data Sources
37
 
38
+ #### 1. chat-skill.md - Myanmar LLM Data (`amkyawdev/myanmar-llm-data`)
39
 
40
+ Multi-turn conversations in Burmese and English:
41
+ - **Format**: `messages` (role + content), `tags`
42
+ - **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data)
43
+
44
+ #### 2. agent-skill.md - Coder Agent Dataset (`amkyawdev/mm-llm-coder-agent-dataset`)
45
+
46
+ Agent workflows with tool usage:
47
+ - **Format**: Agent workflows with `tools_used`, `code_snippets`, `execution_result`
48
+ - **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset)
49
+
50
+ #### 3. code-skill.md - Coder Dataset (`amkyawdev/mm-llm-coder-dataset`)
51
+
52
+ Code generation and debugging tasks:
53
+ - **Format**: Code Q&A conversations
54
+ - **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset)
55
 
56
  ### Features
57
 
58
  - **Myanmar Language Support**: Native Burmese (မြန်မာစာ) conversations and translations
59
+ - **Code Generation**: Python, JavaScript, TypeScript and other programming languages
60
  - **Agent Workflows**: Multi-step coding tasks with tool usage
 
61
  - **Quality Metrics**: Ratings, validation status, and complexity scores
62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  ### Usage
64
 
65
  ```python
66
  from datasets import load_dataset
67
 
68
+ # Load chat-skill dataset (Myanmar conversations)
69
+ chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
70
+ print("Chat data:", chat_ds)
71
 
72
+ # Load agent-skill dataset (Agent workflows)
73
+ agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
74
+ print("Agent data:", agent_ds)
75
 
76
+ # Load code-skill dataset (Code generation)
77
+ code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
78
+ print("Code data:", code_ds)
 
79
 
80
+ # Access specific samples
81
+ chat_sample = chat_ds["train"][0]
82
+ print("Messages:", chat_sample["messages"])
83
+ print("Tags:", chat_sample["tags"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  ```
85
 
86
  ### Use Cases
87
 
88
+ - **Myanmar Language Models**: Train LLMs that understand Burmese
89
+ - **Code Generation**: Train models for programming tasks
90
+ - **Agent Workflows**: Train coding agents with tool usage
91
+ - **Debugging**: Fix common coding errors
92
  - **Multilingual Tasks**: Translation between English and Myanmar
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
 
94
  ### License
95
 
96
  Apache 2.0 License
97
 
 
 
 
 
 
 
 
 
 
 
 
98
  ---
99
 
100
  ## မြန်မာဘာသာ
101
 
102
  ### အနှစ်ချူပါ
103
 
104
+ ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် dataset ၃ ခုကို ပေါင်းစပ်ထားပါ။
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
+ | Source | Dataset | Description | Samples |
107
+ |--------|---------|-------------|---------|
108
+ | chat-skill.md | `amkyawdev/myanmar-llm-data` | မြန်မ���စာပါးဆက် | ~54,553 |
109
+ | agent-skill.md | `amkyawdev/mm-llm-coder-agent-dataset` | Agent workflow | ~1,000,020 |
110
+ | code-skill.md | `amkyawdev/mm-llm-coder-dataset` | ကုဒ်ထုတ်လုပ်ခြင်း | ~2,000,000 |
111
 
112
+ **ပါဝင်မှု စုစုပါး: ~3,020,347**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
  ### သုံးပါ
115
 
116
  ```python
117
  from datasets import load_dataset
 
 
 
118
 
119
+ chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
120
+ agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
121
+ code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
 
122
  ```
123
 
 
 
 
 
 
 
 
 
124
  ### License
125
 
126
  Apache 2.0 License