amkyawdev commited on
Commit
d922d70
·
verified ·
1 Parent(s): 78b1259

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +196 -0
README.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ annotations_creators:
3
+ - no-annotation
4
+ language_creators:
5
+ - found
6
+ languages:
7
+ - my
8
+ - en
9
+ licenses:
10
+ - apache-2.0
11
+ multilinguality:
12
+ - multilingual
13
+ size_categories:
14
+ - n_1M_to_n_10M
15
+ source_datasets:
16
+ - amkyawdev/myanmar-llm-data
17
+ - amkyawdev/mm-llm-coder-agent-dataset
18
+ - amkyawdev/mm-llm-coder-dataset
19
+ ---
20
+
21
+ # Combined Myanmar LLM Dataset
22
+
23
+ A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding.
24
+
25
+ [English](#english) | [မြန်မာဘာသာ](#myanmar)
26
+
27
+ ---
28
+
29
+ ## English
30
+
31
+ ### Overview
32
+
33
+ This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities:
34
+
35
+ | Dataset | Description | Samples |
36
+ |---------|-------------|---------|
37
+ | `amkyawdev/myanmar-llm-data` | Myanmar language conversations, translations, Q&A | 20,327 |
38
+ | `amkyawdev/mm-llm-coder-agent-dataset` | Agent workflow for coding tasks | 1,000,020 |
39
+ | `amkyawdev/mm-llm-coder-dataset` | Code generation tasks | 2,000,000 |
40
+
41
+ **Total Samples: 3,020,347**
42
+
43
+ ### Dataset Structure
44
+
45
+ Each sample contains:
46
+
47
+ ```python
48
+ {
49
+ "messages": [ # Chat messages (list of dicts with role/content)
50
+ {"role": "system", "content": "You are a helpful assistant."},
51
+ {"role": "user", "content": "User input here"},
52
+ {"role": "assistant", "content": "Response here"}
53
+ ],
54
+ "instruction": "Task instruction (for code datasets)",
55
+ "category": "Task category (greeting, translation, code_debugging, etc.)",
56
+ "language": "en or my",
57
+ "difficulty": "beginner, intermediate, or advanced",
58
+ "response": "Expected response/output",
59
+ "task_type": "Type of task (qa_conversation, agent_workflow, etc.)"
60
+ }
61
+ ```
62
+
63
+ ### Extended Fields (from mm-llm-coder-agent-dataset)
64
+
65
+ Some samples include additional fields:
66
+
67
+ | Field | Description |
68
+ |-------|-------------|
69
+ | `framework` | Framework used (React, Express, etc.) |
70
+ | `runtime` | Runtime environment |
71
+ | `database` | Database system |
72
+ | `environment` | Development environment |
73
+ | `tools_used` | Tools used in the task |
74
+ | `code_snippets` | Code examples |
75
+ | `validated` | Whether validated |
76
+ | `rating` | Quality rating |
77
+ | `complexity_score` | Task complexity score |
78
+
79
+ ### Usage
80
+
81
+ ```python
82
+ from datasets import load_dataset
83
+
84
+ # Load the entire dataset
85
+ dataset = load_dataset("amkyawdev/combined-myanmar-llm-dataset")
86
+
87
+ # Load specific split
88
+ train_ds = load_dataset("amkyawdev/combined-myanmar-llm-dataset", split="train")
89
+
90
+ # Access a single sample
91
+ sample = train_ds[0]
92
+ print(sample["messages"])
93
+ ```
94
+
95
+ ### Use Cases
96
+
97
+ - **Myanmar Language Models**: Training LLMs that understand Burmese/Myanmar language
98
+ - **Code Generation**: Training models for programming tasks
99
+ - **Multilingual Tasks**: Translation between English and Myanmar
100
+ - **Chatbots**: Building conversational AI for Myanmar speakers
101
+ - **Agent Workflows**: Training coding agents
102
+
103
+ ### Dataset Card Citation
104
+
105
+ If you use this dataset, please cite:
106
+
107
+ ```bibtex
108
+ @dataset{combined_myanmar_llm,
109
+ title={Combined Myanmar LLM Dataset},
110
+ author={amkyawdev},
111
+ year={2024},
112
+ url={https://huggingface.co/datasets/amkyawdev/combined-myanmar-llm-dataset}
113
+ }
114
+ ```
115
+
116
+ ---
117
+
118
+ ## မြန်မာဘာသာ
119
+
120
+ ### အနှစ်ချူပါ
121
+
122
+ ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် သုံးခုေကာင်း ဒေါင်းလုဒ်များကို ပေါင်းစပ်ထားပပါ။
123
+
124
+ | ဒေါင်းလုဒ် | ဖော်ပါ | ပါဝင်မှု |
125
+ |---------|----------|----------|
126
+ | `amkyawdev/myanmar-llm-data` | မြန်မာစာပါးဆက်ပါ၊ ဘာသာပြန်၊ Q&A | ၂၀,၃၂၇ |
127
+ | `amkyawdev/mm-llm-coder-agent-dataset` | ကုဒ်ရေးလုပ်တဲ့ agents များ | ၁,၀၀၀,၀၂၀ |
128
+ | `amkyawdev/mm-llm-coder-dataset` | ကုဒ်ထုတ်လုပ်တဲ့ အလုပ်များ | ၂,၀၀၀,၀၀၀ |
129
+
130
+ **ပါဝင်မှု စုစုပါး: ၃,၀၂၀,၃၄၇**
131
+
132
+ ### ဖွဲ့စည်းပါ
133
+
134
+ နမူနာတစ်ခုခုမှာ ပါဝင်တာများ:
135
+
136
+ ```python
137
+ {
138
+ "messages": [ # ပါးဆက်ပါ (role/content ရှိတဲ့ dict များ)
139
+ {"role": "system", "content": "သင်သည် အကူအညီပါ။"},
140
+ {"role": "user", "content": "သုံးစွဲသူပါ"},
141
+ {"role": "assistant", "content": "အဖြေပါ"}
142
+ ],
143
+ "instruction": "အလုပ်ညွှန်ကိုးကါ (ကုဒ် dataset များအတွက်)",
144
+ "category": "အလုပ်အမျိုးအစား (greeting, translation, code_debugging, etc.)",
145
+ "language": "en သို့မဟုတ် my",
146
+ "difficulty": "beginner, intermediate, သို့မဟုတ် advanced",
147
+ "response": "မျှော်လင့်တဲ့ အဖြေ/ထွက်ပါ",
148
+ "task_type": "အလုပ်အမျိုးအစား (qa_conversation, agent_workflow, etc.)"
149
+ }
150
+ ```
151
+
152
+ ### သုံးပါ
153
+
154
+ ```python
155
+ from datasets import load_dataset
156
+
157
+ # ဒေါင်းလုဒ်လုပ်ချက်
158
+ dataset = load_dataset("amkyawdev/combined-myanmar-llm-dataset")
159
+
160
+ # ပါဝင်မှု
161
+ train_ds = load_dataset("amkyawdev/combined-myanmar-llm-dataset", split="train")
162
+
163
+ # နမူနာတစ်ခုယူပါ
164
+ sample = train_ds[0]
165
+ print(sample["messages"])
166
+ ```
167
+
168
+ ### သုံးပြုနည်း အမျိုးမျိုး
169
+
170
+ - **မြန်မာစာ LLM**: မြန်မာစာနားလည်တဲ့ LLM များကို လေ့ကျင့်ခြင်း
171
+ - **ကုဒ်ထုတ်လုပ်ခြင်း**: ပရိုဂရမ်ရေးလုပ်တဲ့ မော်ဒယ်များကို လေ့ကျင့်ခြင်း
172
+ - **ဘာသာပြန်**: အင်္ဂလိပ်နဲ့ မြန်မာပါးကြား ပြန်ဆိုခြင်း
173
+ - **ခွန်းဖြေ**: မြန်မာစာပါးဆက်ပါ AI များကို ဆောက်လုပ်ခြင်း
174
+
175
+ ### ကိုးကားချက်
176
+
177
+ ဒီဒေါင်းလုဒ်များကို သုံးပါက ကျေးဇူးပါ။:
178
+
179
+ ```bibtex
180
+ @dataset{combined_myanmar_llm,
181
+ title={Combined Myanmar LLM Dataset},
182
+ author={amkyawdev},
183
+ year={2024},
184
+ url={https://huggingface.co/datasets/amkyawdev/combined-myanmar-llm-dataset}
185
+ }
186
+ ```
187
+
188
+ ---
189
+
190
+ ### License
191
+
192
+ Apache 2.0 License
193
+
194
+ ### Dataset URL
195
+
196
+ https://huggingface.co/datasets/amkyawdev/combined-myanmar-llm-dataset