File size: 4,506 Bytes
d922d70
a5faf54
 
d922d70
 
a5faf54
 
d922d70
 
 
 
 
 
bc42502
d922d70
b8716c5
d922d70
 
 
 
 
 
 
 
 
 
 
b8716c5
 
 
 
 
d922d70
b8716c5
d922d70
b8716c5
a5faf54
b8716c5
a5faf54
b8716c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a5faf54
210ece5
 
 
b8716c5
210ece5
 
 
d922d70
 
 
 
 
b8716c5
 
 
d922d70
b8716c5
 
 
d922d70
b8716c5
 
 
a5faf54
b8716c5
 
 
 
d922d70
 
 
 
b8716c5
 
 
 
d922d70
 
210ece5
 
 
d922d70
 
 
 
 
 
 
b8716c5
a5faf54
b8716c5
 
 
 
 
d922d70
b8716c5
d922d70
 
 
 
 
a5faf54
b8716c5
 
 
d922d70
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: apache-2.0
language:
- my
- en
multilinguality: multilingual
size_categories: n_1M_to_n_10M
source_datasets:
- amkyawdev/myanmar-llm-data
- amkyawdev/mm-llm-coder-agent-dataset
- amkyawdev/mm-llm-coder-dataset
---

# Combined Myanmar LLM Model

A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding.

[English](#english) | [မြန်မာဘာသာ](#myanmar)

---

## English

### Overview

This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities:

| Source | Dataset | Description | Type | Samples |
|--------|---------|-------------|------|---------|
| chat-skill.md | [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data) | Myanmar conversations, translations, Q&A | Chat/Skill | ~54,553 |
| agent-skill.md | [amkyawdev/mm-llm-coder-agent-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset) | Agent workflow for coding tasks | Agent/Skill | ~1,000,020 |
| code-skill.md | [amkyawdev/mm-llm-coder-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset) | Code generation and debugging | Code/Skill | ~2,000,000 |

**Total Samples: ~3,020,347**

### Data Sources

#### 1. chat-skill.md - Myanmar LLM Data (`amkyawdev/myanmar-llm-data`)

Multi-turn conversations in Burmese and English:
- **Format**: `messages` (role + content), `tags`
- **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data)

#### 2. agent-skill.md - Coder Agent Dataset (`amkyawdev/mm-llm-coder-agent-dataset`)

Agent workflows with tool usage:
- **Format**: Agent workflows with `tools_used`, `code_snippets`, `execution_result`
- **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset)

#### 3. code-skill.md - Coder Dataset (`amkyawdev/mm-llm-coder-dataset`)

Code generation and debugging tasks:
- **Format**: Code Q&A conversations
- **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset)

### Features

- **Myanmar Language Support**: Native Burmese (မြန်မာစာ) conversations and translations
- **Code Generation**: Python, JavaScript, TypeScript and other programming languages
- **Agent Workflows**: Multi-step coding tasks with tool usage
- **Quality Metrics**: Ratings, validation status, and complexity scores

### Usage

```python
from datasets import load_dataset

# Load chat-skill dataset (Myanmar conversations)
chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
print("Chat data:", chat_ds)

# Load agent-skill dataset (Agent workflows)
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
print("Agent data:", agent_ds)

# Load code-skill dataset (Code generation)
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
print("Code data:", code_ds)

# Access specific samples
chat_sample = chat_ds["train"][0]
print("Messages:", chat_sample["messages"])
print("Tags:", chat_sample["tags"])
```

### Use Cases

- **Myanmar Language Models**: Train LLMs that understand Burmese
- **Code Generation**: Train models for programming tasks
- **Agent Workflows**: Train coding agents with tool usage
- **Debugging**: Fix common coding errors
- **Multilingual Tasks**: Translation between English and Myanmar

### License

Apache 2.0 License

---

## မြန်မာဘာသာ

### အနှစ်ချူပါ

ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် dataset ၃ ခုကို ပေါင်းစပ်ထားပါ။

| Source | Dataset | Description | Samples |
|--------|---------|-------------|---------|
| chat-skill.md | `amkyawdev/myanmar-llm-data` | မြန်မာစာပါးဆက် | ~54,553 |
| agent-skill.md | `amkyawdev/mm-llm-coder-agent-dataset` | Agent workflow | ~1,000,020 |
| code-skill.md | `amkyawdev/mm-llm-coder-dataset` | ကုဒ်ထုတ်လုပ်ခြင်း | ~2,000,000 |

**ပါဝင်မှု စုစုပါး: ~3,020,347**

### သုံးပါ

```python
from datasets import load_dataset

chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
```

### License

Apache 2.0 License