File size: 4,506 Bytes
d922d70 a5faf54 d922d70 a5faf54 d922d70 bc42502 d922d70 b8716c5 d922d70 b8716c5 d922d70 b8716c5 d922d70 b8716c5 a5faf54 b8716c5 a5faf54 b8716c5 a5faf54 210ece5 b8716c5 210ece5 d922d70 b8716c5 d922d70 b8716c5 d922d70 b8716c5 a5faf54 b8716c5 d922d70 b8716c5 d922d70 210ece5 d922d70 b8716c5 a5faf54 b8716c5 d922d70 b8716c5 d922d70 a5faf54 b8716c5 d922d70 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | ---
license: apache-2.0
language:
- my
- en
multilinguality: multilingual
size_categories: n_1M_to_n_10M
source_datasets:
- amkyawdev/myanmar-llm-data
- amkyawdev/mm-llm-coder-agent-dataset
- amkyawdev/mm-llm-coder-dataset
---
# Combined Myanmar LLM Model
A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding.
[English](#english) | [မြန်မာဘာသာ](#myanmar)
---
## English
### Overview
This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities:
| Source | Dataset | Description | Type | Samples |
|--------|---------|-------------|------|---------|
| chat-skill.md | [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data) | Myanmar conversations, translations, Q&A | Chat/Skill | ~54,553 |
| agent-skill.md | [amkyawdev/mm-llm-coder-agent-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset) | Agent workflow for coding tasks | Agent/Skill | ~1,000,020 |
| code-skill.md | [amkyawdev/mm-llm-coder-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset) | Code generation and debugging | Code/Skill | ~2,000,000 |
**Total Samples: ~3,020,347**
### Data Sources
#### 1. chat-skill.md - Myanmar LLM Data (`amkyawdev/myanmar-llm-data`)
Multi-turn conversations in Burmese and English:
- **Format**: `messages` (role + content), `tags`
- **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data)
#### 2. agent-skill.md - Coder Agent Dataset (`amkyawdev/mm-llm-coder-agent-dataset`)
Agent workflows with tool usage:
- **Format**: Agent workflows with `tools_used`, `code_snippets`, `execution_result`
- **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset)
#### 3. code-skill.md - Coder Dataset (`amkyawdev/mm-llm-coder-dataset`)
Code generation and debugging tasks:
- **Format**: Code Q&A conversations
- **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset)
### Features
- **Myanmar Language Support**: Native Burmese (မြန်မာစာ) conversations and translations
- **Code Generation**: Python, JavaScript, TypeScript and other programming languages
- **Agent Workflows**: Multi-step coding tasks with tool usage
- **Quality Metrics**: Ratings, validation status, and complexity scores
### Usage
```python
from datasets import load_dataset
# Load chat-skill dataset (Myanmar conversations)
chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
print("Chat data:", chat_ds)
# Load agent-skill dataset (Agent workflows)
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
print("Agent data:", agent_ds)
# Load code-skill dataset (Code generation)
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
print("Code data:", code_ds)
# Access specific samples
chat_sample = chat_ds["train"][0]
print("Messages:", chat_sample["messages"])
print("Tags:", chat_sample["tags"])
```
### Use Cases
- **Myanmar Language Models**: Train LLMs that understand Burmese
- **Code Generation**: Train models for programming tasks
- **Agent Workflows**: Train coding agents with tool usage
- **Debugging**: Fix common coding errors
- **Multilingual Tasks**: Translation between English and Myanmar
### License
Apache 2.0 License
---
## မြန်မာဘာသာ
### အနှစ်ချူပါ
ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် dataset ၃ ခုကို ပေါင်းစပ်ထားပါ။
| Source | Dataset | Description | Samples |
|--------|---------|-------------|---------|
| chat-skill.md | `amkyawdev/myanmar-llm-data` | မြန်မာစာပါးဆက် | ~54,553 |
| agent-skill.md | `amkyawdev/mm-llm-coder-agent-dataset` | Agent workflow | ~1,000,020 |
| code-skill.md | `amkyawdev/mm-llm-coder-dataset` | ကုဒ်ထုတ်လုပ်ခြင်း | ~2,000,000 |
**ပါဝင်မှု စုစုပါး: ~3,020,347**
### သုံးပါ
```python
from datasets import load_dataset
chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
```
### License
Apache 2.0 License
|