| --- |
| license: apache-2.0 |
| language: |
| - my |
| - en |
| multilinguality: multilingual |
| size_categories: n_1M_to_n_10M |
| source_datasets: |
| - amkyawdev/myanmar-llm-data |
| - amkyawdev/mm-llm-coder-agent-dataset |
| - amkyawdev/mm-llm-coder-dataset |
| --- |
| |
| # Combined Myanmar LLM Model |
|
|
| A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding. |
|
|
| [English](#english) | [မြန်မာဘာသာ](#myanmar) |
|
|
| --- |
|
|
| ## English |
|
|
| ### Overview |
|
|
| This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities: |
|
|
| | Source | Dataset | Description | Type | Samples | |
| |--------|---------|-------------|------|---------| |
| | chat-skill.md | [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data) | Myanmar conversations, translations, Q&A | Chat/Skill | ~54,553 | |
| | agent-skill.md | [amkyawdev/mm-llm-coder-agent-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset) | Agent workflow for coding tasks | Agent/Skill | ~1,000,020 | |
| | code-skill.md | [amkyawdev/mm-llm-coder-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset) | Code generation and debugging | Code/Skill | ~2,000,000 | |
|
|
| **Total Samples: ~3,020,347** |
|
|
| ### Data Sources |
|
|
| #### 1. chat-skill.md - Myanmar LLM Data (`amkyawdev/myanmar-llm-data`) |
|
|
| Multi-turn conversations in Burmese and English: |
| - **Format**: `messages` (role + content), `tags` |
| - **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data) |
|
|
| #### 2. agent-skill.md - Coder Agent Dataset (`amkyawdev/mm-llm-coder-agent-dataset`) |
|
|
| Agent workflows with tool usage: |
| - **Format**: Agent workflows with `tools_used`, `code_snippets`, `execution_result` |
| - **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset) |
|
|
| #### 3. code-skill.md - Coder Dataset (`amkyawdev/mm-llm-coder-dataset`) |
|
|
| Code generation and debugging tasks: |
| - **Format**: Code Q&A conversations |
| - **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset) |
|
|
| ### Features |
|
|
| - **Myanmar Language Support**: Native Burmese (မြန်မာစာ) conversations and translations |
| - **Code Generation**: Python, JavaScript, TypeScript and other programming languages |
| - **Agent Workflows**: Multi-step coding tasks with tool usage |
| - **Quality Metrics**: Ratings, validation status, and complexity scores |
|
|
| ### Usage |
|
|
| ```python |
| from datasets import load_dataset |
| |
| # Load chat-skill dataset (Myanmar conversations) |
| chat_ds = load_dataset("amkyawdev/myanmar-llm-data") |
| print("Chat data:", chat_ds) |
| |
| # Load agent-skill dataset (Agent workflows) |
| agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset") |
| print("Agent data:", agent_ds) |
| |
| # Load code-skill dataset (Code generation) |
| code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset") |
| print("Code data:", code_ds) |
| |
| # Access specific samples |
| chat_sample = chat_ds["train"][0] |
| print("Messages:", chat_sample["messages"]) |
| print("Tags:", chat_sample["tags"]) |
| ``` |
|
|
| ### Use Cases |
|
|
| - **Myanmar Language Models**: Train LLMs that understand Burmese |
| - **Code Generation**: Train models for programming tasks |
| - **Agent Workflows**: Train coding agents with tool usage |
| - **Debugging**: Fix common coding errors |
| - **Multilingual Tasks**: Translation between English and Myanmar |
|
|
| ### License |
|
|
| Apache 2.0 License |
|
|
| --- |
|
|
| ## မြန်မာဘာသာ |
|
|
| ### အနှစ်ချူပါ |
|
|
| ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် dataset ၃ ခုကို ပေါင်းစပ်ထားပါ။ |
|
|
| | Source | Dataset | Description | Samples | |
| |--------|---------|-------------|---------| |
| | chat-skill.md | `amkyawdev/myanmar-llm-data` | မြန်မာစာပါးဆက် | ~54,553 | |
| | agent-skill.md | `amkyawdev/mm-llm-coder-agent-dataset` | Agent workflow | ~1,000,020 | |
| | code-skill.md | `amkyawdev/mm-llm-coder-dataset` | ကုဒ်ထုတ်လုပ်ခြင်း | ~2,000,000 | |
|
|
| **ပါဝင်မှု စုစုပါး: ~3,020,347** |
|
|
| ### သုံးပါ |
|
|
| ```python |
| from datasets import load_dataset |
| |
| chat_ds = load_dataset("amkyawdev/myanmar-llm-data") |
| agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset") |
| code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset") |
| ``` |
|
|
| ### License |
|
|
| Apache 2.0 License |
|
|