--- license: apache-2.0 language: - my - en multilinguality: multilingual size_categories: n_1M_to_n_10M source_datasets: - amkyawdev/myanmar-llm-data - amkyawdev/mm-llm-coder-agent-dataset - amkyawdev/mm-llm-coder-dataset --- # Combined Myanmar LLM Model A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding. [English](#english) | [မြန်မာဘာသာ](#myanmar) --- ## English ### Overview This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities: | Source | Dataset | Description | Type | Samples | |--------|---------|-------------|------|---------| | chat-skill.md | [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data) | Myanmar conversations, translations, Q&A | Chat/Skill | ~54,553 | | agent-skill.md | [amkyawdev/mm-llm-coder-agent-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset) | Agent workflow for coding tasks | Agent/Skill | ~1,000,020 | | code-skill.md | [amkyawdev/mm-llm-coder-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset) | Code generation and debugging | Code/Skill | ~2,000,000 | **Total Samples: ~3,020,347** ### Data Sources #### 1. chat-skill.md - Myanmar LLM Data (`amkyawdev/myanmar-llm-data`) Multi-turn conversations in Burmese and English: - **Format**: `messages` (role + content), `tags` - **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data) #### 2. agent-skill.md - Coder Agent Dataset (`amkyawdev/mm-llm-coder-agent-dataset`) Agent workflows with tool usage: - **Format**: Agent workflows with `tools_used`, `code_snippets`, `execution_result` - **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset) #### 3. code-skill.md - Coder Dataset (`amkyawdev/mm-llm-coder-dataset`) Code generation and debugging tasks: - **Format**: Code Q&A conversations - **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset) ### Features - **Myanmar Language Support**: Native Burmese (မြန်မာစာ) conversations and translations - **Code Generation**: Python, JavaScript, TypeScript and other programming languages - **Agent Workflows**: Multi-step coding tasks with tool usage - **Quality Metrics**: Ratings, validation status, and complexity scores ### Usage ```python from datasets import load_dataset # Load chat-skill dataset (Myanmar conversations) chat_ds = load_dataset("amkyawdev/myanmar-llm-data") print("Chat data:", chat_ds) # Load agent-skill dataset (Agent workflows) agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset") print("Agent data:", agent_ds) # Load code-skill dataset (Code generation) code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset") print("Code data:", code_ds) # Access specific samples chat_sample = chat_ds["train"][0] print("Messages:", chat_sample["messages"]) print("Tags:", chat_sample["tags"]) ``` ### Use Cases - **Myanmar Language Models**: Train LLMs that understand Burmese - **Code Generation**: Train models for programming tasks - **Agent Workflows**: Train coding agents with tool usage - **Debugging**: Fix common coding errors - **Multilingual Tasks**: Translation between English and Myanmar ### License Apache 2.0 License --- ## မြန်မာဘာသာ ### အနှစ်ချူပါ ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် dataset ၃ ခုကို ပေါင်းစပ်ထားပါ။ | Source | Dataset | Description | Samples | |--------|---------|-------------|---------| | chat-skill.md | `amkyawdev/myanmar-llm-data` | မြန်မာစာပါးဆက် | ~54,553 | | agent-skill.md | `amkyawdev/mm-llm-coder-agent-dataset` | Agent workflow | ~1,000,020 | | code-skill.md | `amkyawdev/mm-llm-coder-dataset` | ကုဒ်ထုတ်လုပ်ခြင်း | ~2,000,000 | **ပါဝင်မှု စုစုပါး: ~3,020,347** ### သုံးပါ ```python from datasets import load_dataset chat_ds = load_dataset("amkyawdev/myanmar-llm-data") agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset") code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset") ``` ### License Apache 2.0 License