Buckets:
| annotations_creators: | |
| - no-annotation | |
| language_creators: | |
| - found | |
| languages: | |
| - my | |
| - en | |
| licenses: | |
| - apache-2.0 | |
| multilinguality: | |
| - multilingual | |
| size_categories: | |
| - n_1M_to_n_10M | |
| source_datasets: | |
| - amkyawdev/myanmar-llm-data | |
| - amkyawdev/mm-llm-coder-agent-dataset | |
| - amkyawdev/mm-llm-coder-dataset | |
| # Combined Myanmar LLM Dataset | |
| A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding. | |
| [English](#english) | [မြန်မာဘာသာ](#myanmar) | |
| --- | |
| ## English | |
| ### Overview | |
| This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities: | |
| | Dataset | Description | Samples | | |
| |---------|-------------|---------| | |
| | `amkyawdev/myanmar-llm-data` | Myanmar language conversations, translations, Q&A | 20,327 | | |
| | `amkyawdev/mm-llm-coder-agent-dataset` | Agent workflow for coding tasks | 1,000,020 | | |
| | `amkyawdev/mm-llm-coder-dataset` | Code generation tasks | 2,000,000 | | |
| **Total Samples: 3,020,347** | |
| ### Dataset Structure | |
| Each sample contains: | |
| ```python | |
| { | |
| "messages": [ # Chat messages (list of dicts with role/content) | |
| {"role": "system", "content": "You are a helpful assistant."}, | |
| {"role": "user", "content": "User input here"}, | |
| {"role": "assistant", "content": "Response here"} | |
| ], | |
| "instruction": "Task instruction (for code datasets)", | |
| "category": "Task category (greeting, translation, code_debugging, etc.)", | |
| "language": "en or my", | |
| "difficulty": "beginner, intermediate, or advanced", | |
| "response": "Expected response/output", | |
| "task_type": "Type of task (qa_conversation, agent_workflow, etc.)" | |
| } | |
| ``` | |
| ### Extended Fields (from mm-llm-coder-agent-dataset) | |
| Some samples include additional fields: | |
| | Field | Description | | |
| |-------|-------------| | |
| | `framework` | Framework used (React, Express, etc.) | | |
| | `runtime` | Runtime environment | | |
| | `database` | Database system | | |
| | `environment` | Development environment | | |
| | `tools_used` | Tools used in the task | | |
| | `code_snippets` | Code examples | | |
| | `validated` | Whether validated | | |
| | `rating` | Quality rating | | |
| | `complexity_score` | Task complexity score | | |
| ### Usage | |
| ```python | |
| from datasets import load_dataset | |
| # Load the entire dataset | |
| dataset = load_dataset("amkyawdev/combined-myanmar-llm-dataset") | |
| # Load specific split | |
| train_ds = load_dataset("amkyawdev/combined-myanmar-llm-dataset", split="train") | |
| # Access a single sample | |
| sample = train_ds[0] | |
| print(sample["messages"]) | |
| ``` | |
| ### Use Cases | |
| - **Myanmar Language Models**: Training LLMs that understand Burmese/Myanmar language | |
| - **Code Generation**: Training models for programming tasks | |
| - **Multilingual Tasks**: Translation between English and Myanmar | |
| - **Chatbots**: Building conversational AI for Myanmar speakers | |
| - **Agent Workflows**: Training coding agents | |
| ### Dataset Card Citation | |
| If you use this dataset, please cite: | |
| ```bibtex | |
| @dataset{combined_myanmar_llm, | |
| title={Combined Myanmar LLM Dataset}, | |
| author={amkyawdev}, | |
| year={2024}, | |
| url={https://huggingface.co/datasets/amkyawdev/combined-myanmar-llm-dataset} | |
| } | |
| ``` | |
| --- | |
| ## မြန်မာဘာသာ | |
| ### အနှစ်ချူပါ | |
| ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် သုံးခုေကာင်း ဒေါင်းလုဒ်များကို ပေါင်းစပ်ထားပပါ။ | |
| | ဒေါင်းလုဒ် | ဖော်ပါ | ပါဝင်မှု | | |
| |---------|----------|----------| | |
| | `amkyawdev/myanmar-llm-data` | မြန်မာစာပါးဆက်ပါ၊ ဘာသာပြန်၊ Q&A | ၂၀,၃၂၇ | | |
| | `amkyawdev/mm-llm-coder-agent-dataset` | ကုဒ်ရေးလုပ်တဲ့ agents များ | ၁,၀၀၀,၀၂၀ | | |
| | `amkyawdev/mm-llm-coder-dataset` | ကုဒ်ထုတ်လုပ်တဲ့ အလုပ်များ | ၂,၀၀၀,၀၀၀ | | |
| **ပါဝင်မှု စုစုပါး: ၃,၀၂၀,၃၄၇** | |
| ### ဖွဲ့စည်းပါ | |
| နမူနာတစ်ခုခုမှာ ပါဝင်တာများ: | |
| ```python | |
| { | |
| "messages": [ # ပါးဆက်ပါ (role/content ရှိတဲ့ dict များ) | |
| {"role": "system", "content": "သင်သည် အကူအညီပါ။"}, | |
| {"role": "user", "content": "သုံးစွဲသူပါ"}, | |
| {"role": "assistant", "content": "အဖြေပါ"} | |
| ], | |
| "instruction": "အလုပ်ညွှန်ကိုးကါ (ကုဒ် dataset များအတွက်)", | |
| "category": "အလုပ်အမျိုးအစား (greeting, translation, code_debugging, etc.)", | |
| "language": "en သို့မဟုတ် my", | |
| "difficulty": "beginner, intermediate, သို့မဟုတ် advanced", | |
| "response": "မျှော်လင့်တဲ့ အဖြေ/ထွက်ပါ", | |
| "task_type": "အလုပ်အမျိုးအစား (qa_conversation, agent_workflow, etc.)" | |
| } | |
| ``` | |
| ### သုံးပါ | |
| ```python | |
| from datasets import load_dataset | |
| # ဒေါင်းလုဒ်လုပ်ချက် | |
| dataset = load_dataset("amkyawdev/combined-myanmar-llm-dataset") | |
| # ပါဝင်မှု | |
| train_ds = load_dataset("amkyawdev/combined-myanmar-llm-dataset", split="train") | |
| # နမူနာတစ်ခုယူပါ | |
| sample = train_ds[0] | |
| print(sample["messages"]) | |
| ``` | |
| ### သုံးပြုနည်း အမျိုးမျိုး | |
| - **မြန်မာစာ LLM**: မြန်မာစာနားလည်တဲ့ LLM များကို လေ့ကျင့်ခြင်း | |
| - **ကုဒ်ထုတ်လုပ်ခြင်း**: ပရိုဂရမ်ရေးလုပ်တဲ့ မော်ဒယ်များကို လေ့ကျင့်ခြင်း | |
| - **ဘာသာပြန်**: အင်္ဂလိပ်နဲ့ မြန်မာပါးကြား ပြန်ဆိုခြင်း | |
| - **ခွန်းဖြေ**: မြန်မာစာပါးဆက်ပါ AI များကို ဆောက်လုပ်ခြင်း | |
| ### ကိုးကားချက် | |
| ဒီဒေါင်းလုဒ်များကို သုံးပါက ကျေးဇူးပါ။: | |
| ```bibtex | |
| @dataset{combined_myanmar_llm, | |
| title={Combined Myanmar LLM Dataset}, | |
| author={amkyawdev}, | |
| year={2024}, | |
| url={https://huggingface.co/datasets/amkyawdev/combined-myanmar-llm-dataset} | |
| } | |
| ``` | |
| --- | |
| ### License | |
| Apache 2.0 License | |
| ### Dataset URL | |
| https://huggingface.co/datasets/amkyawdev/combined-myanmar-llm-dataset |
Xet Storage Details
- Size:
- 6.81 kB
- Xet hash:
- 4f72481dd6a328797926c60db9247292f7fbbe6fe7540d772dbed4af336fe100
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.