mm-llm-code-v3 / README.md
amkyawdev's picture
Update README.md
bc42502 verified
metadata
license: apache-2.0
language:
  - my
  - en
multilinguality: multilingual
size_categories: n_1M_to_n_10M
source_datasets:
  - amkyawdev/myanmar-llm-data
  - amkyawdev/mm-llm-coder-agent-dataset
  - amkyawdev/mm-llm-coder-dataset

Combined Myanmar LLM Model

A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding.

English | မြန်မာဘာသာ


English

Overview

This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities:

Source Dataset Description Type Samples
chat-skill.md amkyawdev/myanmar-llm-data Myanmar conversations, translations, Q&A Chat/Skill ~54,553
agent-skill.md amkyawdev/mm-llm-coder-agent-dataset Agent workflow for coding tasks Agent/Skill ~1,000,020
code-skill.md amkyawdev/mm-llm-coder-dataset Code generation and debugging Code/Skill ~2,000,000

Total Samples: ~3,020,347

Data Sources

1. chat-skill.md - Myanmar LLM Data (amkyawdev/myanmar-llm-data)

Multi-turn conversations in Burmese and English:

  • Format: messages (role + content), tags
  • Link: View Dataset

2. agent-skill.md - Coder Agent Dataset (amkyawdev/mm-llm-coder-agent-dataset)

Agent workflows with tool usage:

  • Format: Agent workflows with tools_used, code_snippets, execution_result
  • Link: View Dataset

3. code-skill.md - Coder Dataset (amkyawdev/mm-llm-coder-dataset)

Code generation and debugging tasks:

Features

  • Myanmar Language Support: Native Burmese (မြန်မာစာ) conversations and translations
  • Code Generation: Python, JavaScript, TypeScript and other programming languages
  • Agent Workflows: Multi-step coding tasks with tool usage
  • Quality Metrics: Ratings, validation status, and complexity scores

Usage

from datasets import load_dataset

# Load chat-skill dataset (Myanmar conversations)
chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
print("Chat data:", chat_ds)

# Load agent-skill dataset (Agent workflows)
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
print("Agent data:", agent_ds)

# Load code-skill dataset (Code generation)
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
print("Code data:", code_ds)

# Access specific samples
chat_sample = chat_ds["train"][0]
print("Messages:", chat_sample["messages"])
print("Tags:", chat_sample["tags"])

Use Cases

  • Myanmar Language Models: Train LLMs that understand Burmese
  • Code Generation: Train models for programming tasks
  • Agent Workflows: Train coding agents with tool usage
  • Debugging: Fix common coding errors
  • Multilingual Tasks: Translation between English and Myanmar

License

Apache 2.0 License


မြန်မာဘာသာ

အနှစ်ချူပါ

ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် dataset ၃ ခုကို ပေါင်းစပ်ထားပါ။

Source Dataset Description Samples
chat-skill.md amkyawdev/myanmar-llm-data မြန်မာစာပါးဆက် ~54,553
agent-skill.md amkyawdev/mm-llm-coder-agent-dataset Agent workflow ~1,000,020
code-skill.md amkyawdev/mm-llm-coder-dataset ကုဒ်ထုတ်လုပ်ခြင်း ~2,000,000

ပါဝင်မှု စုစုပါး: ~3,020,347

သုံးပါ

from datasets import load_dataset

chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")

License

Apache 2.0 License