mm-llm-code-v3 / README.md
amkyawdev's picture
Update README.md
bc42502 verified
---
license: apache-2.0
language:
- my
- en
multilinguality: multilingual
size_categories: n_1M_to_n_10M
source_datasets:
- amkyawdev/myanmar-llm-data
- amkyawdev/mm-llm-coder-agent-dataset
- amkyawdev/mm-llm-coder-dataset
---
# Combined Myanmar LLM Model
A comprehensive dataset combining three Myanmar-related datasets for training large language models, optimized for code generation and Myanmar language understanding.
[English](#english) | [မြန်မာဘာသာ](#myanmar)
---
## English
### Overview
This dataset combines three source datasets for training LLMs with Myanmar language and coding capabilities:
| Source | Dataset | Description | Type | Samples |
|--------|---------|-------------|------|---------|
| chat-skill.md | [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data) | Myanmar conversations, translations, Q&A | Chat/Skill | ~54,553 |
| agent-skill.md | [amkyawdev/mm-llm-coder-agent-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset) | Agent workflow for coding tasks | Agent/Skill | ~1,000,020 |
| code-skill.md | [amkyawdev/mm-llm-coder-dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset) | Code generation and debugging | Code/Skill | ~2,000,000 |
**Total Samples: ~3,020,347**
### Data Sources
#### 1. chat-skill.md - Myanmar LLM Data (`amkyawdev/myanmar-llm-data`)
Multi-turn conversations in Burmese and English:
- **Format**: `messages` (role + content), `tags`
- **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data)
#### 2. agent-skill.md - Coder Agent Dataset (`amkyawdev/mm-llm-coder-agent-dataset`)
Agent workflows with tool usage:
- **Format**: Agent workflows with `tools_used`, `code_snippets`, `execution_result`
- **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-agent-dataset)
#### 3. code-skill.md - Coder Dataset (`amkyawdev/mm-llm-coder-dataset`)
Code generation and debugging tasks:
- **Format**: Code Q&A conversations
- **Link**: [View Dataset](https://huggingface.co/datasets/amkyawdev/mm-llm-coder-dataset)
### Features
- **Myanmar Language Support**: Native Burmese (မြန်မာစာ) conversations and translations
- **Code Generation**: Python, JavaScript, TypeScript and other programming languages
- **Agent Workflows**: Multi-step coding tasks with tool usage
- **Quality Metrics**: Ratings, validation status, and complexity scores
### Usage
```python
from datasets import load_dataset
# Load chat-skill dataset (Myanmar conversations)
chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
print("Chat data:", chat_ds)
# Load agent-skill dataset (Agent workflows)
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
print("Agent data:", agent_ds)
# Load code-skill dataset (Code generation)
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
print("Code data:", code_ds)
# Access specific samples
chat_sample = chat_ds["train"][0]
print("Messages:", chat_sample["messages"])
print("Tags:", chat_sample["tags"])
```
### Use Cases
- **Myanmar Language Models**: Train LLMs that understand Burmese
- **Code Generation**: Train models for programming tasks
- **Agent Workflows**: Train coding agents with tool usage
- **Debugging**: Fix common coding errors
- **Multilingual Tasks**: Translation between English and Myanmar
### License
Apache 2.0 License
---
## မြန်မာဘာသာ
### အနှစ်ချူပါ
ဒီ dataset သည် မြန်မာစာ နှင့် ကုဒ်ရေးလုပ်တဲ့ LLM များကို လေ့ကျင့်ဖို့အတွက် dataset ၃ ခုကို ပေါင်းစပ်ထားပါ။
| Source | Dataset | Description | Samples |
|--------|---------|-------------|---------|
| chat-skill.md | `amkyawdev/myanmar-llm-data` | မြန်မာစာပါးဆက် | ~54,553 |
| agent-skill.md | `amkyawdev/mm-llm-coder-agent-dataset` | Agent workflow | ~1,000,020 |
| code-skill.md | `amkyawdev/mm-llm-coder-dataset` | ကုဒ်ထုတ်လုပ်ခြင်း | ~2,000,000 |
**ပါဝင်မှု စုစုပါး: ~3,020,347**
### သုံးပါ
```python
from datasets import load_dataset
chat_ds = load_dataset("amkyawdev/myanmar-llm-data")
agent_ds = load_dataset("amkyawdev/mm-llm-coder-agent-dataset")
code_ds = load_dataset("amkyawdev/mm-llm-coder-dataset")
```
### License
Apache 2.0 License