File size: 7,545 Bytes

bab9e57
 
f463bcd
 
 
 
 
 
bab9e57
98d2bbe
bab9e57
f463bcd
 
 
 
 
 
 
 
98d2bbe
 
bab9e57
f463bcd
 
 
 
 
 
 
bab9e57
98d2bbe
 
 
f463bcd
98d2bbe
 
 
 
 
 
 
 
 
 
 
 
 
bab9e57
f463bcd
 
 
 
 
 
bab9e57
98d2bbe
f463bcd
 
 
bab9e57
98d2bbe
bab9e57
98d2bbe
 
 
f463bcd
98d2bbe
 
f463bcd
bab9e57
 
 
 
f463bcd
bab9e57
 
f463bcd
 
 
 
 
 
 
 
bab9e57
 
f463bcd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98d2bbe
 
 
bab9e57
 
f463bcd
98d2bbe
 
 
 
 
 
 
f463bcd
 
 
98d2bbe
 
f463bcd
 
 
 
 
 
 
98d2bbe
 
 
f463bcd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98d2bbe
bab9e57
 
 
 
 
98d2bbe
 
 
bab9e57
 
98d2bbe
 
 
f463bcd
 
 
 
 
 
 
 
 
 
 
 
98d2bbe
 
f463bcd
98d2bbe
f463bcd
98d2bbe
f463bcd
98d2bbe
f463bcd
98d2bbe
f463bcd
 
 
98d2bbe
f463bcd
98d2bbe
f463bcd
 
 
 
 
bab9e57
f463bcd

# mm-llm-coder-lite-v1

<p align="center">
  <img src="https://img.shields.io/badge/Myanmar-LLM-blue?style=for-the-badge&logo=huggingface" alt="License">
  <img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" alt="License">
  <img src="https://img.shields.io/badge/Model-phi--2-orange?style=for-the-badge" alt="Base Model">
  <img src="https://img.shields.io/badge/Fine--tuned-LoRA-red?style=for-the-badge" alt="Method">
</p>

## 📌 Overview

**mm-llm-coder-lite-v1** is a specialized Large Language Model (LLM) fine-tuned for Myanmar (Burmese) language understanding, code generation, and conversational tasks. The model is based on Microsoft's `phi-2` and fine-tuned using Low-Rank Adaptation (LoRA) technique.

### Key Features

- 🌍 **Myanmar Language Support**: Specialized in Burmese/Myanmar language processing
- 💻 **Code Generation**: Supports Python, JavaScript, and other programming languages
- 💬 **Conversational AI**: Can engage in natural dialogue in Myanmar language
- ⚡ **Lightweight**: Optimized for efficient inference with LoRA

## 🏗️ Architecture

| Component | Details |
|----------|---------|
| **Base Model** | microsoft/phi-2 |
| **Fine-tuning Method** | LoRA (Low-Rank Adaptation) |
| **Training Framework** | Hugging Face Transformers + PEFT + TRL |
| **Language** | Burmese (Myanmar) |
| **Parameters** | ~2.7B total (trainable: ~2.6M) |

## 📊 Training Details

| Parameter | Value |
|----------|-------|
| Base Model | microsoft/phi-2 |
| Training Epochs | 3 |
| Learning Rate | 2e-4 |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.05 |
| Max Length | 512 |
| Batch Size | 4 |
| Gradient Accumulation | 4 |

## 📁 Dataset

Trained on [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data):

| Tag | Description | Percentage |
|-----|-------------|------------|
| coding | Programming conversations | 90% |
| translation | English-Myanmar translation | 1% |
| general | General knowledge Q&A | 1% |
| greeting | Burmese greetings | 1% |

### Dataset Statistics
- **Train**: ~20,327 samples
- **Test**: ~17,155 samples
- **Validation**: ~17,071 samples

## 🚀 Quick Start

### Installation

```bash
pip install torch transformers peft accelerate datasets
```

### Basic Inference

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "amkyawdev/mm-llm-coder-lite-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Set pad token
tokenizer.pad_token = tokenizer.eos_token

# Generate response
input_text = """System: သင်သည် မြန်မာစာကျွမ်းကျင်သော AI အကူအညီပေးသူဖြစ်သည်။

User: Python နဲ့ Fibonacci စီးရီးထုတ်တဲ့ function ရေးပေးပါ။

Assistant:"""

inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.95,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Using Pipeline

```python
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="amkyawdev/mm-llm-coder-lite-v1",
    tokenizer="amkyawdev/mm-llm-coder-lite-v1",
    device_map="auto",
    torch_dtype=torch.float16
)

prompt = """System: သင်သည် မြန်မာစာကျွမ်းကျင်သော AI အကူအညီပေးသူဖြစ်သည်။

User: ဟိုင်း၊ နေကောင်းလား။

Assistant:"""

result = pipe(prompt, max_new_tokens=128, temperature=0.7)
print(result[0]['generated_text'])
```

## 📝 Prompt Template

This model uses the following prompt format:

```
System: <system_prompt>

User: <user_message>

Assistant: <assistant_response><eos>
```

### Example Prompt

```
System: သင်သည် မြန်မာစာကျွမ်းကျင်သော AI အကူအညီပေးသူဖြစ်သည်။

User: မင်္ဂလာပါ။

Assistant: မင်္ဂလာပါရှင်း။ သင့်အား ကူညီပါသည်။<eos>
```

## 🖥️ Deployment

### GGUF Conversion (for LM Studio / Ollama)

```python
# Install required packages
# pip install transformers peft accelerate sentencepiece

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model
model_name = "amkyawdev/mm-llm-coder-lite-v1"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="cpu",
    low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Merge LoRA weights (if using PEFT)
# Note: This model uses LoRA adapters

# Save merged model
output_dir = "./mm-llm-merged"
model.save_merged(output_dir)
tokenizer.save_pretrained(output_dir)

# Convert to GGUF using llama.cpp
# Follow: https://github.com/ggerganov/llama.cpp/tree/master/convert
```

### Ollama Deployment

```bash
# Create Modelfile
FROM ./mm-llm-coder-lite-v1

PARAMETER temperature 0.7
PARAMETER top_p 0.95
PARAMETER top_k 40

TEMPLATE """System: {{ .System }}

User: {{ .Prompt }}

Assistant: {{ .Response }}<eos>"""

# Create model in Ollama
ollama create mm-llm-coder -f Modelfile

# Run
ollama run mm-llm-coder
```

## 📈 Evaluation

### Myanmar Code Evaluation

```python
# Example evaluation for Myanmar code generation

myanmar_prompts = [
    "Python နဲ့ list ကို sort လုပ်နည်းရေးပါ။",
    "JavaScript နဲ့ function ရေးပေးပါ။",
    "မြန်မာ Unicode ကို Zawgyi ပြောင်းတဲ့ code ရေးပါ။",
]

# Run generation and evaluate
def evaluate_model(prompts):
    results = []
    for prompt in prompts:
        # Generate code
        output = generate(prompt)
        results.append({
            "prompt": prompt,
            "generated": output,
            "success": check_syntax(output)
        })
    return results

# Calculate pass rate
success_rate = sum(1 for r in results if r["success"]) / len(results)
print(f"Success Rate: {success_rate * 100:.2f}%")
```

### Benchmark Adaptation

For Myanmar-specific evaluation, consider:
1. Translating MBPP/MathEval prompts to Myanmar
2. Creating Myanmar coding benchmarks
3. Using BLEU/ROUGE for translation quality

## 📋 Requirements

```
torch>=2.0.0
transformers>=4.35.0
peft>=0.7.0
trl>=0.7.0
accelerate>=0.25.0
datasets>=2.14.0
```

## 🔧 Configuration

```python
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./mm-llm-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,
    fp16=True,
    save_steps=500,
    eval_steps=500,
    save_total_limit=2,
)
```

## 📜 License

This project is licensed under the **MIT License**.

See [LICENSE](LICENSE) for details.

## 👤 Author

**Amkyaw Dev**
- GitHub: [@amkyawdev](https://github.com/amkyawdev)
- Hugging Face: [amkyawdev](https://huggingface.co/amkyawdev)

## 🙏 Acknowledgments

- Microsoft for the phi-2 model
- Hugging Face for Transformers and PEFT
- The Myanmar NLP community

---

<p align="center">
  Made with ❤️ for Myanmar AI Community
</p>