# mm-llm-coder-lite-v1 Model Card

Myanmar LLM License Base Model

## ๐Ÿ“Œ Overview **mm-llm-coder-lite-v1** is a Lite version of the Myanmar Large Language Model, specifically optimized for **efficiency** in Myanmar (Burmese) programming tasks. This model is designed for developers in Myanmar who need a lightweight, fast model for code generation and conversational AI. ### Key Design Goals - ๐Ÿš€ **Efficient**: Optimized for low-resource environments - ๐Ÿ’ป **Code-focused**: Specialized in programming tasks - ๐ŸŒ **Myanmar-first**: Built for Myanmar developers ## ๐Ÿ“Š Model Specifications | Specification | Value | |--------------|-------| | **Parameters** | ~2.7B (base), ~2.6M (trainable with LoRA) | | **Base Model** | microsoft/phi-2 | | **Fine-tuning Method** | LoRA (Low-Rank Adaptation) | | **Training Data Type** | Myanmar code + conversation dataset | | **LoRA Rank (r)** | 16 | | **LoRA Alpha** | 32 | | **Max Length** | 512 tokens | | **Training Epochs** | 3 | | **Learning Rate** | 2e-4 | ## ๐Ÿš€ Quick Start ### Installation ```bash pip install torch transformers peft accelerate ``` ### Basic Usage (Python) ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load the model model_name = "amkyawdev/mm-llm-coder-lite-v1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) # Set pad token tokenizer.pad_token = tokenizer.eos_token ``` ### Generate Response ```python # Create prompt in Myanmar format prompt = """System: แ€žแ€„แ€บแ€žแ€Šแ€บ แ€™แ€ผแ€”แ€บแ€™แ€ฌแ€…แ€ฌแ€€แ€ปแ€ฝแ€™แ€บแ€ธแ€€แ€ปแ€„แ€บแ€žแ€ฑแ€ฌ AI แ€กแ€€แ€ฐแ€กแ€Šแ€ฎแ€•แ€ฑแ€ธแ€žแ€ฐแ€–แ€ผแ€…แ€บแ€žแ€Šแ€บแ‹ User: Python แ€”แ€ฒแ€ท Fibonacci function แ€›แ€ฑแ€ธแ€•แ€ฑแ€ธแ€•แ€ซแ‹ Assistant:""" # Generate inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512) inputs = {k: v.to(model.device) for k, v in inputs.items()} outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.7, top_p=0.95, do_sample=True ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Using Gradio Space ```python # Visit: https://huggingface.co/spaces/amkyawdev/mm-llm-coder-lite-v1 # Or use via API from gradio_client import Client client = Client("amkyawdev/mm-llm-coder-lite-v1") result = client.predict( "Python แ€”แ€ฒแ€ท list sort แ€œแ€ฏแ€•แ€บแ€”แ€Šแ€บแ€ธ", # user message fn_index=0 ) print(result) ``` ## ๐Ÿ“ Sample Prompts (Myanmar) ### Example 1: Code Generation ``` User: Python แ€”แ€ฒแ€ท Fibonacci function แ€›แ€ฑแ€ธแ€•แ€ฑแ€ธแ€•แ€ซแ‹ Assistant: def fibonacci(n): if n <= 1: return n else: return fibonacci(n-1) + fibonacci(n-2) ``` ### Example 2: Translation ``` User: Hello แ€•แ€ซแ€แ€บแ€™แ€พแ€ฌแ€ธแ€•แ€ซแ‹ Assistant: แ€™แ€„แ€บแ€นแ€‚แ€œแ€ฌแ€•แ€ซแ‹ แ€žแ€„แ€ทแ€บแ€กแ€ฌแ€ธ แ€€แ€ฐแ€Šแ€ฎแ€•แ€ซแ€žแ€Šแ€บแ‹ ``` ### Example 3: Data Cleaning ``` User: แ€™แ€ผแ€”แ€บแ€™แ€ฌแ€…แ€ฌแ€žแ€ฌแ€ธแ€กแ€™แ€พแ€ฌแ€ธแ€™แ€พแ€ฌแ€ธแ€•แ€ผแ€„แ€บแ€•แ€ซแ‹ Assistant: import re def clean_myanmar_text(text): # Remove extra spaces text = re.sub(r'\s+', ' ', text) # ... (more cleaning logic) return text ``` ## โš ๏ธ Limitations (Lite Version) This is a **Lite** version with intentional trade-offs: ### Performance Limitations | Limitation | Description | |-----------|------------| | **Smaller Context** | Max 512 tokens (vs 2048+ in full version) | | **Limited Knowledge** | Trained on ~20K samples | | **Code Complexity** | Best for simple to intermediate tasks | | **Language Coverage** | Primarily Myanmar, limited English | ### Expected Behavior 1. **Fast Inference**: optimized for speed over quality 2. **Simple Tasks**: Good for basic code generation 3. **Complex Tasks**: May struggle with advanced algorithms 4. **Long Conversations**: Context may degrade after ~3-4 turns ### Recommendations for Developers - Use for: Simple scripts, code translation, learning - Avoid: Production-grade complex systems, long context tasks - Fine-tune: For your specific use case if needed ## ๐Ÿ“ Training Data - **Dataset**: [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data) - **Training Samples**: ~20,327 - **Test Samples**: ~17,155 - **Categories**: Code (90%), Translation, General, Greetings ## ๐Ÿท๏ธ Tags `myanmar` `burmese` `llm` `code-generation` `fine-tuned` `lora` `phi-2` `transformers` ## ๐Ÿ“œ License MIT License - See [LICENSE](LICENSE) file for details. ## ๐Ÿ™ Acknowledgments - Microsoft for phi-2 base model - Hugging Face community - Myanmar developers ---

๐Ÿ‡ฒ๐Ÿ‡ฒ Made for Myanmar Developers