Upload README.md with huggingface_hub

f463bcd verified 28 days ago

7.55 kB

	# mm-llm-coder-lite-v1

	<p align="center">
	<img src="https://img.shields.io/badge/Myanmar-LLM-blue?style=for-the-badge&logo=huggingface" alt="License">
	<img src="https://img.shields.io/badge/License-MIT-green?style=for-the-badge" alt="License">
	<img src="https://img.shields.io/badge/Model-phi--2-orange?style=for-the-badge" alt="Base Model">
	<img src="https://img.shields.io/badge/Fine--tuned-LoRA-red?style=for-the-badge" alt="Method">
	</p>

	## 📌 Overview

	mm-llm-coder-lite-v1 is a specialized Large Language Model (LLM) fine-tuned for Myanmar (Burmese) language understanding, code generation, and conversational tasks. The model is based on Microsoft's `phi-2` and fine-tuned using Low-Rank Adaptation (LoRA) technique.

	### Key Features

	- 🌍 Myanmar Language Support: Specialized in Burmese/Myanmar language processing
	- 💻 Code Generation: Supports Python, JavaScript, and other programming languages
	- 💬 Conversational AI: Can engage in natural dialogue in Myanmar language
	- ⚡ Lightweight: Optimized for efficient inference with LoRA

	## 🏗️ Architecture

	\| Component \| Details \|
	\|----------\|---------\|
	\| Base Model \| microsoft/phi-2 \|
	\| Fine-tuning Method \| LoRA (Low-Rank Adaptation) \|
	\| Training Framework \| Hugging Face Transformers + PEFT + TRL \|
	\| Language \| Burmese (Myanmar) \|
	\| Parameters \| ~2.7B total (trainable: ~2.6M) \|

	## 📊 Training Details

	\| Parameter \| Value \|
	\|----------\|-------\|
	\| Base Model \| microsoft/phi-2 \|
	\| Training Epochs \| 3 \|
	\| Learning Rate \| 2e-4 \|
	\| LoRA Rank (r) \| 16 \|
	\| LoRA Alpha \| 32 \|
	\| LoRA Dropout \| 0.05 \|
	\| Max Length \| 512 \|
	\| Batch Size \| 4 \|
	\| Gradient Accumulation \| 4 \|

	## 📁 Dataset

	Trained on [amkyawdev/myanmar-llm-data](https://huggingface.co/datasets/amkyawdev/myanmar-llm-data):

	\| Tag \| Description \| Percentage \|
	\|-----\|-------------\|------------\|
	\| coding \| Programming conversations \| 90% \|
	\| translation \| English-Myanmar translation \| 1% \|
	\| general \| General knowledge Q&A \| 1% \|
	\| greeting \| Burmese greetings \| 1% \|

	### Dataset Statistics
	- Train: ~20,327 samples
	- Test: ~17,155 samples
	- Validation: ~17,071 samples

	## 🚀 Quick Start

	### Installation

	```bash
	pip install torch transformers peft accelerate datasets
	```

	### Basic Inference

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load model and tokenizer
	model_name = "amkyawdev/mm-llm-coder-lite-v1"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Set pad token
	tokenizer.pad_token = tokenizer.eos_token

	# Generate response
	input_text = """System: သင်သည် မြန်မာစာကျွမ်းကျင်သော AI အကူအညီပေးသူဖြစ်သည်။

	User: Python နဲ့ Fibonacci စီးရီးထုတ်တဲ့ function ရေးပေးပါ။

	Assistant:"""

	inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
	inputs = {k: v.to(model.device) for k, v in inputs.items()}

	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	temperature=0.7,
	top_p=0.95,
	do_sample=True
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Using Pipeline

	```python
	from transformers import pipeline

	pipe = pipeline(
	"text-generation",
	model="amkyawdev/mm-llm-coder-lite-v1",
	tokenizer="amkyawdev/mm-llm-coder-lite-v1",
	device_map="auto",
	torch_dtype=torch.float16
	)

	prompt = """System: သင်သည် မြန်မာစာကျွမ်းကျင်သော AI အကူအညီပေးသူဖြစ်သည်။

	User: ဟိုင်း၊ နေကောင်းလား။

	Assistant:"""

	result = pipe(prompt, max_new_tokens=128, temperature=0.7)
	print(result[0]['generated_text'])
	```

	## 📝 Prompt Template

	This model uses the following prompt format:

	```
	System: <system_prompt>

	User: <user_message>

	Assistant: <assistant_response><eos>
	```

	### Example Prompt

	```
	System: သင်သည် မြန်မာစာကျွမ်းကျင်သော AI အကူအညီပေးသူဖြစ်သည်။

	User: မင်္ဂလာပါ။

	Assistant: မင်္ဂလာပါရှင်း။ သင့်အား ကူညီပါသည်။<eos>
	```

	## 🖥️ Deployment

	### GGUF Conversion (for LM Studio / Ollama)

	```python
	# Install required packages
	# pip install transformers peft accelerate sentencepiece

	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load model
	model_name = "amkyawdev/mm-llm-coder-lite-v1"
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.float16,
	device_map="cpu",
	low_cpu_mem_usage=True
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	# Merge LoRA weights (if using PEFT)
	# Note: This model uses LoRA adapters

	# Save merged model
	output_dir = "./mm-llm-merged"
	model.save_merged(output_dir)
	tokenizer.save_pretrained(output_dir)

	# Convert to GGUF using llama.cpp
	# Follow: https://github.com/ggerganov/llama.cpp/tree/master/convert
	```

	### Ollama Deployment

	```bash
	# Create Modelfile
	FROM ./mm-llm-coder-lite-v1

	PARAMETER temperature 0.7
	PARAMETER top_p 0.95
	PARAMETER top_k 40

	TEMPLATE """System: {{ .System }}

	User: {{ .Prompt }}

	Assistant: {{ .Response }}<eos>"""

	# Create model in Ollama
	ollama create mm-llm-coder -f Modelfile

	# Run
	ollama run mm-llm-coder
	```

	## 📈 Evaluation

	### Myanmar Code Evaluation

	```python
	# Example evaluation for Myanmar code generation

	myanmar_prompts = [
	"Python နဲ့ list ကို sort လုပ်နည်းရေးပါ။",
	"JavaScript နဲ့ function ရေးပေးပါ။",
	"မြန်မာ Unicode ကို Zawgyi ပြောင်းတဲ့ code ရေးပါ။",
	]

	# Run generation and evaluate
	def evaluate_model(prompts):
	results = []
	for prompt in prompts:
	# Generate code
	output = generate(prompt)
	results.append({
	"prompt": prompt,
	"generated": output,
	"success": check_syntax(output)
	})
	return results

	# Calculate pass rate
	success_rate = sum(1 for r in results if r["success"]) / len(results)
	print(f"Success Rate: {success_rate * 100:.2f}%")
	```

	### Benchmark Adaptation

	For Myanmar-specific evaluation, consider:
	1. Translating MBPP/MathEval prompts to Myanmar
	2. Creating Myanmar coding benchmarks
	3. Using BLEU/ROUGE for translation quality

	## 📋 Requirements

	```
	torch>=2.0.0
	transformers>=4.35.0
	peft>=0.7.0
	trl>=0.7.0
	accelerate>=0.25.0
	datasets>=2.14.0
	```

	## 🔧 Configuration

	```python
	from transformers import TrainingArguments

	training_args = TrainingArguments(
	output_dir="./mm-llm-output",
	num_train_epochs=3,
	per_device_train_batch_size=4,
	learning_rate=2e-4,
	fp16=True,
	save_steps=500,
	eval_steps=500,
	save_total_limit=2,
	)
	```

	## 📜 License

	This project is licensed under the MIT License.

	See [LICENSE](LICENSE) for details.

	## 👤 Author

	Amkyaw Dev
	- GitHub: [@amkyawdev](https://github.com/amkyawdev)
	- Hugging Face: [amkyawdev](https://huggingface.co/amkyawdev)

	## 🙏 Acknowledgments

	- Microsoft for the phi-2 model
	- Hugging Face for Transformers and PEFT
	- The Myanmar NLP community

	---

	<p align="center">
	Made with ❤️ for Myanmar AI Community
	</p>