| --- |
| language: |
| - id |
| - ms |
| base_model: meta-llama/Meta-Llama-3.1-8B-Instruct |
| tags: |
| - llama-3 |
| - maluku |
| - melayu-maluku-utara |
| - fine-tuned |
| - unsloth |
| - lora |
| - gguf |
| - indonesian |
| license: llama3 |
| pipeline_tag: text-generation |
| --- |
| |
| # 🌴 Utu — Asisten Melayu Maluku Utara (Llama-3.1-8B) |
|
|
| Model fine-tuned dari `meta-llama/Meta-Llama-3.1-8B-Instruct` untuk memahami dan merespons |
| dalam **Bahasa Melayu Maluku Utara** (Bahasa Pasar/Bahasa Ternate). |
|
|
| Fine-tuned menggunakan **Unsloth** untuk efisiensi maksimal di GPU T4. |
|
|
| ## Kosakata Lokal yang Dipahami |
|
|
| | Kata | Arti | Contoh | |
| |------|------|--------| |
| | ngana / nga | kamu | "Ngana mo pi mana?" | |
| | kita | saya | "Kita tra tau" | |
| | torang / tong | kami/kita semua | "Torang mo pigi pasar" | |
| | dorang / dong | mereka | "Dorang su pulang" | |
| | pigi | pergi | "Kita mo pigi" | |
| | tra / tara | tidak | "Kita tra mau" | |
| | su / so | sudah | "Kita su makan" | |
| | mo | mau/akan | "Ngana mo makan?" | |
| | deng | dengan | "Kita deng ngana" | |
| | bolom | belum | "Kita bolom makan" | |
| | kobong | kebun | "Pigi di kobong" | |
| | foya | bohong | "Jang foya" | |
| | pe / p | kepemilikan | "Ini kita p buku" | |
| | bkiapa | kenapa | "Bkiapa ngana sedih?" | |
|
|
| ## Cara Penggunaan |
|
|
| ### Python (Transformers) |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_id = "haidar038/utu-malut" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, torch_dtype=torch.float16, device_map="auto" |
| ) |
| |
| messages = [ |
| {"role": "system", "content": "Ngana adalah Utu, asisten AI dari Ternate."}, |
| {"role": "user", "content": "Ngana mau pigi mana sore ini?"} |
| ] |
| inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device) |
| outputs = model.generate(inputs, max_new_tokens=200, temperature=0.7, do_sample=True) |
| print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)) |
| ``` |
|
|
| ### GGUF (CPU, via llama-cpp-python) |
| ```python |
| from llama_cpp import Llama |
| |
| llm = Llama.from_pretrained( |
| repo_id = "haidar038/utu-malut-GGUF", |
| filename = "*q4_k_m*.gguf", |
| n_ctx = 512, |
| ) |
| output = llm.create_chat_completion(messages=[ |
| {"role": "system", "content": "Ngana adalah Utu, asisten AI dari Ternate."}, |
| {"role": "user", "content": "Ngana mau pigi mana?"}, |
| ]) |
| print(output["choices"][0]["message"]["content"]) |
| ``` |
|
|
| ## Detail Training |
|
|
| | Parameter | Nilai | |
| |-----------|-------| |
| | Base Model | meta-llama/Meta-Llama-3.1-8B-Instruct | |
| | Fine-tuning | QLoRA 4-bit (Unsloth) | |
| | LoRA r / alpha | 16 / 32 | |
| | Dataset | ~450 train samples | |
| | Epochs | 3 | |
| | Learning rate | 0.0002 | |
| | Max seq length | 512 | |
| | Eval loss | 1.1841 | |
| | Perplexity | 3.27 | |
| | Platform | Kaggle (T4 GPU) | |
|
|
| ## Deployment |
|
|
| Tersedia sebagai HF Space: [haidar038/utu-malut-chat](https://huggingface.co/spaces/haidar038/utu-malut-chat) |
|
|
| ## Limitasi |
|
|
| - Dataset ~500 baris; belum mencakup semua variasi dialek |
| - Untuk riset dan pengembangan NLP bahasa daerah |
| - Verifikasi output sebelum penggunaan produksi |
|
|
| ## Kredit |
|
|
| Fine-tuned dengan [Unsloth](https://github.com/unslothai/unsloth) + [TRL](https://github.com/huggingface/trl) di Kaggle. |
|
|