| # Luma-base: A High-Performance Foundation Model for Haitian Creole (Kreyòl Ayisyen) |
|
|
| **Luma-base** is a state-of-the-art 4-billion parameter language model, specialized in Haitian Creole. Based on the **Qwen3-4B** architecture, it has undergone extensive domain-specific pre-training to capture the nuances, grammar, and cultural context of the Haitian language. |
|
|
| ## 🚀 Project Overview |
| The Luma project aims to bridge the gap in high-quality AI tools for Haitian Creole. Luma-base is the core engine designed to serve as a backbone for STT (Speech-to-Text) correction, translation, and text generation. |
|
|
| - **Developer:** Frostie08 |
| - **Model Type:** Causal Language Model |
| - **Base Model:** Qwen3-4B |
| - **Language:** Haitian Creole (ht-HT) |
| - **License:** Apache-2.0 |
|
|
| ## 📊 Technical Specifications & Training |
| Luma-base was trained using the **Unsloth** library to ensure maximum efficiency and mathematical precision. |
|
|
| ### Training Details: |
| - **Dataset:** `kani-pretrain` (A curated, high-quality corpus of Haitian Creole literature, news, and formal texts). |
| - **Steps:** 3,591 steps (3 full epochs). |
| - **Batch Size:** 16 (Total). |
| - **Optimizer:** AdamW 8-bit. |
| - **Learning Rate:** 2e-4 with Cosine Scheduler. |
| - **Precision:** Mixed Precision (16-bit). |
|
|
| ### Performance: |
| - **Final Validation Loss:** **1.9252** 🎯 |
| - **Final Training Loss:** 1.4520 |
| - **Perplexity:** ~6.8 (indicating high confidence in word prediction). |
|
|
|
|
|
|
| ## 🛠️ Implementation & Usage |
|
|
| ### 1. For Direct Inference (Text Completion) |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_name = "Frostie08/Luma-base" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| torch_dtype=torch.float16, |
| device_map="auto" |
| ) |
| |
| # Example: Historical/Biblical context completion |
| text = "Nan konmansman, Bondye te kreye..." |
| inputs = tokenizer(text, return_tensors="pt").to("cuda") |
| |
| with torch.no_grad(): |
| outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.6) |
| |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| |
| |