Luma-base / README.md
Frostie08's picture
Update README.md
59a4f14 verified
# Luma-base: A High-Performance Foundation Model for Haitian Creole (Kreyòl Ayisyen)
**Luma-base** is a state-of-the-art 4-billion parameter language model, specialized in Haitian Creole. Based on the **Qwen3-4B** architecture, it has undergone extensive domain-specific pre-training to capture the nuances, grammar, and cultural context of the Haitian language.
## 🚀 Project Overview
The Luma project aims to bridge the gap in high-quality AI tools for Haitian Creole. Luma-base is the core engine designed to serve as a backbone for STT (Speech-to-Text) correction, translation, and text generation.
- **Developer:** Frostie08
- **Model Type:** Causal Language Model
- **Base Model:** Qwen3-4B
- **Language:** Haitian Creole (ht-HT)
- **License:** Apache-2.0
## 📊 Technical Specifications & Training
Luma-base was trained using the **Unsloth** library to ensure maximum efficiency and mathematical precision.
### Training Details:
- **Dataset:** `kani-pretrain` (A curated, high-quality corpus of Haitian Creole literature, news, and formal texts).
- **Steps:** 3,591 steps (3 full epochs).
- **Batch Size:** 16 (Total).
- **Optimizer:** AdamW 8-bit.
- **Learning Rate:** 2e-4 with Cosine Scheduler.
- **Precision:** Mixed Precision (16-bit).
### Performance:
- **Final Validation Loss:** **1.9252** 🎯
- **Final Training Loss:** 1.4520
- **Perplexity:** ~6.8 (indicating high confidence in word prediction).
## 🛠️ Implementation & Usage
### 1. For Direct Inference (Text Completion)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Frostie08/Luma-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Example: Historical/Biblical context completion
text = "Nan konmansman, Bondye te kreye..."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.6)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))