File size: 5,231 Bytes

6f0db5f
 
 
 
 
 
 
 
 
 
 
cd7c5a4
 
 
 
 
 
 
 
 
6f0db5f
 
cee8898
6f0db5f
920051b
 
 
 
 
 
6f0db5f
cee8898
6f0db5f
cee8898
cd7c5a4
cee8898
920051b
cd7c5a4
920051b
cd7c5a4
cee8898
 
920051b
cee8898
98be87c
cee8898
 
 
 
920051b
 
 
 
 
 
 
 
cd7c5a4
 
 
cee8898
cd7c5a4
cee8898
 
 
 
 
cd7c5a4
cee8898
 
 
 
920051b
 
 
 
 
cd7c5a4
 
 
cee8898
cd7c5a4
cee8898
 
 
 
 
 
 
cd7c5a4
 
 
cee8898
 
 
 
 
 
 
cd7c5a4
cee8898
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
920051b
cee8898
920051b
cee8898
920051b
cd7c5a4
 
 
cee8898
cd7c5a4
98be87c
 
920051b
98be87c
920051b
cd7c5a4
 
 
cee8898
cd7c5a4
cee8898
920051b
cee8898
 
cd7c5a4
 
 
cee8898
cd7c5a4
cee8898
 
 
 
cd7c5a4
920051b

---
base_model: unsloth/llama-3.2-3b-instruct
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
license: apache-2.0
language:
- en
- sw
datasets:
- saillab/alpaca_swahili_taco
metrics:
- bleu
- accuracy
- cer
- rouge
pipeline_tag: text-generation
---

# 🧠 SALAMA LLM — Swahili Instruction-Tuned Text Generation Model

**👨‍💻 Developer:** AI4NNOV  
**✍️ Authors:** AI4NNOV  
**📦 Version:** v1.0  
**📜 License:** Apache 2.0  
**🛠️ Model Type:** Instruction-Tuned Large Language Model  
**🧩 Base Model:** `Jacaranda/UlizaLlama`

---

## 🌍 Overview

**SALAMA LLM** is the **language understanding and generation engine** of the **SALAMA Framework** — a modular Speech-to-Speech (STS) AI pipeline built for African languages.  
The model is fine-tuned on Swahili instruction datasets to enable natural, culturally relevant responses in text generation, summarization, question answering, and translation.

This model represents a major step in bridging the linguistic digital divide by providing **high-quality Swahili AI text generation** capabilities within an open, scalable framework.

---

## 🧱️ Model Architecture

SALAMA LLM is based on **Jacaranda/UlizaLlama**, fine-tuned using **Parameter-Efficient Fine-Tuning (PEFT)** via **LoRA/QLoRA**.  
The architecture supports mixed Swahili-English text inputs while focusing on fluent Swahili text generation for both casual and formal domains.

| Parameter | Value |
|------------|--------|
| **Base Model** | `Jacaranda/UlizaLlama` |
| **Fine-Tuning** | QLoRA / LoRA (PEFT) |
| **Precision** | 4-bit quantization |
| **Optimizer** | AdamW |
| **Learning Rate** | 2e-5 |
| **Epochs** | 3–5 |
| **Frameworks** | Transformers, TRL, PEFT, Unsloth |
| **Languages** | Swahili (sw), English (en) |

---

## 📚 Datasets

| Dataset | Description | Purpose |
|----------|--------------|----------|
| `saillab/alpaca_swahili_taco` | Swahili Alpaca-style instruction-response dataset | Instruction tuning |
| `Jacaranda/kiswallama-pretrained` | 321M Swahili tokens, custom tokenizer (20K vocab) | Base Swahili adaptation |
| Custom Swahili QA corpus | Curated Q&A and summarization samples | Conversational fine-tuning |

---

## 🧠 Model Capabilities

✅ Text generation in **Swahili and English**  
✅ Instruction-following, summarization, and dialogue  
✅ Question answering and translation (EN ↔ SW)  
✅ Sentiment and named-entity recognition  
✅ Contextually and culturally aligned text generation  

---

## 📊 Evaluation Metrics

| Metric | Score | Description |
|---------|-------|-------------|
| **BLEU** | 0.49 | Measures fluency and translation accuracy |
| **ROUGE-L** | 0.61 | Summarization recall and overlap |
| **Accuracy (QA)** | 95.5% | Accuracy on Swahili QA tasks |
| **CER** | 0.28 | Character Error Rate |
| **F1 (avg)** | 0.90+ | Weighted average across tasks |

---

## ⚙️ Usage (Python Example)

Below is a quick example to load and use **SALAMA LLM** for Swahili text generation:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "EYEDOL/salama-llm"  # Change to your Hugging Face repo name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Swahili text prompt
prompt = "Andika sentensi fupi kuhusu umuhimu wa elimu."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=120,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.05
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

**🦩 Example Output:**

> “Elimu ni msingi wa maendeleo, humwezesha mtu kuelewa dunia na kuboresha maisha yake na jamii kwa ujumla.”

---

## ⚡ Key Features

- 🧩 Optimized for African low-resource NLP contexts  
- 💬 Instruction-following in Swahili and English  
- ⚙️ Lightweight and efficient (QLoRA fine-tuned; runs on single 24 GB GPU)  
- 🌍 Culturally aligned text generation  
- 🦶 Open-source and extendable to other African languages  

---

## 🚫 Limitations

- ⚠️ May underperform with heavy code-switching (Swahili-English mix)  
- 👤 Not yet optimized for rare dialects or poetic forms  
- 📚 Limited exposure to specialized (medical/legal) corpora  
- 🔊 Relies on accurate STT transcription in end-to-end speech-to-speech use  

---

## 🔗 Related Models

| Model | Description |
|--------|-------------|
| [`EYEDOL/salama-stt`](https://huggingface.co/EYEDOL/salama-stt) | Swahili Speech-to-Text model (Whisper-small fine-tuned) |
| [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/salama-tts) | Swahili Text-to-Speech model (VITS architecture) |

---

## 🧾 Citation

If you use **SALAMA LLM**, please cite:

```bibtex
@misc{salama_llm_2025,
  title={SALAMA LLM: Swahili Instruction-Tuned Text Generation Model},
  author={AI4NNOV},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/EYEDOL/salama-llm}}
}
```

---

**💡 “Elimu ni msingi wa maendeleo — Knowledge is the foundation of progress.”**