File size: 5,231 Bytes
6f0db5f
 
 
 
 
 
 
 
 
 
 
cd7c5a4
 
 
 
 
 
 
 
 
6f0db5f
 
cee8898
6f0db5f
920051b
 
 
 
 
 
6f0db5f
cee8898
6f0db5f
cee8898
cd7c5a4
cee8898
920051b
cd7c5a4
920051b
cd7c5a4
cee8898
 
920051b
cee8898
98be87c
cee8898
 
 
 
920051b
 
 
 
 
 
 
 
cd7c5a4
 
 
cee8898
cd7c5a4
cee8898
 
 
 
 
cd7c5a4
cee8898
 
 
 
920051b
 
 
 
 
cd7c5a4
 
 
cee8898
cd7c5a4
cee8898
 
 
 
 
 
 
cd7c5a4
 
 
cee8898
 
 
 
 
 
 
cd7c5a4
cee8898
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
920051b
cee8898
920051b
cee8898
920051b
cd7c5a4
 
 
cee8898
cd7c5a4
98be87c
 
920051b
98be87c
920051b
cd7c5a4
 
 
cee8898
cd7c5a4
cee8898
920051b
cee8898
 
cd7c5a4
 
 
cee8898
cd7c5a4
cee8898
 
 
 
cd7c5a4
920051b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
base_model: unsloth/llama-3.2-3b-instruct
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
license: apache-2.0
language:
- en
- sw
datasets:
- saillab/alpaca_swahili_taco
metrics:
- bleu
- accuracy
- cer
- rouge
pipeline_tag: text-generation
---

# 🧠 SALAMA LLM β€” Swahili Instruction-Tuned Text Generation Model

**πŸ‘¨β€πŸ’» Developer:** AI4NNOV  
**✍️ Authors:** AI4NNOV  
**πŸ“¦ Version:** v1.0  
**πŸ“œ License:** Apache 2.0  
**πŸ› οΈ Model Type:** Instruction-Tuned Large Language Model  
**🧩 Base Model:** `Jacaranda/UlizaLlama`

---

## 🌍 Overview

**SALAMA LLM** is the **language understanding and generation engine** of the **SALAMA Framework** β€” a modular Speech-to-Speech (STS) AI pipeline built for African languages.  
The model is fine-tuned on Swahili instruction datasets to enable natural, culturally relevant responses in text generation, summarization, question answering, and translation.

This model represents a major step in bridging the linguistic digital divide by providing **high-quality Swahili AI text generation** capabilities within an open, scalable framework.

---

## 🧱️ Model Architecture

SALAMA LLM is based on **Jacaranda/UlizaLlama**, fine-tuned using **Parameter-Efficient Fine-Tuning (PEFT)** via **LoRA/QLoRA**.  
The architecture supports mixed Swahili-English text inputs while focusing on fluent Swahili text generation for both casual and formal domains.

| Parameter | Value |
|------------|--------|
| **Base Model** | `Jacaranda/UlizaLlama` |
| **Fine-Tuning** | QLoRA / LoRA (PEFT) |
| **Precision** | 4-bit quantization |
| **Optimizer** | AdamW |
| **Learning Rate** | 2e-5 |
| **Epochs** | 3–5 |
| **Frameworks** | Transformers, TRL, PEFT, Unsloth |
| **Languages** | Swahili (sw), English (en) |

---

## πŸ“š Datasets

| Dataset | Description | Purpose |
|----------|--------------|----------|
| `saillab/alpaca_swahili_taco` | Swahili Alpaca-style instruction-response dataset | Instruction tuning |
| `Jacaranda/kiswallama-pretrained` | 321M Swahili tokens, custom tokenizer (20K vocab) | Base Swahili adaptation |
| Custom Swahili QA corpus | Curated Q&A and summarization samples | Conversational fine-tuning |

---

## 🧠 Model Capabilities

βœ… Text generation in **Swahili and English**  
βœ… Instruction-following, summarization, and dialogue  
βœ… Question answering and translation (EN ↔ SW)  
βœ… Sentiment and named-entity recognition  
βœ… Contextually and culturally aligned text generation  

---

## πŸ“Š Evaluation Metrics

| Metric | Score | Description |
|---------|-------|-------------|
| **BLEU** | 0.49 | Measures fluency and translation accuracy |
| **ROUGE-L** | 0.61 | Summarization recall and overlap |
| **Accuracy (QA)** | 95.5% | Accuracy on Swahili QA tasks |
| **CER** | 0.28 | Character Error Rate |
| **F1 (avg)** | 0.90+ | Weighted average across tasks |

---

## βš™οΈ Usage (Python Example)

Below is a quick example to load and use **SALAMA LLM** for Swahili text generation:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "EYEDOL/salama-llm"  # Change to your Hugging Face repo name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Swahili text prompt
prompt = "Andika sentensi fupi kuhusu umuhimu wa elimu."

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=120,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.05
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

**🦩 Example Output:**

> β€œElimu ni msingi wa maendeleo, humwezesha mtu kuelewa dunia na kuboresha maisha yake na jamii kwa ujumla.”

---

## ⚑ Key Features

- 🧩 Optimized for African low-resource NLP contexts  
- πŸ’¬ Instruction-following in Swahili and English  
- βš™οΈ Lightweight and efficient (QLoRA fine-tuned; runs on single 24 GB GPU)  
- 🌍 Culturally aligned text generation  
- 🦢 Open-source and extendable to other African languages  

---

## 🚫 Limitations

- ⚠️ May underperform with heavy code-switching (Swahili-English mix)  
- πŸ‘€ Not yet optimized for rare dialects or poetic forms  
- πŸ“š Limited exposure to specialized (medical/legal) corpora  
- πŸ”Š Relies on accurate STT transcription in end-to-end speech-to-speech use  

---

## πŸ”— Related Models

| Model | Description |
|--------|-------------|
| [`EYEDOL/salama-stt`](https://huggingface.co/EYEDOL/salama-stt) | Swahili Speech-to-Text model (Whisper-small fine-tuned) |
| [`EYEDOL/salama-tts`](https://huggingface.co/EYEDOL/salama-tts) | Swahili Text-to-Speech model (VITS architecture) |

---

## 🧾 Citation

If you use **SALAMA LLM**, please cite:

```bibtex
@misc{salama_llm_2025,
  title={SALAMA LLM: Swahili Instruction-Tuned Text Generation Model},
  author={AI4NNOV},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/EYEDOL/salama-llm}}
}
```

---

**πŸ’‘ β€œElimu ni msingi wa maendeleo β€” Knowledge is the foundation of progress.”**