File size: 12,498 Bytes
f0ffd48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
---
library_name: transformers
license: cc-by-sa-4.0
base_model:
- openai/whisper-medium
- openai/whisper-small
tags:
- generated_from_trainer
metrics:
- wer
model-index:
- name: stt
  results: []
datasets:
- mozilla-foundation/common_voice_17_0
- fsicoli/common_voice_19_0
language:
- sw
- en
pipeline_tag: automatic-speech-recognition
---

# Swahili-English Speech-to-Text (STT) Model

This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) specifically optimized for Swahili and English speech recognition. The model has been trained on Common Voice 17.0 dataset and achieves significant improvements in word error rate (WER) compared to the base model.

## Model Performance

The model achieves the following results on the evaluation set:
- **Loss**: 0.3390
- **WER**: 14.7

## Usage

### Installation

First, install the required dependencies:

```bash
pip install transformers torch librosa
```

### Basic Usage

```python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
import librosa

# Load the model and processor
processor = AutoProcessor.from_pretrained("Jacaranda-Health/ASR-STT")
model = AutoModelForSpeechSeq2Seq.from_pretrained("Jacaranda-Health/ASR-STT")
model.generation_config.forced_decoder_ids = None

def transcribe(filepath):
    """
    Transcribe audio file to text
    
    Args:
        filepath (str): Path to audio file
        
    Returns:
        str: Transcribed text
    """
    # Load audio file
    audio, sr = librosa.load(filepath, sr=16000)
    
    # Process audio
    inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
    
    # Generate transcription
    with torch.no_grad():
        generated_ids = model.generate(inputs["input_features"])
    
    # Decode the transcription
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    return transcription

# Example usage
transcription = transcribe("path/to/your/audio.wav")
print(f"Transcription: {transcription}")
```

### Batch Processing

```python
def transcribe_batch(audio_files):
    """
    Transcribe multiple audio files
    
    Args:
        audio_files (list): List of audio file paths
        
    Returns:
        list: List of transcriptions
    """
    transcriptions = []
    
    for filepath in audio_files:
        try:
            transcription = transcribe(filepath)
            transcriptions.append({
                'file': filepath,
                'transcription': transcription
            })
        except Exception as e:
            transcriptions.append({
                'file': filepath,
                'error': str(e)
            })
    
    return transcriptions

# Example usage
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = transcribe_batch(audio_files)
```

## Model Comparison

The fine-tuned model shows **dramatic improvements** over the base Whisper model, particularly in Swahili language accuracy. Here are some comparison examples showing how the base model completely failed while our fine-tuned model nailed it:

### Example 1: Complete Language Confusion
- **Ground Truth**: "Panya wengi huishi kati ya wanadamu."
- **Base Model**: "本来我以为是个铁网来的" *(Chinese characters!)*
- **Fine-tuned Model**: "Wanyawengi huishi kati ya wanadamu." ✓ <br><br>

- **Ground Truth**: "Mji ulianzishwa kwenye kisiwa kilichopo karibu sana na bara."
- **Base Model**: "Nguni unia nzisho kwenye kisiwa kilichopo kariwu sana nabara"
- **Fine-tuned Model**: "Mji ulianzishwa kwenye kisiwa kilichopo karibu sana na bara." ✓ <br><br>


- **Ground Truth**: "Nchi ya maajabu."
- **Base Model**: "Um dia mais, diabo!" *(Portuguese/Spanish)*
- **Fine-tuned Model**: "Nchi ya maajabu." ✓

### Example 2: Arabic Script Mix
- **Ground Truth**: "Alama yake ni µm."
- **Base Model**: "الله معاكي لأم" *(Arabic script)*
- **Fine-tuned Model**: "Alama yake ni µm." ✓

### Example 3: English Instead of Swahili
- **Ground Truth**: "Ni msimamizi wa mtandao na wa wanafunzi."
- **Base Model**: "You don't see no music on Tyndale? No, I don't see no music on Tyndale."
- **Fine-tuned Model**: "Ni msimamizi wa mtandao na wa wanafunzi." ✓


## Key Improvements

The fine-tuned model demonstrates superior performance in:

- **Swahili Grammar**: Better handling of Swahili sentence structure and grammar
- **Word Recognition**: More accurate recognition of Swahili vocabulary
- **Context Understanding**: Improved contextual understanding across different domains
- **Pronunciation Variants**: Better handling of different Swahili pronunciation patterns
- **Mixed Language**: Enhanced performance on code-switched Swahili-English content

## Training Visualizations

The following charts illustrate the model's training progress and performance improvements:

### Word Error Rate (WER) Progress
![WER Progress](./Media/wer.png)

The WER chart shows the steady improvement in transcription accuracy throughout the training process. Starting from approximately 21.6% WER at step 500, the model achieves its best performance of 14.7% WER by step 8000, demonstrating consistent learning and convergence.

### Learning Rate Schedule
![Learning Rate](./Media/lr.png)

The learning rate follows a cosine annealing schedule, starting at 1e-05 and gradually decreasing over the 8000 training steps. This schedule helps ensure stable training and prevents overfitting while allowing the model to fine-tune effectively.

## Training Details

### Training Procedure

The model was fine-tuned using the following approach:
- **Base Model**: OpenAI Whisper Medium
- **Dataset**: Mozilla Common Voice 17.0 (Swahili and English)
- **Training Steps**: 8,000 steps
- **Learning Rate**: 1e-05 with cosine scheduler
- **Batch Size**: 16 (train and eval)

### Training Hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 50
- training_steps: 8000
- mixed_precision_training: Native AMP

### Training Results

| Training Loss | Epoch  | Step | Validation Loss | WER Ortho | WER     |
|:-------------:|:------:|:----:|:---------------:|:---------:|:-------:|
| 0.4135        | 0.6180 | 500  | 0.4069          | 29.9115   | 21.6319 |
| 0.2036        | 1.2361 | 1000 | 0.3584          | 25.8738   | 18.3552 |
| 0.1899        | 1.8541 | 1500 | 0.3390          | 24.0940   | 16.4814 |
| 0.0978        | 2.4722 | 2000 | 0.3406          | 24.1957   | 16.8982 |
| 0.0584        | 3.0902 | 2500 | 0.3589          | 22.7718   | 15.9189 |
| 0.0457        | 3.7083 | 3000 | 0.3660          | 23.3075   | 15.8580 |
| 0.0203        | 4.3263 | 3500 | 0.3762          | 22.9108   | 15.7394 |
| 0.0193        | 4.9444 | 4000 | 0.3683          | 22.0192   | 15.2616 |
| 0.0073        | 5.5624 | 4500 | 0.3926          | 22.5447   | 15.5801 |
| 0.0022        | 6.1805 | 5000 | 0.4065          | 21.5649   | 14.9092 |
| 0.0022        | 6.7985 | 5500 | 0.4080          | 21.2835   | 14.6313 |
| 0.0009        | 7.4166 | 6000 | 0.4180          | 21.2564   | 14.6415 |
| 0.0007        | 8.0346 | 6500 | 0.4244          | 21.2361   | 14.6551 |
| 0.0006        | 8.6527 | 7000 | 0.4283          | 21.3276   | 14.6957 |
| 0.0006        | 9.2707 | 7500 | 0.4297          | 21.3378   | 14.7059 |
| 0.0006        | 9.8888 | 8000 | 0.4300          | 21.3276   | 14.7093 |

## Supported Languages

- **Primary**: Swahili (sw)
- **Secondary**: English (en)


## Out-of-Scope Use

The use of this Speech-to-Text (ASR) model is intended for research, social good, and internal use purposes only. For commercial use and distribution, organizations/individuals are encouraged to contact **Jacaranda Health**. To ensure the ethical and responsible use of this ASR model, we have outlined a set of guidelines. These guidelines categorize activities and practices into three main areas: prohibited actions, high-risk activities, and deceptive practices. By understanding and adhering to these directives, users can contribute to a safer and more trustworthy environment.

### 1. Prohibited Actions:
* **Illegal Activities:** Avoid using the model to transcribe content that promotes violence, child exploitation, human trafficking, and other crimes.
* **Harassment and Discrimination:** No transcription activities that facilitate bullying, threats, or discriminatory practices.
* **Unauthorized Surveillance:** No unlicensed monitoring or recording of individuals without proper consent.
* **Data Misuse:** Handle audio data and transcriptions with proper consents and privacy protections.
* **Rights Violations:** Respect third-party intellectual property and privacy rights in audio content.
* **Malicious Content Creation:** Avoid transcribing content intended for harmful software or malicious purposes.

### 2. High-Risk Activities:
* **Sensitive Industries:** Exercise extreme caution when using in military, nuclear, or intelligence domains.
* **Legal Proceedings:** Avoid usage as sole evidence in critical legal or judicial processes without proper validation.
* **Critical Systems:** No deployment in safety-critical infrastructure or transport technologies without extensive testing.
* **Medical Diagnosis:** Avoid using transcriptions for direct medical diagnosis or treatment decisions.
* **Emergency Services:** Not suitable as primary tool for emergency response systems.

### 3. Deceptive Practices:
* **Misinformation:** Refrain from using transcriptions to create or promote fraudulent or misleading audio content.
* **Deepfake Audio:** Avoid using transcriptions to facilitate creation of deceptive synthetic audio.
* **Impersonation:** No transcribing content intended to impersonate individuals without authorization.
* **Misrepresentation:** No false claims about transcription accuracy or model capabilities.
* **Fake Content Generation:** No promotion of false audio-text pairs or fabricated conversations.

## Bias, Risks, and Limitations

This Speech-to-Text model represents cutting-edge technology with significant potential, yet it is not without inherent risks and limitations. The extensive testing conducted has been predominantly focused on Swahili and English languages, leaving an expansive terrain of linguistic variations and acoustic scenarios unexplored.

### Key Limitations:

**Language and Dialect Variations**: The model's performance may vary significantly across different Swahili dialects, regional accents, and code-switching patterns not represented in the training data.

**Audio Quality Sensitivity**: Performance degrades with poor audio quality, background noise, multiple speakers, or non-standard recording conditions.

**Domain Specificity**: The model may struggle with highly technical terminology, proper names, or domain-specific vocabulary outside its training scope.

**Contextual Understanding**: While improved over the base model, contextual interpretation limitations may lead to incorrect transcriptions in ambiguous scenarios.

**Bias Considerations**: Like other AI models, this ASR system may exhibit biases present in the training data, potentially affecting transcription quality for underrepresented speaker groups or topics.

### Responsible Deployment:

Consequently, like other ASR systems, this model's output predictability remains variable, and there's potential for it to occasionally generate transcriptions that are inaccurate, culturally insensitive, or otherwise problematic when processing certain audio inputs. 

Prior to deploying this ASR model in any production applications, developers must embark on thorough safety testing and meticulous evaluation customized to the unique demands of their specific use cases. This includes testing across diverse speaker demographics, audio conditions, and content types relevant to the intended application.

## Contact Us

For any questions, feedback, or commercial inquiries, please reach out at **ai@jacarandahealth.org**


## Framework Versions

- Transformers 4.51.3
- PyTorch 2.5.1+cu121
- Datasets 3.6.0
- Tokenizers 0.21.1

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{jacaranda_asr_stt_2025,
  title={Swahili-English Speech-to-Text Model},
  author={Jacaranda Health},
  year={2025},
  howpublished={\url{https://huggingface.co/Jacaranda-Health/ASR-STT}}
}
```