Spaces:
Sleeping
Sleeping
File size: 14,909 Bytes
76db545 26659d8 096b19d 76db545 ca322f7 26659d8 76db545 da3a060 096b19d 76db545 096b19d 76db545 da3a060 76db545 da3a060 76db545 da3a060 76db545 da3a060 096b19d d0e28fa 9e99c2c d0e28fa da3a060 d0e28fa da3a060 9e99c2c d0e28fa da3a060 d0e28fa da3a060 d0e28fa da3a060 d0e28fa da3a060 096b19d da3a060 096b19d d0e28fa da3a060 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | ---
title: Sahel-Voice-Lab — Minimal
emoji: 🌍
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: "5.25.0"
app_file: app_minimal.py
hardware: cpu-basic
pinned: false
license: mit
tags:
- bambara
- fula
- speech-recognition
- text-to-speech
- agriculture
- iot
- language-learning
- west-africa
- low-resource-nlp
- memory
---
# 🌍 Sahel-Voice-Lab
**A voice-first AI assistant for Bambara (Mali) and Fula/Pular (Guinea, Senegal).**
Two intertwined jobs:
1. **Memory loop** — users *teach* the assistant new words; it persists them to a HuggingFace dataset and uses them as the source of truth in future answers.
2. **Agricultural IoT voice interface** — Sahelian farmers query soil, weather, irrigation, and pest data in their own language, short answers, ≤ 6 words per sentence for clean TTS.
The core stack is explicitly **100% non-Meta** (Whisper / Aya-Expanse / F5-TTS / VITS); MMS-TTS is only used as a baseline fallback.
---
## What this Space currently runs — the `ground-zero` minimal baseline
The deployed Space (`app_file: app_minimal.py`) is the **Month 1–3 rebuild**
baseline — a stripped-down Whisper → LLM → MMS-TTS pipeline used for field
testing and to build a real-user eval set. No LoRA adapters, no memory loop,
no speaker ID, no voice cloning, no IoT, no phrase matcher. Everything in
`app.py` still exists for the full production stack; it is just not what the
Space serves today.
Three stacked changes land dialect fidelity without any training:
1. **Stage 1 — dialect-pinned system prompt** (`src/llm/minimal_client.py`).
Replaces the `GemmaClient` JSON/teacher flow with a plain-text client whose
system prompt pins the target dialect explicitly — *Bambara as spoken in
Bamako, Mali* and *Pular of Fuuta Jallon, as spoken in Guinea* — names the
languages the model must **not** drift into (Wolof, Hausa, Pulaar of
Senegal, Fulfulde of Nigeria, Jula of Côte d'Ivoire), and injects a 30-pair
bilingual gold list as few-shot anchoring
(`configs/dialect_anchors/{bambara_mali,pular_guinea}.json`).
2. **Stage 2 — curated phrasebook short-circuit** (`src/llm/phrasebook.py`).
Before calling the LLM, the user's input is normalised and fuzzy-matched
(threshold 0.88) against a curated English-keyed phrasebook
(`configs/dialect_anchors/{bambara,pular}_phrasebook.json` — 100 Bambara /
110 Pular entries across greetings, family, food, farming, health,
shopping, travel, clarity, time, parting). A hit returns the gold
translation directly — zero LLM risk, zero latency.
3. **Stage 3 — better multilingual base LLM.**
Default `LLM_MODEL_ID` is now **`CohereLabs/aya-expanse-32b`**, a 23-language
multilingual model with much stronger West African coverage than Qwen
2.5-7B. Can be overridden via the `LLM_MODEL_ID` env var (e.g. to
`Qwen/Qwen2.5-72B-Instruct`) if Cohere's inference provider is not
available on your HF account.
4. **Stage 4 — split translate / reply UI + per-turn telemetry + RAG few-shot.**
Both Voice and Text tabs use a 4-box layout: phrasebook translation (text
+ audio) is automatic on submit (no LLM), and a separate **Generate reply**
button calls the dialect-anchored LLM for a conversational response. On a
phrasebook miss the LLM is RAG-injected with the top-3 nearest curated
pairs as additional style anchoring. Every turn is appended to
`data/field_turns.jsonl` (`src/engine/turn_logger.py`) with phase, latency
breakdown, phrasebook hit, and reply — the substrate for hit-rate
measurement, A/B comparisons, and eventual Stage-5 LoRA training-data
curation. The system prompt now also explicitly tells the LLM to **reply,
not translate** — the few-shot pairs are framed as style/orthography
references only, fixing the "the LLM just echoes the phrasebook target"
regression.
See `docs/baseline_rebuild.md` for the broader minimal-track plan.
---
## Status
| Phase | Feature | State |
|------:|---------|-------|
| 1 | Memory loop (JSONL + HF Hub) | ✅ shipped |
| 2 | Waxal VITS TTS — Bambara | ✅ shipped |
| 2 | Waxal VITS TTS — Fula | ⏳ placeholder until `ous-sow/fula-tts` is trained |
| 3 | Voice-to-voice S2S (F5-TTS + CER) | 🚧 merged, stabilizing |
| — | Adlam ↔ Latin round-trip, per-language prompts | ✅ landed |
See `docs/roadmap_2026-04.md` for the full plan and `docs/baseline_rebuild.md` for the parallel minimal-track strategy.
---
## Stack
| Layer | Tool |
|-------|------|
| STT | `openai/whisper-large-v3-turbo` + PEFT LoRA hot-swap (~50 MB adapter per language, ~50 ms switch) |
| LLM | `CohereLabs/aya-expanse-32b` (minimal-baseline default, strong African-language coverage) via HF Serverless InferenceClient — overridable to `Qwen/Qwen2.5-72B-Instruct`, `Qwen2.5-7B-Instruct`, Mistral, Zephyr |
| Dialect anchoring (minimal) | `src/llm/minimal_client.py` — pinned Bambara-Mali / Pular-Guinea system prompt with 30-pair bilingual few-shot + forbidden-drift guardrails |
| Phrasebook short-circuit (minimal) | `src/llm/phrasebook.py` — 100 Bambara + 110 Pular curated gold pairs, fuzzy-matched (0.88 threshold) before any LLM call |
| TTS (baseline) | `facebook/mms-tts-bam`, `facebook/mms-tts-ful` |
| TTS (Bambara) | `ynnov/ekodi-bambara-tts-female` (Waxal VITS) |
| TTS (Fula) | placeholder → `ous-sow/fula-tts` when published |
| Voice cloning | F5-TTS + OpenVoice V2 (Phase 3, GPU-only) |
| Speaker ID | SpeechBrain ECAPA-TDNN, 192-d embeddings, cosine ≥ 0.75 |
| Fast path | RapidFuzz over `data/phrases/{lang}.json` for greetings / thanks / farewells |
| Persistence | JSONL on disk + HF Hub datasets (no ORM) |
| Training | PEFT LoRA + `Seq2SeqTrainer` on FLEURS, Jeli-ASR, SLR 105/106 |
---
## Three entry points (do not conflate)
| File | Purpose | Lifecycle |
|------|---------|-----------|
| `app_minimal.py` | **Minimal baseline Gradio UI** — what the HF Space currently serves. Whisper → LLM → MMS-TTS with dialect-pinned prompts + curated phrasebook short-circuit + RAG few-shot on miss + per-turn JSONL telemetry. Tabs: Voice / Text, each with split translation (phrasebook, automatic) and reply (LLM, on demand). | `python app_minimal.py` |
| `app.py` | **Full production Gradio UI** (not currently served on the Space). Single-file (~99 KB) by design. Tabs: Conversation / Teaching / Knowledge Base / Self-Teaching. | `python app.py` |
| `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` |
| `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` |
---
## Repository layout
```
app.py # Gradio (production, HF Spaces)
app_lab.py # Gradio (experimental)
requirements.txt # Spaces runtime — do NOT pin torch/torchaudio
packages.txt # apt deps (ffmpeg)
configs/
base_config.yaml # shared settings
api_config.yaml # FastAPI-specific
lora_bambara.yaml # Bambara LoRA hyperparams
lora_fula.yaml # Fula LoRA hyperparams
data/
phrases/ # RapidFuzz shortcut phrase JSONs per language
vocabulary.jsonl # local mirror of the HF Hub memory dataset
docs/
roadmap_2026-04.md # full architectural walkthrough + action plan
baseline_rebuild.md # parallel minimal-track plan (non-destructive)
notebook_collaboration.md # Kaggle push/pull workflow for contributors
kaggle_mcp_setup.md # optional Kaggle MCP for Claude Desktop
notebooks/
kaggle_master_trainer/ # -> oussow/kaggle-master-trainer (LoRA fine-tune)
train_fula_tts/ # -> oussow/sahel-voice-fula-tts-trainer (TBD)
bootstrap_repos.ipynb
train_colab.ipynb # legacy Colab trainer
scripts/
train_bambara.py # LoRA fine-tune entrypoint (Kaggle/RunPod)
train_fula.py # LoRA fine-tune entrypoint (Kaggle/RunPod)
export_onnx.py # merge LoRA -> ONNX -> TFLite
verify_baseline.py # eval harness
run_server.py # FastAPI launcher
run_data_pipeline.py # dataset prep
push_to_hf.sh # deploy helpers
push_to_kaggle.sh # deploy helpers
runpod_setup.sh
src/
api/ # FastAPI app, schemas, routes, middleware
conversation/ # memory_manager, gemma_client, phrase_matcher, intent_parser
data/ # dataset loading + normalization (Adlam, Bambara)
engine/ # adapter_manager, transcriber, stt_processor, curiosity
iot/ # intent_parser, voice_responder, sensor_bridge
llm/ # LLM client wrappers
memory/ # vocabulary persistence
optimization/ # ONNX / quantization helpers
training/ # trainer, callbacks, augmenters
tts/ # mms_tts, waxal_tts, f5_tts, voice_cloner
voice/ # speaker_profiles (ECAPA-TDNN + OpenVoice SE)
tests/ # pytest — api, data pipeline, engine, iot
```
---
## How the memory loop works
1. Press **Push-to-Talk** → speak in Bambara, Fula, French, or English.
2. **Whisper** transcribes. If the language has a LoRA adapter loaded, `AdapterManager` hot-swaps to it (~50 ms).
3. **Qwen** reads the vocabulary it has learned so far (`MemoryManager.get_vocabulary_context()`), then returns a structured JSON reply with `intent ∈ {teaching, question, conversation, error}`.
4. If `teaching`: the word pair is appended to `data/vocabulary.jsonl` and async-pushed to `ous-sow/sahel-agri-feedback → vocabulary.jsonl`.
5. If `question`: Qwen answers using the remembered vocabulary as source of truth.
6. If `conversation`: Qwen replies naturally.
7. TTS speaks the reply (Waxal VITS for Bambara, MMS-TTS fallback elsewhere).
The last 5 learned words are always visible in the UI.
---
## How the agricultural voice interface works
1. User asks, e.g., *"A bɛ di wa?"* ("Is it OK?") referring to their field.
2. `intent_parser.py` (keyword-based) classifies the request: `check_soil` / `check_weather` / `irrigation_status` / `pest_alert` / etc.
3. `SensorBridge` calls the configured `SENSOR_API_URL` and returns a typed `SensorData`.
4. `voice_responder.py` maps `(Intent, SensorData)` → a short (≤ 6 words/sentence) Bambara or Fula reply + English translation. Alert thresholds are encoded here (`SOIL_MOISTURE_LOW=30`, `TEMP_HIGH=38`, pH bounds).
5. TTS speaks the reply.
---
## Environment variables
All variables have sensible defaults, so you can boot the Space without any of them — but without `HF_TOKEN` the memory loop cannot push.
### Core
| Key | Default | Purpose |
|-----|---------|---------|
| `HF_TOKEN` | — | HF write token. Required for Hub push and gated models. |
| `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` | Memory-loop target dataset. |
| `ADAPTER_REPO_ID` | `ous-sow/sahel-agri-adapters` | Published LoRA adapters. |
| `WHISPER_MODEL_ID` | `openai/whisper-large-v3-turbo` | STT base model. |
| `LLM_MODEL_ID` | `CohereLabs/aya-expanse-32b` | LLM via HF Serverless. Override to any HF Serverless-supported model. |
| `LOG_LEVEL` | `INFO` | Standard Python logging level. |
| `DEVICE` | `cuda` (FastAPI) | Torch device for inference. |
### Adapters & TTS
| Key | Default |
|-----|---------|
| `BAMBARA_ADAPTER_PATH` | `./adapters/bambara` |
| `FULA_ADAPTER_PATH` | `./adapters/fula` |
| `BAMBARA_TTS_REPO` | `ynnov/ekodi-bambara-tts-female` |
| `FULA_TTS_REPO` | `ous-sow/fula-tts` |
### IoT
| Key | Default |
|-----|---------|
| `SENSOR_API_URL` | *(unset → mock sensor)* |
### Self-Teaching tab (triggers Kaggle training runs)
| Key | Default |
|-----|---------|
| `KAGGLE_USERNAME` | — |
| `KAGGLE_KEY` | — |
| `KAGGLE_KERNEL_SLUG` | `ous-sow/sahel-voice-master-trainer` *(override in prod to `oussow/kaggle-master-trainer` — the actual Kaggle owner slug)* |
| `AUTO_TRAIN_THRESHOLD` | `50` |
---
## Run locally
```bash
# Minimal baseline (what the Space runs)
pip install -r requirements.txt
python app_minimal.py
# Full production UI (not currently on the Space)
python app.py
# FastAPI service
python scripts/run_server.py
# Experimental lab UI
python app_lab.py
```
System-level dependency: **ffmpeg** (see `packages.txt`).
---
## Training
LoRA fine-tuning runs on **Kaggle T4** or **RunPod** — not locally. Pick one entrypoint:
| Target | Script | Notebook |
|--------|--------|----------|
| Bambara LoRA | `scripts/train_bambara.py` | `notebooks/kaggle_master_trainer/` |
| Fula LoRA | `scripts/train_fula.py` | `notebooks/kaggle_master_trainer/` |
| Fula TTS | — | `notebooks/train_fula_tts/` *(planned)* |
**Contributor workflow:** edit notebooks locally in `notebooks/<slug>/`, commit with `nbstripout` keeping diffs clean, then `cd notebooks/<slug> && kaggle kernels push` to run on Kaggle GPU. Full walkthrough in `docs/notebook_collaboration.md`.
`docs/kaggle_mcp_setup.md` documents the optional Kaggle MCP for Claude Desktop if you'd rather drive Kaggle from an LLM.
---
## Export for edge
```bash
python scripts/export_onnx.py # merges LoRA into the backbone, exports ONNX
# then onnx-tf → TFLite for Android
```
ONNX does not support LoRA hot-swap, so export one file per language. `bitsandbytes` NF4 / 8-bit quantization is available for GPU-constrained deploys but is a training-only dep (not in `requirements.txt`).
---
## Tests
```bash
pytest tests/
```
Covers: FastAPI routes, data pipeline, engine (adapter manager + transcriber), IoT (intent parser + voice responder).
---
## Space secrets (HF UI → Settings → Secrets)
At minimum:
| Key | Value |
|-----|-------|
| `HF_TOKEN` | write-scope token |
| `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` |
| `LLM_MODEL_ID` | `CohereLabs/aya-expanse-32b` (or any HF Serverless-supported model) |
---
## Design constraints (deliberate — do not change without discussion)
- **Adapter hot-swap** via PEFT's multi-adapter API — one backbone in VRAM, ~50 MB adapters per language, `set_adapter` ≈ 50 ms.
- **Qwen "adult-child" JSON contract** — structured `intent`/`reply`/`english`/`teaching_pair` output, parsed out of optional markdown fences.
- **JSONL + Hub push memory** — no ORM, thread-safe `MemoryManager`, async push so UI never blocks.
- **≤ 6 words per sentence** in `voice_responder.py` for clean MMS-TTS.
- **Adlam ↔ Latin dual-script** handling in `adlam.py` + `bam_normalize.py`.
- **Single-file `app.py`** — intentional for now; do not split without a plan.
---
## License
MIT.
|