Spaces:

MataStrategy
/

ground-zero

Sleeping

App Files Files Community

ground-zero / docs /roadmap_2026-04.md

Broulaye Doumbia

push docs and script

cc8b90c 2 months ago

preview code

Raw

History Blame Contribute Delete

25.5 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Sahel-Voice-Lab — Roadmap & Starting-from-Scratch Plan

Last updated: 2026-04-19 Maintainer: Broulaye

This document has two parts:

Where the project stands today — what's built, what's missing, and what to do next.
If I were starting from zero today — a realistic, solo-maintainer, free-compute path from nothing to a usable Bambara voice assistant.

Part 1 — The four layers, in plain language

A voice assistant is four layers stacked on top of each other. Most of this project is about the fourth layer; the first three are mostly rented or borrowed.

Layer 1 — the ear: Speech-to-Text (STT / ASR)

Abbreviations:

STT = Speech-to-Text
ASR = Automatic Speech Recognition (same thing)
LoRA = Low-Rank Adaptation — a technique to "patch" a large model with a tiny file (~50 MB) instead of retraining all of it
PEFT = Parameter-Efficient Fine-Tuning — the HuggingFace library that implements LoRA

The model used is Whisper (OpenAI's open-source multilingual speech model). Out of the box, Whisper's Bambara is poor because it barely saw Bambara during training. The fix: train LoRA adapters per language. One ~1.5 GB Whisper backbone stays in memory; small Bambara and Fula patches swap in and out in ~50 ms.

Where it lives in the repo:

src/engine/whisper_base.py — loads the backbone
src/engine/adapter_manager.py — the hot-swap
src/engine/transcriber.py — what the app calls
src/training/trainer.py + notebooks/kaggle_master_trainer.ipynb — training

Layer 2 — the brain: Large Language Model (LLM)

Abbreviations:

LLM = Large Language Model
JSON = JavaScript Object Notation, a structured format

No one trains this from scratch. You rent one. The project calls Qwen (Alibaba's multilingual model) through HuggingFace's hosted inference service, with a custom "adult-child" prompt that forces structured JSON output (fields like intent, reply, translation).

Where it lives:

src/llm/gemma_client.py — named "gemma" for legacy reasons; now talks to Qwen.

Layer 3 — the mouth: Text-to-Speech (TTS)

Abbreviations:

TTS = Text-to-Speech
MMS = Massively Multilingual Speech (Meta's 1000+ language model, lower quality, used as fallback)
VITS = Variational Inference Text-to-Speech (a specific architecture — higher quality, one speaker per trained model)
F5-TTS = a recent zero-shot voice-cloning TTS system

The hardest layer for low-resource languages. Needs hours of clean studio audio from a native speaker. Used in tiers:

MMS-TTS as fallback baseline
Waxal-VITS for trained Bambara quality
F5-TTS for voice cloning in Phase 3

Where it lives:

src/tts/mms_tts.py, src/tts/waxal_tts.py, src/tts/f5_tts.py, src/tts/voice_cloner.py

Layer 4 — the glue

The real differentiator of the project — everything that makes the rented models into a product.

Abbreviations:

IoT = Internet of Things (networked sensors)
ECAPA-TDNN = Emphasized Channel Attention Propagation — Time-Delay Neural Network; a speaker-fingerprint model

Components:

Memory loop — src/memory/memory_manager.py
Normalization — src/data/bam_normalize.py, src/data/adlam.py
Fast-path phrases — src/conversation/phrase_matcher.py
Intent detection — src/iot/intent_parser.py
Voice responder (≤ 6-word replies) — src/iot/voice_responder.py
Sensor bridge — src/iot/sensor_bridge.py
Speaker ID — src/voice/speaker_profiles.py

Part 2 — What's present vs missing

Present

All four layers scaffolded; every module named in the project description exists in src/
Two entry points: app.py (Gradio, HF Space) and src/api/app.py (FastAPI)
Training infrastructure and Kaggle notebooks
Mobile export pipeline (ONNX, TFLite) in src/optimization/
Bambara Waxal-VITS TTS working
Memory loop wired into UI
Agricultural domain vocabulary and intent model

Missing or weak

data/vocabulary.jsonl is empty — no local snapshot of user-taught words
LoRA fine-tuning still crashes on Kaggle T4 (active blocker per project notes)
Fula TTS is a placeholder — no trained ous-sow/fula-tts yet
No real-user evaluation set (no data/eval/ folder with farmer recordings); all quality numbers currently come from FLEURS, which does not reflect real conditions
No documented tone-handling policy for TTS (Bambara tone is unmarked in writing but matters for pronunciation)

Part 3 — Actionable next steps (ordered by leverage)

Step 1 — Fix the LoRA training crash on Kaggle

Highest leverage. Unblocks every ASR quality gain downstream.

Reproduce the exact error on a Kaggle T4 runtime
Pin datasets to a known-good version (either pre-4.x, or the correct torchcodec pin for 4.x)
If the AMP (Automatic Mixed Precision) scaler is the issue, either disable AMP or switch to bf16 if T4 supports it cleanly
Validate with a tiny 100-sample training job before a full run
Commit a working Bambara adapter before moving on

Step 2 — Build a real-user evaluation set

Do this in parallel with Step 1.

Record 50-100 Bambara utterances from at least 3 native speakers
Include noisy conditions (wind, motorcycle, livestock — noise_samples/ already anticipates this)
Transcribe by hand; store under data/eval/bambara_field.jsonl
Run the current stack and record baseline WER (Word Error Rate) and CER (Character Error Rate)
From here on, all changes measured against this set, not FLEURS

Step 3 — Exercise the memory loop end-to-end

Run 10 live teaching sessions
Confirm local JSONL grows; confirm HuggingFace Hub push
Add a test under tests/ that mocks the Hub and validates the write path

Step 4 — Train `ous-sow/fula-tts`

Can run in parallel on RunPod
Need 1-3 hours of clean studio audio from a single Fula speaker
Same VITS recipe as Waxal Bambara

Step 5 — Close Phase 3 voice-to-voice parity

Once Fula TTS exists, test the full voice-in → voice-out pipeline for both languages
Measure round-trip CER: spoken sentence → transcript → response → synthesized speech → re-transcribe → compare
Catches compounding errors across layers

Step 6 — Small field test

Five Malian farmers. Cheap version: WhatsApp voice messages or a phone call with screen-shared Gradio
Log what they try to ask, whether the response is intelligible, whether they'd use it again
Success metric: do they ask a second question without being prompted?

Step 7 — Write a tone-handling policy

Pick a position: "accept tonally-wrong TTS on homographs as a known limitation" vs "invest in tone annotation for TTS training corpus in cycle N+1"
Either is defensible. The bad option is leaving it unspoken.

Part 4 — If I were starting from zero today

Realistic assumptions: solo maintainer, nights-and-weekends pace, free or cheap compute (Kaggle free T4, HuggingFace Spaces cpu-basic, occasional RunPod for bigger runs), access to native speakers (this one is non-negotiable — if you don't have them, stop and find them first).

The single most important lesson from Sahel-Voice-Lab as it exists today: it does four product's worth of things at once (agricultural IoT, self-teaching, multi-language, voice cloning). If starting over, I'd ship one at a time.

Month 0 — Before writing any code

Pick a narrow use case. Not "general Bambara assistant." Something like "voice queries for soil moisture" or "learn 100 agricultural words." One domain, one job.
Identify 3-5 native speakers willing to test throughout. Get their phone numbers. Ask now, not later.
Map the data landscape. Write a one-page doc listing every Bambara dataset you find: FLEURS (bam_ML), RobotsMali Jeli-ASR, OpenSLR, Masakhane resources, Common Voice. Note size, license, quality.
Decide: Bambara only for the first version. Fula comes later. Do not start bilingual.

Month 1 — Text-first prototype (no audio yet)

Wire the LLM (Qwen via HuggingFace Inference) with a carefully-written system prompt. Start in French or English; have it answer in Bambara.
Build a Gradio text-in / text-out demo. Deploy to a HuggingFace Space on cpu-basic.
Write the normalizer (the bam_normalize.py equivalent) with real tests. Spend real time on this; the audit you already did on the alphabet is the specification.
Show it to your native speakers. Is the Bambara intelligible? Are the answers right?
Do not add STT or TTS yet. This stage's only job is to learn what the LLM knows about Bambara and what it doesn't.

Month 2 — Add the ear, and the eval set

Build the evaluation set before training anything. 50 utterances, 3 speakers, hand-transcribed. This is the most "wish I'd done this earlier" advice in low-resource ASR.
Try Whisper-large-v3-turbo zero-shot on your eval set. Record the baseline WER. It will probably be 60-80%.
Only then start LoRA fine-tuning with FLEURS + Jeli-ASR on Kaggle T4. Target: WER from ~70% to ~30% within four weeks.
Wire the trained adapter into the Gradio app.

Month 3 — Add the mouth (baseline quality)

Use MMS-TTS Bambara. One API call. It sounds robotic but it speaks.
Ship this as the "Phase 1 complete" milestone on HuggingFace Spaces. This is a real product now: voice in, voice out.
Collect 50-100 field interactions. Log everything.

Month 4 — Memory loop

Build the teach-new-word flow.
JSONL on disk + HuggingFace dataset push.
Add the "curiosity" feature (system occasionally asks the user to teach it a word).
Exercise it with real users before declaring it done. An empty vocabulary.jsonl is a sign the loop was never really tested.

Month 5 — Upgrade TTS

Record 1-3 hours of studio audio with a single native speaker reading from a curated script that covers your domain vocabulary. This is the single biggest quality jump in the whole project.
Train a VITS model (Waxal-style). Swap MMS out for it.
Compare side-by-side with native listeners. Keep MMS as fallback.

Month 6 — Field test and iterate

Five farmers. Phone calls or WhatsApp. Real conditions.
Success metric: do they ask a second question unprompted? Do they come back tomorrow?
Expect this stage to reshape priorities. Follow the feedback; do not defend the roadmap.

Month 7+ — Everything else

Second language (Fula / Adlam): only after Bambara is stable
Voice cloning (F5-TTS)
Mobile / offline export (ONNX, TFLite)
IoT sensor integration
FastAPI service alongside the Gradio app

Things I would deliberately do differently

Ship the ugliest possible version at Month 3, not the polished pipeline at Month 9. Five farmers with a robotic voice tell you more than 500 hours of benchmark tuning.
Build the evaluation set in Month 2, not later. Every decision compounds; without an eval, you cannot tell which decisions to keep.
One language, one entry point, one framework at a time. The current project has FastAPI + Gradio + Kaggle + ONNX + TFLite + bitsandbytes + speaker ID + voice cloning. Each is a maintenance commitment. Add them only when the product's existence justifies them.
No training your own ASR adapter until the LLM/TTS product has been tested. Whisper zero-shot is good enough to validate the product idea. Training is expensive; you might end up optimizing a layer users don't care about.
Native speakers as collaborators, not testers at the end. Monthly review calls from Month 1, not Month 6.

One-sentence summary

If I were starting from zero today, I would ship a narrow, ugly, one-language, text-first version to five real native-speaker users in the first three months, and build everything else on top of the feedback from those five people.

Part 5 — Expanded walkthrough: why, how, and where Sahel-Voice-Lab fits

Each stage below has three sections: Why (the purpose — why this stage exists and what breaks if you skip it), How (concrete mechanics — files, commands, tools, decisions), and Current project status (what you have, what's missing, relative to this stage).

Stage A — Scoping and data audit (Month 0)

Why. The single biggest failure mode in low-resource voice AI is attempting a "general Bambara assistant." You cannot measure general; you cannot ship general; you cannot collect targeted data for general. You need one narrow domain so vocabulary is bounded, users can be found, failures are diagnosable, and every subsequent decision has a clear yes/no test: "does this help a farmer query soil moisture?" A bad scope locks in months of wasted work.

How. Write a one-page scoping document that answers: (1) what is the single first use case — one sentence, measurable; (2) who is the first user — names, phone numbers, what language variety they speak; (3) what does success look like in three months — one metric, not five. Then write a data audit: every public Bambara dataset with size, license, quality, and known issues. FLEURS (bam_ML), RobotsMali Jeli-ASR, OpenSLR, Masakhane, Common Voice. Note what's missing — domain vocabulary usually is.

Current project status. Stage A is implicitly done. The domain is "agricultural voice interface for Sahelian farmers." The data sources are identified and wired (src/data/waxal_loader.py, src/data/web_harvester.py, FLEURS referenced in training configs). The one thing weakly documented is the target user profile — which region, which dialect, what level of literacy, what phones they use. Writing this down explicitly (even as a one-paragraph persona in the README) tightens every downstream decision.

Stage B — Text-first prototype (Month 1)

Why. Before introducing audio, you need to know what the LLM actually knows about Bambara and what it doesn't. If the text-in/text-out experience is bad, adding voice will not save it; voice only adds more failure modes. Text prototyping is cheap — one deployment, no GPU, a few prompts — and teaches you the vocabulary gap you will spend the rest of the project closing.

How. Call a hosted multilingual LLM (Qwen, Mistral, Gemma) via HuggingFace Inference with huggingface-hub's InferenceClient. Write a careful system prompt — the "adult-child" contract: LLM acts like a patient teacher, returns structured JSON with fields {intent, reply_bm, reply_fr, confidence}. Deploy a Gradio text-in/text-out interface to a HuggingFace Space on cpu-basic. Show it to two native speakers; ask what sounds wrong. Spend real time on the normalizer at this stage — the orthography audit (ɛ ↔ e, ɔ ↔ o, ɲ ↔ ny/gn, ŋ ↔ ng, 1967 vs older forms, and the ny ambiguity between palatal nasal and nasal + palatal glide) is the specification.

Current project status. Stage B is done. src/llm/gemma_client.py implements the adult-child JSON contract against Qwen. src/data/bam_normalize.py handles the orthographic cleanups. The Gradio app has been deployed. This stage is behind you.

Stage C — The ear: STT plus the evaluation set (Month 2)

Why. This is the stage with the highest "wish I'd done it earlier" rate in low-resource ASR. You need a real-user evaluation set before you train anything, because training without an eval is hill-climbing in the dark. FLEURS numbers do not predict field performance; field recordings do. Only after an eval exists is it worth investing Kaggle hours in fine-tuning.

How. First, the eval set. Ask three native speakers to each record 15-20 utterances covering your domain vocabulary. Use their actual phones, in their actual environments (not a quiet office). Transcribe by hand. Store under data/eval/bambara_field.jsonl as {audio_path, transcript, speaker_id, noise_conditions}. Run Whisper-large-v3-turbo zero-shot against it. Record the baseline WER (Word Error Rate) and CER (Character Error Rate) numbers in the repo somewhere durable (docs/metrics.md). Only then: start LoRA fine-tuning with FLEURS + Jeli-ASR on Kaggle T4. Each training run is measured against your eval set, not against FLEURS.

Current project status. You are mostly here — with two important gaps. The Whisper + LoRA + adapter-swap pipeline is built (src/engine/whisper_base.py, src/engine/adapter_manager.py, src/engine/transcriber.py). Training infrastructure exists (src/training/trainer.py, notebooks/kaggle_master_trainer.ipynb). However: (1) there is no data/eval/ folder with real farmer recordings, and (2) the LoRA fine-tuning pipeline still crashes on Kaggle T4 per your project notes. These are your two most important current blockers. Until they resolve, every other ASR improvement is speculative.

Stage D — The mouth: baseline TTS and first ship (Month 3)

Why. Shipping an ugly working product beats polishing a pretty broken one. The first voice-in/voice-out deployment reveals failure modes no amount of offline testing catches — wake-word confusion, ambient noise you didn't model, users speaking too fast or too softly, compounding latency that makes the system feel dead. You cannot learn these from benchmarks; you learn them from users. Ship at the robotic-voice MMS-TTS baseline, then improve.

How. Wire MMS-TTS Bambara (facebook/mms-tts-bam) into the Gradio app — it's one from transformers import VitsModel call plus audio post-processing. Return audio as a Gradio gr.Audio output. Deploy. Write a very short intro text explaining this is a prototype. Share the Space URL with two native-speaker testers, tell them nothing about how it works, ask them to try three things.

Current project status. Stage D is done. MMS-TTS is wired (src/tts/mms_tts.py), the Gradio Space is deployed, Phase 1 has shipped per your notes. Two things that might be worth auditing: whether the deployed Space is still on the MMS fallback or already on Waxal-VITS, and whether there is any logging/telemetry on usage to tell you whether real people are actually touching the deployed Space.

Stage E — The memory loop (Month 4)

Why. The model does not know most Bambara vocabulary; users do. Without a mechanism to capture and persist what they teach, every conversation's knowledge dies with the session. The memory loop is the product's data-collection engine — the thing that lets it get better over time without you personally labeling data. This is also the core differentiation of Sahel-Voice-Lab versus a generic Bambara ASR+TTS demo.

How. Three components. (1) A teach-new-word flow in the UI: the user says "this is how you say X," the system confirms, stores to data/vocabulary.jsonl as {word, translation, speaker_id, timestamp, audio_ref}. (2) An async push to a versioned HuggingFace dataset (ous-sow/sahel-agri-feedback). (3) A "curiosity" mechanism where every N turns the LLM is prompted to identify a vocabulary gap and ask the user — this inverts the teaching initiative and collects more data per session.

Current project status. Stage E is structurally done but likely not exercised. src/memory/memory_manager.py implements the thread-safe JSONL + Hub push. src/engine/curiosity.py implements the CuriosityEngine. The Gradio app has a Teaching tab. However, your local data/vocabulary.jsonl is empty (0 lines). This means one of three things: (a) no one has used the teach flow yet, (b) the write path is broken and you haven't noticed because no one has used it, or (c) data goes only to the Hub and you've never pulled a snapshot locally. Worth a 20-minute investigation to confirm which. A test in tests/ that mocks the Hub and asserts the local JSONL write is cheap insurance.

Stage F — Upgraded TTS (Month 5)

Why. MMS-TTS works but sounds robotic, and users notice immediately. Moving to a single-speaker VITS model trained on 1-3 hours of clean studio audio is the single biggest perceived-quality jump in the entire pipeline. It also gives you something MMS cannot: a consistent, identifiable voice that users remember. For long-term adoption, voice identity matters as much as intelligibility.

How. Record 1-3 hours of studio audio with one native speaker reading a curated script that covers your domain vocabulary plus conversational filler. Target: quiet room, decent USB mic, 22050 or 44100 Hz, single take per sentence. Align transcripts, clean silence, normalize loudness. Train a VITS model on your RunPod GPU (Kaggle T4 usually not enough memory for full VITS). Publish to HuggingFace as a private or public model repo. Swap out MMS in the TTS dispatcher, keep MMS as fallback.

Current project status. Stage F is done for Bambara, not for Fula. The Waxal VITS integration lives in src/tts/waxal_tts.py and per your notes is partially shipped for Bambara (ynnov/ekodi-bambara-tts-female). Fula TTS is a placeholder — ous-sow/fula-tts does not exist yet. Closing this is one of your active goals. The recording session is usually the bottleneck, not the training.

Stage G — Field test (Month 6)

Why. Everything before this stage is technical. This stage is where you find out whether the technical work produced something humans actually use. It's also where you discover that three of your prior assumptions were wrong — assumptions you could not have tested any other way. Every low-resource voice project that skips this stage ends up polished and unused.

How. Five native-speaker users. Cheapest version: WhatsApp voice messages or a phone call with screen-shared Gradio. Give them a small task ("ask about your soil moisture"), observe without coaching. Record what they try to ask, whether the transcript is right, whether the answer is intelligible to them, whether they would use it unprompted again. The success metric is not WER. It is: does the user ask a second question they came up with themselves?

Current project status. Stage G is not done. There is no field-test evidence in the repo, no usage logs, no session transcripts from actual farmers. This is, honestly, the single largest gap between where the project is and where it needs to be — more important than the Kaggle crash or the missing Fula TTS. You can ship a field test with what you have today and the feedback will reshape everything downstream.

Stage H — Expansion (Month 7+)

Why. Only once a single-language, single-domain product has real users do you earn the right to expand. Each added dimension (second language, voice cloning, mobile export, IoT integration) doubles surface area for bugs and maintenance. Adding them in parallel to the core product means you will ship nothing well; adding them after the core is stable means each addition builds on a known-good base.

How. Second language (Fula/Adlam): repeat stages B through G with the new language, reusing infrastructure but refitting normalization and TTS training. Voice cloning: F5-TTS or OpenVoice, keyed to a speaker embedding from the speaker-ID layer. Mobile export: ONNX per language, then TFLite via onnx-tf, then bundle into a thin Android app. IoT integration: FastAPI service in front of the sensor bridge, authenticated, cached.

Current project status. You are ahead of schedule here, which is the diagnostic. Phase 3 voice-to-voice is merged and stabilizing. F5-TTS is scaffolded (src/tts/f5_tts.py). OpenVoice-based voice cloning is scaffolded (src/tts/voice_cloner.py). Speaker ID with ECAPA-TDNN is in place (src/voice/speaker_profiles.py). Adlam/Pular integration has landed. ONNX and TFLite exporters exist (src/optimization/). A FastAPI service is scaffolded (src/api/). This is Month 7+ work already in the codebase. The issue is not that this work is wrong — it is that it was built before Stages C (eval set), E (loop exercised with real data), and G (field test) were actually completed. The risk is building a polished Stage H surface on an unmeasured Stage C-E foundation.

Where you actually are right now

The honest diagnosis of Sahel-Voice-Lab as of 2026-04-19, mapped onto the staged plan:

Done: Stages A, B, D. Bambara text and audio pipeline ships to users via Gradio on HF Spaces. The LLM contract is stable. Normalization is implemented.

Partially done: Stage C (ASR pipeline built but no field eval set, training still crashes on Kaggle), Stage E (memory loop built but vocabulary.jsonl empty — not yet exercised with real users), Stage F (Bambara TTS upgraded, Fula TTS still placeholder).

Not done: Stage G (no field test with real farmers).

Ahead of schedule: Stage H (Phase 3 voice-to-voice, voice cloning, Adlam/Pular, ONNX/TFLite, FastAPI — all built in parallel with, or before, completing C/E/G).

The path forward, ordered by leverage: (1) fix the Kaggle LoRA crash so Stage C can continue; (2) build the real-user eval set so Stage C has a measurement foundation; (3) exercise the memory loop with three real users so Stage E is confirmed; (4) run a small field test so Stage G is unblocked; (5) train ous-sow/fula-tts so Stage F closes for Fula; (6) return to Stage H work with actual user signal guiding priorities.

Everything the project is missing is measurement. Everything the project has is implementation. That is a recoverable position, but only if the measurement work now gets the same weight the implementation work has had.

Sahel-Voice-Lab — Roadmap & Starting-from-Scratch Plan

Part 1 — The four layers, in plain language

Layer 1 — the ear: Speech-to-Text (STT / ASR)

Layer 2 — the brain: Large Language Model (LLM)

Layer 3 — the mouth: Text-to-Speech (TTS)

Layer 4 — the glue

Part 2 — What's present vs missing

Present

Missing or weak

Part 3 — Actionable next steps (ordered by leverage)

Step 1 — Fix the LoRA training crash on Kaggle

Step 2 — Build a real-user evaluation set

Step 3 — Exercise the memory loop end-to-end

Step 4 — Train ous-sow/fula-tts

Step 5 — Close Phase 3 voice-to-voice parity

Step 6 — Small field test

Step 7 — Write a tone-handling policy

Part 4 — If I were starting from zero today

Month 0 — Before writing any code

Month 1 — Text-first prototype (no audio yet)

Month 2 — Add the ear, and the eval set

Month 3 — Add the mouth (baseline quality)

Month 4 — Memory loop

Month 5 — Upgrade TTS

Month 6 — Field test and iterate

Month 7+ — Everything else

Things I would deliberately do differently

One-sentence summary

Part 5 — Expanded walkthrough: why, how, and where Sahel-Voice-Lab fits

Stage A — Scoping and data audit (Month 0)

Stage B — Text-first prototype (Month 1)

Stage C — The ear: STT plus the evaluation set (Month 2)

Stage D — The mouth: baseline TTS and first ship (Month 3)

Stage E — The memory loop (Month 4)

Stage F — Upgraded TTS (Month 5)

Stage G — Field test (Month 6)

Stage H — Expansion (Month 7+)

Where you actually are right now

Step 4 — Train `ous-sow/fula-tts`