ground-zero / project-context.txt
Broulaye Doumbia
push docs and script
cc8b90c
Raw
History Blame Contribute Delete
14.5 kB
================================================================================
PROJECT CONTEXT — sahel-agri-voice
Generated: 2026-04-17
================================================================================
PROJECT NAME
------------
Sahel-Voice-Lab / Sahel-Agri Voice AI
(HuggingFace Space title: "Sahel-Voice-Lab", Phase 1: "The Memory Loop")
PURPOSE
-------
A voice-first, self-learning AI assistant for two West African languages —
Bambara (bam, spoken in Mali) and Fula/Pular (ful, spoken in Guinea and
Senegal) — targeted at farmers in the Sahel region.
The system has two complementary capabilities:
1. LANGUAGE-LEARNING MEMORY LOOP (Phase 1)
The assistant behaves like an "eager child learner." Users teach it
Bambara/Fula words ("I ni ce means hello") via voice or text; an LLM
detects the teaching intent and the word pair is persisted to a
HuggingFace Hub dataset (ous-sow/sahel-agri-feedback → vocabulary.jsonl)
so knowledge accumulates across sessions and users. The vocabulary is
then injected into the LLM's system prompt as its source of truth for
answering questions.
2. AGRICULTURAL IoT VOICE INTERFACE
Farmers speak questions in their own language ("how is the soil?",
"is it going to rain?"). Whisper transcribes, an intent parser keyword-
matches Bambara/Fula agricultural terms (soil, rain, irrigation, pest),
a sensor bridge fetches data from an IoT backend (or mock data), and
VoiceResponder + a TTS engine reply in short Bambara/Fula sentences
with alert thresholds (e.g. "Bunding ji dɔgɔ. I ka foro ji." =
"Soil moisture is low. Irrigate your field.").
The project is deployed as a HuggingFace Space (Gradio frontend) with an
optional FastAPI service. The system is explicitly "100% non-Meta" for its
core stack (Whisper / Qwen / F5-TTS / VITS), avoiding Meta models for the
main loop.
FULL TECH STACK
---------------
Deployment / hosting
- HuggingFace Spaces (Gradio SDK 5.25.0, hardware: cpu-basic)
- Kaggle notebooks (T4 GPU) for training runs
- RunPod alternative training environment
- HF Hub datasets as persistent vocabulary + feedback store
Frontend
- Gradio 5.25.0 (app.py — main UI; app_lab.py — experimental lab UI)
Backend API
- FastAPI (src/api/app.py via create_app() + lifespan)
- Pydantic v2 (schemas)
- httpx (async calls to IoT sensor backend)
Speech-to-text (STT)
- openai/whisper-large-v3-turbo (default backbone)
- transformers 5.5.0 (WhisperForConditionalGeneration, WhisperProcessor)
- PEFT (LoRA adapters, hot-swappable per language)
- accelerate 1.13.0
- librosa 0.10.2, soundfile 0.12.1, torchaudio
LLM (reasoning / teaching-intent detection)
- Qwen/Qwen2.5-72B-Instruct (default, via HF Serverless Inference)
- Qwen/Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Zephyr-7b-beta
as faster alternatives
- huggingface-hub 1.9.0 InferenceClient
Text-to-speech (TTS)
- Phase 1: facebook/mms-tts-bam, mms-tts-ful, mms-tts-fra, mms-tts-eng
- Phase 2: ynnov/ekodi-bambara-tts-female (VITS)
+ placeholder ous-sow/fula-tts
- F5-TTS (SWivid/F5-TTS) for GPU voice cloning (optional, ~2GB)
- OpenVoice V2 (myshell-ai/openvoice-v2) for tone-color conversion
- SpeechBrain ECAPA-TDNN for speaker identification (per-user profiles)
Data / datasets
- google/fleurs (bam_ML, ff_SN) as STT training corpus
- RobotsMali/jeli-asr, google/fleurs Fula, Wikipedia (bm, ff) harvested
text via src/data/web_harvester.py
- datasets 4.8.4 (+ torchcodec for 4.x audio decoding)
- Adlam ↔ Latin transliteration for Guinea Pular
Training / fine-tuning
- PEFT LoRA + Seq2SeqTrainer
- jiwer 3.0.4 (WER / CER metrics)
- Custom callbacks: EarlyStoppingOnWER, AdapterCheckpointCallback
- FieldNoiseAugmenter (tractor / wind / livestock noise mixing)
Optimization / edge deploy
- optimum[onnxruntime] → per-language ONNX export
- onnx-tf / TensorFlow → TFLite for Android
- bitsandbytes NF4 / 8-bit quantization (training environments)
Utilities / runtime
- PyYAML 6.0.2, python-dotenv 1.1.0
- NumPy 2.2.4, SciPy 1.15.2
- rapidfuzz 3.13.0 (fuzzy phrase matching)
- pypdf, python-docx (Knowledge Base upload → vocabulary.jsonl)
- Kaggle API (Self-Teaching tab triggers training runs)
- ffmpeg (packages.txt — sole system-level dep)
Environment variables
HF_TOKEN, FEEDBACK_REPO_ID (ous-sow/sahel-agri-feedback),
LLM_MODEL_ID, BAMBARA_ADAPTER_PATH, FULA_ADAPTER_PATH,
SENSOR_API_URL, BAMBARA_TTS_REPO, FULA_TTS_REPO, DEVICE, LOG_LEVEL
KEY SOURCE FILES AND WHAT THEY DO
---------------------------------
Top-level entry points
app.py
Gradio UI (~99 KB). Main user-facing application running on the HF Space.
Wires STT → LLM → memory → TTS, exposes the Conversation / Teaching /
Knowledge Base / Self-Teaching tabs.
app_lab.py
Experimental/lab Gradio UI used to prototype new features
(e.g. CuriosityEngine integration) before folding into app.py.
setup.sh
Shell bootstrap for local + RunPod environments.
src/api/ — FastAPI service (alternative to Gradio-only deploy)
app.py FastAPI factory with async lifespan: loads Whisper backbone
once, registers bam/ful adapters, pre-loads 'bam', attaches
Transcriber + SensorBridge to app.state.
dependencies.py FastAPI DI helpers to pull shared objects off app.state.
middleware.py CORS / logging middleware registration.
schemas.py Pydantic v2 request/response models.
routes/health.py GET /health — model status + loaded adapters.
routes/transcribe.py POST /transcribe — audio → text, 10 MB cap,
wav/mp3/ogg/m4a/flac/webm.
routes/iot.py POST /query — full pipeline: audio → transcribe → intent
→ sensor → voice response (IoTQueryResponse).
src/engine/ — STT core
whisper_base.py Singleton loader for WhisperForConditionalGeneration +
WhisperProcessor. FP16 on CUDA, FP32 on CPU. free()
releases VRAM.
adapter_manager.py Hot-swap LoRA adapters via PEFT's multi-adapter API:
first load ~2s, subsequent set_adapter ~50ms.
Keeps one backbone in VRAM and swaps ~50MB adapters.
transcriber.py Public inference API. Handles ≤30s chunks directly,
>30s by slicing into 30s windows. Returns
TranscriptionResult (text, language, duration_s,
processing_time_ms, confidence).
stt_processor.py avg_logprob confidence extractor; threshold -1.0 =
"confused", caller should ask user to repeat.
curiosity.py CuriosityEngine — every N interactions, prompts the
LLM to spot a vocabulary gap and ask the user how to
say a missing agricultural term.
src/llm/
gemma_client.py Wraps HF Serverless InferenceClient. Implements the
"adult-child" system prompt that returns structured
JSON with intent ∈ {teaching, question, conversation,
error}. Parses JSON out of optional markdown fences.
src/memory/
memory_manager.py Thread-safe vocabulary store. Persists to
data/vocabulary.jsonl locally and pushes asynchronously
to HF Hub dataset. Provides get_recent() and a
formatted get_vocabulary_context() for the LLM prompt.
src/conversation/
phrase_matcher.py RapidFuzz-based matcher over curated JSON phrase
libraries (data/phrases/{lang}.json + _additions.json).
Handles greetings / thanks / farewells without hitting
the LLM.
src/iot/
intent_parser.py Keyword-based Intent classifier
(greeting/thanks/farewell/check_soil/check_weather/
irrigation_status/pest_alert) for bam, ful, fr, en.
Confidence = matched_keywords / total_keywords.
sensor_bridge.py Async bridge to an IoT backend (SENSOR_API_URL) for
soil / weather / irrigation / pest readings.
Falls back to mock random data.
voice_responder.py Maps (Intent, SensorData) → short Bambara/Fula reply
string (≤6 words per sentence for clean MMS-TTS) plus
English translation. Alert thresholds encoded here
(SOIL_MOISTURE_LOW=30, PH bounds, TEMP_HIGH=38, etc.).
Also has a verbose French-language path.
src/data/
agri_dictionary.py Bambara + Fula domain vocab used to bias the Whisper
decoder prompt toward agricultural terms.
waxal_loader.py Streams google/fleurs (bam_ML, ff_SN) — the
replacement for the retired google/waxal dataset.
feature_extractor.py Log-mel spectrogram extraction and batched padding
collator for Whisper Seq2SeqTrainer.
augmentation.py FieldNoiseAugmenter — mixes clean speech with
tractor/wind/livestock samples; falls back to
Gaussian noise.
bam_normalize.py Bambara phonetic normalizer (ou→u, gn/ny→ɲ,
N'Ko-derived standard).
adlam.py Adlam (𞤀𞤣𞤤𞤢𞤥) ↔ Latin transliteration for Pular;
normalize_pular() for ASR preprocessing.
web_harvester.py Harvests RobotsMali/jeli-asr, google/fleurs ff_SN,
and bm/ff Wikipedia into the feedback Hub dataset.
src/training/
trainer.py WhisperLoRATrainer — full fine-tune orchestration
(backbone + LoraConfig + WaxalDataLoader +
Seq2SeqTrainer).
metrics.py WER/CER for Seq2SeqTrainer eval loop (via jiwer).
callbacks.py EarlyStoppingOnWER, AdapterCheckpointCallback
(saves adapter-only, not full model).
src/tts/
waxal_tts.py VITS engine wrapping ynnov/ekodi-bambara-tts-female
for Bambara; Fula is a placeholder until
ous-sow/fula-tts is trained.
mms_tts.py Facebook MMS-TTS (bam/ful/fra/eng).
f5_tts.py F5-TTS voice cloning (optional, GPU-only, ~750MB);
gracefully falls back to MMS when missing.
voice_cloner.py OpenVoice V2 tone-color converter — reshapes VITS
audio to a target speaker's voice.
src/voice/
speaker_profiles.py SpeakerProfileManager with SpeechBrain ECAPA-TDNN
(192-d embeddings). Per-user running-average embeddings
for identification + OpenVoice SE for cloning; cosine
similarity ≥ 0.75 attributes to an existing user.
src/optimization/
onnx_exporter.py Merges LoRA into backbone and exports per-language
ONNX (ONNX can't hot-swap adapters at runtime).
quantizer.py BitsAndBytes NF4 / 8-bit quantization for GPU-
constrained deploys (turbo ~3GB → ~1GB VRAM).
tflite_converter.py ONNX → TFLite for offline Android; exports encoder
and decoder separately.
Config / data folders
configs/ base_config.yaml + per-language LoRA configs.
data/ vocabulary.jsonl, phrases/*.json, profiles/, etc.
notebooks/ Kaggle / RunPod fine-tune + TTS training notebooks.
noise_samples/ .wav clips for field-noise augmentation.
scripts/ utility scripts (bootstrap, harvest, eval).
tests/ pytest suite (not installed in HF Spaces runtime).
RECENT GIT COMMITS SUMMARY (last 20)
------------------------------------
The recent history is focused on three concurrent tracks:
1. STT / training stability
- bb78cbf Add torchcodec install for datasets 4.x audio decoding
- 9049ef3 Prepare training stack for RunPod: env-aware notebook +
bootstrap script
- cc50efb Align Whisper default to turbo-v3 + add document upload to
Knowledge Base tab
- c33a061 Fix WhisperProcessor import in reload + upgrade base to
large-v3-turbo
- 7fae91b Fix mel-bin mismatch: load per-language processor from
fine-tuned checkpoint
- 6682858 Fix jiwer crash on post-normalisation empty refs;
register SLR106/105 datasets
- 58f431a Fix SyntaxError in Cell 17: unterminated f-string literal
- 3632a23 Fix compute_metrics crash on empty eval references
in Fula training
- 71bb3bc Fix: add trust_remote_code=True for datasets 3.x compatibility
- cd017e2 Fix Cell 16 ValueError: load model fp32 so AMP gradient scaler
works
2. Language support / Adlam / Pular expansion
- ced078c Add Adlam/Pular Fula integration: transliterator +
3 new datasets + normalisation pipeline
- 40cf84d Fix language mixing: per-language prompts +
Mali Bambara / Guinea Pular context
- 33c3a5a Fix Self-Teaching language detection: parse code from
dropdown label
- 24b1617 Fix Self-Teaching tab: float sliders, deduplication,
Kaggle API fallback
3. Conversation / voice pipeline
- 8952fff Phase 3: Voice-to-Voice S2S pipeline —
F5-TTS, LLM brain, CER metric
- ad902c6 Add real conversational memory + live learning to
Conversation Mode
- 8d7d9d8 Fix conversation mode timeout: two-stage pipeline + faster LLM
- 1958814 Fix "Model loading" stuck state: block in _do_asr until
Whisper is ready
- 618eab5 Fix model loading stuck forever + unhandled TTS crash in
conversation mode
- bfe5b59 Fix slow build: strip runtime-irrelevant heavy packages from
requirements.txt
Overall trajectory: the project has moved past initial Phase 1 scaffolding
and is iterating hard on (a) stabilising fine-tuning on Kaggle/RunPod with
large-v3-turbo, (b) expanding to Guinea Pular with the native Adlam script,
and (c) finishing the Phase 3 voice-to-voice pipeline (F5-TTS + LLM brain).
Most recent commits are bug-fixes rather than net-new features, suggesting
the current codebase is approaching a stable milestone.
================================================================================