Spaces:
Sleeping
Sleeping
| ================================================================================ | |
| PROJECT CONTEXT — sahel-agri-voice | |
| Generated: 2026-04-17 | |
| ================================================================================ | |
| PROJECT NAME | |
| ------------ | |
| Sahel-Voice-Lab / Sahel-Agri Voice AI | |
| (HuggingFace Space title: "Sahel-Voice-Lab", Phase 1: "The Memory Loop") | |
| PURPOSE | |
| ------- | |
| A voice-first, self-learning AI assistant for two West African languages — | |
| Bambara (bam, spoken in Mali) and Fula/Pular (ful, spoken in Guinea and | |
| Senegal) — targeted at farmers in the Sahel region. | |
| The system has two complementary capabilities: | |
| 1. LANGUAGE-LEARNING MEMORY LOOP (Phase 1) | |
| The assistant behaves like an "eager child learner." Users teach it | |
| Bambara/Fula words ("I ni ce means hello") via voice or text; an LLM | |
| detects the teaching intent and the word pair is persisted to a | |
| HuggingFace Hub dataset (ous-sow/sahel-agri-feedback → vocabulary.jsonl) | |
| so knowledge accumulates across sessions and users. The vocabulary is | |
| then injected into the LLM's system prompt as its source of truth for | |
| answering questions. | |
| 2. AGRICULTURAL IoT VOICE INTERFACE | |
| Farmers speak questions in their own language ("how is the soil?", | |
| "is it going to rain?"). Whisper transcribes, an intent parser keyword- | |
| matches Bambara/Fula agricultural terms (soil, rain, irrigation, pest), | |
| a sensor bridge fetches data from an IoT backend (or mock data), and | |
| VoiceResponder + a TTS engine reply in short Bambara/Fula sentences | |
| with alert thresholds (e.g. "Bunding ji dɔgɔ. I ka foro ji." = | |
| "Soil moisture is low. Irrigate your field."). | |
| The project is deployed as a HuggingFace Space (Gradio frontend) with an | |
| optional FastAPI service. The system is explicitly "100% non-Meta" for its | |
| core stack (Whisper / Qwen / F5-TTS / VITS), avoiding Meta models for the | |
| main loop. | |
| FULL TECH STACK | |
| --------------- | |
| Deployment / hosting | |
| - HuggingFace Spaces (Gradio SDK 5.25.0, hardware: cpu-basic) | |
| - Kaggle notebooks (T4 GPU) for training runs | |
| - RunPod alternative training environment | |
| - HF Hub datasets as persistent vocabulary + feedback store | |
| Frontend | |
| - Gradio 5.25.0 (app.py — main UI; app_lab.py — experimental lab UI) | |
| Backend API | |
| - FastAPI (src/api/app.py via create_app() + lifespan) | |
| - Pydantic v2 (schemas) | |
| - httpx (async calls to IoT sensor backend) | |
| Speech-to-text (STT) | |
| - openai/whisper-large-v3-turbo (default backbone) | |
| - transformers 5.5.0 (WhisperForConditionalGeneration, WhisperProcessor) | |
| - PEFT (LoRA adapters, hot-swappable per language) | |
| - accelerate 1.13.0 | |
| - librosa 0.10.2, soundfile 0.12.1, torchaudio | |
| LLM (reasoning / teaching-intent detection) | |
| - Qwen/Qwen2.5-72B-Instruct (default, via HF Serverless Inference) | |
| - Qwen/Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Zephyr-7b-beta | |
| as faster alternatives | |
| - huggingface-hub 1.9.0 InferenceClient | |
| Text-to-speech (TTS) | |
| - Phase 1: facebook/mms-tts-bam, mms-tts-ful, mms-tts-fra, mms-tts-eng | |
| - Phase 2: ynnov/ekodi-bambara-tts-female (VITS) | |
| + placeholder ous-sow/fula-tts | |
| - F5-TTS (SWivid/F5-TTS) for GPU voice cloning (optional, ~2GB) | |
| - OpenVoice V2 (myshell-ai/openvoice-v2) for tone-color conversion | |
| - SpeechBrain ECAPA-TDNN for speaker identification (per-user profiles) | |
| Data / datasets | |
| - google/fleurs (bam_ML, ff_SN) as STT training corpus | |
| - RobotsMali/jeli-asr, google/fleurs Fula, Wikipedia (bm, ff) harvested | |
| text via src/data/web_harvester.py | |
| - datasets 4.8.4 (+ torchcodec for 4.x audio decoding) | |
| - Adlam ↔ Latin transliteration for Guinea Pular | |
| Training / fine-tuning | |
| - PEFT LoRA + Seq2SeqTrainer | |
| - jiwer 3.0.4 (WER / CER metrics) | |
| - Custom callbacks: EarlyStoppingOnWER, AdapterCheckpointCallback | |
| - FieldNoiseAugmenter (tractor / wind / livestock noise mixing) | |
| Optimization / edge deploy | |
| - optimum[onnxruntime] → per-language ONNX export | |
| - onnx-tf / TensorFlow → TFLite for Android | |
| - bitsandbytes NF4 / 8-bit quantization (training environments) | |
| Utilities / runtime | |
| - PyYAML 6.0.2, python-dotenv 1.1.0 | |
| - NumPy 2.2.4, SciPy 1.15.2 | |
| - rapidfuzz 3.13.0 (fuzzy phrase matching) | |
| - pypdf, python-docx (Knowledge Base upload → vocabulary.jsonl) | |
| - Kaggle API (Self-Teaching tab triggers training runs) | |
| - ffmpeg (packages.txt — sole system-level dep) | |
| Environment variables | |
| HF_TOKEN, FEEDBACK_REPO_ID (ous-sow/sahel-agri-feedback), | |
| LLM_MODEL_ID, BAMBARA_ADAPTER_PATH, FULA_ADAPTER_PATH, | |
| SENSOR_API_URL, BAMBARA_TTS_REPO, FULA_TTS_REPO, DEVICE, LOG_LEVEL | |
| KEY SOURCE FILES AND WHAT THEY DO | |
| --------------------------------- | |
| Top-level entry points | |
| app.py | |
| Gradio UI (~99 KB). Main user-facing application running on the HF Space. | |
| Wires STT → LLM → memory → TTS, exposes the Conversation / Teaching / | |
| Knowledge Base / Self-Teaching tabs. | |
| app_lab.py | |
| Experimental/lab Gradio UI used to prototype new features | |
| (e.g. CuriosityEngine integration) before folding into app.py. | |
| setup.sh | |
| Shell bootstrap for local + RunPod environments. | |
| src/api/ — FastAPI service (alternative to Gradio-only deploy) | |
| app.py FastAPI factory with async lifespan: loads Whisper backbone | |
| once, registers bam/ful adapters, pre-loads 'bam', attaches | |
| Transcriber + SensorBridge to app.state. | |
| dependencies.py FastAPI DI helpers to pull shared objects off app.state. | |
| middleware.py CORS / logging middleware registration. | |
| schemas.py Pydantic v2 request/response models. | |
| routes/health.py GET /health — model status + loaded adapters. | |
| routes/transcribe.py POST /transcribe — audio → text, 10 MB cap, | |
| wav/mp3/ogg/m4a/flac/webm. | |
| routes/iot.py POST /query — full pipeline: audio → transcribe → intent | |
| → sensor → voice response (IoTQueryResponse). | |
| src/engine/ — STT core | |
| whisper_base.py Singleton loader for WhisperForConditionalGeneration + | |
| WhisperProcessor. FP16 on CUDA, FP32 on CPU. free() | |
| releases VRAM. | |
| adapter_manager.py Hot-swap LoRA adapters via PEFT's multi-adapter API: | |
| first load ~2s, subsequent set_adapter ~50ms. | |
| Keeps one backbone in VRAM and swaps ~50MB adapters. | |
| transcriber.py Public inference API. Handles ≤30s chunks directly, | |
| >30s by slicing into 30s windows. Returns | |
| TranscriptionResult (text, language, duration_s, | |
| processing_time_ms, confidence). | |
| stt_processor.py avg_logprob confidence extractor; threshold -1.0 = | |
| "confused", caller should ask user to repeat. | |
| curiosity.py CuriosityEngine — every N interactions, prompts the | |
| LLM to spot a vocabulary gap and ask the user how to | |
| say a missing agricultural term. | |
| src/llm/ | |
| gemma_client.py Wraps HF Serverless InferenceClient. Implements the | |
| "adult-child" system prompt that returns structured | |
| JSON with intent ∈ {teaching, question, conversation, | |
| error}. Parses JSON out of optional markdown fences. | |
| src/memory/ | |
| memory_manager.py Thread-safe vocabulary store. Persists to | |
| data/vocabulary.jsonl locally and pushes asynchronously | |
| to HF Hub dataset. Provides get_recent() and a | |
| formatted get_vocabulary_context() for the LLM prompt. | |
| src/conversation/ | |
| phrase_matcher.py RapidFuzz-based matcher over curated JSON phrase | |
| libraries (data/phrases/{lang}.json + _additions.json). | |
| Handles greetings / thanks / farewells without hitting | |
| the LLM. | |
| src/iot/ | |
| intent_parser.py Keyword-based Intent classifier | |
| (greeting/thanks/farewell/check_soil/check_weather/ | |
| irrigation_status/pest_alert) for bam, ful, fr, en. | |
| Confidence = matched_keywords / total_keywords. | |
| sensor_bridge.py Async bridge to an IoT backend (SENSOR_API_URL) for | |
| soil / weather / irrigation / pest readings. | |
| Falls back to mock random data. | |
| voice_responder.py Maps (Intent, SensorData) → short Bambara/Fula reply | |
| string (≤6 words per sentence for clean MMS-TTS) plus | |
| English translation. Alert thresholds encoded here | |
| (SOIL_MOISTURE_LOW=30, PH bounds, TEMP_HIGH=38, etc.). | |
| Also has a verbose French-language path. | |
| src/data/ | |
| agri_dictionary.py Bambara + Fula domain vocab used to bias the Whisper | |
| decoder prompt toward agricultural terms. | |
| waxal_loader.py Streams google/fleurs (bam_ML, ff_SN) — the | |
| replacement for the retired google/waxal dataset. | |
| feature_extractor.py Log-mel spectrogram extraction and batched padding | |
| collator for Whisper Seq2SeqTrainer. | |
| augmentation.py FieldNoiseAugmenter — mixes clean speech with | |
| tractor/wind/livestock samples; falls back to | |
| Gaussian noise. | |
| bam_normalize.py Bambara phonetic normalizer (ou→u, gn/ny→ɲ, | |
| N'Ko-derived standard). | |
| adlam.py Adlam (𞤀𞤣𞤤𞤢𞤥) ↔ Latin transliteration for Pular; | |
| normalize_pular() for ASR preprocessing. | |
| web_harvester.py Harvests RobotsMali/jeli-asr, google/fleurs ff_SN, | |
| and bm/ff Wikipedia into the feedback Hub dataset. | |
| src/training/ | |
| trainer.py WhisperLoRATrainer — full fine-tune orchestration | |
| (backbone + LoraConfig + WaxalDataLoader + | |
| Seq2SeqTrainer). | |
| metrics.py WER/CER for Seq2SeqTrainer eval loop (via jiwer). | |
| callbacks.py EarlyStoppingOnWER, AdapterCheckpointCallback | |
| (saves adapter-only, not full model). | |
| src/tts/ | |
| waxal_tts.py VITS engine wrapping ynnov/ekodi-bambara-tts-female | |
| for Bambara; Fula is a placeholder until | |
| ous-sow/fula-tts is trained. | |
| mms_tts.py Facebook MMS-TTS (bam/ful/fra/eng). | |
| f5_tts.py F5-TTS voice cloning (optional, GPU-only, ~750MB); | |
| gracefully falls back to MMS when missing. | |
| voice_cloner.py OpenVoice V2 tone-color converter — reshapes VITS | |
| audio to a target speaker's voice. | |
| src/voice/ | |
| speaker_profiles.py SpeakerProfileManager with SpeechBrain ECAPA-TDNN | |
| (192-d embeddings). Per-user running-average embeddings | |
| for identification + OpenVoice SE for cloning; cosine | |
| similarity ≥ 0.75 attributes to an existing user. | |
| src/optimization/ | |
| onnx_exporter.py Merges LoRA into backbone and exports per-language | |
| ONNX (ONNX can't hot-swap adapters at runtime). | |
| quantizer.py BitsAndBytes NF4 / 8-bit quantization for GPU- | |
| constrained deploys (turbo ~3GB → ~1GB VRAM). | |
| tflite_converter.py ONNX → TFLite for offline Android; exports encoder | |
| and decoder separately. | |
| Config / data folders | |
| configs/ base_config.yaml + per-language LoRA configs. | |
| data/ vocabulary.jsonl, phrases/*.json, profiles/, etc. | |
| notebooks/ Kaggle / RunPod fine-tune + TTS training notebooks. | |
| noise_samples/ .wav clips for field-noise augmentation. | |
| scripts/ utility scripts (bootstrap, harvest, eval). | |
| tests/ pytest suite (not installed in HF Spaces runtime). | |
| RECENT GIT COMMITS SUMMARY (last 20) | |
| ------------------------------------ | |
| The recent history is focused on three concurrent tracks: | |
| 1. STT / training stability | |
| - bb78cbf Add torchcodec install for datasets 4.x audio decoding | |
| - 9049ef3 Prepare training stack for RunPod: env-aware notebook + | |
| bootstrap script | |
| - cc50efb Align Whisper default to turbo-v3 + add document upload to | |
| Knowledge Base tab | |
| - c33a061 Fix WhisperProcessor import in reload + upgrade base to | |
| large-v3-turbo | |
| - 7fae91b Fix mel-bin mismatch: load per-language processor from | |
| fine-tuned checkpoint | |
| - 6682858 Fix jiwer crash on post-normalisation empty refs; | |
| register SLR106/105 datasets | |
| - 58f431a Fix SyntaxError in Cell 17: unterminated f-string literal | |
| - 3632a23 Fix compute_metrics crash on empty eval references | |
| in Fula training | |
| - 71bb3bc Fix: add trust_remote_code=True for datasets 3.x compatibility | |
| - cd017e2 Fix Cell 16 ValueError: load model fp32 so AMP gradient scaler | |
| works | |
| 2. Language support / Adlam / Pular expansion | |
| - ced078c Add Adlam/Pular Fula integration: transliterator + | |
| 3 new datasets + normalisation pipeline | |
| - 40cf84d Fix language mixing: per-language prompts + | |
| Mali Bambara / Guinea Pular context | |
| - 33c3a5a Fix Self-Teaching language detection: parse code from | |
| dropdown label | |
| - 24b1617 Fix Self-Teaching tab: float sliders, deduplication, | |
| Kaggle API fallback | |
| 3. Conversation / voice pipeline | |
| - 8952fff Phase 3: Voice-to-Voice S2S pipeline — | |
| F5-TTS, LLM brain, CER metric | |
| - ad902c6 Add real conversational memory + live learning to | |
| Conversation Mode | |
| - 8d7d9d8 Fix conversation mode timeout: two-stage pipeline + faster LLM | |
| - 1958814 Fix "Model loading" stuck state: block in _do_asr until | |
| Whisper is ready | |
| - 618eab5 Fix model loading stuck forever + unhandled TTS crash in | |
| conversation mode | |
| - bfe5b59 Fix slow build: strip runtime-irrelevant heavy packages from | |
| requirements.txt | |
| Overall trajectory: the project has moved past initial Phase 1 scaffolding | |
| and is iterating hard on (a) stabilising fine-tuning on Kaggle/RunPod with | |
| large-v3-turbo, (b) expanding to Guinea Pular with the native Adlam script, | |
| and (c) finishing the Phase 3 voice-to-voice pipeline (F5-TTS + LLM brain). | |
| Most recent commits are bug-fixes rather than net-new features, suggesting | |
| the current codebase is approaching a stable milestone. | |
| ================================================================================ | |