Spaces:

MataStrategy
/

ground-zero

Sleeping

ground-zero / project-context.txt

Broulaye Doumbia

push docs and script

cc8b90c 2 months ago

14.5 kB

	================================================================================
	PROJECT CONTEXT — sahel-agri-voice
	Generated: 2026-04-17
	================================================================================

	PROJECT NAME
	------------
	Sahel-Voice-Lab / Sahel-Agri Voice AI
	(HuggingFace Space title: "Sahel-Voice-Lab", Phase 1: "The Memory Loop")

	PURPOSE
	-------
	A voice-first, self-learning AI assistant for two West African languages —
	Bambara (bam, spoken in Mali) and Fula/Pular (ful, spoken in Guinea and
	Senegal) — targeted at farmers in the Sahel region.

	The system has two complementary capabilities:

	1. LANGUAGE-LEARNING MEMORY LOOP (Phase 1)
	The assistant behaves like an "eager child learner." Users teach it
	Bambara/Fula words ("I ni ce means hello") via voice or text; an LLM
	detects the teaching intent and the word pair is persisted to a
	HuggingFace Hub dataset (ous-sow/sahel-agri-feedback → vocabulary.jsonl)
	so knowledge accumulates across sessions and users. The vocabulary is
	then injected into the LLM's system prompt as its source of truth for
	answering questions.

	2. AGRICULTURAL IoT VOICE INTERFACE
	Farmers speak questions in their own language ("how is the soil?",
	"is it going to rain?"). Whisper transcribes, an intent parser keyword-
	matches Bambara/Fula agricultural terms (soil, rain, irrigation, pest),
	a sensor bridge fetches data from an IoT backend (or mock data), and
	VoiceResponder + a TTS engine reply in short Bambara/Fula sentences
	with alert thresholds (e.g. "Bunding ji dɔgɔ. I ka foro ji." =
	"Soil moisture is low. Irrigate your field.").

	The project is deployed as a HuggingFace Space (Gradio frontend) with an
	optional FastAPI service. The system is explicitly "100% non-Meta" for its
	core stack (Whisper / Qwen / F5-TTS / VITS), avoiding Meta models for the
	main loop.

	FULL TECH STACK
	---------------
	Deployment / hosting
	- HuggingFace Spaces (Gradio SDK 5.25.0, hardware: cpu-basic)
	- Kaggle notebooks (T4 GPU) for training runs
	- RunPod alternative training environment
	- HF Hub datasets as persistent vocabulary + feedback store

	Frontend
	- Gradio 5.25.0 (app.py — main UI; app_lab.py — experimental lab UI)

	Backend API
	- FastAPI (src/api/app.py via create_app() + lifespan)
	- Pydantic v2 (schemas)
	- httpx (async calls to IoT sensor backend)

	Speech-to-text (STT)
	- openai/whisper-large-v3-turbo (default backbone)
	- transformers 5.5.0 (WhisperForConditionalGeneration, WhisperProcessor)
	- PEFT (LoRA adapters, hot-swappable per language)
	- accelerate 1.13.0
	- librosa 0.10.2, soundfile 0.12.1, torchaudio

	LLM (reasoning / teaching-intent detection)
	- Qwen/Qwen2.5-72B-Instruct (default, via HF Serverless Inference)
	- Qwen/Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Zephyr-7b-beta
	as faster alternatives
	- huggingface-hub 1.9.0 InferenceClient

	Text-to-speech (TTS)
	- Phase 1: facebook/mms-tts-bam, mms-tts-ful, mms-tts-fra, mms-tts-eng
	- Phase 2: ynnov/ekodi-bambara-tts-female (VITS)
	+ placeholder ous-sow/fula-tts
	- F5-TTS (SWivid/F5-TTS) for GPU voice cloning (optional, ~2GB)
	- OpenVoice V2 (myshell-ai/openvoice-v2) for tone-color conversion
	- SpeechBrain ECAPA-TDNN for speaker identification (per-user profiles)

	Data / datasets
	- google/fleurs (bam_ML, ff_SN) as STT training corpus
	- RobotsMali/jeli-asr, google/fleurs Fula, Wikipedia (bm, ff) harvested
	text via src/data/web_harvester.py
	- datasets 4.8.4 (+ torchcodec for 4.x audio decoding)
	- Adlam ↔ Latin transliteration for Guinea Pular

	Training / fine-tuning
	- PEFT LoRA + Seq2SeqTrainer
	- jiwer 3.0.4 (WER / CER metrics)
	- Custom callbacks: EarlyStoppingOnWER, AdapterCheckpointCallback
	- FieldNoiseAugmenter (tractor / wind / livestock noise mixing)

	Optimization / edge deploy
	- optimum[onnxruntime] → per-language ONNX export
	- onnx-tf / TensorFlow → TFLite for Android
	- bitsandbytes NF4 / 8-bit quantization (training environments)

	Utilities / runtime
	- PyYAML 6.0.2, python-dotenv 1.1.0
	- NumPy 2.2.4, SciPy 1.15.2
	- rapidfuzz 3.13.0 (fuzzy phrase matching)
	- pypdf, python-docx (Knowledge Base upload → vocabulary.jsonl)
	- Kaggle API (Self-Teaching tab triggers training runs)
	- ffmpeg (packages.txt — sole system-level dep)

	Environment variables
	HF_TOKEN, FEEDBACK_REPO_ID (ous-sow/sahel-agri-feedback),
	LLM_MODEL_ID, BAMBARA_ADAPTER_PATH, FULA_ADAPTER_PATH,
	SENSOR_API_URL, BAMBARA_TTS_REPO, FULA_TTS_REPO, DEVICE, LOG_LEVEL

	KEY SOURCE FILES AND WHAT THEY DO
	---------------------------------
	Top-level entry points
	app.py
	Gradio UI (~99 KB). Main user-facing application running on the HF Space.
	Wires STT → LLM → memory → TTS, exposes the Conversation / Teaching /
	Knowledge Base / Self-Teaching tabs.
	app_lab.py
	Experimental/lab Gradio UI used to prototype new features
	(e.g. CuriosityEngine integration) before folding into app.py.
	setup.sh
	Shell bootstrap for local + RunPod environments.

	src/api/ — FastAPI service (alternative to Gradio-only deploy)
	app.py FastAPI factory with async lifespan: loads Whisper backbone
	once, registers bam/ful adapters, pre-loads 'bam', attaches
	Transcriber + SensorBridge to app.state.
	dependencies.py FastAPI DI helpers to pull shared objects off app.state.
	middleware.py CORS / logging middleware registration.
	schemas.py Pydantic v2 request/response models.
	routes/health.py GET /health — model status + loaded adapters.
	routes/transcribe.py POST /transcribe — audio → text, 10 MB cap,
	wav/mp3/ogg/m4a/flac/webm.
	routes/iot.py POST /query — full pipeline: audio → transcribe → intent
	→ sensor → voice response (IoTQueryResponse).

	src/engine/ — STT core
	whisper_base.py Singleton loader for WhisperForConditionalGeneration +
	WhisperProcessor. FP16 on CUDA, FP32 on CPU. free()
	releases VRAM.
	adapter_manager.py Hot-swap LoRA adapters via PEFT's multi-adapter API:
	first load ~2s, subsequent set_adapter ~50ms.
	Keeps one backbone in VRAM and swaps ~50MB adapters.
	transcriber.py Public inference API. Handles ≤30s chunks directly,
	>30s by slicing into 30s windows. Returns
	TranscriptionResult (text, language, duration_s,
	processing_time_ms, confidence).
	stt_processor.py avg_logprob confidence extractor; threshold -1.0 =
	"confused", caller should ask user to repeat.
	curiosity.py CuriosityEngine — every N interactions, prompts the
	LLM to spot a vocabulary gap and ask the user how to
	say a missing agricultural term.

	src/llm/
	gemma_client.py Wraps HF Serverless InferenceClient. Implements the
	"adult-child" system prompt that returns structured
	JSON with intent ∈ {teaching, question, conversation,
	error}. Parses JSON out of optional markdown fences.

	src/memory/
	memory_manager.py Thread-safe vocabulary store. Persists to
	data/vocabulary.jsonl locally and pushes asynchronously
	to HF Hub dataset. Provides get_recent() and a
	formatted get_vocabulary_context() for the LLM prompt.

	src/conversation/
	phrase_matcher.py RapidFuzz-based matcher over curated JSON phrase
	libraries (data/phrases/{lang}.json + _additions.json).
	Handles greetings / thanks / farewells without hitting
	the LLM.

	src/iot/
	intent_parser.py Keyword-based Intent classifier
	(greeting/thanks/farewell/check_soil/check_weather/
	irrigation_status/pest_alert) for bam, ful, fr, en.
	Confidence = matched_keywords / total_keywords.
	sensor_bridge.py Async bridge to an IoT backend (SENSOR_API_URL) for
	soil / weather / irrigation / pest readings.
	Falls back to mock random data.
	voice_responder.py Maps (Intent, SensorData) → short Bambara/Fula reply
	string (≤6 words per sentence for clean MMS-TTS) plus
	English translation. Alert thresholds encoded here
	(SOIL_MOISTURE_LOW=30, PH bounds, TEMP_HIGH=38, etc.).
	Also has a verbose French-language path.

	src/data/
	agri_dictionary.py Bambara + Fula domain vocab used to bias the Whisper
	decoder prompt toward agricultural terms.
	waxal_loader.py Streams google/fleurs (bam_ML, ff_SN) — the
	replacement for the retired google/waxal dataset.
	feature_extractor.py Log-mel spectrogram extraction and batched padding
	collator for Whisper Seq2SeqTrainer.
	augmentation.py FieldNoiseAugmenter — mixes clean speech with
	tractor/wind/livestock samples; falls back to
	Gaussian noise.
	bam_normalize.py Bambara phonetic normalizer (ou→u, gn/ny→ɲ,
	N'Ko-derived standard).
	adlam.py Adlam (𞤀𞤣𞤤𞤢𞤥) ↔ Latin transliteration for Pular;
	normalize_pular() for ASR preprocessing.
	web_harvester.py Harvests RobotsMali/jeli-asr, google/fleurs ff_SN,
	and bm/ff Wikipedia into the feedback Hub dataset.

	src/training/
	trainer.py WhisperLoRATrainer — full fine-tune orchestration
	(backbone + LoraConfig + WaxalDataLoader +
	Seq2SeqTrainer).
	metrics.py WER/CER for Seq2SeqTrainer eval loop (via jiwer).
	callbacks.py EarlyStoppingOnWER, AdapterCheckpointCallback
	(saves adapter-only, not full model).

	src/tts/
	waxal_tts.py VITS engine wrapping ynnov/ekodi-bambara-tts-female
	for Bambara; Fula is a placeholder until
	ous-sow/fula-tts is trained.
	mms_tts.py Facebook MMS-TTS (bam/ful/fra/eng).
	f5_tts.py F5-TTS voice cloning (optional, GPU-only, ~750MB);
	gracefully falls back to MMS when missing.
	voice_cloner.py OpenVoice V2 tone-color converter — reshapes VITS
	audio to a target speaker's voice.

	src/voice/
	speaker_profiles.py SpeakerProfileManager with SpeechBrain ECAPA-TDNN
	(192-d embeddings). Per-user running-average embeddings
	for identification + OpenVoice SE for cloning; cosine
	similarity ≥ 0.75 attributes to an existing user.

	src/optimization/
	onnx_exporter.py Merges LoRA into backbone and exports per-language
	ONNX (ONNX can't hot-swap adapters at runtime).
	quantizer.py BitsAndBytes NF4 / 8-bit quantization for GPU-
	constrained deploys (turbo ~3GB → ~1GB VRAM).
	tflite_converter.py ONNX → TFLite for offline Android; exports encoder
	and decoder separately.

	Config / data folders
	configs/ base_config.yaml + per-language LoRA configs.
	data/ vocabulary.jsonl, phrases/*.json, profiles/, etc.
	notebooks/ Kaggle / RunPod fine-tune + TTS training notebooks.
	noise_samples/ .wav clips for field-noise augmentation.
	scripts/ utility scripts (bootstrap, harvest, eval).
	tests/ pytest suite (not installed in HF Spaces runtime).

	RECENT GIT COMMITS SUMMARY (last 20)
	------------------------------------
	The recent history is focused on three concurrent tracks:

	1. STT / training stability
	- bb78cbf Add torchcodec install for datasets 4.x audio decoding
	- 9049ef3 Prepare training stack for RunPod: env-aware notebook +
	bootstrap script
	- cc50efb Align Whisper default to turbo-v3 + add document upload to
	Knowledge Base tab
	- c33a061 Fix WhisperProcessor import in reload + upgrade base to
	large-v3-turbo
	- 7fae91b Fix mel-bin mismatch: load per-language processor from
	fine-tuned checkpoint
	- 6682858 Fix jiwer crash on post-normalisation empty refs;
	register SLR106/105 datasets
	- 58f431a Fix SyntaxError in Cell 17: unterminated f-string literal
	- 3632a23 Fix compute_metrics crash on empty eval references
	in Fula training
	- 71bb3bc Fix: add trust_remote_code=True for datasets 3.x compatibility
	- cd017e2 Fix Cell 16 ValueError: load model fp32 so AMP gradient scaler
	works

	2. Language support / Adlam / Pular expansion
	- ced078c Add Adlam/Pular Fula integration: transliterator +
	3 new datasets + normalisation pipeline
	- 40cf84d Fix language mixing: per-language prompts +
	Mali Bambara / Guinea Pular context
	- 33c3a5a Fix Self-Teaching language detection: parse code from
	dropdown label
	- 24b1617 Fix Self-Teaching tab: float sliders, deduplication,
	Kaggle API fallback

	3. Conversation / voice pipeline
	- 8952fff Phase 3: Voice-to-Voice S2S pipeline —
	F5-TTS, LLM brain, CER metric
	- ad902c6 Add real conversational memory + live learning to
	Conversation Mode
	- 8d7d9d8 Fix conversation mode timeout: two-stage pipeline + faster LLM
	- 1958814 Fix "Model loading" stuck state: block in _do_asr until
	Whisper is ready
	- 618eab5 Fix model loading stuck forever + unhandled TTS crash in
	conversation mode
	- bfe5b59 Fix slow build: strip runtime-irrelevant heavy packages from
	requirements.txt

	Overall trajectory: the project has moved past initial Phase 1 scaffolding
	and is iterating hard on (a) stabilising fine-tuning on Kaggle/RunPod with
	large-v3-turbo, (b) expanding to Guinea Pular with the native Adlam script,
	and (c) finishing the Phase 3 voice-to-voice pipeline (F5-TTS + LLM brain).
	Most recent commits are bug-fixes rather than net-new features, suggesting
	the current codebase is approaching a stable milestone.

	================================================================================