Spaces:

MataStrategy
/

ground-zero

Sleeping

App Files Files Community

ground-zero / docs /baseline_rebuild.md

Broulaye Doumbia

push docs and script

cc8b90c 2 months ago

preview code

Raw

History Blame Contribute Delete

12.1 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Baseline Rebuild Plan — Recovering Months 1-3 Without Losing Existing Work

Created: 2026-04-20 Maintainer: Broulaye

The framing

You are not restarting. You are backfilling the measurement foundation that was skipped the first time through Stages C (ASR + eval), E (memory loop with real users), and G (field test). Every existing file in src/ stays exactly where it is. The app.py Gradio Space on HuggingFace keeps running. Phase 3 voice-to-voice, Waxal VITS, Adlam/Pular, F5-TTS, ONNX exporters, the FastAPI service — all of it stays.

What you add is a parallel minimal track: a new, deliberately simple entry point that uses only the smallest slice of the existing codebase, runs a real field test against it, and collects the data that should have been collected in Months 1-3. Once the minimal track has produced field signal, you use that signal to guide which features in the main app are actually earning their keep.

Three principles govern this plan:

Never delete, never rewrite. If something is wrong in an existing module, fix it in place. The minimal track imports from src/; it does not fork it.
The existing app.py keeps shipping. Do not take down the production Space. The minimal version deploys as a separate Space.
The measurement artifacts (eval set, logs, field-test notes) merge back into main when done. Code stays isolated on a branch; data and docs come back.

Step-by-step

Step 1 — Protect main with a branch and a tag

Why. Every experiment has to be safely discardable. Tagging the current commit lets you return to known-good state at any point; branching means nothing the rebuild does can touch the main deploy.

How.

cd /sessions/practical-intelligent-knuth/mnt/sahel-agri-voice
git status                                  # confirm clean working tree
git tag v0.3-pre-rebuild -m "Last state before baseline rebuild"
git push origin v0.3-pre-rebuild            # if you want the tag on GitHub
git checkout -b experimental/baseline-rebuild

From this point on, all rebuild work happens on experimental/baseline-rebuild. Main is frozen for the duration of the rebuild. Hotfixes to production still go through main as normal.

Step 2 — Create the minimal entry point

Why. You need to run Whisper + LLM + MMS-TTS in the simplest possible wiring, with nothing else in the critical path. This is what users will actually evaluate. Every extra component adds a failure mode you can't isolate. The minimal entry point becomes a joint debugging tool and a field-test artifact.

How. Add a new file app_minimal.py at the repo root — a third entry point alongside app.py (full production) and app_lab.py (experimental). It should import only:

src.llm.gemma_client — the Qwen LLM client, unchanged
src.engine.whisper_base — Whisper backbone, used zero-shot (no adapter)
src.tts.mms_tts — MMS-TTS Bambara fallback
src.data.bam_normalize — the orthography normalizer

It should not touch:

src/engine/adapter_manager.py (skip LoRA entirely — zero-shot only)
src/engine/transcriber.py (the adapter-aware wrapper — use whisper_base directly)
src/memory/ (no memory loop in the minimal version yet)
src/voice/speaker_profiles.py (no speaker ID)
src/iot/ (no sensors, no intent parsing — LLM handles it all)
src/tts/waxal_tts.py, src/tts/f5_tts.py, src/tts/voice_cloner.py (no upgraded TTS)
src/conversation/phrase_matcher.py (no fast-path shortcuts)

Single Gradio interface, one tab: microphone input, audio output, transcript visible for debugging. Roughly 150-200 lines total. Add a header comment explaining what it is:

"""Minimal baseline Gradio entry point for the Month 1-3 rebuild.

Wires the simplest possible slice: Whisper (zero-shot) -> Qwen -> MMS-TTS.
No LoRA adapters, no memory loop, no speaker ID, no voice cloning.
Used for field testing and building a real-user eval set.
See docs/baseline_rebuild.md for the plan this fits into.
"""

Step 3 — Add the evaluation infrastructure

Why. This is the single most load-bearing deliverable of the rebuild. Without a real-user eval set, every subsequent decision is speculation. The eval set is what turns "I think this change helps" into "I measured this change helped." It also makes the LoRA Kaggle training work (Stage C continuation) scientifically valid whenever you get back to it.

How. Create the folder structure:

data/eval/
  bambara_field.jsonl        # the eval manifest — starts empty
  audio/                     # the actual wav files (gitignore large files; keep manifest in git)
  README.md                  # recording protocol
scripts/
  eval_baseline.py           # runs minimal stack against manifest, emits metrics
docs/
  eval_protocol.md           # how to add a new recording, quality criteria
  metrics.md                 # where baseline numbers are recorded

The JSONL manifest format:

{"audio_path": "audio/speaker01_001.wav", "transcript": "ji be min?", "speaker_id": "speaker01", "region": "Bamako", "noise": "quiet", "duration_s": 2.3}

scripts/eval_baseline.py loads the manifest, runs each audio through Whisper-large-v3-turbo (zero-shot, no adapter), compares to the ground-truth transcript, and prints WER and CER per-speaker and overall. Also prints a few failure cases for inspection. This script becomes your standard measurement harness — every future change gets compared against the same manifest.

Step 4 — Collect real recordings (the only human-gated step)

Why. This is where the rebuild touches reality. Three to five native speakers, using their actual phones, in their actual environments. Fifteen to twenty utterances each covering the agricultural domain you scoped for. The recording conditions have to be real, or the eval set will give you FLEURS-like numbers that lie to you.

How. Write a recording script with 50-100 prompts covering:

Greetings and politeness formulas (baseline — should be easy)
Agricultural queries the product actually needs to handle ("how wet is the soil," "when should I water the tomatoes," "is there a pest alert")
Vocabulary you know is underrepresented in FLEURS (crop names, tool names, regional agricultural terms)
A few natural code-switch utterances (Bambara with French loanwords)

Share the script via WhatsApp voice messages or have them record in a free mobile app that returns wav or m4a. Transcribe by hand (or by LLM with manual correction). Commit the JSONL manifest to the repo; upload the audio to a private HF dataset to avoid bloating git history.

Set a target: at least 50 utterances across at least 3 speakers before running your first baseline eval. More is better, but 50 is the usable floor.

Step 5 — Deploy the minimal Space

Why. A second HF Space running app_minimal.py in parallel with the main Space gives testers a stripped-down version to react to. Comparing two Spaces teaches you which features in the main app are actually pulling weight — if minimal gets the same "I'd use this" reaction as the full version, most of the fancy work isn't load-bearing for first-use value (which doesn't mean it's wrong, just that adoption doesn't depend on it).

How. Create a new Space, e.g. ous-sow/sahel-voice-minimal. Set the Space entry point to app_minimal.py. Keep packages.txt unchanged (ffmpeg is still needed). In requirements.txt, consider a trimmed version that doesn't pull in voice cloning or training-only deps — this is a chance to get the minimal Space to cold-boot faster.

Add basic session logging: every interaction writes a row to a HF dataset ous-sow/sahel-agri-field-logs with fields {timestamp, speaker_opt_in_id, audio_hash, transcript, llm_reply, tts_audio_hash, latency_ms}. With opt-in consent text in the UI. No PII. This logging is what will feed your future training data and answer "are users actually coming back."

Step 6 — Run the field test

Why. The whole rebuild exists to get this step done. Everything before it is scaffolding; everything after it is informed by what happens in it. The success metric is not WER. It is: do the testers ask a second question they came up with themselves? That is the shortest signal that tells you whether this is a product or a demo.

How. Five testers, two weeks. WhatsApp intro: here is the link, please try to ask about soil or weather in Bambara, tell me anything weird. No coaching on phrasing. At the end of week 1 and week 2, ask each tester three questions: what worked, what failed, would you come back tomorrow. Record answers. No metrics from this stage go in a spreadsheet; they go in a short note under docs/field_test_notes_YYYY-MM-DD.md written in plain language.

In parallel, the session logs from Step 5 accumulate. At the end of two weeks, run a small analysis: median latency, distribution of utterance lengths, most common failure utterances, return rate per tester.

Step 7 — Selective reintegration

Why. Now you have evidence. Some of the Stage H features the main app already has will turn out to be essential — users asked for speaker memory, or they wanted the IoT integration enough to keep trying. Other features will turn out to be polish no tester noticed. The rebuild ends not with a big merge but with a prioritized list: which features go back into the critical path immediately, which wait, which get deprecated.

How. Open a small PR from experimental/baseline-rebuild back into main that brings in only the data and documentation:

data/eval/bambara_field.jsonl and the audio reference
scripts/eval_baseline.py
docs/eval_protocol.md
docs/metrics.md with baseline numbers recorded
docs/field_test_notes_*.md
The session-logging infrastructure (if you want it in the production Space too — usually yes)

Leave app_minimal.py on the branch as a long-lived tool — it's now your smoke-test harness. Don't merge it into main unless it's actively useful there.

From the field test notes, write a short follow-up roadmap document (docs/roadmap_post_field_test.md) that reorders the Month 7+ work based on what you actually learned. The features the testers needed get priority. The features that weren't missed drop in rank.

What NOT to touch during the rebuild

Production app.py — stays as-is on main. Users continue to see it on the main HF Space.
The HF dataset ous-sow/sahel-agri-feedback — keep accepting writes from the main app; the minimal Space can also write to it or to a separate one, your call.
LoRA training infrastructure — fixing the Kaggle crash is important Stage C work but it is not part of this rebuild. Track it as a separate issue. The rebuild uses Whisper zero-shot deliberately, to decouple field testing from training progress.
All src/ modules — use them, import them, fix bugs in-place if found, but do not rewrite.
The FastAPI service — leave dormant for the duration. It comes back into focus post-rebuild.

Rough timeline

Week	Work
1	Steps 1-2: branch, tag, `app_minimal.py` wired and locally runnable
2	Step 3: eval infrastructure + script scaffolded; Step 4 started (recording script sent to speakers)
3	Step 4 continues: first 50 utterances collected, transcribed, committed to eval manifest
4	Step 3 closed: first baseline eval run, numbers recorded in `docs/metrics.md`; Step 5: minimal Space deployed
5-6	Step 6: field test runs, logs accumulate, interviews at end of week 5 and week 6
7	Step 7: reintegration PR, follow-up roadmap written

Seven weeks to close the measurement gap, with production untouched the whole time.

One-line summary

The rebuild is a parallel minimal track that collects the real-user signal the project was built without — nothing gets deleted, production keeps shipping, and the reintegration at the end is a PR of data and docs, not code.