# Voice Command Status For Hugging Face Spaces This document tracks the current voice command pipeline and the remaining work that matters for a public Hugging Face Space. ## Current Pipeline Voice has two separate stages: ```text audio -> POST /api/speech -> dukaan_saathi/integrations/speech.py -> MODAL_SPEECH_ENDPOINT or SPEECH_ASR_ENDPOINT -> transcript -> owner reviews/edits text -> _h_voice_command -> ReAct stock command tool -> pending stock action -> owner approves -> _h_voice_apply -> inventory write ``` The Modal ASR endpoint does speech-to-text only. ReAct starts after text exists. ## Completed - Field names are normalized to the UI shape: - `action` - `product` - `product_id` - `quantity` - `unit` - `confidence` - `trace` - `add_stock` and `set_stock` are both handled. - The parser uses the returned `product_id`; it does not re-match blindly. - Parsed commands no longer auto-apply. - The UI shows a pending parsed action and requires **Approve stock change**. - `_h_voice_apply` is the only custom FastAPI voice handler that writes stock. - Modal cold-start copy is visible and `/api/warm` runs best-effort on page load. - Safety tests cover parse-without-write and apply-with-write. ## Current Limitations | Gap | Impact on HF Space | |-----|--------------------| | Deterministic command parser | Reliable for seeded/demo examples, weaker for natural Telugu/code-mix. | | Limited product aliases | Commands such as "tamatar" need aliases or NLU to map to seeded products. | | Modal ASR cold start | First request may take 10-30 seconds unless endpoint is warm. | | Ephemeral SQLite | Approved stock changes may reset on Space rebuild unless persistent storage is enabled. | ## Recommended Next Steps ### 1. Keep deterministic parser as the default For the hackathon/public Space, deterministic parsing is safer and easier to debug. Continue using seeded examples that map to inventory: ```text add Bun 12 set OBM stock 5 add Bingo 4 Happy Happy low ``` Do not bypass owner approval to make voice feel more automatic. ### 2. Add optional HF Inference voice NLU If Telugu/code-mixed commands are important for the Space demo, add an optional HF Inference path behind a feature flag: ```text VOICE_LLM_BACKEND=keyword | hf_inference HF_VOICE_NLU_MODEL_REPO=... ``` The output contract should stay the same: ```json { "action": "add_stock|set_stock|mark_out_of_stock|unknown", "product_name": "string or null", "product_id": "string or null", "quantity": "number or null", "unit": "string or null", "confidence": "low|medium|high" } ``` Fallback to the deterministic parser on malformed JSON, low confidence, missing product match, timeout, or missing env vars. ### 3. Improve aliases before adding broad NLU For a constrained demo, aliases often beat another model call: - Add common transliterations for seeded products. - Keep examples aligned to seeded inventory. - Add parser tests for each new alias. ### 4. Preserve the approval gate Any voice NLU path must still produce only a pending action: ```text model/parser output -> pending action -> owner approval -> inventory write ``` No model, parser, or ReAct step may write inventory directly. ## Tests To Keep - Voice parse does not change stock. - Voice apply changes stock. - Unknown/low-confidence commands do not expose an approval button. - Malformed model output falls back or returns `unknown`. - Missing Modal ASR endpoint produces a useful UI error, not a crash.