| # Voice Command Status For Hugging Face Spaces |
|
|
| This document tracks the current voice command pipeline and the remaining work |
| that matters for a public Hugging Face Space. |
|
|
| ## Current Pipeline |
|
|
| Voice has two separate stages: |
|
|
| ```text |
| audio |
| -> POST /api/speech |
| -> dukaan_saathi/integrations/speech.py |
| -> MODAL_SPEECH_ENDPOINT or SPEECH_ASR_ENDPOINT |
| -> transcript |
| -> owner reviews/edits text |
| -> _h_voice_command |
| -> ReAct stock command tool |
| -> pending stock action |
| -> owner approves |
| -> _h_voice_apply |
| -> inventory write |
| ``` |
|
|
| The Modal ASR endpoint does speech-to-text only. ReAct starts after text exists. |
|
|
| ## Completed |
|
|
| - Field names are normalized to the UI shape: |
| - `action` |
| - `product` |
| - `product_id` |
| - `quantity` |
| - `unit` |
| - `confidence` |
| - `trace` |
| - `add_stock` and `set_stock` are both handled. |
| - The parser uses the returned `product_id`; it does not re-match blindly. |
| - Parsed commands no longer auto-apply. |
| - The UI shows a pending parsed action and requires **Approve stock change**. |
| - `_h_voice_apply` is the only custom FastAPI voice handler that writes stock. |
| - Modal cold-start copy is visible and `/api/warm` runs best-effort on page load. |
| - Safety tests cover parse-without-write and apply-with-write. |
|
|
| ## Current Limitations |
|
|
| | Gap | Impact on HF Space | |
| |-----|--------------------| |
| | Deterministic command parser | Reliable for seeded/demo examples, weaker for natural Telugu/code-mix. | |
| | Limited product aliases | Commands such as "tamatar" need aliases or NLU to map to seeded products. | |
| | Modal ASR cold start | First request may take 10-30 seconds unless endpoint is warm. | |
| | Ephemeral SQLite | Approved stock changes may reset on Space rebuild unless persistent storage is enabled. | |
|
|
| ## Recommended Next Steps |
|
|
| ### 1. Keep deterministic parser as the default |
|
|
| For the hackathon/public Space, deterministic parsing is safer and easier to |
| debug. Continue using seeded examples that map to inventory: |
|
|
| ```text |
| add Bun 12 |
| set OBM stock 5 |
| add Bingo 4 |
| Happy Happy low |
| ``` |
|
|
| Do not bypass owner approval to make voice feel more automatic. |
|
|
| ### 2. Add optional HF Inference voice NLU |
|
|
| If Telugu/code-mixed commands are important for the Space demo, add an optional |
| HF Inference path behind a feature flag: |
|
|
| ```text |
| VOICE_LLM_BACKEND=keyword | hf_inference |
| HF_VOICE_NLU_MODEL_REPO=... |
| ``` |
|
|
| The output contract should stay the same: |
|
|
| ```json |
| { |
| "action": "add_stock|set_stock|mark_out_of_stock|unknown", |
| "product_name": "string or null", |
| "product_id": "string or null", |
| "quantity": "number or null", |
| "unit": "string or null", |
| "confidence": "low|medium|high" |
| } |
| ``` |
|
|
| Fallback to the deterministic parser on malformed JSON, low confidence, missing |
| product match, timeout, or missing env vars. |
|
|
| ### 3. Improve aliases before adding broad NLU |
|
|
| For a constrained demo, aliases often beat another model call: |
|
|
| - Add common transliterations for seeded products. |
| - Keep examples aligned to seeded inventory. |
| - Add parser tests for each new alias. |
|
|
| ### 4. Preserve the approval gate |
|
|
| Any voice NLU path must still produce only a pending action: |
|
|
| ```text |
| model/parser output -> pending action -> owner approval -> inventory write |
| ``` |
|
|
| No model, parser, or ReAct step may write inventory directly. |
|
|
| ## Tests To Keep |
|
|
| - Voice parse does not change stock. |
| - Voice apply changes stock. |
| - Unknown/low-confidence commands do not expose an approval button. |
| - Malformed model output falls back or returns `unknown`. |
| - Missing Modal ASR endpoint produces a useful UI error, not a crash. |
|
|