Voice Command Status For Hugging Face Spaces
This document tracks the current voice command pipeline and the remaining work that matters for a public Hugging Face Space.
Current Pipeline
Voice has two separate stages:
audio
-> POST /api/speech
-> dukaan_saathi/integrations/speech.py
-> MODAL_SPEECH_ENDPOINT or SPEECH_ASR_ENDPOINT
-> transcript
-> owner reviews/edits text
-> _h_voice_command
-> ReAct stock command tool
-> pending stock action
-> owner approves
-> _h_voice_apply
-> inventory write
The Modal ASR endpoint does speech-to-text only. ReAct starts after text exists.
Completed
- Field names are normalized to the UI shape:
actionproductproduct_idquantityunitconfidencetrace
add_stockandset_stockare both handled.- The parser uses the returned
product_id; it does not re-match blindly. - Parsed commands no longer auto-apply.
- The UI shows a pending parsed action and requires Approve stock change.
_h_voice_applyis the only custom FastAPI voice handler that writes stock.- Modal cold-start copy is visible and
/api/warmruns best-effort on page load. - Safety tests cover parse-without-write and apply-with-write.
Current Limitations
| Gap | Impact on HF Space |
|---|---|
| Deterministic command parser | Reliable for seeded/demo examples, weaker for natural Telugu/code-mix. |
| Limited product aliases | Commands such as "tamatar" need aliases or NLU to map to seeded products. |
| Modal ASR cold start | First request may take 10-30 seconds unless endpoint is warm. |
| Ephemeral SQLite | Approved stock changes may reset on Space rebuild unless persistent storage is enabled. |
Recommended Next Steps
1. Keep deterministic parser as the default
For the hackathon/public Space, deterministic parsing is safer and easier to debug. Continue using seeded examples that map to inventory:
add Bun 12
set OBM stock 5
add Bingo 4
Happy Happy low
Do not bypass owner approval to make voice feel more automatic.
2. Add optional HF Inference voice NLU
If Telugu/code-mixed commands are important for the Space demo, add an optional HF Inference path behind a feature flag:
VOICE_LLM_BACKEND=keyword | hf_inference
HF_VOICE_NLU_MODEL_REPO=...
The output contract should stay the same:
{
"action": "add_stock|set_stock|mark_out_of_stock|unknown",
"product_name": "string or null",
"product_id": "string or null",
"quantity": "number or null",
"unit": "string or null",
"confidence": "low|medium|high"
}
Fallback to the deterministic parser on malformed JSON, low confidence, missing product match, timeout, or missing env vars.
3. Improve aliases before adding broad NLU
For a constrained demo, aliases often beat another model call:
- Add common transliterations for seeded products.
- Keep examples aligned to seeded inventory.
- Add parser tests for each new alias.
4. Preserve the approval gate
Any voice NLU path must still produce only a pending action:
model/parser output -> pending action -> owner approval -> inventory write
No model, parser, or ReAct step may write inventory directly.
Tests To Keep
- Voice parse does not change stock.
- Voice apply changes stock.
- Unknown/low-confidence commands do not expose an approval button.
- Malformed model output falls back or returns
unknown. - Missing Modal ASR endpoint produces a useful UI error, not a crash.