Kirana_AI / docs /plan_voice_command_agent.md
Zappandy's picture
Deploy to HF Space
dae60e5
|
Raw
History Blame Contribute Delete
3.51 kB
# Voice Command Status For Hugging Face Spaces
This document tracks the current voice command pipeline and the remaining work
that matters for a public Hugging Face Space.
## Current Pipeline
Voice has two separate stages:
```text
audio
-> POST /api/speech
-> dukaan_saathi/integrations/speech.py
-> MODAL_SPEECH_ENDPOINT or SPEECH_ASR_ENDPOINT
-> transcript
-> owner reviews/edits text
-> _h_voice_command
-> ReAct stock command tool
-> pending stock action
-> owner approves
-> _h_voice_apply
-> inventory write
```
The Modal ASR endpoint does speech-to-text only. ReAct starts after text exists.
## Completed
- Field names are normalized to the UI shape:
- `action`
- `product`
- `product_id`
- `quantity`
- `unit`
- `confidence`
- `trace`
- `add_stock` and `set_stock` are both handled.
- The parser uses the returned `product_id`; it does not re-match blindly.
- Parsed commands no longer auto-apply.
- The UI shows a pending parsed action and requires **Approve stock change**.
- `_h_voice_apply` is the only custom FastAPI voice handler that writes stock.
- Modal cold-start copy is visible and `/api/warm` runs best-effort on page load.
- Safety tests cover parse-without-write and apply-with-write.
## Current Limitations
| Gap | Impact on HF Space |
|-----|--------------------|
| Deterministic command parser | Reliable for seeded/demo examples, weaker for natural Telugu/code-mix. |
| Limited product aliases | Commands such as "tamatar" need aliases or NLU to map to seeded products. |
| Modal ASR cold start | First request may take 10-30 seconds unless endpoint is warm. |
| Ephemeral SQLite | Approved stock changes may reset on Space rebuild unless persistent storage is enabled. |
## Recommended Next Steps
### 1. Keep deterministic parser as the default
For the hackathon/public Space, deterministic parsing is safer and easier to
debug. Continue using seeded examples that map to inventory:
```text
add Bun 12
set OBM stock 5
add Bingo 4
Happy Happy low
```
Do not bypass owner approval to make voice feel more automatic.
### 2. Add optional HF Inference voice NLU
If Telugu/code-mixed commands are important for the Space demo, add an optional
HF Inference path behind a feature flag:
```text
VOICE_LLM_BACKEND=keyword | hf_inference
HF_VOICE_NLU_MODEL_REPO=...
```
The output contract should stay the same:
```json
{
"action": "add_stock|set_stock|mark_out_of_stock|unknown",
"product_name": "string or null",
"product_id": "string or null",
"quantity": "number or null",
"unit": "string or null",
"confidence": "low|medium|high"
}
```
Fallback to the deterministic parser on malformed JSON, low confidence, missing
product match, timeout, or missing env vars.
### 3. Improve aliases before adding broad NLU
For a constrained demo, aliases often beat another model call:
- Add common transliterations for seeded products.
- Keep examples aligned to seeded inventory.
- Add parser tests for each new alias.
### 4. Preserve the approval gate
Any voice NLU path must still produce only a pending action:
```text
model/parser output -> pending action -> owner approval -> inventory write
```
No model, parser, or ReAct step may write inventory directly.
## Tests To Keep
- Voice parse does not change stock.
- Voice apply changes stock.
- Unknown/low-confidence commands do not expose an approval button.
- Malformed model output falls back or returns `unknown`.
- Missing Modal ASR endpoint produces a useful UI error, not a crash.