Spaces:

Zappandy
/

Kirana_AI

Sleeping

App Files Files Community

Kirana_AI / docs /plan_voice_command_agent.md

Zappandy

Deploy to HF Space

dae60e5 21 days ago

preview code

Raw

History Blame Contribute Delete

3.51 kB

	# Voice Command Status For Hugging Face Spaces

	This document tracks the current voice command pipeline and the remaining work
	that matters for a public Hugging Face Space.

	## Current Pipeline

	Voice has two separate stages:

	```text
	audio
	-> POST /api/speech
	-> dukaan_saathi/integrations/speech.py
	-> MODAL_SPEECH_ENDPOINT or SPEECH_ASR_ENDPOINT
	-> transcript
	-> owner reviews/edits text
	-> _h_voice_command
	-> ReAct stock command tool
	-> pending stock action
	-> owner approves
	-> _h_voice_apply
	-> inventory write
	```

	The Modal ASR endpoint does speech-to-text only. ReAct starts after text exists.

	## Completed

	- Field names are normalized to the UI shape:
	- `action`
	- `product`
	- `product_id`
	- `quantity`
	- `unit`
	- `confidence`
	- `trace`
	- `add_stock` and `set_stock` are both handled.
	- The parser uses the returned `product_id`; it does not re-match blindly.
	- Parsed commands no longer auto-apply.
	- The UI shows a pending parsed action and requires Approve stock change.
	- `_h_voice_apply` is the only custom FastAPI voice handler that writes stock.
	- Modal cold-start copy is visible and `/api/warm` runs best-effort on page load.
	- Safety tests cover parse-without-write and apply-with-write.

	## Current Limitations

	\| Gap \| Impact on HF Space \|
	\|-----\|--------------------\|
	\| Deterministic command parser \| Reliable for seeded/demo examples, weaker for natural Telugu/code-mix. \|
	\| Limited product aliases \| Commands such as "tamatar" need aliases or NLU to map to seeded products. \|
	\| Modal ASR cold start \| First request may take 10-30 seconds unless endpoint is warm. \|
	\| Ephemeral SQLite \| Approved stock changes may reset on Space rebuild unless persistent storage is enabled. \|

	## Recommended Next Steps

	### 1. Keep deterministic parser as the default

	For the hackathon/public Space, deterministic parsing is safer and easier to
	debug. Continue using seeded examples that map to inventory:

	```text
	add Bun 12
	set OBM stock 5
	add Bingo 4
	Happy Happy low
	```

	Do not bypass owner approval to make voice feel more automatic.

	### 2. Add optional HF Inference voice NLU

	If Telugu/code-mixed commands are important for the Space demo, add an optional
	HF Inference path behind a feature flag:

	```text
	VOICE_LLM_BACKEND=keyword \| hf_inference
	HF_VOICE_NLU_MODEL_REPO=...
	```

	The output contract should stay the same:

	```json
	{
	"action": "add_stock\|set_stock\|mark_out_of_stock\|unknown",
	"product_name": "string or null",
	"product_id": "string or null",
	"quantity": "number or null",
	"unit": "string or null",
	"confidence": "low\|medium\|high"
	}
	```

	Fallback to the deterministic parser on malformed JSON, low confidence, missing
	product match, timeout, or missing env vars.

	### 3. Improve aliases before adding broad NLU

	For a constrained demo, aliases often beat another model call:

	- Add common transliterations for seeded products.
	- Keep examples aligned to seeded inventory.
	- Add parser tests for each new alias.

	### 4. Preserve the approval gate

	Any voice NLU path must still produce only a pending action:

	```text
	model/parser output -> pending action -> owner approval -> inventory write
	```

	No model, parser, or ReAct step may write inventory directly.

	## Tests To Keep

	- Voice parse does not change stock.
	- Voice apply changes stock.
	- Unknown/low-confidence commands do not expose an approval button.
	- Malformed model output falls back or returns `unknown`.
	- Missing Modal ASR endpoint produces a useful UI error, not a crash.