Kirana_AI / docs /plan_voice_command_agent.md
Zappandy's picture
Deploy to HF Space
dae60e5
|
Raw
History Blame Contribute Delete
3.51 kB

Voice Command Status For Hugging Face Spaces

This document tracks the current voice command pipeline and the remaining work that matters for a public Hugging Face Space.

Current Pipeline

Voice has two separate stages:

audio
-> POST /api/speech
-> dukaan_saathi/integrations/speech.py
-> MODAL_SPEECH_ENDPOINT or SPEECH_ASR_ENDPOINT
-> transcript
-> owner reviews/edits text
-> _h_voice_command
-> ReAct stock command tool
-> pending stock action
-> owner approves
-> _h_voice_apply
-> inventory write

The Modal ASR endpoint does speech-to-text only. ReAct starts after text exists.

Completed

  • Field names are normalized to the UI shape:
    • action
    • product
    • product_id
    • quantity
    • unit
    • confidence
    • trace
  • add_stock and set_stock are both handled.
  • The parser uses the returned product_id; it does not re-match blindly.
  • Parsed commands no longer auto-apply.
  • The UI shows a pending parsed action and requires Approve stock change.
  • _h_voice_apply is the only custom FastAPI voice handler that writes stock.
  • Modal cold-start copy is visible and /api/warm runs best-effort on page load.
  • Safety tests cover parse-without-write and apply-with-write.

Current Limitations

Gap Impact on HF Space
Deterministic command parser Reliable for seeded/demo examples, weaker for natural Telugu/code-mix.
Limited product aliases Commands such as "tamatar" need aliases or NLU to map to seeded products.
Modal ASR cold start First request may take 10-30 seconds unless endpoint is warm.
Ephemeral SQLite Approved stock changes may reset on Space rebuild unless persistent storage is enabled.

Recommended Next Steps

1. Keep deterministic parser as the default

For the hackathon/public Space, deterministic parsing is safer and easier to debug. Continue using seeded examples that map to inventory:

add Bun 12
set OBM stock 5
add Bingo 4
Happy Happy low

Do not bypass owner approval to make voice feel more automatic.

2. Add optional HF Inference voice NLU

If Telugu/code-mixed commands are important for the Space demo, add an optional HF Inference path behind a feature flag:

VOICE_LLM_BACKEND=keyword | hf_inference
HF_VOICE_NLU_MODEL_REPO=...

The output contract should stay the same:

{
  "action": "add_stock|set_stock|mark_out_of_stock|unknown",
  "product_name": "string or null",
  "product_id": "string or null",
  "quantity": "number or null",
  "unit": "string or null",
  "confidence": "low|medium|high"
}

Fallback to the deterministic parser on malformed JSON, low confidence, missing product match, timeout, or missing env vars.

3. Improve aliases before adding broad NLU

For a constrained demo, aliases often beat another model call:

  • Add common transliterations for seeded products.
  • Keep examples aligned to seeded inventory.
  • Add parser tests for each new alias.

4. Preserve the approval gate

Any voice NLU path must still produce only a pending action:

model/parser output -> pending action -> owner approval -> inventory write

No model, parser, or ReAct step may write inventory directly.

Tests To Keep

  • Voice parse does not change stock.
  • Voice apply changes stock.
  • Unknown/low-confidence commands do not expose an approval button.
  • Malformed model output falls back or returns unknown.
  • Missing Modal ASR endpoint produces a useful UI error, not a crash.