VoiceGate / docs /todo.md
YanTianlong's picture
Add TTS trim control and polish UI
1c552ae
|
Raw
History Blame Contribute Delete
7.45 kB
# VoiceGate HF Space TODO
This is the execution checklist for bringing the VoiceGate Hugging Face Space
from scaffold to a working Gradio app.
## Current Status
- [x] Create the `VoiceGate-hf` Space wrapper repository.
- [x] Configure and push to `build-small-hackathon/VoiceGate`.
- [x] Replace the default HF template with a VoiceGate scaffold.
- [x] Keep the upstream `VoiceGate/` checkout local-only and ignored.
- [x] Preserve Hugging Face LFS rules in `.gitattributes`.
- [x] Confirm `VoiceGate/workflows/VoiceGate-Workflow_api.json` is valid JSON.
- [x] Copy the validated API workflow into `workflows/voicegate_api.json`.
- [x] Confirm SSH access to the running Space container and document the
runbook.
- [x] Confirm `DEEPSEEK_API_KEY` is visible in the Space without printing it.
- [x] Download VoxCPM2 and MelBand RoFormer to persistent Space storage.
- [x] Confirm ZeroGPU CUDA can be invoked from the normal Gradio runtime through
`@spaces.GPU`.
- [x] Confirm the full short-audio workflow can run on the duplicated personal
T4 Small Space.
## Phase 1: Repository Hygiene
- [x] Copy `VoiceGate/workflows/VoiceGate-Workflow_api.json` to
`workflows/voicegate_api.json`.
- [x] Copy `VoiceGate/workflows/VoiceGate-Workflow.json` to
`workflows/voicegate_ui.json` for reference.
- [x] Update docs so `VoiceGate/` is clearly documented as a local upstream
checkout, not Space runtime content.
- [x] Commit and push the workflow files after confirming they contain no
secrets or large media.
## Phase 2: Dependency Inventory
- [x] Identify exact repositories and pinned commits for every required custom
node package:
- [x] `comfyui_voicebridge`
- [x] RunningHub VoxCPM nodes
- [x] MelBandRoFormer nodes
- [x] rgthree nodes
- [x] comfyui-easy-use
- [x] Comfyroll text nodes or equivalent provider for `CR Text`
- [x] `RH_LLMAPI_NODE` provider
- [x] `ReplaceText` provider
- [x] `MergeAudioMW` provider
- [x] Identify Python dependency constraints for ComfyUI and all custom nodes.
- [x] Identify system packages beyond `ffmpeg` and `git`, if any.
- [x] Decide whether custom nodes are vendored in `custom_nodes/` or installed
by pinned git URL during bootstrap.
- [x] Decide where model files are downloaded and cached in the Space.
## Phase 3: Runtime Bootstrap
- [x] Add `scripts/bootstrap_comfy.py`.
- [x] Add `scripts/run_comfy.py`.
- [x] Add `scripts/workflow_client.py`.
- [ ] Install or prepare ComfyUI on Space startup.
- [x] Add bootstrap support for installing custom node dependencies.
- [x] Add opt-in model directory preparation and download commands.
- [x] Verify ComfyUI can start locally in the Space in CPU mode.
- [x] Verify ComfyUI can start from a Gradio `@spaces.GPU` function and report
CUDA through `/system_stats`.
- [x] Verify `ComfyUI_AudioTools` imports successfully in the Space after
adding its missing system and Python dependencies.
- [ ] Verify ComfyUI API endpoints are reachable:
- [x] `/system_stats`
- [x] `/upload/image` or the audio upload equivalent used by `LoadAudio`
- [x] `/prompt`
- [x] `/history/{prompt_id}`
## Phase 4: Workflow Parameterization
- [x] Add Python-side patching for node `16` `LoadAudio.inputs.audio`.
- [x] Add Python-side patching for node `105` `api_key` from
`DEEPSEEK_API_KEY`.
- [x] Add Python-side patching for node `105` `api_baseurl`.
- [x] Add Python-side patching for node `105` `model`.
- [x] Add Python-side patching for node `110` target language.
- [x] Add unique job-specific output prefixes for node `180` and node `214`.
- [x] Decide which user controls are exposed first:
- [ ] source language
- [x] target language
- [ ] LLM model
- [ ] max input duration
- [ ] keep or drop background audio
- [ ] Remove or ignore display-only `easy showAnything` nodes if they are not
needed for API execution.
## Phase 5: Gradio Interface
- [x] Replace the placeholder `app.py` with the first VoiceGate interface.
- [x] Add short audio upload input.
- [x] Add target language input.
- [x] Add status/log output.
- [x] Add generated audio output.
- [x] Add generated SRT file outputs.
- [ ] Wrap GPU-heavy execution with `@spaces.GPU(duration=...)`.
- [x] Add and run a minimal GPU smoke test button.
- [x] Keep a diagnostics tab for internal workflow tests.
- [x] Add automatic model-path preparation for the user-facing run path.
- [x] Add user-facing TTS segment trim control for node `268`
`TrimAudioDuration.start_index`.
- [ ] Add guardrails for file type and duration.
## Phase 6: Minimal Runtime Tests
- [x] Start Space and confirm Gradio loads.
- [x] Start ComfyUI from the Space process.
- [x] Submit a minimal prompt to ComfyUI and receive a response.
- [x] Submit a minimal prompt to ComfyUI from the Gradio GPU runtime and receive
a response.
- [x] Run a DeepSeek LLM node smoke test.
- [x] Run a MelBand RoFormer smoke test.
- [x] Run a MelBand RoFormer smoke test inside ZeroGPU.
- [x] Run a tiny TTS-only workflow inside ZeroGPU.
- [x] Run a short ASR-only workflow inside ZeroGPU.
- [x] Verify the Space API workflow connection graph against
`VoiceGate/workflows/VoiceGate-Workflow.json`.
- [ ] Run SRT split -> VoxCPM -> SRT merge.
- [x] Run the full short-audio VoiceGate workflow on the personal T4 Small
Space.
- [ ] Run the full short-audio VoiceGate workflow on the organization ZeroGPU
Space after quota recovers.
- [x] Confirm output audio is downloadable/playable from Gradio.
- [ ] Confirm SRT files are downloadable from Gradio after redeploy.
## Phase 7: Full VoiceGate Path
- [ ] Add video input support after the audio path is stable.
- [ ] Extract audio from video with `ffmpeg`.
- [ ] Merge generated audio back into the original video if required.
- [ ] Add subtitle download and optional subtitle burn-in.
- [ ] Add examples with small sample files, if allowed by Space storage limits.
## Open Questions
- [x] Which exact node repository provides `RH_LLMAPI_NODE`?
- [x] Which exact node repository provides `RunningHub_VoxCPM_*`?
- [ ] Is `flash_attention_2` available and reliable in the ZeroGPU environment?
- [x] Can ASR run without `flash_attention_2`? Yes. The ASR-only smoke test
used `attention=sdpa`.
- [ ] Does VoxCPM2 fit comfortably in ZeroGPU memory with ASR and
MelBandRoFormer in the same run? A full run reached the heavy nodes but timed
out at the `800s` app wait window; the next shortened retry was blocked by
ZeroGPU quota before execution.
- [x] Can a full short-audio workflow run on T4 Small? Yes, on the duplicated
personal Space with a warm ComfyUI process, the test completed in `24.4s`.
- [x] How do we avoid user-facing failure after hardware changes? The user path
now checks required model paths and runs `bootstrap_comfy.py --with-models`
before submitting to ComfyUI.
- [x] Where should large model files live? `/data/voicegate_models`, with
symlinks into ComfyUI's expected model directories.
- [ ] Should the first public demo disable background separation to reduce
runtime and memory pressure?
- [ ] What maximum uploaded audio/video duration should the first version allow?
- [x] Can SSH access the ZeroGPU CUDA device directly? No. SSH enters the
normal running Space container without CUDA; GPU work must run inside a
`@spaces.GPU` function.
- [x] Has the Space successfully invoked ZeroGPU CUDA from Gradio? Yes. The
`gpu_smoke_test` button returned `cuda_available=True` on one CUDA device.