Spaces:

build-small-hackathon
/

VoiceGate

Running on Zero

App Files Files Community

VoiceGate / docs /todo.md

YanTianlong

Add TTS trim control and polish UI

1c552ae 27 days ago

preview code

Raw

History Blame Contribute Delete

7.45 kB

	# VoiceGate HF Space TODO

	This is the execution checklist for bringing the VoiceGate Hugging Face Space
	from scaffold to a working Gradio app.

	## Current Status

	- [x] Create the `VoiceGate-hf` Space wrapper repository.
	- [x] Configure and push to `build-small-hackathon/VoiceGate`.
	- [x] Replace the default HF template with a VoiceGate scaffold.
	- [x] Keep the upstream `VoiceGate/` checkout local-only and ignored.
	- [x] Preserve Hugging Face LFS rules in `.gitattributes`.
	- [x] Confirm `VoiceGate/workflows/VoiceGate-Workflow_api.json` is valid JSON.
	- [x] Copy the validated API workflow into `workflows/voicegate_api.json`.
	- [x] Confirm SSH access to the running Space container and document the
	runbook.
	- [x] Confirm `DEEPSEEK_API_KEY` is visible in the Space without printing it.
	- [x] Download VoxCPM2 and MelBand RoFormer to persistent Space storage.
	- [x] Confirm ZeroGPU CUDA can be invoked from the normal Gradio runtime through
	`@spaces.GPU`.
	- [x] Confirm the full short-audio workflow can run on the duplicated personal
	T4 Small Space.

	## Phase 1: Repository Hygiene

	- [x] Copy `VoiceGate/workflows/VoiceGate-Workflow_api.json` to
	`workflows/voicegate_api.json`.
	- [x] Copy `VoiceGate/workflows/VoiceGate-Workflow.json` to
	`workflows/voicegate_ui.json` for reference.
	- [x] Update docs so `VoiceGate/` is clearly documented as a local upstream
	checkout, not Space runtime content.
	- [x] Commit and push the workflow files after confirming they contain no
	secrets or large media.

	## Phase 2: Dependency Inventory

	- [x] Identify exact repositories and pinned commits for every required custom
	node package:
	- [x] `comfyui_voicebridge`
	- [x] RunningHub VoxCPM nodes
	- [x] MelBandRoFormer nodes
	- [x] rgthree nodes
	- [x] comfyui-easy-use
	- [x] Comfyroll text nodes or equivalent provider for `CR Text`
	- [x] `RH_LLMAPI_NODE` provider
	- [x] `ReplaceText` provider
	- [x] `MergeAudioMW` provider
	- [x] Identify Python dependency constraints for ComfyUI and all custom nodes.
	- [x] Identify system packages beyond `ffmpeg` and `git`, if any.
	- [x] Decide whether custom nodes are vendored in `custom_nodes/` or installed
	by pinned git URL during bootstrap.
	- [x] Decide where model files are downloaded and cached in the Space.

	## Phase 3: Runtime Bootstrap

	- [x] Add `scripts/bootstrap_comfy.py`.
	- [x] Add `scripts/run_comfy.py`.
	- [x] Add `scripts/workflow_client.py`.
	- [ ] Install or prepare ComfyUI on Space startup.
	- [x] Add bootstrap support for installing custom node dependencies.
	- [x] Add opt-in model directory preparation and download commands.
	- [x] Verify ComfyUI can start locally in the Space in CPU mode.
	- [x] Verify ComfyUI can start from a Gradio `@spaces.GPU` function and report
	CUDA through `/system_stats`.
	- [x] Verify `ComfyUI_AudioTools` imports successfully in the Space after
	adding its missing system and Python dependencies.
	- [ ] Verify ComfyUI API endpoints are reachable:
	- [x] `/system_stats`
	- [x] `/upload/image` or the audio upload equivalent used by `LoadAudio`
	- [x] `/prompt`
	- [x] `/history/{prompt_id}`

	## Phase 4: Workflow Parameterization

	- [x] Add Python-side patching for node `16` `LoadAudio.inputs.audio`.
	- [x] Add Python-side patching for node `105` `api_key` from
	`DEEPSEEK_API_KEY`.
	- [x] Add Python-side patching for node `105` `api_baseurl`.
	- [x] Add Python-side patching for node `105` `model`.
	- [x] Add Python-side patching for node `110` target language.
	- [x] Add unique job-specific output prefixes for node `180` and node `214`.
	- [x] Decide which user controls are exposed first:
	- [ ] source language
	- [x] target language
	- [ ] LLM model
	- [ ] max input duration
	- [ ] keep or drop background audio
	- [ ] Remove or ignore display-only `easy showAnything` nodes if they are not
	needed for API execution.

	## Phase 5: Gradio Interface

	- [x] Replace the placeholder `app.py` with the first VoiceGate interface.
	- [x] Add short audio upload input.
	- [x] Add target language input.
	- [x] Add status/log output.
	- [x] Add generated audio output.
	- [x] Add generated SRT file outputs.
	- [ ] Wrap GPU-heavy execution with `@spaces.GPU(duration=...)`.
	- [x] Add and run a minimal GPU smoke test button.
	- [x] Keep a diagnostics tab for internal workflow tests.
	- [x] Add automatic model-path preparation for the user-facing run path.
	- [x] Add user-facing TTS segment trim control for node `268`
	`TrimAudioDuration.start_index`.
	- [ ] Add guardrails for file type and duration.

	## Phase 6: Minimal Runtime Tests

	- [x] Start Space and confirm Gradio loads.
	- [x] Start ComfyUI from the Space process.
	- [x] Submit a minimal prompt to ComfyUI and receive a response.
	- [x] Submit a minimal prompt to ComfyUI from the Gradio GPU runtime and receive
	a response.
	- [x] Run a DeepSeek LLM node smoke test.
	- [x] Run a MelBand RoFormer smoke test.
	- [x] Run a MelBand RoFormer smoke test inside ZeroGPU.
	- [x] Run a tiny TTS-only workflow inside ZeroGPU.
	- [x] Run a short ASR-only workflow inside ZeroGPU.
	- [x] Verify the Space API workflow connection graph against
	`VoiceGate/workflows/VoiceGate-Workflow.json`.
	- [ ] Run SRT split -> VoxCPM -> SRT merge.
	- [x] Run the full short-audio VoiceGate workflow on the personal T4 Small
	Space.
	- [ ] Run the full short-audio VoiceGate workflow on the organization ZeroGPU
	Space after quota recovers.
	- [x] Confirm output audio is downloadable/playable from Gradio.
	- [ ] Confirm SRT files are downloadable from Gradio after redeploy.

	## Phase 7: Full VoiceGate Path

	- [ ] Add video input support after the audio path is stable.
	- [ ] Extract audio from video with `ffmpeg`.
	- [ ] Merge generated audio back into the original video if required.
	- [ ] Add subtitle download and optional subtitle burn-in.
	- [ ] Add examples with small sample files, if allowed by Space storage limits.

	## Open Questions

	- [x] Which exact node repository provides `RH_LLMAPI_NODE`?
	- [x] Which exact node repository provides `RunningHub_VoxCPM_*`?
	- [ ] Is `flash_attention_2` available and reliable in the ZeroGPU environment?
	- [x] Can ASR run without `flash_attention_2`? Yes. The ASR-only smoke test
	used `attention=sdpa`.
	- [ ] Does VoxCPM2 fit comfortably in ZeroGPU memory with ASR and
	MelBandRoFormer in the same run? A full run reached the heavy nodes but timed
	out at the `800s` app wait window; the next shortened retry was blocked by
	ZeroGPU quota before execution.
	- [x] Can a full short-audio workflow run on T4 Small? Yes, on the duplicated
	personal Space with a warm ComfyUI process, the test completed in `24.4s`.
	- [x] How do we avoid user-facing failure after hardware changes? The user path
	now checks required model paths and runs `bootstrap_comfy.py --with-models`
	before submitting to ComfyUI.
	- [x] Where should large model files live? `/data/voicegate_models`, with
	symlinks into ComfyUI's expected model directories.
	- [ ] Should the first public demo disable background separation to reduce
	runtime and memory pressure?
	- [ ] What maximum uploaded audio/video duration should the first version allow?
	- [x] Can SSH access the ZeroGPU CUDA device directly? No. SSH enters the
	normal running Space container without CUDA; GPU work must run inside a
	`@spaces.GPU` function.
	- [x] Has the Space successfully invoked ZeroGPU CUDA from Gradio? Yes. The
	`gpu_smoke_test` button returned `cuda_available=True` on one CUDA device.