VoiceGate / docs /todo.md
YanTianlong's picture
Add TTS trim control and polish UI
1c552ae
|
Raw
History Blame Contribute Delete
7.45 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

VoiceGate HF Space TODO

This is the execution checklist for bringing the VoiceGate Hugging Face Space from scaffold to a working Gradio app.

Current Status

  • Create the VoiceGate-hf Space wrapper repository.
  • Configure and push to build-small-hackathon/VoiceGate.
  • Replace the default HF template with a VoiceGate scaffold.
  • Keep the upstream VoiceGate/ checkout local-only and ignored.
  • Preserve Hugging Face LFS rules in .gitattributes.
  • Confirm VoiceGate/workflows/VoiceGate-Workflow_api.json is valid JSON.
  • Copy the validated API workflow into workflows/voicegate_api.json.
  • Confirm SSH access to the running Space container and document the runbook.
  • Confirm DEEPSEEK_API_KEY is visible in the Space without printing it.
  • Download VoxCPM2 and MelBand RoFormer to persistent Space storage.
  • Confirm ZeroGPU CUDA can be invoked from the normal Gradio runtime through @spaces.GPU.
  • Confirm the full short-audio workflow can run on the duplicated personal T4 Small Space.

Phase 1: Repository Hygiene

  • Copy VoiceGate/workflows/VoiceGate-Workflow_api.json to workflows/voicegate_api.json.
  • Copy VoiceGate/workflows/VoiceGate-Workflow.json to workflows/voicegate_ui.json for reference.
  • Update docs so VoiceGate/ is clearly documented as a local upstream checkout, not Space runtime content.
  • Commit and push the workflow files after confirming they contain no secrets or large media.

Phase 2: Dependency Inventory

  • Identify exact repositories and pinned commits for every required custom node package:
    • comfyui_voicebridge
    • RunningHub VoxCPM nodes
    • MelBandRoFormer nodes
    • rgthree nodes
    • comfyui-easy-use
    • Comfyroll text nodes or equivalent provider for CR Text
    • RH_LLMAPI_NODE provider
    • ReplaceText provider
    • MergeAudioMW provider
  • Identify Python dependency constraints for ComfyUI and all custom nodes.
  • Identify system packages beyond ffmpeg and git, if any.
  • Decide whether custom nodes are vendored in custom_nodes/ or installed by pinned git URL during bootstrap.
  • Decide where model files are downloaded and cached in the Space.

Phase 3: Runtime Bootstrap

  • Add scripts/bootstrap_comfy.py.
  • Add scripts/run_comfy.py.
  • Add scripts/workflow_client.py.
  • Install or prepare ComfyUI on Space startup.
  • Add bootstrap support for installing custom node dependencies.
  • Add opt-in model directory preparation and download commands.
  • Verify ComfyUI can start locally in the Space in CPU mode.
  • Verify ComfyUI can start from a Gradio @spaces.GPU function and report CUDA through /system_stats.
  • Verify ComfyUI_AudioTools imports successfully in the Space after adding its missing system and Python dependencies.
  • Verify ComfyUI API endpoints are reachable:
    • /system_stats
    • /upload/image or the audio upload equivalent used by LoadAudio
    • /prompt
    • /history/{prompt_id}

Phase 4: Workflow Parameterization

  • Add Python-side patching for node 16 LoadAudio.inputs.audio.
  • Add Python-side patching for node 105 api_key from DEEPSEEK_API_KEY.
  • Add Python-side patching for node 105 api_baseurl.
  • Add Python-side patching for node 105 model.
  • Add Python-side patching for node 110 target language.
  • Add unique job-specific output prefixes for node 180 and node 214.
  • Decide which user controls are exposed first:
    • source language
    • target language
    • LLM model
    • max input duration
    • keep or drop background audio
  • Remove or ignore display-only easy showAnything nodes if they are not needed for API execution.

Phase 5: Gradio Interface

  • Replace the placeholder app.py with the first VoiceGate interface.
  • Add short audio upload input.
  • Add target language input.
  • Add status/log output.
  • Add generated audio output.
  • Add generated SRT file outputs.
  • Wrap GPU-heavy execution with @spaces.GPU(duration=...).
  • Add and run a minimal GPU smoke test button.
  • Keep a diagnostics tab for internal workflow tests.
  • Add automatic model-path preparation for the user-facing run path.
  • Add user-facing TTS segment trim control for node 268 TrimAudioDuration.start_index.
  • Add guardrails for file type and duration.

Phase 6: Minimal Runtime Tests

  • Start Space and confirm Gradio loads.
  • Start ComfyUI from the Space process.
  • Submit a minimal prompt to ComfyUI and receive a response.
  • Submit a minimal prompt to ComfyUI from the Gradio GPU runtime and receive a response.
  • Run a DeepSeek LLM node smoke test.
  • Run a MelBand RoFormer smoke test.
  • Run a MelBand RoFormer smoke test inside ZeroGPU.
  • Run a tiny TTS-only workflow inside ZeroGPU.
  • Run a short ASR-only workflow inside ZeroGPU.
  • Verify the Space API workflow connection graph against VoiceGate/workflows/VoiceGate-Workflow.json.
  • Run SRT split -> VoxCPM -> SRT merge.
  • Run the full short-audio VoiceGate workflow on the personal T4 Small Space.
  • Run the full short-audio VoiceGate workflow on the organization ZeroGPU Space after quota recovers.
  • Confirm output audio is downloadable/playable from Gradio.
  • Confirm SRT files are downloadable from Gradio after redeploy.

Phase 7: Full VoiceGate Path

  • Add video input support after the audio path is stable.
  • Extract audio from video with ffmpeg.
  • Merge generated audio back into the original video if required.
  • Add subtitle download and optional subtitle burn-in.
  • Add examples with small sample files, if allowed by Space storage limits.

Open Questions

  • Which exact node repository provides RH_LLMAPI_NODE?
  • Which exact node repository provides RunningHub_VoxCPM_*?
  • Is flash_attention_2 available and reliable in the ZeroGPU environment?
  • Can ASR run without flash_attention_2? Yes. The ASR-only smoke test used attention=sdpa.
  • Does VoxCPM2 fit comfortably in ZeroGPU memory with ASR and MelBandRoFormer in the same run? A full run reached the heavy nodes but timed out at the 800s app wait window; the next shortened retry was blocked by ZeroGPU quota before execution.
  • Can a full short-audio workflow run on T4 Small? Yes, on the duplicated personal Space with a warm ComfyUI process, the test completed in 24.4s.
  • How do we avoid user-facing failure after hardware changes? The user path now checks required model paths and runs bootstrap_comfy.py --with-models before submitting to ComfyUI.
  • Where should large model files live? /data/voicegate_models, with symlinks into ComfyUI's expected model directories.
  • Should the first public demo disable background separation to reduce runtime and memory pressure?
  • What maximum uploaded audio/video duration should the first version allow?
  • Can SSH access the ZeroGPU CUDA device directly? No. SSH enters the normal running Space container without CUDA; GPU work must run inside a @spaces.GPU function.
  • Has the Space successfully invoked ZeroGPU CUDA from Gradio? Yes. The gpu_smoke_test button returned cuda_available=True on one CUDA device.