Spaces:
Running on Zero
Running on Zero
| # VoiceGate HF Space Deployment Plan | |
| ## Goal | |
| Deploy VoiceGate to a Hugging Face Space with a Gradio interface, using ZeroGPU | |
| for the GPU-heavy inference path. | |
| The initial target is a short-audio workflow that proves the full chain: | |
| audio input -> source separation -> ASR/SRT -> LLM translation -> VoxCPM TTS -> | |
| SRT-aligned audio merge -> audio and subtitle outputs. | |
| ## Repository Roles | |
| Use three clear ownership boundaries: | |
| - `VoiceGate`: upstream project assets, README, diagrams, and source workflows. | |
| - `comfyui_voicebridge`: the VoiceBridge ComfyUI custom node repository. | |
| - `VoiceGate-hf`: this repository, the Hugging Face Space deployment wrapper. | |
| The Space repository should not depend on nested git repositories at runtime. | |
| For deployment, copy or vendor only the required workflow files, custom nodes, | |
| bootstrap scripts, and Gradio application code into the Space layout. | |
| Current local state: | |
| - The outer `VoiceGate-hf` repository is connected to the Hugging Face Space | |
| remote `build-small-hackathon/VoiceGate`. | |
| - `VoiceGate/` is present as a local upstream checkout only. It is ignored by | |
| the Space repository and must not be treated as runtime content. | |
| - `VoiceGate/.gitmodules` references `comfyui_voicebridge`, but the local | |
| `VoiceGate/comfyui_voicebridge/` directory is currently empty. | |
| - `VoiceGate/workflows/VoiceGate-Workflow.json` is the UI workflow. | |
| - `VoiceGate/workflows/VoiceGate-Workflow_api.json` exists and has been | |
| confirmed as valid JSON. It still needs parameterization before Gradio can | |
| submit it to ComfyUI. | |
| - `workflows/voicegate_api.json` is the deployment copy of the API workflow. | |
| - `workflows/voicegate_ui.json` is the deployment reference copy of the UI | |
| workflow. | |
| ## Repository Hygiene | |
| The Space repository should stay small and deterministic: | |
| - Keep `VoiceGate/` as a local-only upstream checkout. | |
| - Copy deployment-ready workflow files into `workflows/`. | |
| - Copy or install custom nodes through an explicit bootstrap step. | |
| - Do not commit nested `.git` directories, model weights, API keys, uploaded | |
| media, generated audio, generated subtitles, or ComfyUI runtime caches. | |
| - Keep `.gitattributes` LFS rules for future model or binary assets, but prefer | |
| downloading model files at runtime instead of committing them. | |
| ## Hugging Face Space Constraints | |
| ZeroGPU Spaces are intended for Gradio SDK Spaces. The Gradio app should expose | |
| a normal `app.py`, and GPU-heavy functions should be wrapped with `@spaces.GPU`. | |
| This means the first implementation should prefer: | |
| - Gradio Space root files: `README.md`, `app.py`, `requirements.txt`, | |
| `packages.txt`. | |
| - A Python bootstrap that installs or prepares ComfyUI and custom nodes. | |
| - A workflow client that calls the local ComfyUI API from inside the Gradio | |
| handler. | |
| Avoid starting with a Docker Space for ZeroGPU, even though Docker would be a | |
| cleaner fit for a long-running ComfyUI service. | |
| ## Proposed Space Layout | |
| ```text | |
| VoiceGate-hf/ | |
| |-- README.md | |
| |-- app.py | |
| |-- requirements.txt | |
| |-- packages.txt | |
| |-- scripts/ | |
| | |-- bootstrap_comfy.py | |
| | |-- run_comfy.py | |
| | `-- workflow_client.py | |
| |-- workflows/ | |
| | |-- voicegate_api.json | |
| | `-- voicegate_ui.json | |
| |-- custom_nodes/ | |
| | `-- comfyui_voicebridge/ | |
| |-- assets/ | |
| `-- docs/ | |
| `-- deployment-plan.md | |
| ``` | |
| The current repository has the root scaffold, planning docs, and deployment | |
| workflow copies. Later steps should add bootstrap scripts and either copy | |
| deployment-ready custom nodes into `custom_nodes/` or install pinned node | |
| repositories during Space startup. | |
| ## Known Workflow Nodes | |
| The API workflow references these important node classes: | |
| - `LoadAudio` | |
| - `MelBandRoFormerModelLoader` | |
| - `MelBandRoFormerSampler` | |
| - `VoiceBridgeASRLoader` | |
| - `VoiceBridgeASRTranscribe` | |
| - `GenerateSRT` | |
| - `RH_LLMAPI_NODE` | |
| - `VoiceBridgeSRTSplitter` | |
| - `RunningHub_VoxCPM_LoadModel` | |
| - `RunningHub_VoxCPM_Generate` | |
| - `VoiceBridgeAudioListMergerBySRT` | |
| - `MergeAudioMW` | |
| - `SaveAudioMP3` | |
| - `SaveSRTFromString` | |
| - `TrimAudioDuration` | |
| - `Any Switch (rgthree)` | |
| - `easy showAnything` | |
| - `easy string` | |
| - `CR Text` | |
| - `ReplaceText` | |
| This implies dependencies on VoiceBridge, VoxCPM/RunningHub nodes, | |
| MelBandRoFormer nodes/models, rgthree, easy-use, and the LLM API node package. | |
| ## Model and Secret Inventory | |
| Expected model assets: | |
| - `Qwen/Qwen3-ASR-1.7B` | |
| - `Qwen/Qwen3-ForcedAligner-0.6B` | |
| - `VoxCPM2` | |
| - `MelBandRoFormer_comfy/MelBandRoformer_fp32.safetensors` | |
| Expected Space secrets: | |
| - `HF_TOKEN`, if private or gated model downloads are needed. | |
| - `DEEPSEEK_API_KEY` or another LLM provider key. | |
| - Optional LLM base URL and model name configuration. | |
| Do not commit model weights, API keys, generated audio, or generated subtitles. | |
| ## Implementation Phases | |
| ### Phase 1: Scaffold and Repository Hygiene | |
| Done: | |
| - Add HF Space root files. | |
| - Add minimal Gradio placeholder. | |
| - Add deployment plan. | |
| - Add ignore rules for runtime and generated artifacts. | |
| - Add a TODO checklist. | |
| - Copy the API workflow to `workflows/voicegate_api.json`. | |
| - Copy the UI workflow to `workflows/voicegate_ui.json`. | |
| - Confirm the API workflow is valid JSON. | |
| - Confirm the workflow files do not contain real API keys. | |
| ### Phase 2: Dependency Inventory | |
| Done: | |
| - Identify the ComfyUI and custom node repositories needed by the API workflow. | |
| - Pin the current candidate commits in `docs/dependency-inventory.md`. | |
| - Identify initial Python, system package, model, and secret requirements. | |
| - Decide to install custom nodes from pinned git URLs during bootstrap instead | |
| of vendoring them into this Space repo. | |
| ### Phase 3: Runtime Bootstrap | |
| Create scripts that can: | |
| - Clone or install ComfyUI. | |
| - Install Python dependencies. | |
| - Install required custom nodes at pinned commits. | |
| - Download or locate required model files. | |
| - Start ComfyUI locally inside the Space process. | |
| Current script status: | |
| - `scripts/bootstrap_comfy.py` clones ComfyUI and all pinned custom node | |
| repositories, installs their requirements, prepares model directories, and | |
| can optionally download the VoxCPM2 and MelBand RoFormer assets. | |
| - `scripts/run_comfy.py` starts ComfyUI and waits for `/system_stats`. | |
| - `scripts/workflow_client.py` uploads audio, patches the VoiceGate API | |
| workflow, submits it through `/prompt`, and waits on `/history/{prompt_id}`. | |
| Remaining runtime bootstrap work: | |
| - Wire bootstrap/startup behavior into `app.py`. | |
| - Validate the bootstrap and ComfyUI startup in the actual Space container. | |
| - Confirm the upload endpoint used by `LoadAudio` accepts the audio files we | |
| send from Gradio. | |
| ### Phase 4: Workflow Parameterization | |
| Parameterize `workflows/voicegate_api.json` before submitting it to ComfyUI. | |
| Required edits: | |
| - Patch hard-coded audio filenames with Gradio-uploaded input files. | |
| - Patch API keys from environment variables. | |
| - Patch target language, LLM model, and provider base URL. | |
| - Ensure output nodes produce deterministic job-specific file paths. | |
| These are implemented in `scripts/workflow_client.py`, but still need to be | |
| connected to the Gradio UI and verified against a running ComfyUI process. | |
| ### Phase 5: Gradio Integration | |
| Build the first real interface: | |
| - Input audio file. | |
| - Target language selector/text input. | |
| - Source language, default `auto`. | |
| - Optional prompt override. | |
| - Output audio. | |
| - Output translated/adjusted SRT. | |
| - Runtime log. | |
| Wrap the end-to-end function with `@spaces.GPU(duration=...)` and start with a | |
| short maximum input duration. | |
| ### Phase 6: Verification | |
| Verify in this order: | |
| 1. ComfyUI starts and exposes its local API. | |
| 2. TTS-only minimal workflow runs. | |
| 3. ASR-only short audio workflow runs. | |
| 4. SRT splitter + VoxCPM + merger runs. | |
| 5. Full VoiceGate short-audio workflow runs. | |
| 6. Video input support is added after the audio path is stable. | |
| ## Immediate Next Step | |
| Continue Phase 3 by wiring bootstrap/startup behavior into `app.py`, then test | |
| the scripts inside the running Hugging Face Space container. | |