Spaces:
Running on Zero
Running on Zero
| # VoiceGate HF Space Work Log | |
| This document records the effective work completed while preparing the | |
| `build-small-hackathon/VoiceGate` Hugging Face Space, plus the pitfalls found | |
| and how they were resolved. | |
| ## Current Snapshot | |
| - Space: `https://huggingface.co/spaces/build-small-hackathon/VoiceGate` | |
| - Space git remote: `https://huggingface.co/spaces/build-small-hackathon/VoiceGate` | |
| - Runtime hardware: ZeroGPU / `zero-a10g` | |
| - Space SDK: Gradio | |
| - Local Space wrapper repo: `VoiceGate-hf` | |
| - Local upstream reference checkout: `VoiceGate/` | |
| - Latest confirmed normal runtime commit: `316b35db739d74d05543d6c8c9dd9c16e0580b17` | |
| - Current expected Space secret: `DEEPSEEK_API_KEY` | |
| - Default persistent model root: `/data/voicegate_models` | |
| Do not commit API keys, model weights, uploaded media, generated outputs, or the | |
| local `VoiceGate/` upstream checkout. | |
| ## Executive Summary | |
| The Space is no longer just a blank scaffold. It can now run Gradio, invoke | |
| ZeroGPU, prepare a ComfyUI runtime, start ComfyUI from a GPU-backed Gradio | |
| function, and submit several segmented ComfyUI workflows. | |
| Confirmed working: | |
| - Hugging Face Space git push and normal rebuild flow. | |
| - Dev Mode SSH for CPU/container diagnostics. | |
| - ZeroGPU invocation from Gradio through `@spaces.GPU`. | |
| - ComfyUI startup from inside a `@spaces.GPU` function. | |
| - ComfyUI API calls from the Gradio process. | |
| - DeepSeek-compatible LLM node with the Space secret. | |
| - MelBand RoFormer smoke tests in CPU mode and ZeroGPU mode. | |
| - VoxCPM2 TTS-only smoke test in ZeroGPU mode. | |
| - VoiceBridge ASR-only smoke test in ZeroGPU mode. | |
| - Persistent model storage for VoxCPM2, MelBand, Qwen3-ASR, and Qwen3 forced | |
| aligner under `/data`. | |
| Not yet confirmed at the start of 2026-06-06: | |
| - SRT split -> VoxCPM -> SRT merge. | |
| - Full short-audio VoiceGate workflow. | |
| - Final user-facing Gradio upload/download UI. | |
| ## Repository Setup Completed | |
| - Created and pushed the Space wrapper repository. | |
| - Kept `VoiceGate/` as a local-only upstream reference and ignored it in git. | |
| - Preserved Hugging Face LFS rules. | |
| - Copied deployment workflows: | |
| - `workflows/voicegate_api.json` | |
| - `workflows/voicegate_ui.json` | |
| - Confirmed the API workflow JSON is valid. | |
| - Confirmed workflow files contain no committed API key. | |
| ## Dependency Inventory Completed | |
| Required workflow node providers were identified and pinned: | |
| - ComfyUI core: | |
| `comfyanonymous/ComfyUI` | |
| - VoiceBridge: | |
| `YanTianlong-01/comfyui_voicebridge` | |
| - RunningHub VoxCPM: | |
| `RH-RunningHub/ComfyUI_RH_VoxCPM` | |
| - MelBand RoFormer: | |
| `kijai/ComfyUI-MelBandRoFormer` | |
| - RunningHub LLM API: | |
| `HM-RunningHub/ComfyUI_RH_LLM_API` | |
| - rgthree: | |
| `rgthree/rgthree-comfy` | |
| - Easy Use: | |
| `yolain/ComfyUI-Easy-Use` | |
| - Comfyroll: | |
| `Suzie1/ComfyUI_Comfyroll_CustomNodes` | |
| - MW AudioTools: | |
| `billwuhao/ComfyUI_AudioTools` | |
| Important node source confirmations: | |
| - `ReplaceText` is provided by ComfyUI core extra nodes. | |
| - `MergeAudioMW` is provided by `ComfyUI_AudioTools`. | |
| - `RH_LLMAPI_NODE` is provided by `ComfyUI_RH_LLM_API`. | |
| ## Runtime Bootstrap Added | |
| The following scripts were added: | |
| - `scripts/bootstrap_comfy.py` | |
| - Clones ComfyUI. | |
| - Checks out pinned commits. | |
| - Clones required custom node repositories. | |
| - Installs ComfyUI and custom node Python requirements. | |
| - Prepares expected model directories. | |
| - Optionally downloads large model assets with `--with-models`. | |
| - `scripts/run_comfy.py` | |
| - Starts ComfyUI. | |
| - Waits for `/system_stats`. | |
| - Supports `--cpu` for SSH diagnostics. | |
| - `scripts/workflow_client.py` | |
| - Loads `workflows/voicegate_api.json`. | |
| - Uploads audio through the ComfyUI API. | |
| - Patches workflow inputs. | |
| - Submits `/prompt`. | |
| - Waits for `/history/{prompt_id}`. | |
| Workflow patching currently covers: | |
| - Node `16`: uploaded audio filename. | |
| - Node `105`: `DEEPSEEK_API_KEY`. | |
| - Node `105`: API base URL. | |
| - Node `105`: LLM model name. | |
| - Node `110`: target language. | |
| - Node `180`: job-specific audio output prefix. | |
| - Node `214`: job-specific SRT output prefix. | |
| ## Hugging Face Space Runtime Findings | |
| ### Dev Mode and SSH | |
| SSH target: | |
| ```text | |
| build-small-hackathon-voicegate@ssh.hf.space | |
| ``` | |
| Local private key: | |
| ```text | |
| C:\Users\yantianlong\.ssh\codex_space_voicegate | |
| ``` | |
| SSH is only available while the Space is in Dev Mode. Normal running Spaces do | |
| not accept SSH and return: | |
| ```text | |
| Bad request: SSH in only allowed in Dev mode | |
| ``` | |
| Dev Mode can be toggled through the Hugging Face API endpoint: | |
| ```text | |
| POST /api/spaces/build-small-hackathon/VoiceGate/dev-mode | |
| ``` | |
| Use Dev Mode for diagnostics only. Persistent fixes must be committed locally | |
| and pushed. | |
| ### Dev Mode Stale Commit Pitfall | |
| The running container initially stayed on the original template commit: | |
| ```text | |
| a94117f35a42cb17f654ae70cbe619c15345d057 | |
| ``` | |
| even after newer commits were pushed. `restart_space` alone did not move it to | |
| the latest repository state while Dev Mode was enabled. | |
| Fix: | |
| - Disable Dev Mode. | |
| - Use `factory_reboot=True` or push a new commit to trigger a normal rebuild. | |
| - Confirm runtime metadata reports the latest commit. | |
| ### ZeroGPU Startup Requirement | |
| When Dev Mode was disabled, the Space entered `RUNTIME_ERROR` with: | |
| ```text | |
| No @spaces.GPU function detected during startup | |
| ``` | |
| Fix: | |
| - Import `spaces`. | |
| - Add at least one `@spaces.GPU(duration=...)` function in `app.py`. | |
| Current placeholder fix: | |
| ```python | |
| @spaces.GPU(duration=30) | |
| def placeholder(): | |
| ... | |
| ``` | |
| Later this placeholder was replaced by real diagnostic functions: | |
| ```python | |
| @spaces.GPU(duration=60) | |
| def gpu_smoke_test(): | |
| ... | |
| @spaces.GPU(duration=900) | |
| def comfy_runtime_test(): | |
| ... | |
| ``` | |
| ### SSH Does Not Expose ZeroGPU CUDA | |
| Starting ComfyUI normally through SSH failed with: | |
| ```text | |
| RuntimeError: No CUDA GPUs are available | |
| ``` | |
| Conclusion: | |
| - SSH is useful for CPU-mode diagnostics. | |
| - Real GPU work must run from the Gradio process inside a `@spaces.GPU` | |
| function. | |
| CPU diagnostic command: | |
| ```bash | |
| python scripts/run_comfy.py --cpu | |
| ``` | |
| ### Gradio Request Timeout During Bootstrap | |
| Long bootstrap work should not run synchronously inside a Gradio request. The | |
| first attempt did this: | |
| ```text | |
| Gradio click -> bootstrap_comfy.py -> clone repos -> pip install -> start ComfyUI | |
| ``` | |
| The request was interrupted by Gradio/ZeroGPU's outer queue after roughly 2.5 | |
| minutes and returned: | |
| ```text | |
| event: error | |
| data: {"error": null} | |
| ``` | |
| Fix: | |
| - Add a non-GPU `Prepare` action that starts `scripts/bootstrap_comfy.py` as a | |
| background process. | |
| - Add `Prepare Status` to poll `/tmp/voicegate_bootstrap.log`. | |
| - Keep GPU actions focused on starting ComfyUI and running actual CUDA work. | |
| This avoids wasting ZeroGPU time on clone/install steps and prevents the request | |
| from being killed before diagnostics can return useful logs. | |
| ### Runtime Pip Install Pitfall | |
| The background bootstrap installed a large dependency set and upgraded the | |
| on-disk Torch package. The already-running Gradio process continued to report: | |
| ```text | |
| torch=2.11.0+cu130 | |
| ``` | |
| while the ComfyUI subprocess started afterwards reported: | |
| ```text | |
| pytorch_version=2.12.0+cu130 | |
| ``` | |
| This is workable for diagnostics, but final production should avoid heavy | |
| runtime `pip install` where possible. Prefer moving stable dependencies into | |
| Space build-time requirements or explicitly controlling pins. | |
| ### ZeroGPU Duration and Quota Pitfall | |
| The ASR diagnostic was first decorated with: | |
| ```python | |
| @spaces.GPU(duration=1800) | |
| ``` | |
| The Space rejected it before execution: | |
| ```text | |
| ZeroGPU illegal duration | |
| The requested GPU duration is larger than the maximum allowed | |
| ``` | |
| After reducing the function to `duration=1200`, the Space still rejected the | |
| call because the quota precheck reported: | |
| ```text | |
| You have exceeded your Pro ZeroGPU quota (1800s requested vs. 1389s left) | |
| ``` | |
| The working diagnostic used: | |
| ```python | |
| @spaces.GPU(duration=900) | |
| ``` | |
| For future tests, keep diagnostic durations conservative and increase only when | |
| the workflow has already proven it needs more time. | |
| ## Dependency Pitfalls and Fixes | |
| `ComfyUI_AudioTools` initially failed to import. | |
| First failure: | |
| ```text | |
| SoX could not be found | |
| ModuleNotFoundError: No module named 'sounddevice' | |
| ``` | |
| Second failure after adding `sounddevice`: | |
| ```text | |
| OSError: PortAudio library not found | |
| ``` | |
| Third failure: | |
| ```text | |
| ModuleNotFoundError: No module named 'easydict' | |
| ``` | |
| Fourth failure: | |
| ```text | |
| ModuleNotFoundError: No module named 'pytorch_lightning' | |
| ``` | |
| Fixes added: | |
| - `packages.txt` | |
| - `sox` | |
| - `libportaudio2` | |
| - `portaudio19-dev` | |
| - `requirements.txt` | |
| - `sounddevice` | |
| - `easydict` | |
| - `pytorch-lightning` | |
| Final verification: | |
| ```text | |
| 0.4 seconds: /home/user/app/ComfyUI/custom_nodes/ComfyUI_AudioTools | |
| ``` | |
| with no `IMPORT FAILED` entry. | |
| ## ComfyUI API Smoke Test | |
| Test audio source: | |
| ```text | |
| D:\voicebridge-test-audio\test_audio\2-坤哥.MP3 | |
| ``` | |
| The first upload attempt used a plain PowerShell byte pipeline and corrupted the | |
| binary file. The remote file was identified as text instead of MP3, and | |
| `LoadAudio` failed with: | |
| ```text | |
| Invalid data found when processing input: 'avcodec_send_packet()' | |
| ``` | |
| Fix: | |
| - Upload binary test media through a binary-safe method. | |
| - Verify remote `sha256sum` before using the file. | |
| Successful upload result: | |
| ```text | |
| /tmp/voicegate_test_audio.mp3: Audio file with ID3 version 2.3.0 | |
| ``` | |
| ComfyUI API endpoints verified in Dev Mode: | |
| - `/system_stats` | |
| - `/upload/image` | |
| - `/prompt` | |
| - `/history/{prompt_id}` | |
| Minimal test workflow: | |
| ```text | |
| LoadAudio -> SaveAudioMP3 | |
| ``` | |
| Successful `/history/{prompt_id}` result: | |
| ```text | |
| status_str: success | |
| completed: true | |
| ``` | |
| Output reported by ComfyUI: | |
| ```text | |
| audio/api_smoke_voicegate_00001.mp3 | |
| ``` | |
| ## Segmented Workflow Smoke Tests | |
| ### ComfyUI From Gradio ZeroGPU | |
| On 2026-06-05, `app.py` was expanded with diagnostic Gradio actions: | |
| - `prepare_runtime`: starts `scripts/bootstrap_comfy.py` in the background and | |
| writes progress to `/tmp/voicegate_bootstrap.log`. | |
| - `prepare_status`: reports the background bootstrap status and log tail. | |
| - `comfy_runtime_test`: runs inside `@spaces.GPU`, starts ComfyUI, and calls | |
| `/system_stats`. | |
| - `melband_gpu_test`: runs a tiny MelBand workflow inside `@spaces.GPU`. | |
| - `voxcpm_tts_gpu_test`: runs a tiny VoxCPM2 TTS-only workflow inside | |
| `@spaces.GPU`. | |
| The first attempt ran the full bootstrap synchronously inside a Gradio request | |
| and the request was interrupted by the outer queue with `event: error` and no | |
| function payload after roughly 2.5 minutes. The fix was to start bootstrap as a | |
| background process and poll a status endpoint. | |
| The background prepare completed successfully. It installed a large dependency | |
| set and upgraded the on-disk Torch package from `2.11.0` to `2.12.0`. The | |
| already-running Gradio process still reported its originally imported | |
| `torch=2.11.0+cu130`, while the newly started ComfyUI subprocess reported: | |
| ```text | |
| pytorch_version=2.12.0+cu130 | |
| ``` | |
| This is acceptable for the smoke test, but runtime pip installs are not ideal | |
| for the final app. A later pass should move heavy Python dependencies into the | |
| Space build/install phase or pin the root requirements more deliberately. | |
| `comfy_runtime_test` result: | |
| ```text | |
| cuda_available=True | |
| comfy_ready=true | |
| comfy_elapsed_sec=16.0 | |
| ComfyUI version=0.24.0 | |
| device=cuda:0 NVIDIA RTX PRO 6000 Blackwell Server Edition MIG 2g.48gb | |
| vram_total=50868518912 | |
| ``` | |
| Observed behavior: separate `@spaces.GPU` calls may run in separate worker | |
| processes, so the ComfyUI subprocess should not be assumed to persist across | |
| different button/API calls. | |
| ### ZeroGPU Gradio Invocation | |
| On 2026-06-05, the Space was tested in normal runtime, with Dev Mode off, using | |
| a Gradio button backed by: | |
| ```python | |
| @spaces.GPU(duration=60) | |
| def gpu_smoke_test(): | |
| ... | |
| ``` | |
| The private Space API was called with the local Hugging Face token through: | |
| ```text | |
| POST /gradio_api/call/gpu_smoke_test | |
| GET /gradio_api/call/gpu_smoke_test/{event_id} | |
| ``` | |
| Result: | |
| ```text | |
| torch=2.11.0+cu130 | |
| cuda_available=True | |
| cuda_device_count=1 | |
| device_name=NVIDIA RTX PRO 6000 Blackwell Server Edition MIG 2g.48gb | |
| total_memory_gb=47.38 | |
| tensor_result=240.0 | |
| memory_reserved_mb=2.00 | |
| ``` | |
| This confirms ZeroGPU CUDA is available from the normal Gradio runtime when the | |
| work is executed inside a `@spaces.GPU` function. SSH still should be treated as | |
| CPU-only diagnostic access. | |
| ### DeepSeek LLM Node | |
| On 2026-06-05, `RH_LLMAPI_NODE` was tested through ComfyUI in Dev Mode using | |
| the Space `DEEPSEEK_API_KEY` secret. The key was not printed. | |
| Minimal workflow: | |
| ```text | |
| RH_LLMAPI_NODE -> easy showAnything | |
| ``` | |
| Prompt: | |
| ```text | |
| Translate to Simplified Chinese: VoiceGate smoke test. | |
| ``` | |
| Result: | |
| ```text | |
| status_str: success | |
| output: VoiceGate 冒烟测试。 | |
| ``` | |
| This confirms the RunningHub LLM node can read the Space secret and call the | |
| DeepSeek-compatible API endpoint. | |
| ### MelBand RoFormer | |
| On 2026-06-05, `MelBandRoFormerModelLoader` and `MelBandRoFormerSampler` were | |
| tested through ComfyUI in CPU mode. | |
| Input: | |
| ```text | |
| 1 second synthetic 440 Hz WAV generated with ffmpeg | |
| ``` | |
| Minimal workflow: | |
| ```text | |
| LoadAudio -> MelBandRoFormerModelLoader -> MelBandRoFormerSampler | |
| -> SaveAudioMP3(vocals) | |
| -> SaveAudioMP3(instruments) | |
| ``` | |
| Result: | |
| ```text | |
| status_str: success | |
| audio/melband_smoke_vocals_00001.mp3 | |
| audio/melband_smoke_instruments_00001.mp3 | |
| ``` | |
| CPU-mode runtime for the 1 second smoke input was about 51 seconds. Real runs | |
| should execute inside a `@spaces.GPU` function. | |
| Later on 2026-06-05, the same kind of tiny MelBand smoke test was run from the | |
| normal Gradio runtime inside `@spaces.GPU`. | |
| Input: | |
| ```text | |
| 1 second synthetic 440 Hz WAV written to ComfyUI/input | |
| ``` | |
| Result: | |
| ```text | |
| status_str=success | |
| completed=True | |
| audio/melband_gpu_32459bea_instruments_00001.mp3 | |
| audio/melband_gpu_32459bea_vocals_00001.mp3 | |
| elapsed_sec=78.3 | |
| ``` | |
| This confirms the MelBand custom node and model can execute from the Space | |
| ZeroGPU path. | |
| ### VoxCPM2 TTS-only | |
| On 2026-06-05, a minimal VoxCPM2 TTS-only workflow was run from the normal | |
| Gradio runtime inside `@spaces.GPU`. | |
| Minimal workflow: | |
| ```text | |
| RunningHub_VoxCPM_LoadModel -> RunningHub_VoxCPM_Generate -> SaveAudioMP3 | |
| ``` | |
| Prompt text: | |
| ```text | |
| 你好,VoiceGate GPU 语音合成测试。 | |
| ``` | |
| Result: | |
| ```text | |
| status_str=success | |
| completed=True | |
| audio/voxcpm_tts_gpu_cda209ec_00001.mp3 | |
| elapsed_sec=766.2 | |
| ``` | |
| This confirms VoxCPM2 fits and executes in ZeroGPU, but the first cold TTS-only | |
| run was very slow. The final app should minimize cold starts, avoid repeated | |
| ComfyUI/model reloads where possible, and use shorter diagnostic prompts while | |
| tuning. | |
| ### VoiceBridge ASR-only | |
| On 2026-06-06, a minimal VoiceBridge ASR-only workflow was run from the normal | |
| Gradio runtime inside `@spaces.GPU`. | |
| Before running ASR, `scripts/bootstrap_comfy.py` was extended so Qwen ASR models | |
| also live on persistent storage: | |
| ```text | |
| /home/user/app/ComfyUI/models/Qwen3-ASR | |
| -> /data/voicegate_models/Qwen3-ASR | |
| ``` | |
| The model preparation downloads: | |
| ```text | |
| /data/voicegate_models/Qwen3-ASR/Qwen3-ASR-1.7B | |
| /data/voicegate_models/Qwen3-ASR/Qwen3-ForcedAligner-0.6B | |
| ``` | |
| Minimal workflow: | |
| ```text | |
| LoadAudio | |
| -> VoiceBridgeASRLoader(attention=sdpa, forced_aligner=Qwen/Qwen3-ForcedAligner-0.6B) | |
| -> VoiceBridgeASRTranscribe(return_timestamps=True) | |
| -> GenerateSRT | |
| -> easy showAnything | |
| ``` | |
| Input: | |
| ```text | |
| D:\voicebridge-test-audio\test_audio\2-坤哥.MP3 | |
| ``` | |
| Result: | |
| ```text | |
| status_str=success | |
| completed=True | |
| elapsed_sec=62.4 | |
| ``` | |
| Returned SRT text: | |
| ```text | |
| 1 | |
| 00:00:02,080 --> 00:00:03,200 | |
| 全民制作人们 大家好 | |
| 2 | |
| 00:00:03,439 --> 00:00:06,160 | |
| 我是练习时长两年半的个人练习生蔡徐坤 | |
| 3 | |
| 00:00:06,480 --> 00:00:09,359 | |
| 喜欢唱、跳、rap、篮球、music | |
| ``` | |
| This confirms the Qwen3-ASR model, forced aligner, VoiceBridge ASR nodes, and | |
| SRT generation can run in the Space ZeroGPU path. The smoke test intentionally | |
| used `attention=sdpa` instead of `flash_attention_2`; `flash_attention_2` | |
| availability remains unverified. | |
| ## Secrets and API Keys | |
| `DEEPSEEK_API_KEY` should be stored only as a Hugging Face Space Secret. | |
| Current expected secret: | |
| ```text | |
| DEEPSEEK_API_KEY | |
| ``` | |
| Optional variables: | |
| ```text | |
| DEEPSEEK_BASE_URL=https://api.deepseek.com | |
| DEEPSEEK_MODEL=deepseek-v4-flash | |
| ``` | |
| Never store these values in: | |
| - `app.py` | |
| - workflow JSON files | |
| - README files | |
| - docs | |
| - `.env` files committed to git | |
| `scripts/workflow_client.py` reads these from environment variables. | |
| `scripts/check_space_env.py` verifies whether these environment variables are | |
| present without printing their values. | |
| ## Model Storage | |
| Large model files should live on the Space persistent storage volume instead of | |
| inside `/home/user/app`, because `/home/user/app` can be replaced during Space | |
| rebuilds. | |
| Default model root: | |
| ```text | |
| /data/voicegate_models | |
| ``` | |
| `scripts/bootstrap_comfy.py` creates symlinks from ComfyUI's expected paths to | |
| that persistent root: | |
| ```text | |
| ComfyUI/models/voxcpm/VoxCPM2 | |
| -> /data/voicegate_models/voxcpm/VoxCPM2 | |
| ComfyUI/models/diffusion_models/MelBandRoFormer_comfy | |
| -> /data/voicegate_models/diffusion_models/MelBandRoFormer_comfy | |
| ComfyUI/models/Qwen3-ASR | |
| -> /data/voicegate_models/Qwen3-ASR | |
| ``` | |
| Override the root with: | |
| ```text | |
| VOICEGATE_MODEL_ROOT | |
| ``` | |
| On 2026-06-05, the first two explicit ComfyUI-path models were downloaded to | |
| persistent storage: | |
| ```text | |
| /data/voicegate_models/voxcpm/VoxCPM2/model.safetensors | |
| /data/voicegate_models/voxcpm/VoxCPM2/audiovae.pth | |
| /data/voicegate_models/diffusion_models/MelBandRoFormer_comfy/MelBandRoformer_fp32.safetensors | |
| /data/voicegate_models/Qwen3-ASR/Qwen3-ASR-1.7B | |
| /data/voicegate_models/Qwen3-ASR/Qwen3-ForcedAligner-0.6B | |
| ``` | |
| Verified symlinks: | |
| ```text | |
| /home/user/app/ComfyUI/models/voxcpm/VoxCPM2 | |
| -> /data/voicegate_models/voxcpm/VoxCPM2 | |
| /home/user/app/ComfyUI/models/diffusion_models/MelBandRoFormer_comfy | |
| -> /data/voicegate_models/diffusion_models/MelBandRoFormer_comfy | |
| /home/user/app/ComfyUI/models/Qwen3-ASR | |
| -> /data/voicegate_models/Qwen3-ASR | |
| ``` | |
| `DEEPSEEK_API_KEY` was also verified as present in the Space environment without | |
| printing its value. | |
| Model download pitfall: | |
| - `huggingface-cli download` is deprecated and failed in the Space. | |
| - `hf download` also failed because of a CLI dependency compatibility issue. | |
| - `scripts/bootstrap_comfy.py` now uses the `huggingface_hub` Python API | |
| directly for model downloads. | |
| ## Current Known Good Commits | |
| - `683b147` Add ComfyUI runtime bootstrap scripts | |
| - `520334e` Record Space SSH runtime findings | |
| - `223ef10` Add ZeroGPU placeholder hook | |
| - `5dac213` Add missing AudioTools dependencies | |
| - `d849d03` Record ComfyUI API smoke test | |
| - `79b8b37` Add GPU smoke test button | |
| - `6e4cd3f` Run Space preparation in background | |
| - `b39ef30` Add ASR diagnostic workflow and deployment guide | |
| - `316b35d` Reduce ASR ZeroGPU duration | |
| - `90f8205` Reduce full workflow smoke test runtime | |
| - `b8ca809` Initialize matplotlib backend for Gradio | |
| ## Full Workflow Status | |
| On 2026-06-06, the copied Space workflows were checked against the upstream | |
| VoiceGate workflows: | |
| ```text | |
| workflows/voicegate_api.json == VoiceGate/workflows/VoiceGate-Workflow_api.json | |
| workflows/voicegate_ui.json == VoiceGate/workflows/VoiceGate-Workflow.json | |
| checked_connections=31 | |
| mismatches=0 | |
| ``` | |
| The connection validator resolves ComfyUI UI-only `SetNode` / `GetNode` helper | |
| pairs before comparing the API workflow to the UI workflow. With that resolution, | |
| the API graph used by the Space follows | |
| `VoiceGate/workflows/VoiceGate-Workflow.json`. | |
| Full workflow runtime attempts: | |
| - Initial full workflow submission failed at prompt validation because | |
| `VoiceBridgeASRLoader` required `source`; `scripts/workflow_client.py` now | |
| patches node `31` with `source=HuggingFace`. | |
| - The next run failed because node `31` requested `flash_attention_2`; the Space | |
| runtime did not have `flash_attn`, so the workflow patch now uses | |
| `attention=sdpa`. | |
| - A later run submitted prompt `16b45231-c2e3-4ded-aa38-ac6a3b6813d8` and | |
| reached the heavy nodes, including MelBand, VoxCPM, ASR, and forced aligner | |
| loading, but exceeded the app-side `800s` wait window before `/history` | |
| returned completion. | |
| - To reduce smoke-test runtime without changing graph connections, commit | |
| `90f8205` patches ASR `max_new_tokens=256` and VoxCPM inference | |
| `inference_steps=4`. | |
| - The next retry did not start execution because Hugging Face ZeroGPU quota was | |
| exhausted before scheduling: | |
| ```text | |
| ZeroGPU quota exceeded | |
| 1200s requested vs. 407s left | |
| try again in about 17:22:12 | |
| ``` | |
| Current conclusion: workflow connection fidelity is verified, and individual | |
| GPU smoke tests for CUDA, ComfyUI, MelBand, VoxCPM TTS, and ASR have passed. | |
| The remaining full end-to-end verification is blocked by available ZeroGPU | |
| quota/runtime, not by a known workflow connection mismatch. | |
| Follow-up result on the duplicated personal Space: | |
| - Space: `YanTianlong/VoiceGate-personal` | |
| - Hardware: Nvidia T4 Small | |
| - Input: short `2-坤哥.MP3` test audio | |
| - Target language: English | |
| - Result: success | |
| - Output: `audio/voicegate_full_23499a26_00001.mp3` | |
| - Total elapsed time reported by Gradio: `24.4s` | |
| - ComfyUI websocket elapsed time: `23.9s` | |
| Top measured node timings: | |
| ```text | |
| 206 RunningHub_VoxCPM_Generate: 8.2s | |
| 99 MelBandRoFormerSampler: 6.6s | |
| 105 RH_LLMAPI_NODE: 4.5s | |
| 33 VoiceBridgeASRTranscribe: 2.2s | |
| 45 VoiceBridgeASRTranscribe: 1.9s | |
| 180 SaveAudioMP3: 0.4s | |
| ``` | |
| This confirms the full VoiceGate workflow can run end-to-end on a warm personal | |
| T4 Small Space for very short audio. Longer audio remains untested on T4 Small | |
| and may still hit runtime or memory limits. | |
| ## Gradio Interface Status | |
| The first user-facing Gradio interface now has two tabs: | |
| - `Translate`: simple user flow with audio upload, target language dropdown, | |
| generated translated dubbing audio output, original/translated subtitle text, | |
| downloadable `.srt` subtitle files, and status. | |
| - `Diagnostics`: retained internal test controls for Prepare, GPU, ComfyUI, | |
| MelBand, VoxCPM TTS, ASR, and full workflow timing. | |
| Supported target-language dropdown values: | |
| ```text | |
| Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, | |
| Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, | |
| Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, | |
| Thai, Turkish, Vietnamese | |
| ``` | |
| The `Translate` tab also exposes an advanced cleanup slider: | |
| ```text | |
| TTS segment trim start | |
| range: 0.0 - 1.0 seconds | |
| default: 0.0 | |
| workflow node: 268 TrimAudioDuration.start_index | |
| ``` | |
| This controls the trim node between `RunningHub_VoxCPM_Generate` and | |
| `VoiceBridgeAudioListMergerBySRT`. It can skip the first `n` seconds of every | |
| generated TTS segment when a TTS segment begins with noise or an unstable | |
| attack. It changes only node input parameters, not workflow graph connections. | |
| User-facing reliability pitfall found on the duplicated personal Space: | |
| - After switching/rebuilding hardware, ComfyUI started successfully but the | |
| MelBand model list was empty. | |
| - `/prompt` failed with: | |
| ```text | |
| MelBandRoFormerModelLoader 96 | |
| model_name: 'MelBandRoFormer_comfy/MelBandRoformer_fp32.safetensors' not in [] | |
| ``` | |
| Root cause: the runtime container had ComfyUI/custom nodes, but required model | |
| files were not present or linked under ComfyUI's model directories yet. Internal | |
| diagnostic usage can tolerate a manual `Prepare` step, but the user-facing | |
| `Translate` path must not require that. | |
| Mitigation added: | |
| - Before running full VoiceGate, the app checks required MelBand, VoxCPM, and | |
| Qwen3-ASR model paths. | |
| - If any are missing, it runs `scripts/bootstrap_comfy.py --with-models` | |
| synchronously and rechecks the paths. | |
| - If models still cannot be prepared, the app returns a clear preparation error | |
| instead of a raw ComfyUI prompt-validation failure. | |
| ## Remaining Work | |
| Next recommended steps: | |
| 1. Run progressively larger workflows: | |
| - SRT split and merge | |
| - full short-audio VoiceGate workflow on the organization ZeroGPU Space after | |
| quota recovers | |
| 2. Polish the first Gradio user interface and validate the automatic model | |
| preparation path after Space rebuilds/hardware changes. | |
| 3. Reduce runtime dependency installation and model reload overhead. | |