VoiceGate / docs /work-log.md
YanTianlong's picture
Add TTS trim control and polish UI
1c552ae
|
Raw
History Blame Contribute Delete
24.1 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

VoiceGate HF Space Work Log

This document records the effective work completed while preparing the build-small-hackathon/VoiceGate Hugging Face Space, plus the pitfalls found and how they were resolved.

Current Snapshot

  • Space: https://huggingface.co/spaces/build-small-hackathon/VoiceGate
  • Space git remote: https://huggingface.co/spaces/build-small-hackathon/VoiceGate
  • Runtime hardware: ZeroGPU / zero-a10g
  • Space SDK: Gradio
  • Local Space wrapper repo: VoiceGate-hf
  • Local upstream reference checkout: VoiceGate/
  • Latest confirmed normal runtime commit: 316b35db739d74d05543d6c8c9dd9c16e0580b17
  • Current expected Space secret: DEEPSEEK_API_KEY
  • Default persistent model root: /data/voicegate_models

Do not commit API keys, model weights, uploaded media, generated outputs, or the local VoiceGate/ upstream checkout.

Executive Summary

The Space is no longer just a blank scaffold. It can now run Gradio, invoke ZeroGPU, prepare a ComfyUI runtime, start ComfyUI from a GPU-backed Gradio function, and submit several segmented ComfyUI workflows.

Confirmed working:

  • Hugging Face Space git push and normal rebuild flow.
  • Dev Mode SSH for CPU/container diagnostics.
  • ZeroGPU invocation from Gradio through @spaces.GPU.
  • ComfyUI startup from inside a @spaces.GPU function.
  • ComfyUI API calls from the Gradio process.
  • DeepSeek-compatible LLM node with the Space secret.
  • MelBand RoFormer smoke tests in CPU mode and ZeroGPU mode.
  • VoxCPM2 TTS-only smoke test in ZeroGPU mode.
  • VoiceBridge ASR-only smoke test in ZeroGPU mode.
  • Persistent model storage for VoxCPM2, MelBand, Qwen3-ASR, and Qwen3 forced aligner under /data.

Not yet confirmed at the start of 2026-06-06:

  • SRT split -> VoxCPM -> SRT merge.
  • Full short-audio VoiceGate workflow.
  • Final user-facing Gradio upload/download UI.

Repository Setup Completed

  • Created and pushed the Space wrapper repository.
  • Kept VoiceGate/ as a local-only upstream reference and ignored it in git.
  • Preserved Hugging Face LFS rules.
  • Copied deployment workflows:
    • workflows/voicegate_api.json
    • workflows/voicegate_ui.json
  • Confirmed the API workflow JSON is valid.
  • Confirmed workflow files contain no committed API key.

Dependency Inventory Completed

Required workflow node providers were identified and pinned:

  • ComfyUI core: comfyanonymous/ComfyUI
  • VoiceBridge: YanTianlong-01/comfyui_voicebridge
  • RunningHub VoxCPM: RH-RunningHub/ComfyUI_RH_VoxCPM
  • MelBand RoFormer: kijai/ComfyUI-MelBandRoFormer
  • RunningHub LLM API: HM-RunningHub/ComfyUI_RH_LLM_API
  • rgthree: rgthree/rgthree-comfy
  • Easy Use: yolain/ComfyUI-Easy-Use
  • Comfyroll: Suzie1/ComfyUI_Comfyroll_CustomNodes
  • MW AudioTools: billwuhao/ComfyUI_AudioTools

Important node source confirmations:

  • ReplaceText is provided by ComfyUI core extra nodes.
  • MergeAudioMW is provided by ComfyUI_AudioTools.
  • RH_LLMAPI_NODE is provided by ComfyUI_RH_LLM_API.

Runtime Bootstrap Added

The following scripts were added:

  • scripts/bootstrap_comfy.py
    • Clones ComfyUI.
    • Checks out pinned commits.
    • Clones required custom node repositories.
    • Installs ComfyUI and custom node Python requirements.
    • Prepares expected model directories.
    • Optionally downloads large model assets with --with-models.
  • scripts/run_comfy.py
    • Starts ComfyUI.
    • Waits for /system_stats.
    • Supports --cpu for SSH diagnostics.
  • scripts/workflow_client.py
    • Loads workflows/voicegate_api.json.
    • Uploads audio through the ComfyUI API.
    • Patches workflow inputs.
    • Submits /prompt.
    • Waits for /history/{prompt_id}.

Workflow patching currently covers:

  • Node 16: uploaded audio filename.
  • Node 105: DEEPSEEK_API_KEY.
  • Node 105: API base URL.
  • Node 105: LLM model name.
  • Node 110: target language.
  • Node 180: job-specific audio output prefix.
  • Node 214: job-specific SRT output prefix.

Hugging Face Space Runtime Findings

Dev Mode and SSH

SSH target:

build-small-hackathon-voicegate@ssh.hf.space

Local private key:

C:\Users\yantianlong\.ssh\codex_space_voicegate

SSH is only available while the Space is in Dev Mode. Normal running Spaces do not accept SSH and return:

Bad request: SSH in only allowed in Dev mode

Dev Mode can be toggled through the Hugging Face API endpoint:

POST /api/spaces/build-small-hackathon/VoiceGate/dev-mode

Use Dev Mode for diagnostics only. Persistent fixes must be committed locally and pushed.

Dev Mode Stale Commit Pitfall

The running container initially stayed on the original template commit:

a94117f35a42cb17f654ae70cbe619c15345d057

even after newer commits were pushed. restart_space alone did not move it to the latest repository state while Dev Mode was enabled.

Fix:

  • Disable Dev Mode.
  • Use factory_reboot=True or push a new commit to trigger a normal rebuild.
  • Confirm runtime metadata reports the latest commit.

ZeroGPU Startup Requirement

When Dev Mode was disabled, the Space entered RUNTIME_ERROR with:

No @spaces.GPU function detected during startup

Fix:

  • Import spaces.
  • Add at least one @spaces.GPU(duration=...) function in app.py.

Current placeholder fix:

@spaces.GPU(duration=30)
def placeholder():
    ...

Later this placeholder was replaced by real diagnostic functions:

@spaces.GPU(duration=60)
def gpu_smoke_test():
    ...

@spaces.GPU(duration=900)
def comfy_runtime_test():
    ...

SSH Does Not Expose ZeroGPU CUDA

Starting ComfyUI normally through SSH failed with:

RuntimeError: No CUDA GPUs are available

Conclusion:

  • SSH is useful for CPU-mode diagnostics.
  • Real GPU work must run from the Gradio process inside a @spaces.GPU function.

CPU diagnostic command:

python scripts/run_comfy.py --cpu

Gradio Request Timeout During Bootstrap

Long bootstrap work should not run synchronously inside a Gradio request. The first attempt did this:

Gradio click -> bootstrap_comfy.py -> clone repos -> pip install -> start ComfyUI

The request was interrupted by Gradio/ZeroGPU's outer queue after roughly 2.5 minutes and returned:

event: error
data: {"error": null}

Fix:

  • Add a non-GPU Prepare action that starts scripts/bootstrap_comfy.py as a background process.
  • Add Prepare Status to poll /tmp/voicegate_bootstrap.log.
  • Keep GPU actions focused on starting ComfyUI and running actual CUDA work.

This avoids wasting ZeroGPU time on clone/install steps and prevents the request from being killed before diagnostics can return useful logs.

Runtime Pip Install Pitfall

The background bootstrap installed a large dependency set and upgraded the on-disk Torch package. The already-running Gradio process continued to report:

torch=2.11.0+cu130

while the ComfyUI subprocess started afterwards reported:

pytorch_version=2.12.0+cu130

This is workable for diagnostics, but final production should avoid heavy runtime pip install where possible. Prefer moving stable dependencies into Space build-time requirements or explicitly controlling pins.

ZeroGPU Duration and Quota Pitfall

The ASR diagnostic was first decorated with:

@spaces.GPU(duration=1800)

The Space rejected it before execution:

ZeroGPU illegal duration
The requested GPU duration is larger than the maximum allowed

After reducing the function to duration=1200, the Space still rejected the call because the quota precheck reported:

You have exceeded your Pro ZeroGPU quota (1800s requested vs. 1389s left)

The working diagnostic used:

@spaces.GPU(duration=900)

For future tests, keep diagnostic durations conservative and increase only when the workflow has already proven it needs more time.

Dependency Pitfalls and Fixes

ComfyUI_AudioTools initially failed to import.

First failure:

SoX could not be found
ModuleNotFoundError: No module named 'sounddevice'

Second failure after adding sounddevice:

OSError: PortAudio library not found

Third failure:

ModuleNotFoundError: No module named 'easydict'

Fourth failure:

ModuleNotFoundError: No module named 'pytorch_lightning'

Fixes added:

  • packages.txt
    • sox
    • libportaudio2
    • portaudio19-dev
  • requirements.txt
    • sounddevice
    • easydict
    • pytorch-lightning

Final verification:

0.4 seconds: /home/user/app/ComfyUI/custom_nodes/ComfyUI_AudioTools

with no IMPORT FAILED entry.

ComfyUI API Smoke Test

Test audio source:

D:\voicebridge-test-audio\test_audio\2-坤哥.MP3

The first upload attempt used a plain PowerShell byte pipeline and corrupted the binary file. The remote file was identified as text instead of MP3, and LoadAudio failed with:

Invalid data found when processing input: 'avcodec_send_packet()'

Fix:

  • Upload binary test media through a binary-safe method.
  • Verify remote sha256sum before using the file.

Successful upload result:

/tmp/voicegate_test_audio.mp3: Audio file with ID3 version 2.3.0

ComfyUI API endpoints verified in Dev Mode:

  • /system_stats
  • /upload/image
  • /prompt
  • /history/{prompt_id}

Minimal test workflow:

LoadAudio -> SaveAudioMP3

Successful /history/{prompt_id} result:

status_str: success
completed: true

Output reported by ComfyUI:

audio/api_smoke_voicegate_00001.mp3

Segmented Workflow Smoke Tests

ComfyUI From Gradio ZeroGPU

On 2026-06-05, app.py was expanded with diagnostic Gradio actions:

  • prepare_runtime: starts scripts/bootstrap_comfy.py in the background and writes progress to /tmp/voicegate_bootstrap.log.
  • prepare_status: reports the background bootstrap status and log tail.
  • comfy_runtime_test: runs inside @spaces.GPU, starts ComfyUI, and calls /system_stats.
  • melband_gpu_test: runs a tiny MelBand workflow inside @spaces.GPU.
  • voxcpm_tts_gpu_test: runs a tiny VoxCPM2 TTS-only workflow inside @spaces.GPU.

The first attempt ran the full bootstrap synchronously inside a Gradio request and the request was interrupted by the outer queue with event: error and no function payload after roughly 2.5 minutes. The fix was to start bootstrap as a background process and poll a status endpoint.

The background prepare completed successfully. It installed a large dependency set and upgraded the on-disk Torch package from 2.11.0 to 2.12.0. The already-running Gradio process still reported its originally imported torch=2.11.0+cu130, while the newly started ComfyUI subprocess reported:

pytorch_version=2.12.0+cu130

This is acceptable for the smoke test, but runtime pip installs are not ideal for the final app. A later pass should move heavy Python dependencies into the Space build/install phase or pin the root requirements more deliberately.

comfy_runtime_test result:

cuda_available=True
comfy_ready=true
comfy_elapsed_sec=16.0
ComfyUI version=0.24.0
device=cuda:0 NVIDIA RTX PRO 6000 Blackwell Server Edition MIG 2g.48gb
vram_total=50868518912

Observed behavior: separate @spaces.GPU calls may run in separate worker processes, so the ComfyUI subprocess should not be assumed to persist across different button/API calls.

ZeroGPU Gradio Invocation

On 2026-06-05, the Space was tested in normal runtime, with Dev Mode off, using a Gradio button backed by:

@spaces.GPU(duration=60)
def gpu_smoke_test():
    ...

The private Space API was called with the local Hugging Face token through:

POST /gradio_api/call/gpu_smoke_test
GET /gradio_api/call/gpu_smoke_test/{event_id}

Result:

torch=2.11.0+cu130
cuda_available=True
cuda_device_count=1
device_name=NVIDIA RTX PRO 6000 Blackwell Server Edition MIG 2g.48gb
total_memory_gb=47.38
tensor_result=240.0
memory_reserved_mb=2.00

This confirms ZeroGPU CUDA is available from the normal Gradio runtime when the work is executed inside a @spaces.GPU function. SSH still should be treated as CPU-only diagnostic access.

DeepSeek LLM Node

On 2026-06-05, RH_LLMAPI_NODE was tested through ComfyUI in Dev Mode using the Space DEEPSEEK_API_KEY secret. The key was not printed.

Minimal workflow:

RH_LLMAPI_NODE -> easy showAnything

Prompt:

Translate to Simplified Chinese: VoiceGate smoke test.

Result:

status_str: success
output: VoiceGate 冒烟测试。

This confirms the RunningHub LLM node can read the Space secret and call the DeepSeek-compatible API endpoint.

MelBand RoFormer

On 2026-06-05, MelBandRoFormerModelLoader and MelBandRoFormerSampler were tested through ComfyUI in CPU mode.

Input:

1 second synthetic 440 Hz WAV generated with ffmpeg

Minimal workflow:

LoadAudio -> MelBandRoFormerModelLoader -> MelBandRoFormerSampler
  -> SaveAudioMP3(vocals)
  -> SaveAudioMP3(instruments)

Result:

status_str: success
audio/melband_smoke_vocals_00001.mp3
audio/melband_smoke_instruments_00001.mp3

CPU-mode runtime for the 1 second smoke input was about 51 seconds. Real runs should execute inside a @spaces.GPU function.

Later on 2026-06-05, the same kind of tiny MelBand smoke test was run from the normal Gradio runtime inside @spaces.GPU.

Input:

1 second synthetic 440 Hz WAV written to ComfyUI/input

Result:

status_str=success
completed=True
audio/melband_gpu_32459bea_instruments_00001.mp3
audio/melband_gpu_32459bea_vocals_00001.mp3
elapsed_sec=78.3

This confirms the MelBand custom node and model can execute from the Space ZeroGPU path.

VoxCPM2 TTS-only

On 2026-06-05, a minimal VoxCPM2 TTS-only workflow was run from the normal Gradio runtime inside @spaces.GPU.

Minimal workflow:

RunningHub_VoxCPM_LoadModel -> RunningHub_VoxCPM_Generate -> SaveAudioMP3

Prompt text:

你好,VoiceGate GPU 语音合成测试。

Result:

status_str=success
completed=True
audio/voxcpm_tts_gpu_cda209ec_00001.mp3
elapsed_sec=766.2

This confirms VoxCPM2 fits and executes in ZeroGPU, but the first cold TTS-only run was very slow. The final app should minimize cold starts, avoid repeated ComfyUI/model reloads where possible, and use shorter diagnostic prompts while tuning.

VoiceBridge ASR-only

On 2026-06-06, a minimal VoiceBridge ASR-only workflow was run from the normal Gradio runtime inside @spaces.GPU.

Before running ASR, scripts/bootstrap_comfy.py was extended so Qwen ASR models also live on persistent storage:

/home/user/app/ComfyUI/models/Qwen3-ASR
  -> /data/voicegate_models/Qwen3-ASR

The model preparation downloads:

/data/voicegate_models/Qwen3-ASR/Qwen3-ASR-1.7B
/data/voicegate_models/Qwen3-ASR/Qwen3-ForcedAligner-0.6B

Minimal workflow:

LoadAudio
  -> VoiceBridgeASRLoader(attention=sdpa, forced_aligner=Qwen/Qwen3-ForcedAligner-0.6B)
  -> VoiceBridgeASRTranscribe(return_timestamps=True)
  -> GenerateSRT
  -> easy showAnything

Input:

D:\voicebridge-test-audio\test_audio\2-坤哥.MP3

Result:

status_str=success
completed=True
elapsed_sec=62.4

Returned SRT text:

1
00:00:02,080 --> 00:00:03,200
全民制作人们 大家好

2
00:00:03,439 --> 00:00:06,160
我是练习时长两年半的个人练习生蔡徐坤

3
00:00:06,480 --> 00:00:09,359
喜欢唱、跳、rap、篮球、music

This confirms the Qwen3-ASR model, forced aligner, VoiceBridge ASR nodes, and SRT generation can run in the Space ZeroGPU path. The smoke test intentionally used attention=sdpa instead of flash_attention_2; flash_attention_2 availability remains unverified.

Secrets and API Keys

DEEPSEEK_API_KEY should be stored only as a Hugging Face Space Secret.

Current expected secret:

DEEPSEEK_API_KEY

Optional variables:

DEEPSEEK_BASE_URL=https://api.deepseek.com
DEEPSEEK_MODEL=deepseek-v4-flash

Never store these values in:

  • app.py
  • workflow JSON files
  • README files
  • docs
  • .env files committed to git

scripts/workflow_client.py reads these from environment variables.

scripts/check_space_env.py verifies whether these environment variables are present without printing their values.

Model Storage

Large model files should live on the Space persistent storage volume instead of inside /home/user/app, because /home/user/app can be replaced during Space rebuilds.

Default model root:

/data/voicegate_models

scripts/bootstrap_comfy.py creates symlinks from ComfyUI's expected paths to that persistent root:

ComfyUI/models/voxcpm/VoxCPM2
  -> /data/voicegate_models/voxcpm/VoxCPM2

ComfyUI/models/diffusion_models/MelBandRoFormer_comfy
  -> /data/voicegate_models/diffusion_models/MelBandRoFormer_comfy

ComfyUI/models/Qwen3-ASR
  -> /data/voicegate_models/Qwen3-ASR

Override the root with:

VOICEGATE_MODEL_ROOT

On 2026-06-05, the first two explicit ComfyUI-path models were downloaded to persistent storage:

/data/voicegate_models/voxcpm/VoxCPM2/model.safetensors
/data/voicegate_models/voxcpm/VoxCPM2/audiovae.pth
/data/voicegate_models/diffusion_models/MelBandRoFormer_comfy/MelBandRoformer_fp32.safetensors
/data/voicegate_models/Qwen3-ASR/Qwen3-ASR-1.7B
/data/voicegate_models/Qwen3-ASR/Qwen3-ForcedAligner-0.6B

Verified symlinks:

/home/user/app/ComfyUI/models/voxcpm/VoxCPM2
  -> /data/voicegate_models/voxcpm/VoxCPM2

/home/user/app/ComfyUI/models/diffusion_models/MelBandRoFormer_comfy
  -> /data/voicegate_models/diffusion_models/MelBandRoFormer_comfy

/home/user/app/ComfyUI/models/Qwen3-ASR
  -> /data/voicegate_models/Qwen3-ASR

DEEPSEEK_API_KEY was also verified as present in the Space environment without printing its value.

Model download pitfall:

  • huggingface-cli download is deprecated and failed in the Space.
  • hf download also failed because of a CLI dependency compatibility issue.
  • scripts/bootstrap_comfy.py now uses the huggingface_hub Python API directly for model downloads.

Current Known Good Commits

  • 683b147 Add ComfyUI runtime bootstrap scripts
  • 520334e Record Space SSH runtime findings
  • 223ef10 Add ZeroGPU placeholder hook
  • 5dac213 Add missing AudioTools dependencies
  • d849d03 Record ComfyUI API smoke test
  • 79b8b37 Add GPU smoke test button
  • 6e4cd3f Run Space preparation in background
  • b39ef30 Add ASR diagnostic workflow and deployment guide
  • 316b35d Reduce ASR ZeroGPU duration
  • 90f8205 Reduce full workflow smoke test runtime
  • b8ca809 Initialize matplotlib backend for Gradio

Full Workflow Status

On 2026-06-06, the copied Space workflows were checked against the upstream VoiceGate workflows:

workflows/voicegate_api.json == VoiceGate/workflows/VoiceGate-Workflow_api.json
workflows/voicegate_ui.json == VoiceGate/workflows/VoiceGate-Workflow.json
checked_connections=31
mismatches=0

The connection validator resolves ComfyUI UI-only SetNode / GetNode helper pairs before comparing the API workflow to the UI workflow. With that resolution, the API graph used by the Space follows VoiceGate/workflows/VoiceGate-Workflow.json.

Full workflow runtime attempts:

  • Initial full workflow submission failed at prompt validation because VoiceBridgeASRLoader required source; scripts/workflow_client.py now patches node 31 with source=HuggingFace.
  • The next run failed because node 31 requested flash_attention_2; the Space runtime did not have flash_attn, so the workflow patch now uses attention=sdpa.
  • A later run submitted prompt 16b45231-c2e3-4ded-aa38-ac6a3b6813d8 and reached the heavy nodes, including MelBand, VoxCPM, ASR, and forced aligner loading, but exceeded the app-side 800s wait window before /history returned completion.
  • To reduce smoke-test runtime without changing graph connections, commit 90f8205 patches ASR max_new_tokens=256 and VoxCPM inference inference_steps=4.
  • The next retry did not start execution because Hugging Face ZeroGPU quota was exhausted before scheduling:
ZeroGPU quota exceeded
1200s requested vs. 407s left
try again in about 17:22:12

Current conclusion: workflow connection fidelity is verified, and individual GPU smoke tests for CUDA, ComfyUI, MelBand, VoxCPM TTS, and ASR have passed. The remaining full end-to-end verification is blocked by available ZeroGPU quota/runtime, not by a known workflow connection mismatch.

Follow-up result on the duplicated personal Space:

  • Space: YanTianlong/VoiceGate-personal
  • Hardware: Nvidia T4 Small
  • Input: short 2-坤哥.MP3 test audio
  • Target language: English
  • Result: success
  • Output: audio/voicegate_full_23499a26_00001.mp3
  • Total elapsed time reported by Gradio: 24.4s
  • ComfyUI websocket elapsed time: 23.9s

Top measured node timings:

206 RunningHub_VoxCPM_Generate: 8.2s
99 MelBandRoFormerSampler: 6.6s
105 RH_LLMAPI_NODE: 4.5s
33 VoiceBridgeASRTranscribe: 2.2s
45 VoiceBridgeASRTranscribe: 1.9s
180 SaveAudioMP3: 0.4s

This confirms the full VoiceGate workflow can run end-to-end on a warm personal T4 Small Space for very short audio. Longer audio remains untested on T4 Small and may still hit runtime or memory limits.

Gradio Interface Status

The first user-facing Gradio interface now has two tabs:

  • Translate: simple user flow with audio upload, target language dropdown, generated translated dubbing audio output, original/translated subtitle text, downloadable .srt subtitle files, and status.
  • Diagnostics: retained internal test controls for Prepare, GPU, ComfyUI, MelBand, VoxCPM TTS, ASR, and full workflow timing.

Supported target-language dropdown values:

Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German,
Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay,
Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog,
Thai, Turkish, Vietnamese

The Translate tab also exposes an advanced cleanup slider:

TTS segment trim start
range: 0.0 - 1.0 seconds
default: 0.0
workflow node: 268 TrimAudioDuration.start_index

This controls the trim node between RunningHub_VoxCPM_Generate and VoiceBridgeAudioListMergerBySRT. It can skip the first n seconds of every generated TTS segment when a TTS segment begins with noise or an unstable attack. It changes only node input parameters, not workflow graph connections.

User-facing reliability pitfall found on the duplicated personal Space:

  • After switching/rebuilding hardware, ComfyUI started successfully but the MelBand model list was empty.
  • /prompt failed with:
MelBandRoFormerModelLoader 96
model_name: 'MelBandRoFormer_comfy/MelBandRoformer_fp32.safetensors' not in []

Root cause: the runtime container had ComfyUI/custom nodes, but required model files were not present or linked under ComfyUI's model directories yet. Internal diagnostic usage can tolerate a manual Prepare step, but the user-facing Translate path must not require that.

Mitigation added:

  • Before running full VoiceGate, the app checks required MelBand, VoxCPM, and Qwen3-ASR model paths.
  • If any are missing, it runs scripts/bootstrap_comfy.py --with-models synchronously and rechecks the paths.
  • If models still cannot be prepared, the app returns a clear preparation error instead of a raw ComfyUI prompt-validation failure.

Remaining Work

Next recommended steps:

  1. Run progressively larger workflows:
    • SRT split and merge
    • full short-audio VoiceGate workflow on the organization ZeroGPU Space after quota recovers
  2. Polish the first Gradio user interface and validate the automatic model preparation path after Space rebuilds/hardware changes.
  3. Reduce runtime dependency installation and model reload overhead.