File size: 7,894 Bytes
307f5a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31212b0
 
 
 
307f5a9
 
 
5277693
 
 
31212b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
307f5a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31212b0
 
 
 
307f5a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31212b0
307f5a9
31212b0
307f5a9
 
 
 
 
31212b0
 
 
 
 
307f5a9
683b147
 
 
 
 
 
 
 
 
 
 
307f5a9
 
 
 
 
 
 
 
 
683b147
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
307f5a9
31212b0
307f5a9
 
 
683b147
 
 
 
 
 
 
307f5a9
683b147
307f5a9
 
 
 
 
 
 
 
 
 
 
 
 
 
683b147
307f5a9
 
 
 
 
 
 
 
 
 
 
 
683b147
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
# VoiceGate HF Space Deployment Plan

## Goal

Deploy VoiceGate to a Hugging Face Space with a Gradio interface, using ZeroGPU
for the GPU-heavy inference path.

The initial target is a short-audio workflow that proves the full chain:
audio input -> source separation -> ASR/SRT -> LLM translation -> VoxCPM TTS ->
SRT-aligned audio merge -> audio and subtitle outputs.

## Repository Roles

Use three clear ownership boundaries:

- `VoiceGate`: upstream project assets, README, diagrams, and source workflows.
- `comfyui_voicebridge`: the VoiceBridge ComfyUI custom node repository.
- `VoiceGate-hf`: this repository, the Hugging Face Space deployment wrapper.

The Space repository should not depend on nested git repositories at runtime.
For deployment, copy or vendor only the required workflow files, custom nodes,
bootstrap scripts, and Gradio application code into the Space layout.

Current local state:

- The outer `VoiceGate-hf` repository is connected to the Hugging Face Space
  remote `build-small-hackathon/VoiceGate`.
- `VoiceGate/` is present as a local upstream checkout only. It is ignored by
  the Space repository and must not be treated as runtime content.
- `VoiceGate/.gitmodules` references `comfyui_voicebridge`, but the local
  `VoiceGate/comfyui_voicebridge/` directory is currently empty.
- `VoiceGate/workflows/VoiceGate-Workflow.json` is the UI workflow.
- `VoiceGate/workflows/VoiceGate-Workflow_api.json` exists and has been
  confirmed as valid JSON. It still needs parameterization before Gradio can
  submit it to ComfyUI.
- `workflows/voicegate_api.json` is the deployment copy of the API workflow.
- `workflows/voicegate_ui.json` is the deployment reference copy of the UI
  workflow.

## Repository Hygiene

The Space repository should stay small and deterministic:

- Keep `VoiceGate/` as a local-only upstream checkout.
- Copy deployment-ready workflow files into `workflows/`.
- Copy or install custom nodes through an explicit bootstrap step.
- Do not commit nested `.git` directories, model weights, API keys, uploaded
  media, generated audio, generated subtitles, or ComfyUI runtime caches.
- Keep `.gitattributes` LFS rules for future model or binary assets, but prefer
  downloading model files at runtime instead of committing them.

## Hugging Face Space Constraints

ZeroGPU Spaces are intended for Gradio SDK Spaces. The Gradio app should expose
a normal `app.py`, and GPU-heavy functions should be wrapped with `@spaces.GPU`.

This means the first implementation should prefer:

- Gradio Space root files: `README.md`, `app.py`, `requirements.txt`,
  `packages.txt`.
- A Python bootstrap that installs or prepares ComfyUI and custom nodes.
- A workflow client that calls the local ComfyUI API from inside the Gradio
  handler.

Avoid starting with a Docker Space for ZeroGPU, even though Docker would be a
cleaner fit for a long-running ComfyUI service.

## Proposed Space Layout

```text
VoiceGate-hf/
|-- README.md
|-- app.py
|-- requirements.txt
|-- packages.txt
|-- scripts/
|   |-- bootstrap_comfy.py
|   |-- run_comfy.py
|   `-- workflow_client.py
|-- workflows/
|   |-- voicegate_api.json
|   `-- voicegate_ui.json
|-- custom_nodes/
|   `-- comfyui_voicebridge/
|-- assets/
`-- docs/
    `-- deployment-plan.md
```

The current repository has the root scaffold, planning docs, and deployment
workflow copies. Later steps should add bootstrap scripts and either copy
deployment-ready custom nodes into `custom_nodes/` or install pinned node
repositories during Space startup.

## Known Workflow Nodes

The API workflow references these important node classes:

- `LoadAudio`
- `MelBandRoFormerModelLoader`
- `MelBandRoFormerSampler`
- `VoiceBridgeASRLoader`
- `VoiceBridgeASRTranscribe`
- `GenerateSRT`
- `RH_LLMAPI_NODE`
- `VoiceBridgeSRTSplitter`
- `RunningHub_VoxCPM_LoadModel`
- `RunningHub_VoxCPM_Generate`
- `VoiceBridgeAudioListMergerBySRT`
- `MergeAudioMW`
- `SaveAudioMP3`
- `SaveSRTFromString`
- `TrimAudioDuration`
- `Any Switch (rgthree)`
- `easy showAnything`
- `easy string`
- `CR Text`
- `ReplaceText`

This implies dependencies on VoiceBridge, VoxCPM/RunningHub nodes,
MelBandRoFormer nodes/models, rgthree, easy-use, and the LLM API node package.

## Model and Secret Inventory

Expected model assets:

- `Qwen/Qwen3-ASR-1.7B`
- `Qwen/Qwen3-ForcedAligner-0.6B`
- `VoxCPM2`
- `MelBandRoFormer_comfy/MelBandRoformer_fp32.safetensors`

Expected Space secrets:

- `HF_TOKEN`, if private or gated model downloads are needed.
- `DEEPSEEK_API_KEY` or another LLM provider key.
- Optional LLM base URL and model name configuration.

Do not commit model weights, API keys, generated audio, or generated subtitles.

## Implementation Phases

### Phase 1: Scaffold and Repository Hygiene

Done:

- Add HF Space root files.
- Add minimal Gradio placeholder.
- Add deployment plan.
- Add ignore rules for runtime and generated artifacts.
- Add a TODO checklist.
- Copy the API workflow to `workflows/voicegate_api.json`.
- Copy the UI workflow to `workflows/voicegate_ui.json`.
- Confirm the API workflow is valid JSON.
- Confirm the workflow files do not contain real API keys.

### Phase 2: Dependency Inventory

Done:

- Identify the ComfyUI and custom node repositories needed by the API workflow.
- Pin the current candidate commits in `docs/dependency-inventory.md`.
- Identify initial Python, system package, model, and secret requirements.
- Decide to install custom nodes from pinned git URLs during bootstrap instead
  of vendoring them into this Space repo.

### Phase 3: Runtime Bootstrap

Create scripts that can:

- Clone or install ComfyUI.
- Install Python dependencies.
- Install required custom nodes at pinned commits.
- Download or locate required model files.
- Start ComfyUI locally inside the Space process.

Current script status:

- `scripts/bootstrap_comfy.py` clones ComfyUI and all pinned custom node
  repositories, installs their requirements, prepares model directories, and
  can optionally download the VoxCPM2 and MelBand RoFormer assets.
- `scripts/run_comfy.py` starts ComfyUI and waits for `/system_stats`.
- `scripts/workflow_client.py` uploads audio, patches the VoiceGate API
  workflow, submits it through `/prompt`, and waits on `/history/{prompt_id}`.

Remaining runtime bootstrap work:

- Wire bootstrap/startup behavior into `app.py`.
- Validate the bootstrap and ComfyUI startup in the actual Space container.
- Confirm the upload endpoint used by `LoadAudio` accepts the audio files we
  send from Gradio.

### Phase 4: Workflow Parameterization

Parameterize `workflows/voicegate_api.json` before submitting it to ComfyUI.

Required edits:

- Patch hard-coded audio filenames with Gradio-uploaded input files.
- Patch API keys from environment variables.
- Patch target language, LLM model, and provider base URL.
- Ensure output nodes produce deterministic job-specific file paths.

These are implemented in `scripts/workflow_client.py`, but still need to be
connected to the Gradio UI and verified against a running ComfyUI process.

### Phase 5: Gradio Integration

Build the first real interface:

- Input audio file.
- Target language selector/text input.
- Source language, default `auto`.
- Optional prompt override.
- Output audio.
- Output translated/adjusted SRT.
- Runtime log.

Wrap the end-to-end function with `@spaces.GPU(duration=...)` and start with a
short maximum input duration.

### Phase 6: Verification

Verify in this order:

1. ComfyUI starts and exposes its local API.
2. TTS-only minimal workflow runs.
3. ASR-only short audio workflow runs.
4. SRT splitter + VoxCPM + merger runs.
5. Full VoiceGate short-audio workflow runs.
6. Video input support is added after the audio path is stable.

## Immediate Next Step

Continue Phase 3 by wiring bootstrap/startup behavior into `app.py`, then test
the scripts inside the running Hugging Face Space container.