OnyxlMunkey Cursor commited on
Commit
e961681
·
1 Parent(s): e391f8d

Add ACE-Step 1.5 Docker app

Browse files

Co-authored-by: Cursor <cursoragent@cursor.com>

This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .claude/skills/acestep-docs/SKILL.md +60 -0
  2. .claude/skills/acestep-docs/api/API.md +746 -0
  3. .claude/skills/acestep-docs/api/Openrouter_API.md +517 -0
  4. .claude/skills/acestep-docs/getting-started/ABOUT.md +87 -0
  5. .claude/skills/acestep-docs/getting-started/README.md +232 -0
  6. .claude/skills/acestep-docs/getting-started/Tutorial.md +964 -0
  7. .claude/skills/acestep-docs/guides/ENVIRONMENT_SETUP.md +542 -0
  8. .claude/skills/acestep-docs/guides/GPU_COMPATIBILITY.md +134 -0
  9. .claude/skills/acestep-docs/guides/GRADIO_GUIDE.md +549 -0
  10. .claude/skills/acestep-docs/guides/INFERENCE.md +1191 -0
  11. .claude/skills/acestep-docs/guides/SCRIPT_CONFIGURATION.md +615 -0
  12. .claude/skills/acestep-docs/guides/UPDATE_AND_BACKUP.md +496 -0
  13. .claude/skills/acestep-lyrics-transcription/SKILL.md +173 -0
  14. .claude/skills/acestep-lyrics-transcription/scripts/acestep-lyrics-transcription.sh +584 -0
  15. .claude/skills/acestep-lyrics-transcription/scripts/config.example.json +14 -0
  16. .claude/skills/acestep-simplemv/SKILL.md +133 -0
  17. .claude/skills/acestep-simplemv/scripts/package-lock.json +0 -0
  18. .claude/skills/acestep-simplemv/scripts/package.json +27 -0
  19. .claude/skills/acestep-simplemv/scripts/remotion.config.ts +4 -0
  20. .claude/skills/acestep-simplemv/scripts/render-mv.sh +123 -0
  21. .claude/skills/acestep-simplemv/scripts/render.mjs +345 -0
  22. .claude/skills/acestep-simplemv/scripts/render.sh +12 -0
  23. .claude/skills/acestep-simplemv/scripts/src/AudioVisualization.tsx +314 -0
  24. .claude/skills/acestep-simplemv/scripts/src/Root.tsx +31 -0
  25. .claude/skills/acestep-simplemv/scripts/src/index.ts +4 -0
  26. .claude/skills/acestep-simplemv/scripts/src/parseLrc.ts +40 -0
  27. .claude/skills/acestep-simplemv/scripts/src/types.ts +32 -0
  28. .claude/skills/acestep-simplemv/scripts/tsconfig.json +18 -0
  29. .claude/skills/acestep-songwriting/SKILL.md +194 -0
  30. .claude/skills/acestep/SKILL.md +253 -0
  31. .claude/skills/acestep/api-reference.md +149 -0
  32. .claude/skills/acestep/scripts/acestep.sh +1093 -0
  33. .claude/skills/acestep/scripts/config.example.json +14 -0
  34. .dockerignore +42 -0
  35. .editorconfig +16 -0
  36. .env.example +78 -0
  37. .github/ISSUE_TEMPLATE/bug_report.md +38 -0
  38. .github/ISSUE_TEMPLATE/feature_request.md +20 -0
  39. .github/copilot-instructions.md +67 -0
  40. .github/workflows/codeql.yml +99 -0
  41. .gitignore +250 -0
  42. AGENTS.md +96 -0
  43. CONTRIBUTING.md +175 -0
  44. Dockerfile +28 -0
  45. README.md +9 -278
  46. SECURITY.md +27 -0
  47. app.py +18 -13
  48. check_update.bat +609 -0
  49. check_update.sh +330 -0
  50. cli.py +1998 -0
.claude/skills/acestep-docs/SKILL.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: acestep-docs
3
+ description: ACE-Step documentation and troubleshooting. Use when users ask about installing ACE-Step, GPU configuration, model download, Gradio UI usage, API integration, or troubleshooting issues like VRAM problems, CUDA errors, or model loading failures.
4
+ allowed-tools: Read, Glob, Grep
5
+ ---
6
+
7
+ # ACE-Step Documentation
8
+
9
+ Documentation skill for ACE-Step music generation system.
10
+
11
+ ## Quick Reference
12
+
13
+ ### Getting Started
14
+ | Document | Description |
15
+ |----------|-------------|
16
+ | [README.md](getting-started/README.md) | Installation, model download, startup commands |
17
+ | [Tutorial.md](getting-started/Tutorial.md) | Getting started tutorial, best practices |
18
+ | [ABOUT.md](getting-started/ABOUT.md) | Project overview, architecture, model zoo |
19
+
20
+ ### Guides
21
+ | Document | Description |
22
+ |----------|-------------|
23
+ | [GRADIO_GUIDE.md](guides/GRADIO_GUIDE.md) | Web UI usage guide |
24
+ | [INFERENCE.md](guides/INFERENCE.md) | Inference parameters tuning |
25
+ | [GPU_COMPATIBILITY.md](guides/GPU_COMPATIBILITY.md) | GPU/VRAM configuration, hardware recommendations |
26
+ | [ENVIRONMENT_SETUP.md](guides/ENVIRONMENT_SETUP.md) | Environment detection, uv installation, python_embeded setup (Windows/Linux/macOS) |
27
+ | [SCRIPT_CONFIGURATION.md](guides/SCRIPT_CONFIGURATION.md) | Configuring launch scripts: .bat (Windows) and .sh (Linux/macOS) |
28
+ | [UPDATE_AND_BACKUP.md](guides/UPDATE_AND_BACKUP.md) | Git updates, file backup, conflict resolution (all platforms) |
29
+
30
+ ### API (for developers)
31
+ | Document | Description |
32
+ |----------|-------------|
33
+ | [API.md](api/API.md) | REST API documentation |
34
+ | [Openrouter_API.md](api/Openrouter_API.md) | OpenRouter API integration |
35
+
36
+ ## Instructions
37
+
38
+ 1. Installation questions → read [getting-started/README.md](getting-started/README.md)
39
+ 2. General usage / best practices → read [getting-started/Tutorial.md](getting-started/Tutorial.md)
40
+ 3. Project overview / architecture → read [getting-started/ABOUT.md](getting-started/ABOUT.md)
41
+ 4. Web UI questions → read [guides/GRADIO_GUIDE.md](guides/GRADIO_GUIDE.md)
42
+ 5. Inference parameter tuning → read [guides/INFERENCE.md](guides/INFERENCE.md)
43
+ 6. GPU/VRAM issues → read [guides/GPU_COMPATIBILITY.md](guides/GPU_COMPATIBILITY.md)
44
+ 7. Environment setup (uv, python_embeded) → read [guides/ENVIRONMENT_SETUP.md](guides/ENVIRONMENT_SETUP.md)
45
+ 8. Launch script configuration (.bat/.sh) → read [guides/SCRIPT_CONFIGURATION.md](guides/SCRIPT_CONFIGURATION.md)
46
+ 9. Updates and backup → read [guides/UPDATE_AND_BACKUP.md](guides/UPDATE_AND_BACKUP.md)
47
+ 10. API development → read [api/API.md](api/API.md) or [api/Openrouter_API.md](api/Openrouter_API.md)
48
+
49
+ ## Common Issues
50
+
51
+ - **Installation problems**: See getting-started/README.md
52
+ - **VRAM insufficient**: See guides/GPU_COMPATIBILITY.md
53
+ - **Model download failed**: See getting-started/README.md or guides/SCRIPT_CONFIGURATION.md
54
+ - **uv not found**: See guides/ENVIRONMENT_SETUP.md
55
+ - **Environment detection issues**: See guides/ENVIRONMENT_SETUP.md
56
+ - **BAT/SH script configuration**: See guides/SCRIPT_CONFIGURATION.md
57
+ - **Update and backup**: See guides/UPDATE_AND_BACKUP.md
58
+ - **Update conflicts**: See guides/UPDATE_AND_BACKUP.md
59
+ - **Inference quality issues**: See guides/INFERENCE.md
60
+ - **Gradio UI not starting**: See guides/GRADIO_GUIDE.md
.claude/skills/acestep-docs/api/API.md ADDED
@@ -0,0 +1,746 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step API Client Documentation
2
+
3
+ ---
4
+
5
+ This service provides an HTTP-based asynchronous music generation API.
6
+
7
+ **Basic Workflow**:
8
+ 1. Call `POST /release_task` to submit a task and obtain a `task_id`.
9
+ 2. Call `POST /query_result` to batch query task status until `status` is `1` (succeeded) or `2` (failed).
10
+ 3. Download audio files via `GET /v1/audio?path=...` URLs returned in the result.
11
+
12
+ ---
13
+
14
+ ## Table of Contents
15
+
16
+ - [Authentication](#1-authentication)
17
+ - [Response Format](#2-response-format)
18
+ - [Task Status Description](#3-task-status-description)
19
+ - [Create Generation Task](#4-create-generation-task)
20
+ - [Batch Query Task Results](#5-batch-query-task-results)
21
+ - [Format Input](#6-format-input)
22
+ - [Get Random Sample](#7-get-random-sample)
23
+ - [List Available Models](#8-list-available-models)
24
+ - [Server Statistics](#9-server-statistics)
25
+ - [Download Audio Files](#10-download-audio-files)
26
+ - [Health Check](#11-health-check)
27
+ - [Environment Variables](#12-environment-variables)
28
+
29
+ ---
30
+
31
+ ## 1. Authentication
32
+
33
+ The API supports optional API key authentication. When enabled, a valid key must be provided in requests.
34
+
35
+ ### Authentication Methods
36
+
37
+ Two authentication methods are supported:
38
+
39
+ **Method A: ai_token in request body**
40
+
41
+ ```json
42
+ {
43
+ "ai_token": "your-api-key",
44
+ "prompt": "upbeat pop song",
45
+ ...
46
+ }
47
+ ```
48
+
49
+ **Method B: Authorization header**
50
+
51
+ ```bash
52
+ curl -X POST http://localhost:8001/release_task \
53
+ -H 'Authorization: Bearer your-api-key' \
54
+ -H 'Content-Type: application/json' \
55
+ -d '{"prompt": "upbeat pop song"}'
56
+ ```
57
+
58
+ ### Configuring API Key
59
+
60
+ Set via environment variable or command-line argument:
61
+
62
+ ```bash
63
+ # Environment variable
64
+ export ACESTEP_API_KEY=your-secret-key
65
+
66
+ # Or command-line argument
67
+ python -m acestep.api_server --api-key your-secret-key
68
+ ```
69
+
70
+ ---
71
+
72
+ ## 2. Response Format
73
+
74
+ All API responses use a unified wrapper format:
75
+
76
+ ```json
77
+ {
78
+ "data": { ... },
79
+ "code": 200,
80
+ "error": null,
81
+ "timestamp": 1700000000000,
82
+ "extra": null
83
+ }
84
+ ```
85
+
86
+ | Field | Type | Description |
87
+ | :--- | :--- | :--- |
88
+ | `data` | any | Actual response data |
89
+ | `code` | int | Status code (200=success) |
90
+ | `error` | string | Error message (null on success) |
91
+ | `timestamp` | int | Response timestamp (milliseconds) |
92
+ | `extra` | any | Extra information (usually null) |
93
+
94
+ ---
95
+
96
+ ## 3. Task Status Description
97
+
98
+ Task status (`status`) is represented as integers:
99
+
100
+ | Status Code | Status Name | Description |
101
+ | :--- | :--- | :--- |
102
+ | `0` | queued/running | Task is queued or in progress |
103
+ | `1` | succeeded | Generation succeeded, result is ready |
104
+ | `2` | failed | Generation failed |
105
+
106
+ ---
107
+
108
+ ## 4. Create Generation Task
109
+
110
+ ### 4.1 API Definition
111
+
112
+ - **URL**: `/release_task`
113
+ - **Method**: `POST`
114
+ - **Content-Type**: `application/json`, `multipart/form-data`, or `application/x-www-form-urlencoded`
115
+
116
+ ### 4.2 Request Parameters
117
+
118
+ #### Parameter Naming Convention
119
+
120
+ The API supports both **snake_case** and **camelCase** naming for most parameters. For example:
121
+ - `audio_duration` / `duration` / `audioDuration`
122
+ - `key_scale` / `keyscale` / `keyScale`
123
+ - `time_signature` / `timesignature` / `timeSignature`
124
+ - `sample_query` / `sampleQuery` / `description` / `desc`
125
+ - `use_format` / `useFormat` / `format`
126
+
127
+ Additionally, metadata can be passed in a nested object (`metas`, `metadata`, or `user_metadata`).
128
+
129
+ #### Method A: JSON Request (application/json)
130
+
131
+ Suitable for passing only text parameters, or referencing audio file paths that already exist on the server.
132
+
133
+ **Basic Parameters**:
134
+
135
+ | Parameter Name | Type | Default | Description |
136
+ | :--- | :--- | :--- | :--- |
137
+ | `prompt` | string | `""` | Music description prompt (alias: `caption`) |
138
+ | `lyrics` | string | `""` | Lyrics content |
139
+ | `thinking` | bool | `false` | Whether to use 5Hz LM to generate audio codes (lm-dit behavior) |
140
+ | `vocal_language` | string | `"en"` | Lyrics language (en, zh, ja, etc.) |
141
+ | `audio_format` | string | `"mp3"` | Output format (mp3, wav, flac) |
142
+
143
+ **Sample/Description Mode Parameters**:
144
+
145
+ | Parameter Name | Type | Default | Description |
146
+ | :--- | :--- | :--- | :--- |
147
+ | `sample_mode` | bool | `false` | Enable random sample generation mode (auto-generates caption/lyrics/metas via LM) |
148
+ | `sample_query` | string | `""` | Natural language description for sample generation (e.g., "a soft Bengali love song"). Aliases: `description`, `desc` |
149
+ | `use_format` | bool | `false` | Use LM to enhance/format the provided caption and lyrics. Alias: `format` |
150
+
151
+ **Multi-Model Support**:
152
+
153
+ | Parameter Name | Type | Default | Description |
154
+ | :--- | :--- | :--- | :--- |
155
+ | `model` | string | null | Select which DiT model to use (e.g., `"acestep-v15-turbo"`, `"acestep-v15-turbo-shift3"`). Use `/v1/models` to list available models. If not specified, uses the default model. |
156
+
157
+ **thinking Semantics (Important)**:
158
+
159
+ - `thinking=false`:
160
+ - The server will **NOT** use 5Hz LM to generate `audio_code_string`.
161
+ - DiT runs in **text2music** mode and **ignores** any provided `audio_code_string`.
162
+ - `thinking=true`:
163
+ - The server will use 5Hz LM to generate `audio_code_string` (lm-dit behavior).
164
+ - DiT runs with LM-generated codes for enhanced music quality.
165
+
166
+ **Metadata Auto-Completion (Conditional)**:
167
+
168
+ When `use_cot_caption=true` or `use_cot_language=true` or metadata fields are missing, the server may call 5Hz LM to fill the missing fields based on `caption`/`lyrics`:
169
+
170
+ - `bpm`
171
+ - `key_scale`
172
+ - `time_signature`
173
+ - `audio_duration`
174
+
175
+ User-provided values always win; LM only fills the fields that are empty/missing.
176
+
177
+ **Music Attribute Parameters**:
178
+
179
+ | Parameter Name | Type | Default | Description |
180
+ | :--- | :--- | :--- | :--- |
181
+ | `bpm` | int | null | Specify tempo (BPM), range 30-300 |
182
+ | `key_scale` | string | `""` | Key/scale (e.g., "C Major", "Am"). Aliases: `keyscale`, `keyScale` |
183
+ | `time_signature` | string | `""` | Time signature (2, 3, 4, 6 for 2/4, 3/4, 4/4, 6/8). Aliases: `timesignature`, `timeSignature` |
184
+ | `audio_duration` | float | null | Generation duration (seconds), range 10-600. Aliases: `duration`, `target_duration` |
185
+
186
+ **Audio Codes (Optional)**:
187
+
188
+ | Parameter Name | Type | Default | Description |
189
+ | :--- | :--- | :--- | :--- |
190
+ | `audio_code_string` | string or string[] | `""` | Audio semantic tokens (5Hz) for `llm_dit`. Alias: `audioCodeString` |
191
+
192
+ **Generation Control Parameters**:
193
+
194
+ | Parameter Name | Type | Default | Description |
195
+ | :--- | :--- | :--- | :--- |
196
+ | `inference_steps` | int | `8` | Number of inference steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). |
197
+ | `guidance_scale` | float | `7.0` | Prompt guidance coefficient. Only effective for base model. |
198
+ | `use_random_seed` | bool | `true` | Whether to use random seed |
199
+ | `seed` | int | `-1` | Specify seed (when use_random_seed=false) |
200
+ | `batch_size` | int | `2` | Batch generation count (max 8) |
201
+
202
+ **Advanced DiT Parameters**:
203
+
204
+ | Parameter Name | Type | Default | Description |
205
+ | :--- | :--- | :--- | :--- |
206
+ | `shift` | float | `3.0` | Timestep shift factor (range 1.0-5.0). Only effective for base models, not turbo models. |
207
+ | `infer_method` | string | `"ode"` | Diffusion inference method: `"ode"` (Euler, faster) or `"sde"` (stochastic). |
208
+ | `timesteps` | string | null | Custom timesteps as comma-separated values (e.g., `"0.97,0.76,0.615,0.5,0.395,0.28,0.18,0.085,0"`). Overrides `inference_steps` and `shift`. |
209
+ | `use_adg` | bool | `false` | Use Adaptive Dual Guidance (base model only) |
210
+ | `cfg_interval_start` | float | `0.0` | CFG application start ratio (0.0-1.0) |
211
+ | `cfg_interval_end` | float | `1.0` | CFG application end ratio (0.0-1.0) |
212
+
213
+ **5Hz LM Parameters (Optional, server-side)**:
214
+
215
+ These parameters control 5Hz LM sampling, used for metadata auto-completion and (when `thinking=true`) codes generation.
216
+
217
+ | Parameter Name | Type | Default | Description |
218
+ | :--- | :--- | :--- | :--- |
219
+ | `lm_model_path` | string | null | 5Hz LM checkpoint dir name (e.g. `acestep-5Hz-lm-0.6B`) |
220
+ | `lm_backend` | string | `"vllm"` | `vllm` or `pt` |
221
+ | `lm_temperature` | float | `0.85` | Sampling temperature |
222
+ | `lm_cfg_scale` | float | `2.5` | CFG scale (>1 enables CFG) |
223
+ | `lm_negative_prompt` | string | `"NO USER INPUT"` | Negative prompt used by CFG |
224
+ | `lm_top_k` | int | null | Top-k (0/null disables) |
225
+ | `lm_top_p` | float | `0.9` | Top-p (>=1 will be treated as disabled) |
226
+ | `lm_repetition_penalty` | float | `1.0` | Repetition penalty |
227
+
228
+ **LM CoT (Chain-of-Thought) Parameters**:
229
+
230
+ | Parameter Name | Type | Default | Description |
231
+ | :--- | :--- | :--- | :--- |
232
+ | `use_cot_caption` | bool | `true` | Let LM rewrite/enhance the input caption via CoT reasoning. Aliases: `cot_caption`, `cot-caption` |
233
+ | `use_cot_language` | bool | `true` | Let LM detect vocal language via CoT. Aliases: `cot_language`, `cot-language` |
234
+ | `constrained_decoding` | bool | `true` | Enable FSM-based constrained decoding for structured LM output. Aliases: `constrainedDecoding`, `constrained` |
235
+ | `constrained_decoding_debug` | bool | `false` | Enable debug logging for constrained decoding |
236
+ | `allow_lm_batch` | bool | `true` | Allow LM batch processing for efficiency |
237
+
238
+ **Edit/Reference Audio Parameters** (requires absolute path on server):
239
+
240
+ | Parameter Name | Type | Default | Description |
241
+ | :--- | :--- | :--- | :--- |
242
+ | `reference_audio_path` | string | null | Reference audio path (Style Transfer) |
243
+ | `src_audio_path` | string | null | Source audio path (Repainting/Cover) |
244
+ | `task_type` | string | `"text2music"` | Task type: `text2music`, `cover`, `repaint`, `lego`, `extract`, `complete` |
245
+ | `instruction` | string | auto | Edit instruction (auto-generated based on task_type if not provided) |
246
+ | `repainting_start` | float | `0.0` | Repainting start time (seconds) |
247
+ | `repainting_end` | float | null | Repainting end time (seconds), -1 for end of audio |
248
+ | `audio_cover_strength` | float | `1.0` | Cover strength (0.0-1.0). Lower values (0.2) for style transfer. |
249
+
250
+ #### Method B: File Upload (multipart/form-data)
251
+
252
+ Use this when you need to upload local audio files as reference or source audio.
253
+
254
+ In addition to supporting all the above fields as Form Fields, the following file fields are also supported:
255
+
256
+ - `reference_audio` or `ref_audio`: (File) Upload reference audio file
257
+ - `src_audio` or `ctx_audio`: (File) Upload source audio file
258
+
259
+ > **Note**: After uploading files, the corresponding `_path` parameters will be automatically ignored, and the system will use the temporary file path after upload.
260
+
261
+ ### 4.3 Response Example
262
+
263
+ ```json
264
+ {
265
+ "data": {
266
+ "task_id": "550e8400-e29b-41d4-a716-446655440000",
267
+ "status": "queued",
268
+ "queue_position": 1
269
+ },
270
+ "code": 200,
271
+ "error": null,
272
+ "timestamp": 1700000000000,
273
+ "extra": null
274
+ }
275
+ ```
276
+
277
+ ### 4.4 Usage Examples (cURL)
278
+
279
+ **Basic JSON Method**:
280
+
281
+ ```bash
282
+ curl -X POST http://localhost:8001/release_task \
283
+ -H 'Content-Type: application/json' \
284
+ -d '{
285
+ "prompt": "upbeat pop song",
286
+ "lyrics": "Hello world",
287
+ "inference_steps": 8
288
+ }'
289
+ ```
290
+
291
+ **With thinking=true (LM generates codes + fills missing metas)**:
292
+
293
+ ```bash
294
+ curl -X POST http://localhost:8001/release_task \
295
+ -H 'Content-Type: application/json' \
296
+ -d '{
297
+ "prompt": "upbeat pop song",
298
+ "lyrics": "Hello world",
299
+ "thinking": true,
300
+ "lm_temperature": 0.85,
301
+ "lm_cfg_scale": 2.5
302
+ }'
303
+ ```
304
+
305
+ **Description-driven generation (sample_query)**:
306
+
307
+ ```bash
308
+ curl -X POST http://localhost:8001/release_task \
309
+ -H 'Content-Type: application/json' \
310
+ -d '{
311
+ "sample_query": "a soft Bengali love song for a quiet evening",
312
+ "thinking": true
313
+ }'
314
+ ```
315
+
316
+ **With format enhancement (use_format=true)**:
317
+
318
+ ```bash
319
+ curl -X POST http://localhost:8001/release_task \
320
+ -H 'Content-Type: application/json' \
321
+ -d '{
322
+ "prompt": "pop rock",
323
+ "lyrics": "[Verse 1]\nWalking down the street...",
324
+ "use_format": true,
325
+ "thinking": true
326
+ }'
327
+ ```
328
+
329
+ **Select specific model**:
330
+
331
+ ```bash
332
+ curl -X POST http://localhost:8001/release_task \
333
+ -H 'Content-Type: application/json' \
334
+ -d '{
335
+ "prompt": "electronic dance music",
336
+ "model": "acestep-v15-turbo",
337
+ "thinking": true
338
+ }'
339
+ ```
340
+
341
+ **With custom timesteps**:
342
+
343
+ ```bash
344
+ curl -X POST http://localhost:8001/release_task \
345
+ -H 'Content-Type: application/json' \
346
+ -d '{
347
+ "prompt": "jazz piano trio",
348
+ "timesteps": "0.97,0.76,0.615,0.5,0.395,0.28,0.18,0.085,0",
349
+ "thinking": true
350
+ }'
351
+ ```
352
+
353
+ **File Upload Method**:
354
+
355
+ ```bash
356
+ curl -X POST http://localhost:8001/release_task \
357
+ -F "prompt=remix this song" \
358
+ -F "src_audio=@/path/to/local/song.mp3" \
359
+ -F "task_type=repaint"
360
+ ```
361
+
362
+ ---
363
+
364
+ ## 5. Batch Query Task Results
365
+
366
+ ### 5.1 API Definition
367
+
368
+ - **URL**: `/query_result`
369
+ - **Method**: `POST`
370
+ - **Content-Type**: `application/json` or `application/x-www-form-urlencoded`
371
+
372
+ ### 5.2 Request Parameters
373
+
374
+ | Parameter Name | Type | Description |
375
+ | :--- | :--- | :--- |
376
+ | `task_id_list` | string (JSON array) or array | List of task IDs to query |
377
+
378
+ ### 5.3 Response Example
379
+
380
+ ```json
381
+ {
382
+ "data": [
383
+ {
384
+ "task_id": "550e8400-e29b-41d4-a716-446655440000",
385
+ "status": 1,
386
+ "result": "[{\"file\": \"/v1/audio?path=...\", \"wave\": \"\", \"status\": 1, \"create_time\": 1700000000, \"env\": \"development\", \"prompt\": \"upbeat pop song\", \"lyrics\": \"Hello world\", \"metas\": {\"bpm\": 120, \"duration\": 30, \"genres\": \"\", \"keyscale\": \"C Major\", \"timesignature\": \"4\"}, \"generation_info\": \"...\", \"seed_value\": \"12345,67890\", \"lm_model\": \"acestep-5Hz-lm-0.6B\", \"dit_model\": \"acestep-v15-turbo\"}]"
387
+ }
388
+ ],
389
+ "code": 200,
390
+ "error": null,
391
+ "timestamp": 1700000000000,
392
+ "extra": null
393
+ }
394
+ ```
395
+
396
+ **Result Field Description** (result is a JSON string, after parsing contains):
397
+
398
+ | Field | Type | Description |
399
+ | :--- | :--- | :--- |
400
+ | `file` | string | Audio file URL (use with `/v1/audio` endpoint) |
401
+ | `wave` | string | Waveform data (usually empty) |
402
+ | `status` | int | Status code (0=in progress, 1=success, 2=failed) |
403
+ | `create_time` | int | Creation time (Unix timestamp) |
404
+ | `env` | string | Environment identifier |
405
+ | `prompt` | string | Prompt used |
406
+ | `lyrics` | string | Lyrics used |
407
+ | `metas` | object | Metadata (bpm, duration, genres, keyscale, timesignature) |
408
+ | `generation_info` | string | Generation info summary |
409
+ | `seed_value` | string | Seed values used (comma-separated) |
410
+ | `lm_model` | string | LM model name used |
411
+ | `dit_model` | string | DiT model name used |
412
+
413
+ ### 5.4 Usage Example
414
+
415
+ ```bash
416
+ curl -X POST http://localhost:8001/query_result \
417
+ -H 'Content-Type: application/json' \
418
+ -d '{
419
+ "task_id_list": ["550e8400-e29b-41d4-a716-446655440000"]
420
+ }'
421
+ ```
422
+
423
+ ---
424
+
425
+ ## 6. Format Input
426
+
427
+ ### 6.1 API Definition
428
+
429
+ - **URL**: `/format_input`
430
+ - **Method**: `POST`
431
+
432
+ This endpoint uses LLM to enhance and format user-provided caption and lyrics.
433
+
434
+ ### 6.2 Request Parameters
435
+
436
+ | Parameter Name | Type | Default | Description |
437
+ | :--- | :--- | :--- | :--- |
438
+ | `prompt` | string | `""` | Music description prompt |
439
+ | `lyrics` | string | `""` | Lyrics content |
440
+ | `temperature` | float | `0.85` | LM sampling temperature |
441
+ | `param_obj` | string (JSON) | `"{}"` | JSON object containing metadata (duration, bpm, key, time_signature, language) |
442
+
443
+ ### 6.3 Response Example
444
+
445
+ ```json
446
+ {
447
+ "data": {
448
+ "caption": "Enhanced music description",
449
+ "lyrics": "Formatted lyrics...",
450
+ "bpm": 120,
451
+ "key_scale": "C Major",
452
+ "time_signature": "4",
453
+ "duration": 180,
454
+ "vocal_language": "en"
455
+ },
456
+ "code": 200,
457
+ "error": null,
458
+ "timestamp": 1700000000000,
459
+ "extra": null
460
+ }
461
+ ```
462
+
463
+ ### 6.4 Usage Example
464
+
465
+ ```bash
466
+ curl -X POST http://localhost:8001/format_input \
467
+ -H 'Content-Type: application/json' \
468
+ -d '{
469
+ "prompt": "pop rock",
470
+ "lyrics": "Walking down the street",
471
+ "param_obj": "{\"duration\": 180, \"language\": \"en\"}"
472
+ }'
473
+ ```
474
+
475
+ ---
476
+
477
+ ## 7. Get Random Sample
478
+
479
+ ### 7.1 API Definition
480
+
481
+ - **URL**: `/create_random_sample`
482
+ - **Method**: `POST`
483
+
484
+ This endpoint returns random sample parameters from pre-loaded example data for form filling.
485
+
486
+ ### 7.2 Request Parameters
487
+
488
+ | Parameter Name | Type | Default | Description |
489
+ | :--- | :--- | :--- | :--- |
490
+ | `sample_type` | string | `"simple_mode"` | Sample type: `"simple_mode"` or `"custom_mode"` |
491
+
492
+ ### 7.3 Response Example
493
+
494
+ ```json
495
+ {
496
+ "data": {
497
+ "caption": "Upbeat pop song with guitar accompaniment",
498
+ "lyrics": "[Verse 1]\nSunshine on my face...",
499
+ "bpm": 120,
500
+ "key_scale": "G Major",
501
+ "time_signature": "4",
502
+ "duration": 180,
503
+ "vocal_language": "en"
504
+ },
505
+ "code": 200,
506
+ "error": null,
507
+ "timestamp": 1700000000000,
508
+ "extra": null
509
+ }
510
+ ```
511
+
512
+ ### 7.4 Usage Example
513
+
514
+ ```bash
515
+ curl -X POST http://localhost:8001/create_random_sample \
516
+ -H 'Content-Type: application/json' \
517
+ -d '{"sample_type": "simple_mode"}'
518
+ ```
519
+
520
+ ---
521
+
522
+ ## 8. List Available Models
523
+
524
+ ### 8.1 API Definition
525
+
526
+ - **URL**: `/v1/models`
527
+ - **Method**: `GET`
528
+
529
+ Returns a list of available DiT models loaded on the server.
530
+
531
+ ### 8.2 Response Example
532
+
533
+ ```json
534
+ {
535
+ "data": {
536
+ "models": [
537
+ {
538
+ "name": "acestep-v15-turbo",
539
+ "is_default": true
540
+ },
541
+ {
542
+ "name": "acestep-v15-turbo-shift3",
543
+ "is_default": false
544
+ }
545
+ ],
546
+ "default_model": "acestep-v15-turbo"
547
+ },
548
+ "code": 200,
549
+ "error": null,
550
+ "timestamp": 1700000000000,
551
+ "extra": null
552
+ }
553
+ ```
554
+
555
+ ### 8.3 Usage Example
556
+
557
+ ```bash
558
+ curl http://localhost:8001/v1/models
559
+ ```
560
+
561
+ ---
562
+
563
+ ## 9. Server Statistics
564
+
565
+ ### 9.1 API Definition
566
+
567
+ - **URL**: `/v1/stats`
568
+ - **Method**: `GET`
569
+
570
+ Returns server runtime statistics.
571
+
572
+ ### 9.2 Response Example
573
+
574
+ ```json
575
+ {
576
+ "data": {
577
+ "jobs": {
578
+ "total": 100,
579
+ "queued": 5,
580
+ "running": 1,
581
+ "succeeded": 90,
582
+ "failed": 4
583
+ },
584
+ "queue_size": 5,
585
+ "queue_maxsize": 200,
586
+ "avg_job_seconds": 8.5
587
+ },
588
+ "code": 200,
589
+ "error": null,
590
+ "timestamp": 1700000000000,
591
+ "extra": null
592
+ }
593
+ ```
594
+
595
+ ### 9.3 Usage Example
596
+
597
+ ```bash
598
+ curl http://localhost:8001/v1/stats
599
+ ```
600
+
601
+ ---
602
+
603
+ ## 10. Download Audio Files
604
+
605
+ ### 10.1 API Definition
606
+
607
+ - **URL**: `/v1/audio`
608
+ - **Method**: `GET`
609
+
610
+ Download generated audio files by path.
611
+
612
+ ### 10.2 Request Parameters
613
+
614
+ | Parameter Name | Type | Description |
615
+ | :--- | :--- | :--- |
616
+ | `path` | string | URL-encoded path to the audio file |
617
+
618
+ ### 10.3 Usage Example
619
+
620
+ ```bash
621
+ # Download using the URL from task result
622
+ curl "http://localhost:8001/v1/audio?path=%2Ftmp%2Fapi_audio%2Fabc123.mp3" -o output.mp3
623
+ ```
624
+
625
+ ---
626
+
627
+ ## 11. Health Check
628
+
629
+ ### 11.1 API Definition
630
+
631
+ - **URL**: `/health`
632
+ - **Method**: `GET`
633
+
634
+ Returns service health status.
635
+
636
+ ### 11.2 Response Example
637
+
638
+ ```json
639
+ {
640
+ "data": {
641
+ "status": "ok",
642
+ "service": "ACE-Step API",
643
+ "version": "1.0"
644
+ },
645
+ "code": 200,
646
+ "error": null,
647
+ "timestamp": 1700000000000,
648
+ "extra": null
649
+ }
650
+ ```
651
+
652
+ ---
653
+
654
+ ## 12. Environment Variables
655
+
656
+ The API server can be configured using environment variables:
657
+
658
+ ### Server Configuration
659
+
660
+ | Variable | Default | Description |
661
+ | :--- | :--- | :--- |
662
+ | `ACESTEP_API_HOST` | `127.0.0.1` | Server bind host |
663
+ | `ACESTEP_API_PORT` | `8001` | Server bind port |
664
+ | `ACESTEP_API_KEY` | (empty) | API authentication key (empty disables auth) |
665
+ | `ACESTEP_API_WORKERS` | `1` | API worker thread count |
666
+
667
+ ### Model Configuration
668
+
669
+ | Variable | Default | Description |
670
+ | :--- | :--- | :--- |
671
+ | `ACESTEP_CONFIG_PATH` | `acestep-v15-turbo` | Primary DiT model path |
672
+ | `ACESTEP_CONFIG_PATH2` | (empty) | Secondary DiT model path (optional) |
673
+ | `ACESTEP_CONFIG_PATH3` | (empty) | Third DiT model path (optional) |
674
+ | `ACESTEP_DEVICE` | `auto` | Device for model loading |
675
+ | `ACESTEP_USE_FLASH_ATTENTION` | `true` | Enable flash attention |
676
+ | `ACESTEP_OFFLOAD_TO_CPU` | `false` | Offload models to CPU when idle |
677
+ | `ACESTEP_OFFLOAD_DIT_TO_CPU` | `false` | Offload DiT specifically to CPU |
678
+
679
+ ### LM Configuration
680
+
681
+ | Variable | Default | Description |
682
+ | :--- | :--- | :--- |
683
+ | `ACESTEP_INIT_LLM` | auto | Whether to initialize LM at startup (auto determines based on GPU) |
684
+ | `ACESTEP_LM_MODEL_PATH` | `acestep-5Hz-lm-0.6B` | Default 5Hz LM model |
685
+ | `ACESTEP_LM_BACKEND` | `vllm` | LM backend (vllm or pt) |
686
+ | `ACESTEP_LM_DEVICE` | (same as ACESTEP_DEVICE) | Device for LM |
687
+ | `ACESTEP_LM_OFFLOAD_TO_CPU` | `false` | Offload LM to CPU |
688
+
689
+ ### Queue Configuration
690
+
691
+ | Variable | Default | Description |
692
+ | :--- | :--- | :--- |
693
+ | `ACESTEP_QUEUE_MAXSIZE` | `200` | Maximum queue size |
694
+ | `ACESTEP_QUEUE_WORKERS` | `1` | Number of queue workers |
695
+ | `ACESTEP_AVG_JOB_SECONDS` | `5.0` | Initial average job duration estimate |
696
+ | `ACESTEP_AVG_WINDOW` | `50` | Window for averaging job duration |
697
+
698
+ ### Cache Configuration
699
+
700
+ | Variable | Default | Description |
701
+ | :--- | :--- | :--- |
702
+ | `ACESTEP_TMPDIR` | `.cache/acestep/tmp` | Temporary file directory |
703
+ | `TRITON_CACHE_DIR` | `.cache/acestep/triton` | Triton cache directory |
704
+ | `TORCHINDUCTOR_CACHE_DIR` | `.cache/acestep/torchinductor` | TorchInductor cache directory |
705
+
706
+ ---
707
+
708
+ ## Error Handling
709
+
710
+ **HTTP Status Codes**:
711
+
712
+ - `200`: Success
713
+ - `400`: Invalid request (bad JSON, missing fields)
714
+ - `401`: Unauthorized (missing or invalid API key)
715
+ - `404`: Resource not found
716
+ - `415`: Unsupported Content-Type
717
+ - `429`: Server busy (queue is full)
718
+ - `500`: Internal server error
719
+
720
+ **Error Response Format**:
721
+
722
+ ```json
723
+ {
724
+ "detail": "Error message describing the issue"
725
+ }
726
+ ```
727
+
728
+ ---
729
+
730
+ ## Best Practices
731
+
732
+ 1. **Use `thinking=true`** for best quality results with LM-enhanced generation.
733
+
734
+ 2. **Use `sample_query`/`description`** for quick generation from natural language descriptions.
735
+
736
+ 3. **Use `use_format=true`** when you have caption/lyrics but want LM to enhance them.
737
+
738
+ 4. **Batch query task status** using the `/query_result` endpoint to query multiple tasks at once.
739
+
740
+ 5. **Check `/v1/stats`** to understand server load and average job time.
741
+
742
+ 6. **Use multi-model support** by setting `ACESTEP_CONFIG_PATH2` and `ACESTEP_CONFIG_PATH3` environment variables, then select with the `model` parameter.
743
+
744
+ 7. **For production**, set `ACESTEP_API_KEY` to enable authentication and secure your API.
745
+
746
+ 8. **For low VRAM environments**, enable `ACESTEP_OFFLOAD_TO_CPU=true` to support longer audio generation.
.claude/skills/acestep-docs/api/Openrouter_API.md ADDED
@@ -0,0 +1,517 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step OpenRouter API Documentation
2
+
3
+ > OpenAI Chat Completions-compatible API for AI music generation
4
+
5
+ **Base URL:** `http://{host}:{port}` (default `http://127.0.0.1:8002`)
6
+
7
+ ---
8
+
9
+ ## Table of Contents
10
+
11
+ - [Authentication](#authentication)
12
+ - [Endpoints](#endpoints)
13
+ - [POST /v1/chat/completions - Generate Music](#1-generate-music)
14
+ - [GET /api/v1/models - List Models](#2-list-models)
15
+ - [GET /health - Health Check](#3-health-check)
16
+ - [Input Modes](#input-modes)
17
+ - [Streaming Responses](#streaming-responses)
18
+ - [Examples](#examples)
19
+ - [Error Codes](#error-codes)
20
+
21
+ ---
22
+
23
+ ## Authentication
24
+
25
+ If the server is configured with an API key (via the `OPENROUTER_API_KEY` environment variable or `--api-key` CLI flag), all requests must include the following header:
26
+
27
+ ```
28
+ Authorization: Bearer <your-api-key>
29
+ ```
30
+
31
+ No authentication is required when no API key is configured.
32
+
33
+ ---
34
+
35
+ ## Endpoints
36
+
37
+ ### 1. Generate Music
38
+
39
+ **POST** `/v1/chat/completions`
40
+
41
+ Generates music from chat messages and returns audio data along with LM-generated metadata.
42
+
43
+ #### Request Parameters
44
+
45
+ | Field | Type | Required | Default | Description |
46
+ |---|---|---|---|---|
47
+ | `model` | string | No | `"acemusic/acestep-v1.5-turbo"` | Model ID |
48
+ | `messages` | array | **Yes** | - | Chat message list. See [Input Modes](#input-modes) |
49
+ | `stream` | boolean | No | `false` | Enable streaming response. See [Streaming Responses](#streaming-responses) |
50
+ | `temperature` | float | No | `0.85` | LM sampling temperature |
51
+ | `top_p` | float | No | `0.9` | LM nucleus sampling parameter |
52
+ | `lyrics` | string | No | `""` | Lyrics passed directly (takes priority over lyrics parsed from messages) |
53
+ | `duration` | float | No | `null` | Audio duration in seconds. If omitted, determined automatically by the LM |
54
+ | `bpm` | integer | No | `null` | Beats per minute. If omitted, determined automatically by the LM |
55
+ | `vocal_language` | string | No | `"en"` | Vocal language code (e.g. `"zh"`, `"en"`, `"ja"`) |
56
+ | `instrumental` | boolean | No | `false` | Whether to generate instrumental-only music (no vocals) |
57
+ | `thinking` | boolean | No | `false` | Enable LLM thinking mode for deeper reasoning |
58
+ | `use_cot_metas` | boolean | No | `true` | Auto-generate BPM, duration, key, time signature via Chain-of-Thought |
59
+ | `use_cot_caption` | boolean | No | `true` | Rewrite/enhance the music description via Chain-of-Thought |
60
+ | `use_cot_language` | boolean | No | `true` | Auto-detect vocal language via Chain-of-Thought |
61
+ | `use_format` | boolean | No | `true` | When prompt/lyrics are provided directly, enhance them via LLM formatting |
62
+
63
+ > **Note on LM parameters:** `use_format` applies when the user provides explicit prompt/lyrics (tagged or lyrics mode) and enhances the description and lyrics formatting via LLM. The `use_cot_*` parameters control Phase 1 CoT reasoning during the audio generation stage. When `use_format` or sample mode has already generated a duration, `use_cot_metas` is automatically skipped to avoid redundancy.
64
+
65
+ #### messages Format
66
+
67
+ ```json
68
+ {
69
+ "messages": [
70
+ {
71
+ "role": "user",
72
+ "content": "Your input content"
73
+ }
74
+ ]
75
+ }
76
+ ```
77
+
78
+ Set `role` to `"user"` and `content` to the text input. The system automatically determines the input mode based on the content. See [Input Modes](#input-modes) for details.
79
+
80
+ ---
81
+
82
+ #### Non-Streaming Response (`stream: false`)
83
+
84
+ ```json
85
+ {
86
+ "id": "chatcmpl-a1b2c3d4e5f6g7h8",
87
+ "object": "chat.completion",
88
+ "created": 1706688000,
89
+ "model": "acemusic/acestep-v1.5-turbo",
90
+ "choices": [
91
+ {
92
+ "index": 0,
93
+ "message": {
94
+ "role": "assistant",
95
+ "content": "## Metadata\n**Caption:** Upbeat pop song...\n**BPM:** 120\n**Duration:** 30s\n**Key:** C major\n\n## Lyrics\n[Verse 1]\nHello world...",
96
+ "audio": [
97
+ {
98
+ "type": "audio_url",
99
+ "audio_url": {
100
+ "url": "data:audio/mpeg;base64,SUQzBAAAAAAAI1RTU0UAAAA..."
101
+ }
102
+ }
103
+ ]
104
+ },
105
+ "finish_reason": "stop"
106
+ }
107
+ ],
108
+ "usage": {
109
+ "prompt_tokens": 10,
110
+ "completion_tokens": 100,
111
+ "total_tokens": 110
112
+ }
113
+ }
114
+ ```
115
+
116
+ **Response Fields:**
117
+
118
+ | Field | Description |
119
+ |---|---|
120
+ | `choices[0].message.content` | Text information generated by the LM, including Metadata (Caption, BPM, Duration, Key, Time Signature, Language) and Lyrics. Returns `"Music generated successfully."` if LM was not involved |
121
+ | `choices[0].message.audio` | Audio data array. Each item contains `type` (`"audio_url"`) and `audio_url.url` (Base64 Data URL in format `data:audio/mpeg;base64,...`) |
122
+ | `choices[0].finish_reason` | `"stop"` indicates normal completion |
123
+
124
+ **Decoding Audio:**
125
+
126
+ The `audio_url.url` value is a Data URL: `data:audio/mpeg;base64,<base64_data>`
127
+
128
+ Extract the base64 portion after the comma and decode it to get the MP3 file:
129
+
130
+ ```python
131
+ import base64
132
+
133
+ url = response["choices"][0]["message"]["audio"][0]["audio_url"]["url"]
134
+ # Strip the "data:audio/mpeg;base64," prefix
135
+ b64_data = url.split(",", 1)[1]
136
+ audio_bytes = base64.b64decode(b64_data)
137
+
138
+ with open("output.mp3", "wb") as f:
139
+ f.write(audio_bytes)
140
+ ```
141
+
142
+ ```javascript
143
+ const url = response.choices[0].message.audio[0].audio_url.url;
144
+ const b64Data = url.split(",")[1];
145
+ const audioBytes = atob(b64Data);
146
+ // Or use the Data URL directly in an <audio> element
147
+ const audio = new Audio(url);
148
+ audio.play();
149
+ ```
150
+
151
+ ---
152
+
153
+ ### 2. List Models
154
+
155
+ **GET** `/api/v1/models`
156
+
157
+ Returns available model information.
158
+
159
+ #### Response
160
+
161
+ ```json
162
+ {
163
+ "data": [
164
+ {
165
+ "id": "acemusic/acestep-v1.5-turbo",
166
+ "name": "ACE-Step",
167
+ "created": 1706688000,
168
+ "description": "High-performance text-to-music generation model...",
169
+ "input_modalities": ["text"],
170
+ "output_modalities": ["audio"],
171
+ "context_length": 4096,
172
+ "pricing": {
173
+ "prompt": "0",
174
+ "completion": "0",
175
+ "request": "0"
176
+ },
177
+ "supported_sampling_parameters": ["temperature", "top_p"]
178
+ }
179
+ ]
180
+ }
181
+ ```
182
+
183
+ ---
184
+
185
+ ### 3. Health Check
186
+
187
+ **GET** `/health`
188
+
189
+ #### Response
190
+
191
+ ```json
192
+ {
193
+ "status": "ok",
194
+ "service": "ACE-Step OpenRouter API",
195
+ "version": "1.0"
196
+ }
197
+ ```
198
+
199
+ ---
200
+
201
+ ## Input Modes
202
+
203
+ The system automatically selects the input mode based on the content of the last `user` message:
204
+
205
+ ### Mode 1: Tagged Mode (Recommended)
206
+
207
+ Use `<prompt>` and `<lyrics>` tags to explicitly specify the music description and lyrics:
208
+
209
+ ```json
210
+ {
211
+ "messages": [
212
+ {
213
+ "role": "user",
214
+ "content": "<prompt>A gentle acoustic ballad in C major, 80 BPM, female vocal</prompt>\n<lyrics>[Verse 1]\nSunlight through the window\nA brand new day begins\n\n[Chorus]\nWe are the dreamers\nWe are the light</lyrics>"
215
+ }
216
+ ]
217
+ }
218
+ ```
219
+
220
+ - `<prompt>...</prompt>` - Music style/scene description (caption)
221
+ - `<lyrics>...</lyrics>` - Lyrics content
222
+ - Either tag can be used alone
223
+ - When `use_format=true`, the LLM automatically enhances both prompt and lyrics
224
+
225
+ ### Mode 2: Natural Language Mode (Sample Mode)
226
+
227
+ Describe the desired music in natural language. The system uses LLM to generate the prompt and lyrics automatically:
228
+
229
+ ```json
230
+ {
231
+ "messages": [
232
+ {
233
+ "role": "user",
234
+ "content": "Generate an upbeat pop song about summer and travel"
235
+ }
236
+ ]
237
+ }
238
+ ```
239
+
240
+ **Trigger condition:** Message content contains no tags and does not resemble lyrics (no `[Verse]`/`[Chorus]` markers, few lines, or long single lines).
241
+
242
+ ### Mode 3: Lyrics-Only Mode
243
+
244
+ Pass in lyrics with structural markers directly. The system identifies them automatically:
245
+
246
+ ```json
247
+ {
248
+ "messages": [
249
+ {
250
+ "role": "user",
251
+ "content": "[Verse 1]\nWalking down the street\nFeeling the beat\n\n[Chorus]\nDance with me tonight\nUnder the moonlight"
252
+ }
253
+ ]
254
+ }
255
+ ```
256
+
257
+ **Trigger condition:** Message content contains `[Verse]`, `[Chorus]`, or similar markers, or has a multi-line short-text structure.
258
+
259
+ ### Instrumental Mode
260
+
261
+ Set `instrumental: true` or use `[inst]` as the lyrics:
262
+
263
+ ```json
264
+ {
265
+ "instrumental": true,
266
+ "messages": [
267
+ {
268
+ "role": "user",
269
+ "content": "<prompt>Epic orchestral cinematic score, dramatic and powerful</prompt>"
270
+ }
271
+ ]
272
+ }
273
+ ```
274
+
275
+ ---
276
+
277
+ ## Streaming Responses
278
+
279
+ Set `"stream": true` to enable SSE (Server-Sent Events) streaming.
280
+
281
+ ### Event Format
282
+
283
+ Each event starts with `data: `, followed by JSON, ending with a double newline `\n\n`:
284
+
285
+ ```
286
+ data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{...},"finish_reason":null}]}
287
+
288
+ ```
289
+
290
+ ### Streaming Event Sequence
291
+
292
+ | Phase | Delta Content | Description |
293
+ |---|---|---|
294
+ | 1. Initialization | `{"role":"assistant","content":""}` | Establishes the connection |
295
+ | 2. LM Content (optional) | `{"content":"## Metadata\n..."}` | Metadata and lyrics generated by the LM |
296
+ | 3. Heartbeat | `{"content":"."}` | Sent every 2 seconds during audio generation to keep the connection alive |
297
+ | 4. Audio Data | `{"audio":[{"type":"audio_url","audio_url":{"url":"data:..."}}]}` | The generated audio |
298
+ | 5. Finish | `finish_reason: "stop"` | Generation complete |
299
+ | 6. Termination | `data: [DONE]` | End-of-stream marker |
300
+
301
+ ### Streaming Response Example
302
+
303
+ ```
304
+ data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
305
+
306
+ data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{"content":"\n\n## Metadata\n**Caption:** Upbeat pop\n**BPM:** 120"},"finish_reason":null}]}
307
+
308
+ data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{"content":"."},"finish_reason":null}]}
309
+
310
+ data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{"audio":[{"type":"audio_url","audio_url":{"url":"data:audio/mpeg;base64,..."}}]},"finish_reason":null}]}
311
+
312
+ data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
313
+
314
+ data: [DONE]
315
+
316
+ ```
317
+
318
+ ### Client-Side Streaming Handling
319
+
320
+ ```python
321
+ import json
322
+ import httpx
323
+
324
+ with httpx.stream("POST", "http://127.0.0.1:8002/v1/chat/completions", json={
325
+ "messages": [{"role": "user", "content": "Generate a cheerful guitar piece"}],
326
+ "stream": True
327
+ }) as response:
328
+ content_parts = []
329
+ audio_url = None
330
+
331
+ for line in response.iter_lines():
332
+ if not line or not line.startswith("data: "):
333
+ continue
334
+ if line == "data: [DONE]":
335
+ break
336
+
337
+ chunk = json.loads(line[6:])
338
+ delta = chunk["choices"][0]["delta"]
339
+
340
+ if "content" in delta and delta["content"]:
341
+ content_parts.append(delta["content"])
342
+
343
+ if "audio" in delta and delta["audio"]:
344
+ audio_url = delta["audio"][0]["audio_url"]["url"]
345
+
346
+ if chunk["choices"][0].get("finish_reason") == "stop":
347
+ print("Generation complete!")
348
+
349
+ print("Content:", "".join(content_parts))
350
+ if audio_url:
351
+ import base64
352
+ b64_data = audio_url.split(",", 1)[1]
353
+ with open("output.mp3", "wb") as f:
354
+ f.write(base64.b64decode(b64_data))
355
+ ```
356
+
357
+ ```javascript
358
+ const response = await fetch("http://127.0.0.1:8002/v1/chat/completions", {
359
+ method: "POST",
360
+ headers: { "Content-Type": "application/json" },
361
+ body: JSON.stringify({
362
+ messages: [{ role: "user", content: "Generate a cheerful guitar piece" }],
363
+ stream: true
364
+ })
365
+ });
366
+
367
+ const reader = response.body.getReader();
368
+ const decoder = new TextDecoder();
369
+ let audioUrl = null;
370
+ let content = "";
371
+
372
+ while (true) {
373
+ const { done, value } = await reader.read();
374
+ if (done) break;
375
+
376
+ const text = decoder.decode(value);
377
+ for (const line of text.split("\n")) {
378
+ if (!line.startsWith("data: ") || line === "data: [DONE]") continue;
379
+
380
+ const chunk = JSON.parse(line.slice(6));
381
+ const delta = chunk.choices[0].delta;
382
+
383
+ if (delta.content) content += delta.content;
384
+ if (delta.audio) audioUrl = delta.audio[0].audio_url.url;
385
+ }
386
+ }
387
+
388
+ // audioUrl can be used directly as <audio src="...">
389
+ ```
390
+
391
+ ---
392
+
393
+ ## Examples
394
+
395
+ ### Example 1: Natural Language Generation (Simplest Usage)
396
+
397
+ ```bash
398
+ curl -X POST http://127.0.0.1:8002/v1/chat/completions \
399
+ -H "Content-Type: application/json" \
400
+ -d '{
401
+ "messages": [
402
+ {"role": "user", "content": "A soft folk song about hometown and memories"}
403
+ ],
404
+ "vocal_language": "en"
405
+ }'
406
+ ```
407
+
408
+ ### Example 2: Tagged Mode with Specific Parameters
409
+
410
+ ```bash
411
+ curl -X POST http://127.0.0.1:8002/v1/chat/completions \
412
+ -H "Content-Type: application/json" \
413
+ -d '{
414
+ "messages": [
415
+ {
416
+ "role": "user",
417
+ "content": "<prompt>Energetic EDM track with heavy bass drops and synth leads</prompt><lyrics>[Verse 1]\nFeel the rhythm in your soul\nLet the music take control\n\n[Drop]\n(instrumental break)</lyrics>"
418
+ }
419
+ ],
420
+ "bpm": 128,
421
+ "duration": 60,
422
+ "vocal_language": "en"
423
+ }'
424
+ ```
425
+
426
+ ### Example 3: Instrumental with LM Enhancement Disabled
427
+
428
+ ```bash
429
+ curl -X POST http://127.0.0.1:8002/v1/chat/completions \
430
+ -H "Content-Type: application/json" \
431
+ -d '{
432
+ "messages": [
433
+ {
434
+ "role": "user",
435
+ "content": "<prompt>Peaceful piano solo, slow tempo, jazz harmony</prompt>"
436
+ }
437
+ ],
438
+ "instrumental": true,
439
+ "use_format": false,
440
+ "use_cot_caption": false,
441
+ "duration": 45
442
+ }'
443
+ ```
444
+
445
+ ### Example 4: Streaming Request
446
+
447
+ ```bash
448
+ curl -X POST http://127.0.0.1:8002/v1/chat/completions \
449
+ -H "Content-Type: application/json" \
450
+ -N \
451
+ -d '{
452
+ "messages": [
453
+ {"role": "user", "content": "Generate a happy birthday song"}
454
+ ],
455
+ "stream": true
456
+ }'
457
+ ```
458
+
459
+ ### Example 5: Full Control with All Parameters
460
+
461
+ ```bash
462
+ curl -X POST http://127.0.0.1:8002/v1/chat/completions \
463
+ -H "Content-Type: application/json" \
464
+ -d '{
465
+ "messages": [
466
+ {
467
+ "role": "user",
468
+ "content": "<prompt>Dreamy lo-fi hip hop beat with vinyl crackle</prompt><lyrics>[inst]</lyrics>"
469
+ }
470
+ ],
471
+ "temperature": 0.9,
472
+ "top_p": 0.95,
473
+ "bpm": 85,
474
+ "duration": 30,
475
+ "instrumental": true,
476
+ "thinking": false,
477
+ "use_cot_metas": true,
478
+ "use_cot_caption": true,
479
+ "use_cot_language": false,
480
+ "use_format": true
481
+ }'
482
+ ```
483
+
484
+ ---
485
+
486
+ ## Error Codes
487
+
488
+ | HTTP Status | Description |
489
+ |---|---|
490
+ | 400 | Invalid request format or missing valid input |
491
+ | 401 | Missing or invalid API key |
492
+ | 500 | Internal error during music generation |
493
+ | 503 | Model not yet initialized |
494
+
495
+ Error response format:
496
+
497
+ ```json
498
+ {
499
+ "detail": "Error description message"
500
+ }
501
+ ```
502
+
503
+ ---
504
+
505
+ ## Server Configuration (Environment Variables)
506
+
507
+ The following environment variables can be used to configure the server (for operations reference):
508
+
509
+ | Variable | Default | Description |
510
+ |---|---|---|
511
+ | `OPENROUTER_API_KEY` | None | API authentication key |
512
+ | `OPENROUTER_HOST` | `127.0.0.1` | Listen address |
513
+ | `OPENROUTER_PORT` | `8002` | Listen port |
514
+ | `ACESTEP_CONFIG_PATH` | `acestep-v15-turbo` | DiT model configuration path |
515
+ | `ACESTEP_DEVICE` | `auto` | Inference device |
516
+ | `ACESTEP_LM_MODEL_PATH` | `acestep-5Hz-lm-0.6B` | LLM model path |
517
+ | `ACESTEP_LM_BACKEND` | `vllm` | LLM inference backend |
.claude/skills/acestep-docs/getting-started/ABOUT.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step Project Overview
2
+
3
+ > For installation instructions, see [README.md](README.md)
4
+
5
+ ## Links
6
+
7
+ - [Project Page](https://ace-step.github.io/ace-step-v1.5.github.io/)
8
+ - [Hugging Face](https://huggingface.co/ACE-Step/Ace-Step1.5)
9
+ - [ModelScope](https://modelscope.cn/models/ACE-Step/Ace-Step1.5)
10
+ - [Space Demo](https://huggingface.co/spaces/ACE-Step/Ace-Step-v1.5)
11
+ - [Discord](https://discord.gg/PeWDxrkdj7)
12
+ - [Technical Report](https://arxiv.org/abs/2602.00744)
13
+
14
+ ## Abstract
15
+
16
+ ACE-Step v1.5 is a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. Key highlights:
17
+
18
+ - Quality beyond most commercial music models
19
+ - Under 2 seconds per full song on A100, under 10 seconds on RTX 3090
20
+ - Runs locally with less than 4GB of VRAM
21
+ - Supports lightweight LoRA personalization from just a few songs
22
+
23
+ The architecture combines a Language Model (LM) as an omni-capable planner with a Diffusion Transformer (DiT). The LM transforms simple user queries into comprehensive song blueprints—scaling from short loops to 10-minute compositions.
24
+
25
+ ## Features
26
+
27
+ ### Performance
28
+ - **Ultra-Fast Generation** — Under 2s per full song on A100
29
+ - **Flexible Duration** — 10 seconds to 10 minutes (600s)
30
+ - **Batch Generation** — Up to 8 songs simultaneously
31
+
32
+ ### Generation Quality
33
+ - **Commercial-Grade Output** — Between Suno v4.5 and Suno v5
34
+ - **Rich Style Support** — 1000+ instruments and styles
35
+ - **Multi-Language Lyrics** — 50+ languages
36
+
37
+ ### Capabilities
38
+
39
+ | Feature | Description |
40
+ |---------|-------------|
41
+ | Reference Audio Input | Use reference audio to guide style |
42
+ | Cover Generation | Create covers from existing audio |
43
+ | Repaint & Edit | Selective local audio editing |
44
+ | Track Separation | Separate into individual stems |
45
+ | Vocal2BGM | Auto-generate accompaniment |
46
+ | Metadata Control | Duration, BPM, key/scale, time signature |
47
+ | Simple Mode | Full songs from simple descriptions |
48
+ | LoRA Training | 8 songs, 1 hour on 3090 (12GB VRAM) |
49
+
50
+ ## Architecture
51
+
52
+ The system uses a hybrid LM + DiT architecture:
53
+ - **LM (Language Model)**: Plans metadata, lyrics, captions via Chain-of-Thought
54
+ - **DiT (Diffusion Transformer)**: Generates audio from the LM's blueprint
55
+
56
+ ## Model Zoo
57
+
58
+ ### DiT Models
59
+
60
+ | Model | Steps | Quality | Diversity | HuggingFace |
61
+ |-------|:-----:|:-------:|:---------:|-------------|
62
+ | `acestep-v15-base` | 50 | Medium | High | [Link](https://huggingface.co/ACE-Step/acestep-v15-base) |
63
+ | `acestep-v15-sft` | 50 | High | Medium | [Link](https://huggingface.co/ACE-Step/acestep-v15-sft) |
64
+ | `acestep-v15-turbo` | 8 | Very High | Medium | [Link](https://huggingface.co/ACE-Step/Ace-Step1.5) |
65
+
66
+ ### LM Models
67
+
68
+ | Model | Audio Understanding | Composition | HuggingFace |
69
+ |-------|:------------------:|:-----------:|-------------|
70
+ | `acestep-5Hz-lm-0.6B` | Medium | Medium | [Link](https://huggingface.co/ACE-Step/acestep-5Hz-lm-0.6B) |
71
+ | `acestep-5Hz-lm-1.7B` | Medium | Medium | [Link](https://huggingface.co/ACE-Step/Ace-Step1.5) |
72
+ | `acestep-5Hz-lm-4B` | Strong | Strong | [Link](https://huggingface.co/ACE-Step/acestep-5Hz-lm-4B) |
73
+
74
+ ## License
75
+
76
+ This project is licensed under [MIT](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/LICENSE).
77
+
78
+ ## Citation
79
+
80
+ ```BibTeX
81
+ @misc{gong2026acestep,
82
+ title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
83
+ author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
84
+ howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
85
+ year={2026}
86
+ }
87
+ ```
.claude/skills/acestep-docs/getting-started/README.md ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step Installation Guide
2
+
3
+ ## Requirements
4
+
5
+ - Python 3.11
6
+ - CUDA GPU recommended (works on CPU/MPS/MLX but slower)
7
+
8
+ ## Installation
9
+
10
+ ### Windows Portable Package (Recommended for Windows)
11
+
12
+ 1. Download and extract: [ACE-Step-1.5.7z](https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z)
13
+ 2. Requirements: CUDA 12.8
14
+ 3. The package includes `python_embeded` with all dependencies pre-installed
15
+
16
+ **Quick Start:**
17
+ ```bash
18
+ # Launch Gradio Web UI (CUDA)
19
+ start_gradio_ui.bat
20
+
21
+ # Launch REST API Server (CUDA)
22
+ start_api_server.bat
23
+
24
+ # Launch Gradio Web UI (AMD ROCm)
25
+ start_gradio_ui_rocm.bat
26
+
27
+ # Launch REST API Server (AMD ROCm)
28
+ start_api_server_rocm.bat
29
+ ```
30
+
31
+ ### Launch Scripts (All Platforms)
32
+
33
+ Ready-to-use launch scripts with auto environment detection, update checking, and uv auto-install.
34
+
35
+ **Windows (.bat):**
36
+ ```bash
37
+ start_gradio_ui.bat # Gradio Web UI (CUDA)
38
+ start_api_server.bat # REST API Server (CUDA)
39
+ start_gradio_ui_rocm.bat # Gradio Web UI (AMD ROCm)
40
+ start_api_server_rocm.bat # REST API Server (AMD ROCm)
41
+ ```
42
+
43
+ **Linux (.sh):**
44
+ ```bash
45
+ chmod +x start_gradio_ui.sh start_api_server.sh # First time only
46
+ ./start_gradio_ui.sh # Gradio Web UI (CUDA)
47
+ ./start_api_server.sh # REST API Server (CUDA)
48
+ ```
49
+
50
+ **macOS Apple Silicon (.sh):**
51
+ ```bash
52
+ chmod +x start_gradio_ui_macos.sh start_api_server_macos.sh # First time only
53
+ ./start_gradio_ui_macos.sh # Gradio Web UI (MLX backend)
54
+ ./start_api_server_macos.sh # REST API Server (MLX backend)
55
+ ```
56
+
57
+ All launch scripts support:
58
+ - Startup update check (enabled by default, configurable)
59
+ - Auto environment detection (`python_embeded` or `uv`)
60
+ - Auto install `uv` if needed
61
+ - Configurable download source (HuggingFace/ModelScope)
62
+ - Customizable language, models, and parameters
63
+
64
+ See [SCRIPT_CONFIGURATION.md](../guides/SCRIPT_CONFIGURATION.md) for configuration details.
65
+
66
+ **Manual Launch (Using Python Directly):**
67
+ ```bash
68
+ # Gradio Web UI
69
+ python_embeded\python.exe acestep\acestep_v15_pipeline.py # Windows portable
70
+ python acestep/acestep_v15_pipeline.py # Linux/macOS
71
+
72
+ # REST API Server
73
+ python_embeded\python.exe acestep\api_server.py # Windows portable
74
+ python acestep/api_server.py # Linux/macOS
75
+ ```
76
+
77
+ ### Standard Installation (All Platforms)
78
+
79
+ **1. Install uv (Package Manager)**
80
+ ```bash
81
+ # macOS / Linux
82
+ curl -LsSf https://astral.sh/uv/install.sh | sh
83
+
84
+ # Windows (PowerShell)
85
+ powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
86
+ ```
87
+
88
+ **2. Clone & Install**
89
+ ```bash
90
+ git clone https://github.com/ACE-Step/ACE-Step-1.5.git
91
+ cd ACE-Step-1.5
92
+ uv sync
93
+ ```
94
+
95
+ **3. Launch**
96
+
97
+ **Using uv:**
98
+ ```bash
99
+ # Gradio Web UI (http://localhost:7860)
100
+ uv run acestep
101
+
102
+ # REST API Server (http://localhost:8001)
103
+ uv run acestep-api
104
+ ```
105
+
106
+ **Using Python directly:**
107
+
108
+ > **Note:** Make sure to activate your Python environment first:
109
+ > - **Conda environment**: Run `conda activate your_env_name` first
110
+ > - **venv**: Run `source venv/bin/activate` (Linux/Mac) or `venv\Scripts\activate` (Windows) first
111
+ > - **System Python**: Use `python` or `python3` directly
112
+
113
+ ```bash
114
+ # Gradio Web UI
115
+ python acestep/acestep_v15_pipeline.py
116
+
117
+ # REST API Server
118
+ python acestep/api_server.py
119
+ ```
120
+
121
+ ## Model Download
122
+
123
+ Models are automatically downloaded on first run. Manual download options:
124
+
125
+ ### Download Source Configuration
126
+
127
+ ACE-Step supports multiple download sources:
128
+
129
+ | Source | Description |
130
+ |--------|-------------|
131
+ | **auto** (default) | Auto-detect best source based on network |
132
+ | **modelscope** | Use ModelScope as download source |
133
+ | **huggingface** | Use HuggingFace Hub as download source |
134
+
135
+ **Using uv:**
136
+ ```bash
137
+ # Download main model
138
+ uv run acestep-download
139
+
140
+ # Download from ModelScope
141
+ uv run acestep-download --download-source modelscope
142
+
143
+ # Download from HuggingFace Hub
144
+ uv run acestep-download --download-source huggingface
145
+
146
+ # Download all models
147
+ uv run acestep-download --all
148
+
149
+ # List available models
150
+ uv run acestep-download --list
151
+ ```
152
+
153
+ **Using Python directly:**
154
+
155
+ > **Note:** Replace `python` with your environment's Python executable:
156
+ > - Windows portable package: `python_embeded\python.exe`
157
+ > - Conda/venv: Activate environment first, then use `python`
158
+ > - System: Use `python` or `python3`
159
+
160
+ ```bash
161
+ # Download main model
162
+ python -m acestep.model_downloader
163
+
164
+ # Download from ModelScope
165
+ python -m acestep.model_downloader --download-source modelscope
166
+
167
+ # Download from HuggingFace Hub
168
+ python -m acestep.model_downloader --download-source huggingface
169
+
170
+ # Download all models
171
+ python -m acestep.model_downloader --all
172
+
173
+ # List available models
174
+ python -m acestep.model_downloader --list
175
+ ```
176
+
177
+ ### GPU VRAM Recommendations
178
+
179
+ | GPU VRAM | Recommended LM Model | Notes |
180
+ |----------|---------------------|-------|
181
+ | ≤6GB | None (DiT only) | LM disabled to save memory |
182
+ | 6-12GB | `acestep-5Hz-lm-0.6B` | Lightweight, good balance |
183
+ | 12-16GB | `acestep-5Hz-lm-1.7B` | Better quality |
184
+ | ≥16GB | `acestep-5Hz-lm-4B` | Best quality |
185
+
186
+ ## Command Line Options
187
+
188
+ ### Gradio UI (`acestep`)
189
+
190
+ | Option | Default | Description |
191
+ |--------|---------|-------------|
192
+ | `--port` | 7860 | Server port |
193
+ | `--server-name` | 127.0.0.1 | Server address (`0.0.0.0` for network) |
194
+ | `--share` | false | Create public Gradio link |
195
+ | `--language` | en | UI language: `en`, `zh`, `ja` |
196
+ | `--init_service` | false | Auto-initialize models on startup |
197
+ | `--config_path` | auto | DiT model name |
198
+ | `--lm_model_path` | auto | LM model name |
199
+ | `--offload_to_cpu` | auto | CPU offload (auto if VRAM < 16GB) |
200
+ | `--download-source` | auto | Model download source: `auto`, `huggingface`, or `modelscope` |
201
+ | `--enable-api` | false | Enable REST API endpoints |
202
+ | `--api-key` | none | API authentication key |
203
+
204
+ **Examples:**
205
+
206
+ > **Note for Python users:** Replace `python` with your environment's Python executable (see note in Launch section above).
207
+
208
+ ```bash
209
+ # Public access with Chinese UI
210
+ uv run acestep --server-name 0.0.0.0 --share --language zh
211
+ # Or using Python directly:
212
+ python acestep/acestep_v15_pipeline.py --server-name 0.0.0.0 --share --language zh
213
+
214
+ # Pre-initialize models
215
+ uv run acestep --init_service true --config_path acestep-v15-turbo
216
+ # Or using Python directly:
217
+ python acestep/acestep_v15_pipeline.py --init_service true --config_path acestep-v15-turbo
218
+
219
+ # Enable API with authentication
220
+ uv run acestep --enable-api --api-key sk-your-secret-key
221
+ # Or using Python directly:
222
+ python acestep/acestep_v15_pipeline.py --enable-api --api-key sk-your-secret-key
223
+
224
+ # Use ModelScope as download source
225
+ uv run acestep --download-source modelscope
226
+ # Or using Python directly:
227
+ python acestep/acestep_v15_pipeline.py --download-source modelscope
228
+ ```
229
+
230
+ ### REST API Server (`acestep-api`)
231
+
232
+ Same options as Gradio UI. See [API documentation](../api/API.md) for endpoints.
.claude/skills/acestep-docs/getting-started/Tutorial.md ADDED
@@ -0,0 +1,964 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step 1.5 Ultimate Guide (Must Read)
2
+
3
+ ---
4
+
5
+ Hello everyone, I'm Gong Junmin, the developer of ACE-Step. Through this tutorial, I'll guide you through the design philosophy and usage of ACE-Step 1.5.
6
+
7
+ ## Mental Models
8
+
9
+ Before we begin, we need to establish the correct mental models to set proper expectations.
10
+
11
+ ### Human-Centered Design
12
+
13
+ This model is not designed for **one-click generation**, but for **human-centered generation**.
14
+
15
+ Understanding this distinction is crucial.
16
+
17
+ ### What is One-Click Generation?
18
+
19
+ You input a prompt, click generate, listen to a few versions, pick one that sounds good, and use it. If someone else inputs the same prompt, they'll likely get similar results.
20
+
21
+ In this mode, you and AI have a **client-vendor** relationship. You come with a clear purpose, with a vague expectation in mind, hoping AI delivers a product close to that expectation. Essentially, it's not much different from searching on Google or finding songs on Spotify—just with a bit more customization.
22
+
23
+ AI is a service, not a creative inspirer.
24
+
25
+ Suno, Udio, MiniMax, Mureka—these platforms are all designed with this philosophy. They can scale up models as services to ensure delivery. Your generated music is bound by their agreements; you can't run it locally, can't fine-tune for personalized exploration; if they secretly change models or terms, you can only accept it.
26
+
27
+ ### What is Human-Centered Generation?
28
+
29
+ If we weaken the AI layer and strengthen the human layer—letting more human will, creativity, and inspiration give life to AI—this is human-centered generation.
30
+
31
+ Unlike the strong purposefulness of one-click generation, human-centered generation has more of a **playful** nature. It's more like an interactive game where you and the model are **collaborators**.
32
+
33
+ The workflow is like this: you throw out some inspiration seeds, get a few songs, choose interesting directions from them to continue iterating—
34
+ - Adjust prompts to regenerate
35
+ - Use **Cover** to maintain structure and adjust details
36
+ - Use **Repaint** for local modifications
37
+ - Use **Add Layer** to add or remove instrument layers
38
+
39
+ At this point, AI is not a servant to you, but an **inspirer**.
40
+
41
+ ### What Conditions Must This Design Meet?
42
+
43
+ For human-centered generation to truly work, the model must meet several key conditions:
44
+
45
+ **First, it must be open-source, locally runnable, and trainable.**
46
+
47
+ This isn't technical purism, but a matter of ownership. When you use closed-source platforms, you don't own the model, and your generated works are bound by their agreements. Version updates, term changes, service shutdowns—none of these are under your control.
48
+
49
+ But when the model is open-source and locally runnable, everything changes: **You forever own this model, and you forever own all the creations you make with it.** No third-party agreement hassles, no platform risks, you can fine-tune, modify, and build your own creative system based on it. Your works will forever belong to you. It's like buying an instrument—you can use it anytime, anywhere, and adjust it anytime, anywhere.
50
+
51
+ **Second, it must be fast.**
52
+
53
+ Human time is precious, but more importantly—**slow generation breaks flow state**.
54
+
55
+ The core of human-centered workflow is the rapid cycle of "try, listen, adjust." If each generation takes minutes, your inspiration dissipates while waiting, and the "play" experience degrades into the "wait" ordeal.
56
+
57
+ Therefore, we specifically optimized ACE-Step for this: while ensuring quality, we made generation fast enough to support a smooth human-machine dialogue rhythm.
58
+
59
+ ### Finite Game vs Infinite Game
60
+
61
+ One-click generation is a **finite game**—clear goals, result-oriented, ends at the finish line. To some extent, it coldly hollows out the music industry, replacing many people's jobs.
62
+
63
+ Human-centered generation is an **infinite game**—because the fun lies in the process, and the process never ends.
64
+
65
+ Our vision is to democratize AI music generation. Let ACE-Step become a big toy in your pocket, let music return to **Play** itself—the creative "play," not just clicking play.
66
+
67
+ ---
68
+
69
+ ## The Elephant Rider Metaphor
70
+
71
+ > Recommended reading: [The Complete Guide to Mastering Suno](https://www.notion.so/The-Complete-Guide-to-Mastering-Suno-Advanced-Strategies-for-Professional-Music-Generation-2d6ae744ebdf8024be42f6645f884221)—this blog tutorial can help you establish the foundational understanding of AI music.
72
+
73
+ AI music generation is like the famous **elephant rider metaphor** in psychology.
74
+
75
+ Consciousness rides on the subconscious, humans ride on elephants. You can give directions, but you can't make the elephant precisely and instantly execute every command. It has its own inertia, its own temperament, its own will.
76
+
77
+ This elephant is the music generation model.
78
+
79
+ ### The Iceberg Model
80
+
81
+ Between audio and semantics lies a hidden iceberg.
82
+
83
+ What we can describe with language—style, instruments, timbre, emotion, scenes, progression, lyrics, vocal style—these are familiar words, the parts we can touch. But together, they're still just a tiny tip of the audio iceberg above the water.
84
+
85
+ What's the most precise control? You input the expected audio, and the model returns it unchanged.
86
+
87
+ But as long as you're using text descriptions, references, prompts—the model will have room to play. This isn't a bug, it's the nature of things.
88
+
89
+ ### What is the Elephant?
90
+
91
+ This elephant is a fusion of countless elements: data distribution, model scale, algorithm design, annotation bias, evaluation bias—**it's an abstract crystallization of human music history and engineering trade-offs.**
92
+
93
+ Any deviation in these elements will cause it to fail to accurately reflect your taste and expectations.
94
+
95
+ Of course, we can expand data scale, improve algorithm efficiency, increase annotation precision, expand model capacity, introduce more professional evaluation systems—these are all directions we can optimize as model developers.
96
+
97
+ But even if one day we achieve technical "perfection," there's still a fundamental problem we can't avoid: **taste.**
98
+
99
+ ### Taste and Expectations
100
+
101
+ Taste varies from person to person.
102
+
103
+ If a music generation model tries to please all listeners, its output will tend toward the popular average of human music history—**this will be extremely mediocre.**
104
+
105
+ It's humans who give sound meaning, emotion, experience, life, and cultural symbolic value. It's a small group of artists who create unique tastes, then drive ordinary people to consume and follow, turning niche into mainstream popularity. These pioneering minority artists become legends.
106
+
107
+ So when you find the model's output "not to your taste," this might not be the model's problem—**but rather your taste happens to be outside that "average."** This is a good thing.
108
+
109
+ This means: **You need to learn to guide this elephant, not expect it to automatically understand you.**
110
+
111
+ ---
112
+
113
+ ## Knowing the Elephant Herd: Model Architecture and Selection
114
+
115
+ Now you understand the "elephant" metaphor. But actually—
116
+
117
+ **This isn't one elephant, but an entire herd—elephants large and small, forming a family.** 🐘🐘🐘🐘
118
+
119
+ ### Architecture Principles: Two Brains
120
+
121
+ ACE-Step 1.5 uses a **hybrid architecture** with two core components working together:
122
+
123
+ ```
124
+ User Input → [5Hz LM] → Semantic Blueprint → [DiT] → Audio
125
+
126
+ Metadata Inference
127
+ Caption Optimization
128
+ Structure Planning
129
+ ```
130
+
131
+ **5Hz LM (Language Model) — Planner (Optional)**
132
+
133
+ The LM is an "omni-capable planner" responsible for understanding your intent and making plans:
134
+ - Infers music metadata (BPM, key, duration, etc.) through **Chain-of-Thought**
135
+ - Optimizes and expands your caption—understanding and supplementing your intent
136
+ - Generates **semantic codes**—implicitly containing composition melody, orchestration, and some timbre information
137
+
138
+ The LM learns **world knowledge** from training data. It's a planner that improves usability and helps you quickly generate prototypes.
139
+
140
+ **But the LM is not required.**
141
+
142
+ If you're very clear about what you want, or already have a clear planning goal—you can completely skip the LM planning step by not using `thinking` mode.
143
+
144
+ For example, in **Cover mode**, you use reference audio to constrain composition, chords, and structure, letting DiT generate directly. Here, **you replace the LM's work**—you become the planner yourself.
145
+
146
+ Another example: in **Repaint mode**, you use reference audio as context, constraining timbre, mixing, and details, letting DiT directly adjust locally. Here, DiT is more like your creative brainstorming partner, helping with creative ideation and fixing local disharmony.
147
+
148
+ **DiT (Diffusion Transformer) — Executor**
149
+
150
+ DiT is the "audio craftsman," responsible for turning plans into reality:
151
+ - Receives semantic codes and conditions generated by LM
152
+ - Gradually "carves" audio from noise through the **diffusion process**
153
+ - Decides final timbre, mixing, details
154
+
155
+ **Why this design?**
156
+
157
+ Traditional methods let diffusion models generate audio directly from text, but text-to-audio mapping is too vague. ACE-Step introduces LM as an intermediate layer:
158
+ - LM excels at understanding semantics and planning
159
+ - DiT excels at generating high-fidelity audio
160
+ - They work together, each doing their part
161
+
162
+ ### Choosing the Planner: LM Models
163
+
164
+ LM has four options: **No LM** (disable thinking mode), **0.6B**, **1.7B**, **4B**.
165
+
166
+ Their training data is completely identical; the difference is purely in **knowledge capacity**:
167
+ - Larger models have richer world knowledge
168
+ - Larger models have stronger memory (e.g., remembering reference audio melodies)
169
+ - Larger models perform relatively better on long-tail styles or instruments
170
+
171
+ | Choice | Speed | World Knowledge | Memory | Use Cases |
172
+ |--------|:-----:|:---------------:|:------:|-----------|
173
+ | No LM | ⚡⚡⚡⚡ | — | — | You do the planning (e.g., Cover mode) |
174
+ | `0.6B` | ⚡⚡⚡ | Basic | Weak | Low VRAM (< 8GB), rapid prototyping |
175
+ | `1.7B` | ⚡⚡ | Medium | Medium | **Default recommendation** |
176
+ | `4B` | ⚡ | Rich | Strong | Complex tasks, high-quality generation |
177
+
178
+ **How to choose?**
179
+
180
+ Based on your hardware:
181
+ - **VRAM < 8GB** → No LM or `0.6B`
182
+ - **VRAM 8–16GB** → `1.7B` (default)
183
+ - **VRAM > 16GB** → `1.7B` or `4B`
184
+
185
+ ### Choosing the Executor: DiT Models
186
+
187
+ With a planning scheme, you still need to choose an executor. DiT is the core of ACE-Step 1.5—it handles various tasks and decides how to interpret LM-generated codes.
188
+
189
+ We've open-sourced **4 Turbo models**, **1 SFT model**, and **1 Base model**.
190
+
191
+ #### Turbo Series (Recommended for Daily Use)
192
+
193
+ Turbo models are trained with distillation, generating high-quality audio in just 8 steps. The core difference between the four variants is the **shift hyperparameter configuration during distillation**.
194
+
195
+ **What is shift?**
196
+
197
+ Shift determines the "attention allocation" during DiT denoising:
198
+ - **Larger shift** → More effort spent on early denoising (building large structure from pure noise), **stronger semantics**, clearer overall framework
199
+ - **Smaller shift** → More even step distribution, **more details**, but details might also be noise
200
+
201
+ Simple understanding: high shift is like "draw outline first then fill details," low shift is like "draw and fix simultaneously."
202
+
203
+ | Model | Distillation Config | Characteristics |
204
+ |-------|---------------------|-----------------|
205
+ | `turbo` (default) | Joint distillation on shift 1, 2, 3 | **Best balance of creativity and semantics**, thoroughly tested, recommended first choice |
206
+ | `turbo-shift1` | Distilled only on shift=1 | Richer details, but semantics weaker |
207
+ | `turbo-shift3` | Distilled only on shift=3 | Clearer, richer timbre, but may sound "dry," minimal orchestration |
208
+ | `turbo-continuous` | Experimental, supports continuous shift 1–5 | Most flexible tuning, but not thoroughly tested |
209
+
210
+ You can choose based on target music style—you might find you prefer a certain variant. **We recommend starting with default turbo**—it's the most balanced and proven choice.
211
+
212
+ #### SFT Model
213
+
214
+ Compared to Turbo, SFT model has two notable features:
215
+ - **Supports CFG** (Classifier-Free Guidance), allowing fine-tuning of prompt adherence
216
+ - **More steps** (50 steps), giving the model more time to "think"
217
+
218
+ The cost: more steps mean error accumulation, audio clarity may be slightly inferior to Turbo. But its **detail expression and semantic parsing will be better**.
219
+
220
+ If you don't care about inference time, like tuning CFG and steps, and prefer that rich detail feel—SFT is a good choice. LM-generated codes can also work with SFT models.
221
+
222
+ #### Base Model
223
+
224
+ Base is the **master of all tasks**, with three exclusive tasks beyond SFT and Turbo:
225
+
226
+ | Task | Description |
227
+ |------|-------------|
228
+ | `extract` | Extract single tracks from mixed audio (e.g., separate vocals) |
229
+ | `lego` | Add new tracks to existing tracks (e.g., add drums to guitar) |
230
+ | `complete` | Add mixed accompaniment to single track (e.g., add guitar+drums accompaniment to vocals) |
231
+
232
+ Additionally, Base has the **strongest plasticity**. If you have large-scale fine-tuning needs, we recommend starting experiments with Base to train your own SFT model.
233
+
234
+ #### Creating Your Custom Model
235
+
236
+ Beyond official models, you can also use **LoRA fine-tuning** to create your custom model.
237
+
238
+ We'll release an example LoRA model—trained on 20+ "Happy New Year" themed songs, specifically suited for expressing festive atmosphere. This is just a starting point.
239
+
240
+ **What does a custom model mean?**
241
+
242
+ You can reshape DiT's capabilities and preferences with your own data recipe:
243
+ - Like a specific timbre style? Train with that type of songs
244
+ - Want the model better at a certain genre? Collect related data for fine-tuning
245
+ - Have your own unique aesthetic taste? "Teach" it to the model
246
+
247
+ This greatly expands **customization and playability**—train a model unique to you with your aesthetic taste.
248
+
249
+ > For detailed LoRA training guide, see the "LoRA Training" tab in Gradio UI.
250
+
251
+ #### DiT Selection Summary
252
+
253
+ | Model | Steps | CFG | Speed | Exclusive Tasks | Recommended Scenarios |
254
+ |-------|:-----:|:---:|:-----:|-----------------|----------------------|
255
+ | `turbo` (default) | 8 | ❌ | ⚡⚡⚡ | — | Daily use, rapid iteration |
256
+ | `sft` | 50 | ✅ | ⚡ | — | Pursuing details, like tuning |
257
+ | `base` | 50 | ✅ | ⚡ | extract, lego, complete | Special tasks, large-scale fine-tuning |
258
+
259
+ ### Combination Strategies
260
+
261
+ Default configuration is **turbo + 1.7B LM**, suitable for most scenarios.
262
+
263
+ | Need | Recommended Combination |
264
+ |------|------------------------|
265
+ | Fastest speed | `turbo` + No LM or `0.6B` |
266
+ | Daily use | `turbo` + `1.7B` (default) |
267
+ | Pursuing details | `sft` + `1.7B` or `4B` |
268
+ | Special tasks | `base` |
269
+ | Large-scale fine-tuning | `base` |
270
+ | Low VRAM (< 4GB) | `turbo` + No LM + CPU offload |
271
+
272
+ ### Downloading Models
273
+
274
+ ```bash
275
+ # Download default models (turbo + 1.7B LM)
276
+ uv run acestep-download
277
+
278
+ # Download all models
279
+ uv run acestep-download --all
280
+
281
+ # Download specific model
282
+ uv run acestep-download --model acestep-v15-base
283
+ uv run acestep-download --model acestep-5Hz-lm-0.6B
284
+
285
+ # List available models
286
+ uv run acestep-download --list
287
+ ```
288
+
289
+ You need to download models into a `checkpoints` folder for easy identification.
290
+
291
+ ---
292
+
293
+ ## Guiding the Elephant: What Can You Control?
294
+
295
+ Now that you know this herd of elephants, let's learn how to communicate with them.
296
+
297
+ Each generation is determined by three types of factors: **input control**, **inference hyperparameters**, and **random factors**.
298
+
299
+ ### I. Input Control: What Do You Want?
300
+
301
+ This is the part where you communicate "creative intent" with the model—what kind of music you want to generate.
302
+
303
+ | Category | Parameter | Function |
304
+ |----------|-----------|----------|
305
+ | **Task Type** | `task_type` | Determines generation mode: text2music, cover, repaint, lego, extract, complete |
306
+ | **Text Input** | `caption` | Description of overall music elements: style, instruments, emotion, atmosphere, timbre, vocal gender, progression, etc. |
307
+ | | `lyrics` | Temporal element description: lyric content, music structure evolution, vocal changes, vocal/instrument performance style, start/end style, articulation, etc. (use `[Instrumental]` for instrumental music) |
308
+ | **Music Metadata** | `bpm` | Tempo (30–300) |
309
+ | | `keyscale` | Key (e.g., C Major, Am) |
310
+ | | `timesignature` | Time signature (4/4, 3/4, 6/8) |
311
+ | | `vocal_language` | Vocal language |
312
+ | | `duration` | Target duration (seconds) |
313
+ | **Audio Reference** | `reference_audio` | Global reference for timbre or style (for cover, style transfer) |
314
+ | | `src_audio` | Source audio for non-text2music tasks (text2music defaults to silence, no input needed) |
315
+ | | `audio_codes` | Semantic codes input to model in Cover mode (advanced: reuse codes for variants, convert songs to codes for extension, combine like DJ mixing) |
316
+ | **Interval Control** | `repainting_start/end` | Time interval for operations (repaint redraw area / lego new track area) |
317
+
318
+ ---
319
+
320
+ #### About Caption: The Most Important Input
321
+
322
+ **Caption is the most important factor affecting generated music.**
323
+
324
+ It supports multiple input formats: simple style words, comma-separated tags, complex natural language descriptions. We've trained to be compatible with various formats, ensuring text format doesn't significantly affect model performance.
325
+
326
+ **We provide at least 5 ways to help you write good captions:**
327
+
328
+ 1. **Random Dice** — Click the random button in the UI to see how example captions are written. You can use this standardized caption as a template and have an LLM rewrite it to your desired form.
329
+
330
+ 2. **Format Auto-Rewrite** — We support using the `format` feature to automatically expand your handwritten simple caption into complex descriptions.
331
+
332
+ 3. **CoT Rewrite** — If LM is initialized, whether `thinking` mode is enabled or not, we support rewriting and expanding captions through Chain-of-Thought (unless you actively disable it in settings, or LM is not initialized).
333
+
334
+ 4. **Audio to Caption** — Our LM supports converting your input audio to caption. While precision is limited, the vague direction is correct—enough as a starting point.
335
+
336
+ 5. **Simple Mode** — Just input a simple song description, and LM will automatically generate complete caption, lyrics, and metas samples—suitable for quick starts.
337
+
338
+ Regardless of which method, they all solve a real problem: **As ordinary people, our music vocabulary is impoverished.**
339
+
340
+ If you want generated music to be more interesting and meet expectations, **Prompting is always the optimal option**—it brings the highest marginal returns and surprises.
341
+
342
+ **Common Dimensions for Caption Writing:**
343
+
344
+ | Dimension | Examples |
345
+ |-----------|----------|
346
+ | **Style/Genre** | pop, rock, jazz, electronic, hip-hop, R&B, folk, classical, lo-fi, synthwave |
347
+ | **Emotion/Atmosphere** | melancholic, uplifting, energetic, dreamy, dark, nostalgic, euphoric, intimate |
348
+ | **Instruments** | acoustic guitar, piano, synth pads, 808 drums, strings, brass, electric bass |
349
+ | **Timbre Texture** | warm, bright, crisp, muddy, airy, punchy, lush, raw, polished |
350
+ | **Era Reference** | 80s synth-pop, 90s grunge, 2010s EDM, vintage soul, modern trap |
351
+ | **Production Style** | lo-fi, high-fidelity, live recording, studio-polished, bedroom pop |
352
+ | **Vocal Characteristics** | female vocal, male vocal, breathy, powerful, falsetto, raspy, choir |
353
+ | **Speed/Rhythm** | slow tempo, mid-tempo, fast-paced, groovy, driving, laid-back |
354
+ | **Structure Hints** | building intro, catchy chorus, dramatic bridge, fade-out ending |
355
+
356
+ **Some Practical Principles:**
357
+
358
+ 1. **Specific beats vague** — "sad piano ballad with female breathy vocal" works better than "a sad song."
359
+
360
+ 2. **Combine multiple dimensions** — Single-dimension descriptions give the model too much room to play; combining style+emotion+instruments+timbre can more precisely anchor your desired direction.
361
+
362
+ 3. **Use references well** — "in the style of 80s synthwave" or "reminiscent of Bon Iver" can quickly convey complex aesthetic preferences.
363
+
364
+ 4. **Texture words are useful** — Adjectives like warm, crisp, airy, punchy can influence mixing and timbre tendencies.
365
+
366
+ 5. **Don't pursue perfect descriptions** — Caption is a starting point, not an endpoint. Write a general direction first, then iterate based on results.
367
+
368
+ 6. **Description granularity determines freedom** — More omitted descriptions give the model more room to play, more random factor influence; more detailed descriptions constrain the model more. Decide specificity based on your needs—want surprises? Write less. Want control? Write more details.
369
+
370
+ 7. **Avoid conflicting words** — Conflicting style combinations easily lead to degraded output. For example, wanting both "classical strings" and "hardcore metal" simultaneously—the model will try to fuse but usually not ideal. Especially when `thinking` mode is enabled, LM has weaker caption generalization than DiT. When prompting is unreasonable, the chance of pleasant surprises is smaller.
371
+
372
+ **Ways to resolve conflicts:**
373
+ - **Repetition reinforcement** — Strengthen the elements you want more in mixed styles by repeating certain words
374
+ - **Conflict to evolution** — Transform style conflicts into temporal style evolution. For example: "Start with soft strings, middle becomes noisy dynamic metal rock, end turns to hip-hop"—this gives the model clear guidance on how to handle different styles, rather than mixing them into a mess
375
+
376
+ > For more prompting tips, see: [The Complete Guide to Mastering Suno](https://www.notion.so/The-Complete-Guide-to-Mastering-Suno-Advanced-Strategies-for-Professional-Music-Generation-2d6ae744ebdf8024be42f6645f884221)—although it's a Suno tutorial, prompting ideas are universal.
377
+
378
+ ---
379
+
380
+ #### About Lyrics: The Temporal Script
381
+
382
+ If Caption describes the music's "overall portrait"—style, atmosphere, timbre—then **Lyrics is the music's "temporal script"**, controlling how music unfolds over time.
383
+
384
+ Lyrics is not just lyric content. It carries:
385
+ - The lyric text itself
386
+ - **Structure tags** ([Verse], [Chorus], [Bridge]...)
387
+ - **Vocal style hints** ([raspy vocal], [whispered]...)
388
+ - **Instrumental sections** ([guitar solo], [drum break]...)
389
+ - **Energy changes** ([building energy], [explosive drop]...)
390
+
391
+ **Structure Tags are Key**
392
+
393
+ Structure tags (Meta Tags) are the most powerful tool in Lyrics. They tell the model: "What is this section, how should it be performed?"
394
+
395
+ **Common Structure Tags:**
396
+
397
+ | Category | Tag | Description |
398
+ |----------|-----|-------------|
399
+ | **Basic Structure** | `[Intro]` | Opening, establish atmosphere |
400
+ | | `[Verse]` / `[Verse 1]` | Verse, narrative progression |
401
+ | | `[Pre-Chorus]` | Pre-chorus, build energy |
402
+ | | `[Chorus]` | Chorus, emotional climax |
403
+ | | `[Bridge]` | Bridge, transition or elevation |
404
+ | | `[Outro]` | Ending, conclusion |
405
+ | **Dynamic Sections** | `[Build]` | Energy gradually rising |
406
+ | | `[Drop]` | Electronic music energy release |
407
+ | | `[Breakdown]` | Reduced instrumentation, space |
408
+ | **Instrumental Sections** | `[Instrumental]` | Pure instrumental, no vocals |
409
+ | | `[Guitar Solo]` | Guitar solo |
410
+ | | `[Piano Interlude]` | Piano interlude |
411
+ | **Special Tags** | `[Fade Out]` | Fade out ending |
412
+ | | `[Silence]` | Silence |
413
+
414
+ **Combining Tags: Use Moderately**
415
+
416
+ Structure tags can be combined with `-` for finer control:
417
+
418
+ ```
419
+ [Chorus - anthemic]
420
+ This is the chorus lyrics
421
+ Dreams are burning
422
+
423
+ [Bridge - whispered]
424
+ Whisper those words softly
425
+ ```
426
+
427
+ This works better than writing `[Chorus]` alone—you're telling the model both what this section is (Chorus) and how to sing it (anthemic).
428
+
429
+ **⚠️ Note: Don't stack too many tags.**
430
+
431
+ ```
432
+ ❌ Not recommended:
433
+ [Chorus - anthemic - stacked harmonies - high energy - powerful - epic]
434
+
435
+ ✅ Recommended:
436
+ [Chorus - anthemic]
437
+ ```
438
+
439
+ Stacking too many tags has two risks:
440
+ 1. The model might mistake tag content as lyrics to sing
441
+ 2. Too many instructions confuse the model, making effects worse
442
+
443
+ **Principle**: Keep structure tags concise; put complex style descriptions in Caption.
444
+
445
+ **⚠️ Key: Maintain Consistency Between Caption and Lyrics**
446
+
447
+ **Models are not good at resolving conflicts.** If descriptions in Caption and Lyrics contradict, the model gets confused and output quality decreases.
448
+
449
+ ```
450
+ ❌ Conflict example:
451
+ Caption: "violin solo, classical, intimate chamber music"
452
+ Lyrics: [Guitar Solo - electric - distorted]
453
+
454
+ ✅ Consistent example:
455
+ Caption: "violin solo, classical, intimate chamber music"
456
+ Lyrics: [Violin Solo - expressive]
457
+ ```
458
+
459
+ **Checklist:**
460
+ - Instruments in Caption ↔ Instrumental section tags in Lyrics
461
+ - Emotion in Caption ↔ Energy tags in Lyrics
462
+ - Vocal description in Caption ↔ Vocal control tags in Lyrics
463
+
464
+ Think of Caption as "overall setting" and Lyrics as "shot script"—they should tell the same story.
465
+
466
+ **Vocal Control Tags:**
467
+
468
+ | Tag | Effect |
469
+ |-----|--------|
470
+ | `[raspy vocal]` | Raspy, textured vocals |
471
+ | `[whispered]` | Whispered |
472
+ | `[falsetto]` | Falsetto |
473
+ | `[powerful belting]` | Powerful, high-pitched singing |
474
+ | `[spoken word]` | Rap/recitation |
475
+ | `[harmonies]` | Layered harmonies |
476
+ | `[call and response]` | Call and response |
477
+ | `[ad-lib]` | Improvised embellishments |
478
+
479
+ **Energy and Emotion Tags:**
480
+
481
+ | Tag | Effect |
482
+ |-----|--------|
483
+ | `[high energy]` | High energy, passionate |
484
+ | `[low energy]` | Low energy, restrained |
485
+ | `[building energy]` | Increasing energy |
486
+ | `[explosive]` | Explosive energy |
487
+ | `[melancholic]` | Melancholic |
488
+ | `[euphoric]` | Euphoric |
489
+ | `[dreamy]` | Dreamy |
490
+ | `[aggressive]` | Aggressive |
491
+
492
+ **Lyric Text Writing Tips**
493
+
494
+ **1. Control Syllable Count**
495
+
496
+ **6-10 syllables per line** usually works best. The model aligns syllables to beats—if one line has 6 syllables and the next has 14, rhythm becomes strange.
497
+
498
+ ```
499
+ ❌ Bad example:
500
+ 我站在窗前看着外面的世界一切都在改变(18 syllables)
501
+ 你好(2 syllables)
502
+
503
+ ✅ Good example:
504
+ 我站在窗前(5 syllables)
505
+ 看着外面世界(6 syllables)
506
+ 一切都在改变(6 syllables)
507
+ ```
508
+
509
+ **Tip**: Keep similar syllable counts for lines in the same position (e.g., first line of each verse) (±1-2 deviation).
510
+
511
+ **2. Use Case to Control Intensity**
512
+
513
+ Uppercase indicates stronger vocal intensity:
514
+
515
+ ```
516
+ [Verse]
517
+ walking through the empty streets (normal intensity)
518
+
519
+ [Chorus]
520
+ WE ARE THE CHAMPIONS! (high intensity, shouting)
521
+ ```
522
+
523
+ **3. Use Parentheses for Background Vocals**
524
+
525
+ ```
526
+ [Chorus]
527
+ We rise together (together)
528
+ Into the light (into the light)
529
+ ```
530
+
531
+ Content in parentheses is processed as background vocals or harmonies.
532
+
533
+ **4. Extend Vowels**
534
+
535
+ You can extend sounds by repeating vowels:
536
+
537
+ ```
538
+ Feeeling so aliiive
539
+ ```
540
+
541
+ But use cautiously—effects are unstable, sometimes ignored or mispronounced.
542
+
543
+ **5. Clear Section Separation**
544
+
545
+ Separate each section with blank lines:
546
+
547
+ ```
548
+ [Verse 1]
549
+ First verse lyrics
550
+ Continue first verse
551
+
552
+ [Chorus]
553
+ Chorus lyrics
554
+ Chorus continues
555
+ ```
556
+
557
+ **Avoiding "AI-flavored" Lyrics**
558
+
559
+ These characteristics make lyrics seem mechanical and lack human touch:
560
+
561
+ | Red Flag 🚩 | Description |
562
+ |-------------|-------------|
563
+ | **Adjective stacking** | "neon skies, electric hearts, endless dreams"—filling a section with vague imagery |
564
+ | **Rhyme chaos** | Inconsistent rhyme patterns, or forced rhymes causing semantic breaks |
565
+ | **Blurred section boundaries** | Lyric content crosses structure tags, Verse content "flows" into Chorus |
566
+ | **No breathing room** | Each line too long, can't sing in one breath |
567
+ | **Mixed metaphors** | First verse uses water imagery, second suddenly becomes fire, third is flying—listeners can't anchor |
568
+
569
+ **Metaphor discipline**: Stick to one core metaphor per song, exploring its multiple aspects. For example, choosing "water" as metaphor, you can explore: how love flows around obstacles like water, can be gentle rain or flood, reflects the other's image, can't be grasped but exists. One image, multiple facets—this gives lyrics cohesion.
570
+
571
+ **Writing Instrumental Music**
572
+
573
+ If generating pure instrumental music without vocals:
574
+
575
+ ```
576
+ [Instrumental]
577
+ ```
578
+
579
+ Or use structure tags to describe instrumental development:
580
+
581
+ ```
582
+ [Intro - ambient]
583
+
584
+ [Main Theme - piano]
585
+
586
+ [Climax - powerful]
587
+
588
+ [Outro - fade out]
589
+ ```
590
+
591
+ **Complete Example**
592
+
593
+ Assuming Caption is: `female vocal, piano ballad, emotional, intimate atmosphere, strings, building to powerful chorus`
594
+
595
+ ```
596
+ [Intro - piano]
597
+
598
+ [Verse 1]
599
+ 月光洒在窗台上
600
+ 我听见你的呼吸
601
+ 城市在远处沉睡
602
+ 只有我们还醒着
603
+
604
+ [Pre-Chorus]
605
+ 这一刻如此安静
606
+ 却藏着汹涌的心
607
+
608
+ [Chorus - powerful]
609
+ 让我们燃烧吧
610
+ 像夜空中的烟火
611
+ 短暂却绚烂
612
+ 这就是我们的时刻
613
+
614
+ [Verse 2]
615
+ 时间在指尖流过
616
+ 我们抓不住什么
617
+ 但至少此刻拥有
618
+ 彼此眼中的火焰
619
+
620
+ [Bridge - whispered]
621
+ 如果明天一切消散
622
+ 至少我们曾经闪耀
623
+
624
+ [Final Chorus]
625
+ 让我们燃烧吧
626
+ 像夜空中的烟火
627
+ 短暂却绚烂
628
+ THIS IS OUR MOMENT!
629
+
630
+ [Outro - fade out]
631
+ ```
632
+
633
+ Note: In this example, Lyrics tags (piano, powerful, whispered) are consistent with Caption descriptions (piano ballad, building to powerful chorus, intimate), with no conflicts.
634
+
635
+ ---
636
+
637
+ #### About Music Metadata: Optional Fine Control
638
+
639
+ **Most of the time, you don't need to manually set metadata.**
640
+
641
+ When you enable `thinking` mode (or enable `use_cot_metas`), LM automatically infers appropriate BPM, key, time signature, etc. based on your Caption and Lyrics. This is usually good enough.
642
+
643
+ But if you have clear ideas, you can also manually control them:
644
+
645
+ | Parameter | Control Range | Description |
646
+ |-----------|--------------|-------------|
647
+ | `bpm` | 30–300 | Tempo. Common distribution: slow songs 60–80, mid-tempo 90–120, fast songs 130–180 |
648
+ | `keyscale` | Key | e.g., `C Major`, `Am`, `F# Minor`. Affects overall pitch and emotional color |
649
+ | `timesignature` | Time signature | `4/4` (most common), `3/4` (waltz), `6/8` (swing feel) |
650
+ | `vocal_language` | Language | Vocal language. LM usually auto-detects from lyrics |
651
+ | `duration` | Seconds | Target duration. Actual generation may vary slightly |
652
+
653
+ **Understanding Control Boundaries**
654
+
655
+ These parameters are **guidance** rather than **precise commands**:
656
+
657
+ - **BPM**: Common range (60–180) works well; extreme values (like 30 or 280) have less training data, may be unstable
658
+ - **Key**: Common keys (C, G, D, Am, Em) are stable; rare keys may be ignored or shifted
659
+ - **Time signature**: `4/4` is most reliable; `3/4`, `6/8` usually OK; complex signatures (like `5/4`, `7/8`) are advanced, effects vary by style
660
+ - **Duration**: Short songs (30–60s) and medium length (2–4min) are stable; very long generation may have repetition or structure issues
661
+
662
+ **The Model's "Reference" Approach**
663
+
664
+ The model doesn't mechanically execute `bpm=120`, but rather:
665
+ 1. Uses `120 BPM` as an **anchor point**
666
+ 2. Samples from distribution near this anchor
667
+ 3. Final result might be 118 or 122, not exactly 120
668
+
669
+ It's like telling a musician "around 120 tempo"—they'll naturally play in this range, not rigidly follow a metronome.
670
+
671
+ **When Do You Need Manual Settings?**
672
+
673
+ | Scenario | Suggestion |
674
+ |----------|------------|
675
+ | Daily generation | Don't worry, let LM auto-infer |
676
+ | Clear tempo requirement | Manually set `bpm` |
677
+ | Specific style (e.g., waltz) | Manually set `timesignature=3/4` |
678
+ | Need to match other material | Manually set `bpm` and `duration` |
679
+ | Pursue specific key color | Manually set `keyscale` |
680
+
681
+ **Tip**: If you manually set metadata but generation results clearly don't match—check if there's conflict with Caption/Lyrics. For example, Caption says "slow ballad" but `bpm=160`, the model gets confused.
682
+
683
+ **Recommended Practice**: Don't write tempo, BPM, key, and other metadata information in Caption. These should be set through dedicated metadata parameters (`bpm`, `keyscale`, `timesignature`, etc.), not described in Caption. Caption should focus on style, emotion, instruments, timbre, and other musical characteristics, while metadata information is handled by corresponding parameters.
684
+
685
+ ---
686
+
687
+ #### About Audio Control: Controlling Sound with Sound
688
+
689
+ **Text is dimensionally reduced abstraction; the best control is still controlling with audio.**
690
+
691
+ There are three ways to control generation with audio, each with different control ranges and uses:
692
+
693
+ ---
694
+
695
+ ##### 1. Reference Audio: Global Acoustic Feature Control
696
+
697
+ Reference audio (`reference_audio`) is used to control the **acoustic features** of generated music—timbre, mixing style, performance style, etc. It **averages temporal dimension information** and acts **globally**.
698
+
699
+ **What Does Reference Audio Control?**
700
+
701
+ Reference audio mainly controls the **acoustic features** of generated music, including:
702
+ - **Timbre texture**: Vocal timbre, instrument timbre
703
+ - **Mixing style**: Spatial sense, dynamic range, frequency distribution
704
+ - **Performance style**: Vocal techniques, playing techniques, expression
705
+ - **Overall atmosphere**: The "feeling" conveyed through reference audio
706
+
707
+ **How Does the Backend Process Reference Audio?**
708
+
709
+ When you provide reference audio, the system performs the following processing:
710
+
711
+ 1. **Audio Preprocessing**:
712
+ - Load audio file, normalize to **stereo 48kHz** format
713
+ - Detect silence, ignore if audio is completely silent
714
+ - If audio length is less than 30 seconds, repeat to fill to at least 30 seconds
715
+ - Randomly select 10-second segments from front, middle, and back positions, concatenate into 30-second reference segment
716
+
717
+ 2. **Encoding Conversion**:
718
+ - Use **VAE (Variational Autoencoder)** `tiled_encode` method to encode audio into **latent representation (latents)**
719
+ - These latents contain acoustic feature information but remove specific melody, rhythm, and other structural information
720
+ - Encoded latents are input as conditions to DiT generation process, **averaging temporal dimension information, acting globally on entire generation process**
721
+
722
+ ---
723
+
724
+ ##### 2. Source Audio: Semantic Structure Control
725
+
726
+ Source audio (`src_audio`) is used for **Cover tasks**, performing **melodic structure control**. Its principle is to quantize your input source audio into semantically structured information.
727
+
728
+ **What Does Source Audio Control?**
729
+
730
+ Source audio is converted into **semantically structured information**, including:
731
+ - **Melody**: Note direction and pitch
732
+ - **Rhythm**: Beat, accent, groove
733
+ - **Chords**: Harmonic progression and changes
734
+ - **Orchestration**: Instrument arrangement and layers
735
+ - **Some timbre**: Partial timbre information
736
+
737
+ **What Can You Do With It?**
738
+
739
+ 1. **Control style**: Maintain source audio structure, change style and details
740
+ 2. **Transfer style**: Apply source audio structure to different styles
741
+ 3. **Retake lottery**: Generate similar structure but different variants, get different interpretations through multiple generations
742
+ 4. **Control influence degree**: Control source audio influence strength through `audio_cover_strength` parameter (0.0–1.0)
743
+ - Higher strength: generation results more strictly follow source audio structure
744
+ - Lower strength: generation results have more room for free play
745
+
746
+ **Advanced Cover Usage**
747
+
748
+ You can use Cover to **Remix a song**, and it supports changing Caption and Lyrics:
749
+
750
+ - **Remix creation**: Input a song as source audio, reinterpret it by modifying Caption and Lyrics
751
+ - Change style: Use different Caption descriptions (e.g., change from pop to rock)
752
+ - Change lyrics: Rewrite lyrics with new Lyrics, maintaining original melody structure
753
+ - Change emotion: Adjust overall atmosphere through Caption (e.g., change from sad to joyful)
754
+
755
+ - **Build complex music structures**: Build complex melodic direction, layers, and groove based on your needed structure influence degree
756
+ - Fine-tune structure adherence through `audio_cover_strength`
757
+ - Combine Caption and Lyrics modifications to create new expression while maintaining core structure
758
+ - Can generate multiple versions, each with different emphasis on structure, style, lyrics
759
+
760
+ ---
761
+
762
+ ##### 3. Source Audio Context-Based Control: Local Completion and Modification
763
+
764
+ This is the **Repaint task**, performing completion or modification based on source audio context.
765
+
766
+ **Repaint Principle**
767
+
768
+ Repaint is based on **context completion** principle:
769
+ - Can complete **beginning**, **middle local**, **ending**, or **any region**
770
+ - Operation range: **3 seconds to 90 seconds**
771
+ - Model references source audio context information, generating within specified interval
772
+
773
+ **What Can You Do With It?**
774
+
775
+ 1. **Local modification**: Modify lyrics, structure, or content in specified interval
776
+ 2. **Change lyrics**: Maintain melody and orchestration, only change lyric content
777
+ 3. **Change structure**: Change music structure in specified interval (e.g., change Verse to Chorus)
778
+ 4. **Continue writing**: Continue writing beginning or ending based on context
779
+ 5. **Clone timbre**: Clone source audio timbre characteristics based on context
780
+
781
+ **Advanced Repaint Usage**
782
+
783
+ You can use Repaint for more complex creative needs:
784
+
785
+ - **Infinite duration generation**:
786
+ - Through multiple Repaint operations, can continuously extend audio, achieving infinite duration generation
787
+ - Each continuation is based on previous segment's context, maintaining natural transitions and coherence
788
+ - Can generate in segments, each 3–90 seconds, finally concatenate into complete work
789
+
790
+ - **Intelligent audio stitching**:
791
+ - Intelligently organize and stitch two audios together
792
+ - Use Repaint at first audio's end to continue, making transitions naturally connect
793
+ - Or use Repaint to modify connection part between two audios for smooth transitions
794
+ - Model automatically handles rhythm, harmony, timbre connections based on context, making stitched audio sound like a complete work
795
+
796
+ ---
797
+
798
+ ##### 4. Base Model Advanced Audio Control Tasks
799
+
800
+ In the **Base model**, we also support more advanced audio control tasks:
801
+
802
+ **Lego Task**: Intelligently add new tracks based on existing tracks
803
+ - Input an existing audio track (e.g., vocals)
804
+ - Model intelligently adds new tracks (e.g., drums, guitar, bass, etc.)
805
+ - New tracks coordinate with original tracks in rhythm and harmony
806
+
807
+ **Complete Task**: Add mixed tracks to single track
808
+ - Input a single-track audio (e.g., a cappella vocals)
809
+ - Model generates complete mixed accompaniment tracks
810
+ - Generated accompaniment matches vocals in style, rhythm, and harmony
811
+
812
+ **These advanced context completion tasks** greatly expand control methods, more intelligently providing inspiration and creativity.
813
+
814
+ ---
815
+
816
+ The combination of these parameters determines what you "want." We'll explain input control **principles** and **techniques** in detail later.
817
+
818
+ ### II. Inference Hyperparameters: How Does the Model Generate?
819
+
820
+ This is the part that affects "generation process behavior"—doesn't change what you want, but changes how the model does it.
821
+
822
+ **DiT (Diffusion Model) Hyperparameters:**
823
+
824
+ | Parameter | Function | Default | Tuning Advice |
825
+ |-----------|----------|---------|---------------|
826
+ | `inference_steps` | Diffusion steps | 8 (turbo) | More steps = finer but slower. Turbo uses 8, Base uses 32–100 |
827
+ | `guidance_scale` | CFG strength | 7.0 | Higher = more prompt adherence, but may overfit. Only Base model effective |
828
+ | `use_adg` | Adaptive Dual Guidance | False | After enabling, dynamically adjusts CFG, Base model only |
829
+ | `cfg_interval_start/end` | CFG effective interval | 0.0–1.0 | Controls which stage to apply CFG |
830
+ | `shift` | Timestep offset | 1.0 | Adjusts denoising trajectory, affects generation style |
831
+ | `infer_method` | Inference method | "ode" | `ode` deterministic, `sde` introduces randomness |
832
+ | `timesteps` | Custom timesteps | None | Advanced usage, overrides steps and shift |
833
+ | `audio_cover_strength` | Reference audio/codes influence strength | 1.0 | 0.0–1.0, higher = closer to reference, lower = more freedom |
834
+
835
+ **5Hz LM (Language Model) Hyperparameters:**
836
+
837
+ | Parameter | Function | Default | Tuning Advice |
838
+ |-----------|----------|---------|---------------|
839
+ | `thinking` | Enable CoT reasoning | True | Enable to let LM reason metadata and codes |
840
+ | `lm_temperature` | Sampling temperature | 0.85 | Higher = more random/creative, lower = more conservative/deterministic |
841
+ | `lm_cfg_scale` | LM CFG strength | 2.0 | Higher = more positive prompt adherence |
842
+ | `lm_top_k` | Top-K sampling | 0 | 0 means disabled, limits candidate word count |
843
+ | `lm_top_p` | Top-P sampling | 0.9 | Nucleus sampling, limits cumulative probability |
844
+ | `lm_negative_prompt` | Negative prompt | "NO USER INPUT" | Tells LM what not to generate |
845
+ | `use_cot_metas` | CoT reason metadata | True | Let LM auto-infer BPM, key, etc. |
846
+ | `use_cot_caption` | CoT rewrite caption | True | Let LM optimize your description |
847
+ | `use_cot_language` | CoT detect language | True | Let LM auto-detect vocal language |
848
+ | `use_constrained_decoding` | Constrained decoding | True | Ensures correct output format |
849
+
850
+ The combination of these parameters determines how the model "does it."
851
+
852
+ **About Parameter Tuning**
853
+
854
+ It's important to emphasize that **tuning factors and random factors sometimes have comparable influence**. When you adjust a parameter, it may be hard to tell if it's the parameter's effect or randomness causing the change.
855
+
856
+ Therefore, **we recommend fixing random factors when tuning**—by setting a fixed `seed` value, ensuring each generation starts from the same initial noise, so you can accurately feel the parameter's real impact on generated audio. Otherwise, parameter change effects may be masked by randomness, causing you to misjudge the parameter's role.
857
+
858
+ ### III. Random Factors: Sources of Uncertainty
859
+
860
+ Even with identical inputs and hyperparameters, two generations may produce different results. This is because:
861
+
862
+ **1. DiT's Initial Noise**
863
+ - Diffusion models start from random noise and gradually denoise
864
+ - `seed` parameter controls this initial noise
865
+ - Different seed → different starting point → different endpoint
866
+
867
+ **2. LM's Sampling Randomness**
868
+ - When `lm_temperature > 0`, the sampling process itself has randomness
869
+ - Same prompt, each sampling may choose different tokens
870
+
871
+ **3. Additional Noise When `infer_method = "sde"`**
872
+ - SDE method injects additional randomness during denoising
873
+
874
+ ---
875
+
876
+ #### Pros and Cons of Random Factors
877
+
878
+ Randomness is a double-edged sword.
879
+
880
+ **Benefits of Randomness:**
881
+ - **Explore creative space**: Same input can produce different variants, giving you more choices
882
+ - **Discover unexpected surprises**: Sometimes randomness brings excellent results you didn't expect
883
+ - **Avoid repetition**: Each generation is different, won't fall into single-pattern loops
884
+
885
+ **Challenges of Randomness:**
886
+ - **Uncontrollable results**: You can't precisely predict generation results, may generate multiple times without satisfaction
887
+ - **Hard to reproduce**: Even with identical inputs, hard to reproduce a specific good result
888
+ - **Tuning difficulty**: When adjusting parameters, hard to tell if it's parameter effect or randomness change
889
+ - **Screening cost**: Need to generate multiple versions to find satisfactory ones, increasing time cost
890
+
891
+ #### What Mindset to Face Random Factors?
892
+
893
+ **1. Accept Uncertainty**
894
+ - Randomness is an essential characteristic of AI music generation, not a bug, but a feature
895
+ - Don't expect every generation to be perfect; treat randomness as an exploration tool
896
+
897
+ **2. Embrace the Exploration Process**
898
+ - Treat generation process as "gacha" or "treasure hunting"—try multiple times, always find surprises
899
+ - Enjoy discovering unexpectedly good results, rather than obsessing over one-time success
900
+
901
+ **3. Use Fixed Seed Wisely**
902
+ - When you want to **understand parameter effects**, fix `seed` to eliminate randomness interference
903
+ - When you want to **explore creative space**, let `seed` vary randomly
904
+
905
+ **4. Batch Generation + Intelligent Screening**
906
+ - Don't rely on single generation; batch generate multiple versions
907
+ - Use automatic scoring mechanisms for initial screening to improve efficiency
908
+
909
+ #### Our Solution: Large Batch + Automatic Scoring
910
+
911
+ Because our inference is extremely fast, if your GPU VRAM is sufficient, you can explore random space through **large batch**:
912
+
913
+ - **Batch generation**: Generate multiple versions at once (e.g., batch_size=2,4,8), quickly explore random space
914
+ - **Automatic scoring mechanism**: We provide automatic scoring mechanisms that can help you initially screen, doing **test time scaling**
915
+
916
+ **Automatic Scoring Mechanism**
917
+
918
+ We provide multiple scoring metrics, among which **my favorite is DiT Lyrics Alignment Score**:
919
+
920
+ - **DiT Lyrics Alignment Score**: This score implicitly affects lyric accuracy
921
+ - It evaluates the alignment degree between lyrics and audio in generated audio
922
+ - Higher score means lyrics are more accurately positioned in audio, better match between singing and lyrics
923
+ - This is particularly important for music generation with lyrics, can help you screen versions with higher lyric accuracy
924
+
925
+ - **Other scoring metrics**: Also include other quality assessment metrics, can evaluate generation results from multiple dimensions
926
+
927
+ **Recommended Workflow:**
928
+
929
+ 1. **Batch generation**: Set larger `batch_size` (e.g., 2, 4, 8), generate multiple versions at once
930
+ 2. **Enable AutoGen**: Enable automatic generation, let system continuously generate new batches in background
931
+ - **AutoGen mechanism**: AutoGen automatically uses same parameters (but random seed) to generate next batch in background while you're viewing current batch results
932
+ - This lets you continuously explore random space without manually clicking generate button
933
+ - Each new batch uses new random seed, ensuring result diversity
934
+ 3. **Automatic scoring**: Enable automatic scoring, let system automatically score each version
935
+ 4. **Initial screening**: Screen versions with higher scores based on DiT Lyrics Alignment Score and other metrics
936
+ 5. **Manual selection**: Manually select the final version that best meets your needs from screened versions
937
+
938
+ This fully utilizes randomness to explore creative space while improving efficiency through automation tools, avoiding blind searching in large generation results. AutoGen lets you "generate while listening"—while browsing current results, the next batch is already prepared in the background.
939
+
940
+ ---
941
+
942
+ ## Conclusion
943
+
944
+ This tutorial currently covers ACE-Step 1.5's core concepts and usage methods:
945
+
946
+ - **Mental Models**: Understanding human-centered generation design philosophy
947
+ - **Model Architecture**: Understanding how LM and DiT work together
948
+ - **Input Control**: Mastering text (Caption, Lyrics, metadata) and audio (reference audio, source audio) control methods
949
+ - **Inference Hyperparameters**: Understanding parameters affecting generation process
950
+ - **Random Factors**: Learning to use randomness to explore creative space, improving efficiency through Large Batch + AutoGen + Automatic Scoring
951
+
952
+ This is just the beginning. There's much more content we want to share with you:
953
+
954
+ - More Prompting tips and practical cases
955
+ - Detailed usage guides for different task types
956
+ - Advanced techniques and creative workflows
957
+ - Common issues and solutions
958
+ - Performance optimization suggestions
959
+
960
+ **This tutorial will continue to be updated and improved.** If you have any questions or suggestions during use, feedback is welcome. Let's make ACE-Step your creative partner in your pocket together.
961
+
962
+ ---
963
+
964
+ *To be continued...*
.claude/skills/acestep-docs/guides/ENVIRONMENT_SETUP.md ADDED
@@ -0,0 +1,542 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment Setup Guide
2
+
3
+ This guide covers Python environment setup for ACE-Step on Windows, Linux, and macOS.
4
+
5
+ ## Environment Options
6
+
7
+ ### Windows
8
+
9
+ **Option 1: python_embeded (Portable Package)**
10
+ - **Best for**: New users, zero-configuration setup
11
+ - **Pros**: Extract and run, no installation required
12
+ - **Cons**: Large download size (~7GB)
13
+ - **Location**: `python_embeded\python.exe`
14
+ - **Download**: https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z
15
+
16
+ **Option 2: uv (Package Manager)**
17
+ - **Best for**: Developers, Git repository users
18
+ - **Pros**: Smaller installation, easy updates, excellent tooling
19
+ - **Cons**: Requires uv installation
20
+ - **Installation**: See [Installing uv](#installing-uv) below
21
+
22
+ ### Linux
23
+
24
+ **uv (Package Manager)**
25
+ - **Only supported option** (no portable package available for Linux)
26
+ - **Best for**: All Linux users
27
+ - **Requires**: uv package manager
28
+ - **Backend**: vllm (default) or pt (PyTorch)
29
+ - **Installation**: See [Installing uv](#installing-uv) below
30
+
31
+ ### macOS (Apple Silicon)
32
+
33
+ **uv with MLX Backend**
34
+ - **Only supported option** (no portable package available for macOS)
35
+ - **Best for**: All macOS Apple Silicon (M1/M2/M3/M4) users
36
+ - **Requires**: uv package manager
37
+ - **Backend**: mlx (native Apple Silicon acceleration)
38
+ - **Dedicated scripts**: `start_gradio_ui_macos.sh`, `start_api_server_macos.sh`
39
+ - **Installation**: See [Installing uv](#installing-uv) below
40
+
41
+ Note: Intel Macs can use the standard `start_gradio_ui.sh` with the PyTorch (pt) backend, but Apple Silicon Macs should use the macOS-specific scripts for optimal performance.
42
+
43
+ ## Automatic Detection
44
+
45
+ ### Windows (bat scripts)
46
+
47
+ The `.bat` startup scripts detect the environment in this order:
48
+
49
+ 1. **First**: Check for `python_embeded\python.exe`
50
+ - If found: Use embedded Python directly
51
+ - If not found: Continue to step 2
52
+
53
+ 2. **Second**: Check for `uv` command
54
+ - If found: Use uv
55
+ - If not found: Prompt to install uv
56
+
57
+ **Example output:**
58
+ ```
59
+ [Environment] Using embedded Python...
60
+ ```
61
+ or
62
+ ```
63
+ [Environment] Embedded Python not found, checking for uv...
64
+ [Environment] Using uv package manager...
65
+ ```
66
+
67
+ ### Linux/macOS (sh scripts)
68
+
69
+ The `.sh` startup scripts detect the environment in this order:
70
+
71
+ 1. **First**: Check for `uv` in PATH
72
+ - Also checks `~/.local/bin/uv` and `~/.cargo/bin/uv`
73
+ - If found: Use uv
74
+ - If not found: Prompt to install uv
75
+
76
+ 2. **If not found**: Offer automatic installation
77
+ - Calls `install_uv.sh --silent` to install uv
78
+ - Updates PATH and continues
79
+
80
+ **Example output (Linux):**
81
+ ```
82
+ [Environment] Using uv package manager...
83
+ ```
84
+
85
+ **Example output (macOS):**
86
+ ```
87
+ ============================================
88
+ ACE-Step 1.5 - macOS Apple Silicon (MLX)
89
+ ============================================
90
+ [Environment] Using uv package manager...
91
+ ```
92
+
93
+ ## Installing uv
94
+
95
+ ### All Platforms
96
+
97
+ **Automatic**: When you run a startup script and uv is not found, you will be prompted:
98
+
99
+ ```
100
+ uv package manager not found!
101
+
102
+ Install uv now? (Y/N):
103
+ ```
104
+
105
+ Type `Y` and press Enter. The script will automatically install uv using the appropriate method for your platform.
106
+
107
+ ### Windows Methods
108
+
109
+ **Method 1: PowerShell (Recommended)**
110
+ ```powershell
111
+ irm https://astral.sh/uv/install.ps1 | iex
112
+ ```
113
+
114
+ **Method 2: winget (Windows 10 1809+, Windows 11)**
115
+ ```batch
116
+ winget install --id=astral-sh.uv -e
117
+ ```
118
+
119
+ **Method 3: Run the install script**
120
+ ```batch
121
+ install_uv.bat
122
+ ```
123
+
124
+ The `install_uv.bat` script tries PowerShell first, then falls back to winget if PowerShell fails.
125
+
126
+ ### Linux Methods
127
+
128
+ **Method 1: curl installer (Recommended)**
129
+ ```bash
130
+ curl -LsSf https://astral.sh/uv/install.sh | sh
131
+ ```
132
+
133
+ **Method 2: Run the install script**
134
+ ```bash
135
+ chmod +x install_uv.sh
136
+ ./install_uv.sh
137
+ ```
138
+
139
+ The `install_uv.sh` script uses `curl` or `wget` to download and run the official installer.
140
+
141
+ ### macOS Methods
142
+
143
+ **Method 1: curl installer (Recommended)**
144
+ ```bash
145
+ curl -LsSf https://astral.sh/uv/install.sh | sh
146
+ ```
147
+
148
+ **Method 2: Homebrew**
149
+ ```bash
150
+ brew install uv
151
+ ```
152
+
153
+ **Method 3: Run the install script**
154
+ ```bash
155
+ chmod +x install_uv.sh
156
+ ./install_uv.sh
157
+ ```
158
+
159
+ The `install_uv.sh` script works on both Linux and macOS, and will suggest `brew install curl` on macOS if neither `curl` nor `wget` is available.
160
+
161
+ ## Installation Locations
162
+
163
+ ### Windows
164
+
165
+ **PowerShell installation:**
166
+ ```
167
+ %USERPROFILE%\.local\bin\uv.exe
168
+ Example: C:\Users\YourName\.local\bin\uv.exe
169
+ ```
170
+
171
+ **winget installation:**
172
+ ```
173
+ %LOCALAPPDATA%\Microsoft\WinGet\Links\uv.exe
174
+ Example: C:\Users\YourName\AppData\Local\Microsoft\WinGet\Links\uv.exe
175
+ ```
176
+
177
+ ### Linux
178
+
179
+ **Default installation (curl installer):**
180
+ ```
181
+ ~/.local/bin/uv
182
+ Example: /home/yourname/.local/bin/uv
183
+ ```
184
+
185
+ **Alternative location (cargo):**
186
+ ```
187
+ ~/.cargo/bin/uv
188
+ Example: /home/yourname/.cargo/bin/uv
189
+ ```
190
+
191
+ ### macOS
192
+
193
+ **Default installation (curl installer):**
194
+ ```
195
+ ~/.local/bin/uv
196
+ Example: /Users/yourname/.local/bin/uv
197
+ ```
198
+
199
+ **Alternative location (cargo):**
200
+ ```
201
+ ~/.cargo/bin/uv
202
+ Example: /Users/yourname/.cargo/bin/uv
203
+ ```
204
+
205
+ **Homebrew installation:**
206
+ ```
207
+ /opt/homebrew/bin/uv (Apple Silicon)
208
+ /usr/local/bin/uv (Intel)
209
+ ```
210
+
211
+ ## First Run
212
+
213
+ ### Windows with python_embeded
214
+
215
+ ```batch
216
+ REM Download and extract portable package from:
217
+ REM https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z
218
+
219
+ REM Run the startup script
220
+ start_gradio_ui.bat
221
+
222
+ REM Output:
223
+ REM [Environment] Using embedded Python...
224
+ REM Starting ACE-Step Gradio UI...
225
+ ```
226
+
227
+ ### Windows with uv
228
+
229
+ ```batch
230
+ REM First time: uv will create a virtual environment and sync dependencies
231
+ start_gradio_ui.bat
232
+
233
+ REM Output:
234
+ REM [Environment] Using uv package manager...
235
+ REM [Setup] Virtual environment not found. Setting up environment...
236
+ REM Running: uv sync
237
+ ```
238
+
239
+ ### Linux with uv
240
+
241
+ ```bash
242
+ # Make scripts executable (first time only)
243
+ chmod +x start_gradio_ui.sh install_uv.sh
244
+
245
+ # First time: uv will create a virtual environment and sync dependencies
246
+ ./start_gradio_ui.sh
247
+
248
+ # Output:
249
+ # [Environment] Using uv package manager...
250
+ # [Setup] Virtual environment not found. Setting up environment...
251
+ # Running: uv sync
252
+ ```
253
+
254
+ ### macOS (Apple Silicon) with uv
255
+
256
+ ```bash
257
+ # Make scripts executable (first time only)
258
+ chmod +x start_gradio_ui_macos.sh install_uv.sh
259
+
260
+ # Use the macOS-specific script for MLX backend
261
+ ./start_gradio_ui_macos.sh
262
+
263
+ # Output:
264
+ # ============================================
265
+ # ACE-Step 1.5 - macOS Apple Silicon (MLX)
266
+ # ============================================
267
+ # [Environment] Using uv package manager...
268
+ # [Setup] Virtual environment not found. Setting up environment...
269
+ # Running: uv sync
270
+ ```
271
+
272
+ Note: On macOS Apple Silicon, always use `start_gradio_ui_macos.sh` instead of `start_gradio_ui.sh` to enable the MLX backend for native acceleration.
273
+
274
+ ## Troubleshooting
275
+
276
+ ### "uv not found" after installation
277
+
278
+ **Windows**
279
+
280
+ Cause: PATH not refreshed after installation.
281
+
282
+ Solution 1: Restart your terminal (close and reopen Command Prompt or PowerShell).
283
+
284
+ Solution 2: Use the full path temporarily:
285
+ ```batch
286
+ %USERPROFILE%\.local\bin\uv.exe run acestep
287
+ ```
288
+
289
+ **Linux/macOS**
290
+
291
+ Cause: uv installed but not in PATH.
292
+
293
+ Solution 1: Restart your terminal or source your profile:
294
+ ```bash
295
+ source ~/.bashrc # or ~/.zshrc on macOS
296
+ ```
297
+
298
+ Solution 2: Add uv to your PATH manually:
299
+ ```bash
300
+ # For ~/.local/bin installation
301
+ echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
302
+ source ~/.bashrc
303
+
304
+ # For macOS with zsh (default shell)
305
+ echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
306
+ source ~/.zshrc
307
+ ```
308
+
309
+ Solution 3: Use the full path temporarily:
310
+ ```bash
311
+ ~/.local/bin/uv run acestep
312
+ ```
313
+
314
+ ### Permission issues (Linux/macOS)
315
+
316
+ **Symptom**: `Permission denied` when running scripts.
317
+
318
+ **Solution**:
319
+ ```bash
320
+ chmod +x start_gradio_ui.sh
321
+ chmod +x start_gradio_ui_macos.sh
322
+ chmod +x install_uv.sh
323
+ ```
324
+
325
+ **Symptom**: `Permission denied` during uv installation.
326
+
327
+ **Solution**: The curl installer installs to `~/.local/bin` which should not require root. If you see permission errors:
328
+ ```bash
329
+ # Ensure the directory exists and is writable
330
+ mkdir -p ~/.local/bin
331
+ ```
332
+
333
+ Do not use `sudo` with the uv installer.
334
+
335
+ ### winget not available (Windows)
336
+
337
+ **Symptom**:
338
+ ```
339
+ 'winget' is not recognized as an internal or external command
340
+ ```
341
+
342
+ **Solution**:
343
+ - Windows 11: Should be pre-installed. Try updating Windows.
344
+ - Windows 10: Install "App Installer" from the Microsoft Store.
345
+ - Alternative: Use the PowerShell installation method instead:
346
+ ```powershell
347
+ irm https://astral.sh/uv/install.ps1 | iex
348
+ ```
349
+
350
+ ### Installation fails
351
+
352
+ **Common causes**:
353
+ - Network connection issues
354
+ - Firewall blocking downloads
355
+ - Antivirus software interference (Windows)
356
+ - Missing `curl` or `wget` (Linux/macOS)
357
+
358
+ **Solutions**:
359
+
360
+ 1. Check your internet connection.
361
+ 2. Temporarily disable firewall/antivirus (Windows).
362
+ 3. Try an alternative installation method:
363
+ - **Windows**: Use PowerShell method if winget fails, or vice versa.
364
+ - **Linux**: Install `curl` first (`sudo apt install curl` on Ubuntu/Debian, `sudo yum install curl` on CentOS/RHEL).
365
+ - **macOS**: Use `brew install uv` as an alternative.
366
+ 4. **Windows only**: Use the portable package instead: https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z
367
+
368
+ ## Switching Environments (Windows Only)
369
+
370
+ Windows is the only platform with two environment options. Linux and macOS use uv exclusively.
371
+
372
+ ### From python_embeded to uv
373
+
374
+ ```batch
375
+ REM 1. Install uv
376
+ install_uv.bat
377
+
378
+ REM 2. Rename or delete python_embeded folder
379
+ rename python_embeded python_embeded_backup
380
+
381
+ REM 3. Run startup script (will use uv)
382
+ start_gradio_ui.bat
383
+ ```
384
+
385
+ ### From uv to python_embeded
386
+
387
+ ```batch
388
+ REM 1. Download portable package
389
+ REM https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z
390
+
391
+ REM 2. Extract python_embeded folder to project root
392
+
393
+ REM 3. Run startup script (will use python_embeded)
394
+ start_gradio_ui.bat
395
+ ```
396
+
397
+ ## Environment Variables (.env)
398
+
399
+ ACE-Step can be configured using environment variables in a `.env` file.
400
+
401
+ ### Setup
402
+
403
+ ```bash
404
+ # Copy the example file
405
+ cp .env.example .env
406
+
407
+ # Edit .env with your preferred settings
408
+ ```
409
+
410
+ ### Available Variables
411
+
412
+ | Variable | Default | Description |
413
+ |----------|---------|-------------|
414
+ | `ACESTEP_INIT_LLM` | auto | LLM initialization control |
415
+ | `ACESTEP_CONFIG_PATH` | acestep-v15-turbo | DiT model path |
416
+ | `ACESTEP_LM_MODEL_PATH` | acestep-5Hz-lm-1.7B | LM model path |
417
+ | `ACESTEP_DEVICE` | auto | Device: auto, cuda, cpu, xpu |
418
+ | `ACESTEP_LM_BACKEND` | vllm | LM backend: vllm, pt, mlx |
419
+ | `ACESTEP_DOWNLOAD_SOURCE` | auto | Download source |
420
+ | `ACESTEP_API_KEY` | (none) | API authentication key |
421
+
422
+ ### ACESTEP_LM_BACKEND
423
+
424
+ Controls which backend is used for the Language Model.
425
+
426
+ | Value | Platform | Description |
427
+ |-------|----------|-------------|
428
+ | `vllm` | Linux (CUDA) | Default. Fastest backend for NVIDIA GPUs. |
429
+ | `pt` | All | PyTorch native backend. Works everywhere but slower. |
430
+ | `mlx` | macOS (Apple Silicon) | Native Apple Silicon acceleration via MLX. |
431
+
432
+ **Platform-specific recommendations:**
433
+ - **Windows**: Use `vllm` (default) with NVIDIA GPU, or `pt` as fallback.
434
+ - **Linux**: Use `vllm` (default) with NVIDIA GPU, or `pt` as fallback.
435
+ - **macOS Apple Silicon**: Use `mlx` for best performance. The `start_gradio_ui_macos.sh` script sets this automatically via `export ACESTEP_LM_BACKEND="mlx"`.
436
+
437
+ **Example .env for macOS Apple Silicon:**
438
+ ```bash
439
+ ACESTEP_LM_BACKEND=mlx
440
+ ACESTEP_CONFIG_PATH=acestep-v15-turbo
441
+ ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B
442
+ ```
443
+
444
+ ### ACESTEP_INIT_LLM - LLM Initialization Control
445
+
446
+ Controls whether the Language Model (5Hz LM) is initialized at startup.
447
+
448
+ **Processing Flow:**
449
+ ```
450
+ GPU Detection (full) --> ACESTEP_INIT_LLM Override --> Model Loading
451
+ ```
452
+
453
+ - GPU optimizations (offload, quantization, batch limits) are **always applied**
454
+ - `ACESTEP_INIT_LLM` only overrides the "should we load LLM" decision
455
+ - Model validation shows warnings but does not block when forcing
456
+
457
+ | Value | Behavior |
458
+ |-------|----------|
459
+ | `auto` (or empty) | Use GPU auto-detection result (recommended) |
460
+ | `true` / `1` / `yes` | Force enable LLM after GPU detection (may cause OOM) |
461
+ | `false` / `0` / `no` | Force disable for pure DiT mode |
462
+
463
+ **Example configurations:**
464
+
465
+ ```bash
466
+ # Auto mode (recommended) - let GPU detection decide
467
+ ACESTEP_INIT_LLM=auto
468
+
469
+ # Auto mode - leave empty (same as above)
470
+ ACESTEP_INIT_LLM=
471
+
472
+ # Force enable on low VRAM GPU (GPU optimizations still applied)
473
+ ACESTEP_INIT_LLM=true
474
+ ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B # Use smallest model
475
+
476
+ # Force disable LLM for faster generation
477
+ ACESTEP_INIT_LLM=false
478
+ ```
479
+
480
+ ### Features Affected by LLM
481
+
482
+ When LLM is disabled (`ACESTEP_INIT_LLM=false`), these features are unavailable:
483
+
484
+ | Feature | Description | Available without LLM |
485
+ |---------|-------------|----------------------|
486
+ | Thinking mode | LLM generates audio codes | No |
487
+ | CoT caption | LLM enhances captions | No (auto-disabled) |
488
+ | CoT language | LLM detects vocal language | No (auto-disabled) |
489
+ | Sample mode | Generate from description | No |
490
+ | Format mode | LLM-enhanced input | No |
491
+ | Basic generation | DiT-based synthesis | Yes |
492
+ | Cover/Repaint | Audio editing tasks | Yes |
493
+
494
+ Note: When using the API server, CoT features (`use_cot_caption`, `use_cot_language`) are automatically disabled when LLM is unavailable, allowing basic generation to proceed.
495
+
496
+ ## Environment Comparison
497
+
498
+ | Feature | python_embeded (Windows) | uv (Windows) | uv (Linux) | uv (macOS) |
499
+ |---------|--------------------------|---------------|-------------|-------------|
500
+ | Setup Difficulty | Zero config | Need install | Need install | Need install |
501
+ | Startup Speed | Fast | Fast | Fast | Fast |
502
+ | Update Ease | Re-download | uv command | uv command | uv command |
503
+ | Environment Isolation | Complete | Virtual env | Virtual env | Virtual env |
504
+ | Development | Basic | Excellent | Excellent | Excellent |
505
+ | Beginner Friendly | Best | Good | Good | Good |
506
+ | GPU Backend | CUDA | CUDA | CUDA (vllm) | MLX (Apple Silicon) |
507
+ | Install Script | N/A | install_uv.bat | install_uv.sh | install_uv.sh |
508
+ | Launch Script | start_gradio_ui.bat | start_gradio_ui.bat | start_gradio_ui.sh | start_gradio_ui_macos.sh |
509
+
510
+ ## Recommendations
511
+
512
+ ### Windows
513
+
514
+ **Use python_embeded if:**
515
+ - First time using ACE-Step
516
+ - Want zero configuration
517
+ - Do not need frequent updates
518
+ - Prefer a self-contained package
519
+
520
+ **Use uv if:**
521
+ - Developer or experienced with Python
522
+ - Need to modify dependencies
523
+ - Using the Git repository
524
+ - Want smaller installation size
525
+ - Need frequent code updates
526
+
527
+ ### Linux
528
+
529
+ **Use uv (only option):**
530
+ - Install uv via the curl installer or `install_uv.sh`
531
+ - Use `start_gradio_ui.sh` to launch
532
+ - NVIDIA GPU with CUDA is recommended for vllm backend
533
+ - CPU-only is possible with `ACESTEP_DEVICE=cpu` and `ACESTEP_LM_BACKEND=pt`
534
+
535
+ ### macOS (Apple Silicon)
536
+
537
+ **Use uv with MLX backend (recommended):**
538
+ - Install uv via curl installer, Homebrew, or `install_uv.sh`
539
+ - Use `start_gradio_ui_macos.sh` to launch (sets MLX backend automatically)
540
+ - The 0.6B LM model (`acestep-5Hz-lm-0.6B`) is recommended for devices with limited unified memory
541
+ - Set `ACESTEP_LM_BACKEND=mlx` in `.env` if launching manually
542
+ - Intel Macs should use `start_gradio_ui.sh` with `ACESTEP_LM_BACKEND=pt` instead
.claude/skills/acestep-docs/guides/GPU_COMPATIBILITY.md ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPU Compatibility Guide
2
+
3
+ ACE-Step 1.5 automatically adapts to your GPU's available VRAM, adjusting generation limits and LM model availability accordingly. The system detects GPU memory at startup and configures optimal settings.
4
+
5
+ ## GPU Tier Configuration
6
+
7
+ | VRAM | Tier | LM Mode | Max Duration | Max Batch Size | LM Memory Allocation |
8
+ |------|------|---------|--------------|----------------|---------------------|
9
+ | ≤4GB | Tier 1 | Not available | 3 min | 1 | - |
10
+ | 4-6GB | Tier 2 | Not available | 6 min | 1 | - |
11
+ | 6-8GB | Tier 3 | 0.6B (optional) | With LM: 4 min / Without: 6 min | With LM: 1 / Without: 2 | 3GB |
12
+ | 8-12GB | Tier 4 | 0.6B (optional) | With LM: 4 min / Without: 6 min | With LM: 2 / Without: 4 | 3GB |
13
+ | 12-16GB | Tier 5 | 0.6B / 1.7B | With LM: 4 min / Without: 6 min | With LM: 2 / Without: 4 | 0.6B: 3GB, 1.7B: 8GB |
14
+ | 16-24GB | Tier 6 | 0.6B / 1.7B / 4B | 8 min | With LM: 4 / Without: 8 | 0.6B: 3GB, 1.7B: 8GB, 4B: 12GB |
15
+ | ≥24GB | Unlimited | All models | 10 min | 8 | Unrestricted |
16
+
17
+ ## Notes
18
+
19
+ - **Default settings** are automatically configured based on detected GPU memory
20
+ - **LM Mode** refers to the Language Model used for Chain-of-Thought generation and audio understanding
21
+ - **Flash Attention**, **CPU Offload**, **Compile**, and **Quantization** are enabled by default for optimal performance
22
+ - If you request a duration or batch size exceeding your GPU's limits, a warning will be displayed and values will be clamped
23
+ - **Constrained Decoding**: When LM is initialized, the LM's duration generation is also constrained to the GPU tier's maximum duration limit, preventing out-of-memory errors during CoT generation
24
+ - For GPUs with ≤6GB VRAM, LM initialization is disabled by default to preserve memory for the DiT model
25
+ - You can manually override settings via command-line arguments or the Gradio UI
26
+
27
+ ## Overriding LLM Initialization
28
+
29
+ By default, LLM is auto-enabled/disabled based on GPU VRAM. You can override this behavior.
30
+
31
+ **Important:** GPU optimizations (offload, quantization, batch limits) are **always applied** regardless of override. `ACESTEP_INIT_LLM` only controls whether to attempt LLM loading.
32
+
33
+ ### Processing Flow
34
+
35
+ ```
36
+ GPU Detection (full) → ACESTEP_INIT_LLM Override → Model Loading
37
+ │ │ │
38
+ ├─ offload settings ├─ auto: use GPU result ├─ Download model
39
+ ├─ batch limits ├─ true: force enable ├─ Initialize LLM
40
+ ├─ duration limits └─ false: force disable └─ (with GPU settings)
41
+ └─ recommended models
42
+ ```
43
+
44
+ ### Gradio UI
45
+
46
+ ```bash
47
+ # Force enable LLM (may cause OOM on low VRAM)
48
+ uv run acestep --init_llm true
49
+
50
+ # Force disable LLM (pure DiT mode)
51
+ uv run acestep --init_llm false
52
+ ```
53
+
54
+ Or in `start_gradio_ui.bat`:
55
+ ```batch
56
+ set INIT_LLM=--init_llm true
57
+ ```
58
+
59
+ ### API Server
60
+
61
+ Using environment variable:
62
+ ```bash
63
+ # Auto mode (recommended)
64
+ set ACESTEP_INIT_LLM=auto
65
+ uv run acestep-api
66
+
67
+ # Force enable LLM
68
+ set ACESTEP_INIT_LLM=true
69
+ uv run acestep-api
70
+
71
+ # Force disable LLM
72
+ set ACESTEP_INIT_LLM=false
73
+ uv run acestep-api
74
+ ```
75
+
76
+ Or using command line:
77
+ ```bash
78
+ uv run acestep-api --init-llm
79
+ ```
80
+
81
+ Or in `start_api_server.bat`:
82
+ ```batch
83
+ set ACESTEP_INIT_LLM=true
84
+ ```
85
+
86
+ ### When to Override
87
+
88
+ | Scenario | Setting | Notes |
89
+ |----------|---------|-------|
90
+ | Low VRAM but need thinking mode | `true` | May cause OOM, use with caution |
91
+ | Fast generation without CoT | `false` | Skips LLM, uses pure DiT |
92
+ | API server pure DiT mode | `false` | Faster responses, simpler setup |
93
+ | High VRAM but want minimal setup | `false` | No LLM model download needed |
94
+
95
+ ### Features Affected by LLM
96
+
97
+ When LLM is disabled, these features are automatically disabled:
98
+ - **Thinking mode** (`thinking=true`)
99
+ - **CoT caption/language detection** (`use_cot_caption`, `use_cot_language`)
100
+ - **Sample mode** (generate from description)
101
+ - **Format mode** (LLM-enhanced input)
102
+
103
+ The API server will automatically fallback to pure DiT mode when these features are requested but LLM is unavailable.
104
+
105
+ > **Community Contributions Welcome**: The GPU tier configurations above are based on our testing across common hardware. If you find that your device's actual performance differs from these parameters (e.g., can handle longer durations or larger batch sizes), we welcome you to conduct more thorough testing and submit a PR to optimize these configurations in `acestep/gpu_config.py`. Your contributions help improve the experience for all users!
106
+
107
+ ## Memory Optimization Tips
108
+
109
+ 1. **Low VRAM (<8GB)**: Use DiT-only mode without LM initialization for maximum duration
110
+ 2. **Medium VRAM (8-16GB)**: Use the 0.6B LM model for best balance of quality and memory
111
+ 3. **High VRAM (>16GB)**: Enable larger LM models (1.7B/4B) for better audio understanding and generation quality
112
+
113
+ ## Debug Mode: Simulating Different GPU Configurations
114
+
115
+ For testing and development, you can simulate different GPU memory sizes using the `MAX_CUDA_VRAM` environment variable:
116
+
117
+ ```bash
118
+ # Simulate a 4GB GPU (Tier 1)
119
+ MAX_CUDA_VRAM=4 uv run acestep
120
+
121
+ # Simulate an 8GB GPU (Tier 4)
122
+ MAX_CUDA_VRAM=8 uv run acestep
123
+
124
+ # Simulate a 12GB GPU (Tier 5)
125
+ MAX_CUDA_VRAM=12 uv run acestep
126
+
127
+ # Simulate a 16GB GPU (Tier 6)
128
+ MAX_CUDA_VRAM=16 uv run acestep
129
+ ```
130
+
131
+ This is useful for:
132
+ - Testing GPU tier configurations on high-end hardware
133
+ - Verifying that warnings and limits work correctly for each tier
134
+ - Developing and testing new GPU configuration parameters before submitting a PR
.claude/skills/acestep-docs/guides/GRADIO_GUIDE.md ADDED
@@ -0,0 +1,549 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step Gradio Demo User Guide
2
+
3
+ ---
4
+
5
+ This guide provides comprehensive documentation for using the ACE-Step Gradio web interface for music generation, including all features and settings.
6
+
7
+ ## Table of Contents
8
+
9
+ - [Getting Started](#getting-started)
10
+ - [Service Configuration](#service-configuration)
11
+ - [Generation Modes](#generation-modes)
12
+ - [Task Types](#task-types)
13
+ - [Input Parameters](#input-parameters)
14
+ - [Advanced Settings](#advanced-settings)
15
+ - [Results Section](#results-section)
16
+ - [LoRA Training](#lora-training)
17
+ - [Tips and Best Practices](#tips-and-best-practices)
18
+
19
+ ---
20
+
21
+ ## Getting Started
22
+
23
+ ### Launching the Demo
24
+
25
+ ```bash
26
+ # Basic launch
27
+ python app.py
28
+
29
+ # With pre-initialization
30
+ python app.py --config acestep-v15-turbo --init-llm
31
+
32
+ # With specific port
33
+ python app.py --port 7860
34
+ ```
35
+
36
+ ### Interface Overview
37
+
38
+ The Gradio interface consists of several main sections:
39
+
40
+ 1. **Service Configuration** - Model loading and initialization
41
+ 2. **Required Inputs** - Task type, audio uploads, and generation mode
42
+ 3. **Music Caption & Lyrics** - Text inputs for generation
43
+ 4. **Optional Parameters** - Metadata like BPM, key, duration
44
+ 5. **Advanced Settings** - Fine-grained control over generation
45
+ 6. **Results** - Generated audio playback and management
46
+
47
+ ---
48
+
49
+ ## Service Configuration
50
+
51
+ ### Model Selection
52
+
53
+ | Setting | Description |
54
+ |---------|-------------|
55
+ | **Checkpoint File** | Select a trained model checkpoint (if available) |
56
+ | **Main Model Path** | Choose the DiT model configuration (e.g., `acestep-v15-turbo`, `acestep-v15-turbo-shift3`) |
57
+ | **Device** | Processing device: `auto` (recommended), `cuda`, or `cpu` |
58
+
59
+ ### 5Hz LM Configuration
60
+
61
+ | Setting | Description |
62
+ |---------|-------------|
63
+ | **5Hz LM Model Path** | Select the language model (e.g., `acestep-5Hz-lm-0.6B`, `acestep-5Hz-lm-1.7B`) |
64
+ | **5Hz LM Backend** | `vllm` (faster, recommended) or `pt` (PyTorch, more compatible) |
65
+ | **Initialize 5Hz LM** | Check to load the LM during initialization (required for thinking mode) |
66
+
67
+ ### Performance Options
68
+
69
+ | Setting | Description |
70
+ |---------|-------------|
71
+ | **Use Flash Attention** | Enable for faster inference (requires flash_attn package) |
72
+ | **Offload to CPU** | Offload models to CPU when idle to save GPU memory |
73
+ | **Offload DiT to CPU** | Specifically offload the DiT model to CPU |
74
+
75
+ ### LoRA Adapter
76
+
77
+ | Setting | Description |
78
+ |---------|-------------|
79
+ | **LoRA Path** | Path to trained LoRA adapter directory |
80
+ | **Load LoRA** | Load the specified LoRA adapter |
81
+ | **Unload** | Remove the currently loaded LoRA |
82
+ | **Use LoRA** | Enable/disable the loaded LoRA for inference |
83
+
84
+ ### Initialization
85
+
86
+ Click **Initialize Service** to load the models. The status box will show progress and confirmation.
87
+
88
+ ---
89
+
90
+ ## Generation Modes
91
+
92
+ ### Simple Mode
93
+
94
+ Simple mode is designed for quick, natural language-based music generation.
95
+
96
+ **How to use:**
97
+ 1. Select "Simple" in the Generation Mode radio button
98
+ 2. Enter a natural language description in the "Song Description" field
99
+ 3. Optionally check "Instrumental" if you don't want vocals
100
+ 4. Optionally select a preferred vocal language
101
+ 5. Click **Create Sample** to generate caption, lyrics, and metadata
102
+ 6. Review the generated content in the expanded sections
103
+ 7. Click **Generate Music** to create the audio
104
+
105
+ **Example descriptions:**
106
+ - "a soft Bengali love song for a quiet evening"
107
+ - "upbeat electronic dance music with heavy bass drops"
108
+ - "melancholic indie folk with acoustic guitar"
109
+ - "jazz trio playing in a smoky bar"
110
+
111
+ **Random Sample:** Click the 🎲 button to load a random example description.
112
+
113
+ ### Custom Mode
114
+
115
+ Custom mode provides full control over all generation parameters.
116
+
117
+ **How to use:**
118
+ 1. Select "Custom" in the Generation Mode radio button
119
+ 2. Manually fill in the Caption and Lyrics fields
120
+ 3. Set optional metadata (BPM, Key, Duration, etc.)
121
+ 4. Optionally click **Format** to enhance your input using the LM
122
+ 5. Configure advanced settings as needed
123
+ 6. Click **Generate Music** to create the audio
124
+
125
+ ---
126
+
127
+ ## Task Types
128
+
129
+ ### text2music (Default)
130
+
131
+ Generate music from text descriptions and/or lyrics.
132
+
133
+ **Use case:** Creating new music from scratch based on prompts.
134
+
135
+ **Required inputs:** Caption or Lyrics (at least one)
136
+
137
+ ### cover
138
+
139
+ Transform existing audio while maintaining structure but changing style.
140
+
141
+ **Use case:** Creating cover versions in different styles.
142
+
143
+ **Required inputs:**
144
+ - Source Audio (upload in Audio Uploads section)
145
+ - Caption describing the target style
146
+
147
+ **Key parameter:** `Audio Cover Strength` (0.0-1.0)
148
+ - Higher values maintain more of the original structure
149
+ - Lower values allow more creative freedom
150
+
151
+ ### repaint
152
+
153
+ Regenerate a specific time segment of audio.
154
+
155
+ **Use case:** Fixing or modifying specific sections of generated music.
156
+
157
+ **Required inputs:**
158
+ - Source Audio
159
+ - Repainting Start (seconds)
160
+ - Repainting End (seconds, -1 for end of file)
161
+ - Caption describing the desired content
162
+
163
+ ### lego (Base Model Only)
164
+
165
+ Generate a specific instrument track in context of existing audio.
166
+
167
+ **Use case:** Adding instrument layers to backing tracks.
168
+
169
+ **Required inputs:**
170
+ - Source Audio
171
+ - Track Name (select from dropdown)
172
+ - Caption describing the track characteristics
173
+
174
+ **Available tracks:** vocals, backing_vocals, drums, bass, guitar, keyboard, percussion, strings, synth, fx, brass, woodwinds
175
+
176
+ ### extract (Base Model Only)
177
+
178
+ Extract/isolate a specific instrument track from mixed audio.
179
+
180
+ **Use case:** Stem separation, isolating instruments.
181
+
182
+ **Required inputs:**
183
+ - Source Audio
184
+ - Track Name to extract
185
+
186
+ ### complete (Base Model Only)
187
+
188
+ Complete partial tracks with specified instruments.
189
+
190
+ **Use case:** Auto-arranging incomplete compositions.
191
+
192
+ **Required inputs:**
193
+ - Source Audio
194
+ - Track Names (multiple selection)
195
+ - Caption describing the desired style
196
+
197
+ ---
198
+
199
+ ## Input Parameters
200
+
201
+ ### Required Inputs
202
+
203
+ #### Task Type
204
+ Select the generation task from the dropdown. The instruction field updates automatically based on the selected task.
205
+
206
+ #### Audio Uploads
207
+
208
+ | Field | Description |
209
+ |-------|-------------|
210
+ | **Reference Audio** | Optional audio for style reference |
211
+ | **Source Audio** | Required for cover, repaint, lego, extract, complete tasks |
212
+ | **Convert to Codes** | Extract 5Hz semantic codes from source audio |
213
+
214
+ #### LM Codes Hints
215
+
216
+ Pre-computed audio semantic codes can be pasted here to guide generation. Use the **Transcribe** button to analyze codes and extract metadata.
217
+
218
+ ### Music Caption
219
+
220
+ The text description of the desired music. Be specific about:
221
+ - Genre and style
222
+ - Instruments
223
+ - Mood and atmosphere
224
+ - Tempo feel (if not specifying BPM)
225
+
226
+ **Example:** "upbeat pop rock with electric guitars, driving drums, and catchy synth hooks"
227
+
228
+ Click 🎲 to load a random example caption.
229
+
230
+ ### Lyrics
231
+
232
+ Enter lyrics with structure tags:
233
+
234
+ ```
235
+ [Verse 1]
236
+ Walking down the street today
237
+ Thinking of the words you used to say
238
+
239
+ [Chorus]
240
+ I'm moving on, I'm staying strong
241
+ This is where I belong
242
+
243
+ [Verse 2]
244
+ ...
245
+ ```
246
+
247
+ **Instrumental checkbox:** Check this to generate instrumental music regardless of lyrics content.
248
+
249
+ **Vocal Language:** Select the language for vocals. Use "unknown" for auto-detection or instrumental tracks.
250
+
251
+ **Format button:** Click to enhance caption and lyrics using the 5Hz LM.
252
+
253
+ ### Optional Parameters
254
+
255
+ | Parameter | Default | Description |
256
+ |-----------|---------|-------------|
257
+ | **BPM** | Auto | Tempo in beats per minute (30-300) |
258
+ | **Key Scale** | Auto | Musical key (e.g., "C Major", "Am", "F# minor") |
259
+ | **Time Signature** | Auto | Time signature: 2 (2/4), 3 (3/4), 4 (4/4), 6 (6/8) |
260
+ | **Audio Duration** | Auto/-1 | Target length in seconds (10-600). -1 for automatic |
261
+ | **Batch Size** | 2 | Number of audio variations to generate (1-8) |
262
+
263
+ ---
264
+
265
+ ## Advanced Settings
266
+
267
+ ### DiT Parameters
268
+
269
+ | Parameter | Default | Description |
270
+ |-----------|---------|-------------|
271
+ | **Inference Steps** | 8 | Denoising steps. Turbo: 1-20, Base: 1-200 |
272
+ | **Guidance Scale** | 7.0 | CFG strength (base model only). Higher = follows prompt more |
273
+ | **Seed** | -1 | Random seed. Use comma-separated values for batches |
274
+ | **Random Seed** | ✓ | When checked, generates random seeds |
275
+ | **Audio Format** | mp3 | Output format: mp3, flac |
276
+ | **Shift** | 3.0 | Timestep shift factor (1.0-5.0). Recommended 3.0 for turbo |
277
+ | **Inference Method** | ode | ode (Euler, faster) or sde (stochastic) |
278
+ | **Custom Timesteps** | - | Override timesteps (e.g., "0.97,0.76,0.615,0.5,0.395,0.28,0.18,0.085,0") |
279
+
280
+ ### Base Model Only Parameters
281
+
282
+ | Parameter | Default | Description |
283
+ |-----------|---------|-------------|
284
+ | **Use ADG** | ✗ | Enable Adaptive Dual Guidance for better quality |
285
+ | **CFG Interval Start** | 0.0 | When to start applying CFG (0.0-1.0) |
286
+ | **CFG Interval End** | 1.0 | When to stop applying CFG (0.0-1.0) |
287
+
288
+ ### LM Parameters
289
+
290
+ | Parameter | Default | Description |
291
+ |-----------|---------|-------------|
292
+ | **LM Temperature** | 0.85 | Sampling temperature (0.0-2.0). Higher = more creative |
293
+ | **LM CFG Scale** | 2.0 | LM guidance strength (1.0-3.0) |
294
+ | **LM Top-K** | 0 | Top-K sampling. 0 disables |
295
+ | **LM Top-P** | 0.9 | Nucleus sampling (0.0-1.0) |
296
+ | **LM Negative Prompt** | "NO USER INPUT" | Negative prompt for CFG |
297
+
298
+ ### CoT (Chain-of-Thought) Options
299
+
300
+ | Option | Default | Description |
301
+ |--------|---------|-------------|
302
+ | **CoT Metas** | ✓ | Generate metadata via LM reasoning |
303
+ | **CoT Language** | ✓ | Detect vocal language via LM |
304
+ | **Constrained Decoding Debug** | ✗ | Enable debug logging |
305
+
306
+ ### Generation Options
307
+
308
+ | Option | Default | Description |
309
+ |--------|---------|-------------|
310
+ | **LM Codes Strength** | 1.0 | How strongly LM codes influence generation (0.0-1.0) |
311
+ | **Auto Score** | ✗ | Automatically calculate quality scores |
312
+ | **Auto LRC** | ✗ | Automatically generate lyrics timestamps |
313
+ | **LM Batch Chunk Size** | 8 | Max items per LM batch (GPU memory) |
314
+
315
+ ### Main Generation Controls
316
+
317
+ | Control | Description |
318
+ |---------|-------------|
319
+ | **Think** | Enable 5Hz LM for code generation and metadata |
320
+ | **ParallelThinking** | Enable parallel LM batch processing |
321
+ | **CaptionRewrite** | Let LM enhance the input caption |
322
+ | **AutoGen** | Automatically start next batch after completion |
323
+
324
+ ---
325
+
326
+ ## Results Section
327
+
328
+ ### Generated Audio
329
+
330
+ Up to 8 audio samples are displayed based on batch size. Each sample includes:
331
+
332
+ - **Audio Player** - Play, pause, and download the generated audio
333
+ - **Send To Src** - Send this audio to the Source Audio input for further processing
334
+ - **Save** - Save audio and metadata to a JSON file
335
+ - **Score** - Calculate perplexity-based quality score
336
+ - **LRC** - Generate lyrics timestamps (LRC format)
337
+
338
+ ### Details Accordion
339
+
340
+ Click "Score & LRC & LM Codes" to expand and view:
341
+ - **LM Codes** - The 5Hz semantic codes for this sample
342
+ - **Quality Score** - Perplexity-based quality metric
343
+ - **Lyrics Timestamps** - LRC format timing data
344
+
345
+ ### Batch Navigation
346
+
347
+ | Control | Description |
348
+ |---------|-------------|
349
+ | **◀ Previous** | View the previous batch |
350
+ | **Batch Indicator** | Shows current batch position (e.g., "Batch 1 / 3") |
351
+ | **Next Batch Status** | Shows background generation progress |
352
+ | **Next ▶** | View the next batch (triggers generation if AutoGen is on) |
353
+
354
+ ### Restore Parameters
355
+
356
+ Click **Apply These Settings to UI** to restore all generation parameters from the current batch back to the input fields. Useful for iterating on a good result.
357
+
358
+ ### Batch Results
359
+
360
+ The "Batch Results & Generation Details" accordion contains:
361
+ - **All Generated Files** - Download all files from all batches
362
+ - **Generation Details** - Detailed information about the generation process
363
+
364
+ ---
365
+
366
+ ## LoRA Training
367
+
368
+ The LoRA Training tab provides tools for creating custom LoRA adapters.
369
+
370
+ ### Dataset Builder Tab
371
+
372
+ #### Step 1: Load or Scan
373
+
374
+ **Option A: Load Existing Dataset**
375
+ 1. Enter the path to a previously saved dataset JSON
376
+ 2. Click **Load**
377
+
378
+ **Option B: Scan New Directory**
379
+ 1. Enter the path to your audio folder
380
+ 2. Click **Scan** to find audio files (wav, mp3, flac, ogg, opus)
381
+
382
+ #### Step 2: Configure Dataset
383
+
384
+ | Setting | Description |
385
+ |---------|-------------|
386
+ | **Dataset Name** | Name for your dataset |
387
+ | **All Instrumental** | Check if all tracks have no vocals |
388
+ | **Custom Activation Tag** | Unique tag to activate this LoRA's style |
389
+ | **Tag Position** | Where to place the tag: Prepend, Append, or Replace caption |
390
+
391
+ #### Step 3: Auto-Label
392
+
393
+ Click **Auto-Label All** to generate metadata for all audio files:
394
+ - Caption (music description)
395
+ - BPM
396
+ - Key
397
+ - Time Signature
398
+
399
+ **Skip Metas** option will skip LLM labeling and use N/A values.
400
+
401
+ #### Step 4: Preview & Edit
402
+
403
+ Use the slider to select samples and manually edit:
404
+ - Caption
405
+ - Lyrics
406
+ - BPM, Key, Time Signature
407
+ - Language
408
+ - Instrumental flag
409
+
410
+ Click **Save Changes** to update the sample.
411
+
412
+ #### Step 5: Save Dataset
413
+
414
+ Enter a save path and click **Save Dataset** to export as JSON.
415
+
416
+ #### Step 6: Preprocess
417
+
418
+ Convert the dataset to pre-computed tensors for fast training:
419
+ 1. Optionally load an existing dataset JSON
420
+ 2. Set the tensor output directory
421
+ 3. Click **Preprocess**
422
+
423
+ This encodes audio to VAE latents, text to embeddings, and runs the condition encoder.
424
+
425
+ ### Train LoRA Tab
426
+
427
+ #### Dataset Selection
428
+
429
+ Enter the path to preprocessed tensors directory and click **Load Dataset**.
430
+
431
+ #### LoRA Settings
432
+
433
+ | Setting | Default | Description |
434
+ |---------|---------|-------------|
435
+ | **LoRA Rank (r)** | 64 | Capacity of LoRA. Higher = more capacity, more memory |
436
+ | **LoRA Alpha** | 128 | Scaling factor (typically 2x rank) |
437
+ | **LoRA Dropout** | 0.1 | Dropout rate for regularization |
438
+
439
+ #### Training Parameters
440
+
441
+ | Setting | Default | Description |
442
+ |---------|---------|-------------|
443
+ | **Learning Rate** | 1e-4 | Optimization learning rate |
444
+ | **Max Epochs** | 500 | Maximum training epochs |
445
+ | **Batch Size** | 1 | Training batch size |
446
+ | **Gradient Accumulation** | 1 | Effective batch = batch_size × accumulation |
447
+ | **Save Every N Epochs** | 200 | Checkpoint save frequency |
448
+ | **Shift** | 3.0 | Timestep shift for turbo model |
449
+ | **Seed** | 42 | Random seed for reproducibility |
450
+
451
+ #### Training Controls
452
+
453
+ - **Start Training** - Begin the training process
454
+ - **Stop Training** - Interrupt training
455
+ - **Training Progress** - Shows current epoch and loss
456
+ - **Training Log** - Detailed training output
457
+ - **Training Loss Plot** - Visual loss curve
458
+
459
+ #### Export LoRA
460
+
461
+ After training, export the final adapter:
462
+ 1. Enter the export path
463
+ 2. Click **Export LoRA**
464
+
465
+ ---
466
+
467
+ ## Tips and Best Practices
468
+
469
+ ### For Best Quality
470
+
471
+ 1. **Use thinking mode** - Keep "Think" checkbox enabled for LM-enhanced generation
472
+ 2. **Be specific in captions** - Include genre, instruments, mood, and style details
473
+ 3. **Let LM detect metadata** - Leave BPM/Key/Duration empty for auto-detection
474
+ 4. **Use batch generation** - Generate 2-4 variations and pick the best
475
+
476
+ ### For Faster Generation
477
+
478
+ 1. **Use turbo model** - Select `acestep-v15-turbo` or `acestep-v15-turbo-shift3`
479
+ 2. **Keep inference steps at 8** - Default is optimal for turbo
480
+ 3. **Reduce batch size** - Lower batch size if you need quick results
481
+ 4. **Disable AutoGen** - Manual control over batch generation
482
+
483
+ ### For Consistent Results
484
+
485
+ 1. **Set a specific seed** - Uncheck "Random Seed" and enter a seed value
486
+ 2. **Save good results** - Use "Save" to export parameters for reproduction
487
+ 3. **Use "Apply These Settings"** - Restore parameters from a good batch
488
+
489
+ ### For Long-form Music
490
+
491
+ 1. **Set explicit duration** - Specify duration in seconds
492
+ 2. **Use repaint task** - Fix problematic sections after initial generation
493
+ 3. **Chain generations** - Use "Send To Src" to build upon previous results
494
+
495
+ ### For Style Consistency
496
+
497
+ 1. **Train a LoRA** - Create a custom adapter for your style
498
+ 2. **Use reference audio** - Upload style reference in Audio Uploads
499
+ 3. **Use consistent captions** - Maintain similar descriptive language
500
+
501
+ ### Troubleshooting
502
+
503
+ **No audio generated:**
504
+ - Check that the model is initialized (green status message)
505
+ - Ensure 5Hz LM is initialized if using thinking mode
506
+ - Check the status output for error messages
507
+
508
+ **Poor quality results:**
509
+ - Increase inference steps (for base model)
510
+ - Adjust guidance scale
511
+ - Try different seeds
512
+ - Make caption more specific
513
+
514
+ **Out of memory:**
515
+ - Reduce batch size
516
+ - Enable CPU offloading
517
+ - Reduce LM batch chunk size
518
+
519
+ **LM not working:**
520
+ - Ensure "Initialize 5Hz LM" was checked during initialization
521
+ - Check that a valid LM model path is selected
522
+ - Verify vllm or PyTorch backend is available
523
+
524
+ ---
525
+
526
+ ## Keyboard Shortcuts
527
+
528
+ The Gradio interface supports standard web shortcuts:
529
+ - **Tab** - Move between input fields
530
+ - **Enter** - Submit text inputs
531
+ - **Space** - Toggle checkboxes
532
+
533
+ ---
534
+
535
+ ## Language Support
536
+
537
+ The interface supports multiple UI languages:
538
+ - **English** (en)
539
+ - **Chinese** (zh)
540
+ - **Japanese** (ja)
541
+
542
+ Select your preferred language in the Service Configuration section.
543
+
544
+ ---
545
+
546
+ For more information, see:
547
+ - Main README: [`../../README.md`](../../README.md)
548
+ - REST API Documentation: [`API.md`](API.md)
549
+ - Python Inference API: [`INFERENCE.md`](INFERENCE.md)
.claude/skills/acestep-docs/guides/INFERENCE.md ADDED
@@ -0,0 +1,1191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step Inference API Documentation
2
+
3
+ ---
4
+
5
+ This document provides comprehensive documentation for the ACE-Step inference API, including parameter specifications for all supported task types.
6
+
7
+ ## Table of Contents
8
+
9
+ - [Quick Start](#quick-start)
10
+ - [API Overview](#api-overview)
11
+ - [GenerationParams Parameters](#generationparams-parameters)
12
+ - [GenerationConfig Parameters](#generationconfig-parameters)
13
+ - [Task Types](#task-types)
14
+ - [Helper Functions](#helper-functions)
15
+ - [Complete Examples](#complete-examples)
16
+ - [Best Practices](#best-practices)
17
+
18
+ ---
19
+
20
+ ## Quick Start
21
+
22
+ ### Basic Usage
23
+
24
+ ```python
25
+ from acestep.handler import AceStepHandler
26
+ from acestep.llm_inference import LLMHandler
27
+ from acestep.inference import GenerationParams, GenerationConfig, generate_music
28
+
29
+ # Initialize handlers
30
+ dit_handler = AceStepHandler()
31
+ llm_handler = LLMHandler()
32
+
33
+ # Initialize services
34
+ dit_handler.initialize_service(
35
+ project_root="/path/to/project",
36
+ config_path="acestep-v15-turbo",
37
+ device="cuda"
38
+ )
39
+
40
+ llm_handler.initialize(
41
+ checkpoint_dir="/path/to/checkpoints",
42
+ lm_model_path="acestep-5Hz-lm-0.6B",
43
+ backend="vllm",
44
+ device="cuda"
45
+ )
46
+
47
+ # Configure generation parameters
48
+ params = GenerationParams(
49
+ caption="upbeat electronic dance music with heavy bass",
50
+ bpm=128,
51
+ duration=30,
52
+ )
53
+
54
+ # Configure generation settings
55
+ config = GenerationConfig(
56
+ batch_size=2,
57
+ audio_format="flac",
58
+ )
59
+
60
+ # Generate music
61
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/path/to/output")
62
+
63
+ # Access results
64
+ if result.success:
65
+ for audio in result.audios:
66
+ print(f"Generated: {audio['path']}")
67
+ print(f"Key: {audio['key']}")
68
+ print(f"Seed: {audio['params']['seed']}")
69
+ else:
70
+ print(f"Error: {result.error}")
71
+ ```
72
+
73
+ ---
74
+
75
+ ## API Overview
76
+
77
+ ### Main Functions
78
+
79
+ #### generate_music
80
+
81
+ ```python
82
+ def generate_music(
83
+ dit_handler,
84
+ llm_handler,
85
+ params: GenerationParams,
86
+ config: GenerationConfig,
87
+ save_dir: Optional[str] = None,
88
+ progress=None,
89
+ ) -> GenerationResult
90
+ ```
91
+
92
+ Main function for generating music using the ACE-Step model.
93
+
94
+ #### understand_music
95
+
96
+ ```python
97
+ def understand_music(
98
+ llm_handler,
99
+ audio_codes: str,
100
+ temperature: float = 0.85,
101
+ top_k: Optional[int] = None,
102
+ top_p: Optional[float] = None,
103
+ repetition_penalty: float = 1.0,
104
+ use_constrained_decoding: bool = True,
105
+ constrained_decoding_debug: bool = False,
106
+ ) -> UnderstandResult
107
+ ```
108
+
109
+ Analyze audio semantic codes and extract metadata (caption, lyrics, BPM, key, etc.).
110
+
111
+ #### create_sample
112
+
113
+ ```python
114
+ def create_sample(
115
+ llm_handler,
116
+ query: str,
117
+ instrumental: bool = False,
118
+ vocal_language: Optional[str] = None,
119
+ temperature: float = 0.85,
120
+ top_k: Optional[int] = None,
121
+ top_p: Optional[float] = None,
122
+ repetition_penalty: float = 1.0,
123
+ use_constrained_decoding: bool = True,
124
+ constrained_decoding_debug: bool = False,
125
+ ) -> CreateSampleResult
126
+ ```
127
+
128
+ Generate a complete music sample (caption, lyrics, metadata) from a natural language description.
129
+
130
+ #### format_sample
131
+
132
+ ```python
133
+ def format_sample(
134
+ llm_handler,
135
+ caption: str,
136
+ lyrics: str,
137
+ user_metadata: Optional[Dict[str, Any]] = None,
138
+ temperature: float = 0.85,
139
+ top_k: Optional[int] = None,
140
+ top_p: Optional[float] = None,
141
+ repetition_penalty: float = 1.0,
142
+ use_constrained_decoding: bool = True,
143
+ constrained_decoding_debug: bool = False,
144
+ ) -> FormatSampleResult
145
+ ```
146
+
147
+ Format and enhance user-provided caption and lyrics, generating structured metadata.
148
+
149
+ ### Configuration Objects
150
+
151
+ The API uses two configuration dataclasses:
152
+
153
+ **GenerationParams** - Contains all music generation parameters:
154
+
155
+ ```python
156
+ @dataclass
157
+ class GenerationParams:
158
+ # Task & Instruction
159
+ task_type: str = "text2music"
160
+ instruction: str = "Fill the audio semantic mask based on the given conditions:"
161
+
162
+ # Audio Uploads
163
+ reference_audio: Optional[str] = None
164
+ src_audio: Optional[str] = None
165
+
166
+ # LM Codes Hints
167
+ audio_codes: str = ""
168
+
169
+ # Text Inputs
170
+ caption: str = ""
171
+ lyrics: str = ""
172
+ instrumental: bool = False
173
+
174
+ # Metadata
175
+ vocal_language: str = "unknown"
176
+ bpm: Optional[int] = None
177
+ keyscale: str = ""
178
+ timesignature: str = ""
179
+ duration: float = -1.0
180
+
181
+ # Advanced Settings
182
+ inference_steps: int = 8
183
+ seed: int = -1
184
+ guidance_scale: float = 7.0
185
+ use_adg: bool = False
186
+ cfg_interval_start: float = 0.0
187
+ cfg_interval_end: float = 1.0
188
+ shift: float = 1.0 # NEW: Timestep shift factor
189
+ infer_method: str = "ode" # NEW: Diffusion inference method
190
+ timesteps: Optional[List[float]] = None # NEW: Custom timesteps
191
+
192
+ repainting_start: float = 0.0
193
+ repainting_end: float = -1
194
+ audio_cover_strength: float = 1.0
195
+
196
+ # 5Hz Language Model Parameters
197
+ thinking: bool = True
198
+ lm_temperature: float = 0.85
199
+ lm_cfg_scale: float = 2.0
200
+ lm_top_k: int = 0
201
+ lm_top_p: float = 0.9
202
+ lm_negative_prompt: str = "NO USER INPUT"
203
+ use_cot_metas: bool = True
204
+ use_cot_caption: bool = True
205
+ use_cot_lyrics: bool = False
206
+ use_cot_language: bool = True
207
+ use_constrained_decoding: bool = True
208
+
209
+ # CoT Generated Values (auto-filled by LM)
210
+ cot_bpm: Optional[int] = None
211
+ cot_keyscale: str = ""
212
+ cot_timesignature: str = ""
213
+ cot_duration: Optional[float] = None
214
+ cot_vocal_language: str = "unknown"
215
+ cot_caption: str = ""
216
+ cot_lyrics: str = ""
217
+ ```
218
+
219
+ **GenerationConfig** - Contains batch and output configuration:
220
+
221
+ ```python
222
+ @dataclass
223
+ class GenerationConfig:
224
+ batch_size: int = 2
225
+ allow_lm_batch: bool = False
226
+ use_random_seed: bool = True
227
+ seeds: Optional[List[int]] = None
228
+ lm_batch_chunk_size: int = 8
229
+ constrained_decoding_debug: bool = False
230
+ audio_format: str = "flac"
231
+ ```
232
+
233
+ ### Result Objects
234
+
235
+ **GenerationResult** - Result of music generation:
236
+
237
+ ```python
238
+ @dataclass
239
+ class GenerationResult:
240
+ # Audio Outputs
241
+ audios: List[Dict[str, Any]] # List of audio dictionaries
242
+
243
+ # Generation Information
244
+ status_message: str # Status message from generation
245
+ extra_outputs: Dict[str, Any] # Extra outputs (latents, masks, lm_metadata, time_costs)
246
+
247
+ # Success Status
248
+ success: bool # Whether generation succeeded
249
+ error: Optional[str] # Error message if failed
250
+ ```
251
+
252
+ **Audio Dictionary Structure:**
253
+
254
+ Each item in `audios` list contains:
255
+
256
+ ```python
257
+ {
258
+ "path": str, # File path to saved audio
259
+ "tensor": Tensor, # Audio tensor [channels, samples], CPU, float32
260
+ "key": str, # Unique audio key (UUID based on params)
261
+ "sample_rate": int, # Sample rate (default: 48000)
262
+ "params": Dict, # Generation params for this audio (includes seed, audio_codes, etc.)
263
+ }
264
+ ```
265
+
266
+ **UnderstandResult** - Result of music understanding:
267
+
268
+ ```python
269
+ @dataclass
270
+ class UnderstandResult:
271
+ # Metadata Fields
272
+ caption: str = ""
273
+ lyrics: str = ""
274
+ bpm: Optional[int] = None
275
+ duration: Optional[float] = None
276
+ keyscale: str = ""
277
+ language: str = ""
278
+ timesignature: str = ""
279
+
280
+ # Status
281
+ status_message: str = ""
282
+ success: bool = True
283
+ error: Optional[str] = None
284
+ ```
285
+
286
+ **CreateSampleResult** - Result of sample creation:
287
+
288
+ ```python
289
+ @dataclass
290
+ class CreateSampleResult:
291
+ # Metadata Fields
292
+ caption: str = ""
293
+ lyrics: str = ""
294
+ bpm: Optional[int] = None
295
+ duration: Optional[float] = None
296
+ keyscale: str = ""
297
+ language: str = ""
298
+ timesignature: str = ""
299
+ instrumental: bool = False
300
+
301
+ # Status
302
+ status_message: str = ""
303
+ success: bool = True
304
+ error: Optional[str] = None
305
+ ```
306
+
307
+ **FormatSampleResult** - Result of sample formatting:
308
+
309
+ ```python
310
+ @dataclass
311
+ class FormatSampleResult:
312
+ # Metadata Fields
313
+ caption: str = ""
314
+ lyrics: str = ""
315
+ bpm: Optional[int] = None
316
+ duration: Optional[float] = None
317
+ keyscale: str = ""
318
+ language: str = ""
319
+ timesignature: str = ""
320
+
321
+ # Status
322
+ status_message: str = ""
323
+ success: bool = True
324
+ error: Optional[str] = None
325
+ ```
326
+
327
+ ---
328
+
329
+ ## GenerationParams Parameters
330
+
331
+ ### Text Inputs
332
+
333
+ | Parameter | Type | Default | Description |
334
+ |-----------|------|---------|-------------|
335
+ | `caption` | `str` | `""` | Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. Max 512 characters. |
336
+ | `lyrics` | `str` | `""` | Lyrics text for vocal music. Use `"[Instrumental]"` for instrumental tracks. Supports multiple languages. Max 4096 characters. |
337
+ | `instrumental` | `bool` | `False` | If True, generate instrumental music regardless of lyrics. |
338
+
339
+ ### Music Metadata
340
+
341
+ | Parameter | Type | Default | Description |
342
+ |-----------|------|---------|-------------|
343
+ | `bpm` | `Optional[int]` | `None` | Beats per minute (30-300). `None` enables auto-detection via LM. |
344
+ | `keyscale` | `str` | `""` | Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection. |
345
+ | `timesignature` | `str` | `""` | Time signature (2 for '2/4', 3 for '3/4', 4 for '4/4', 6 for '6/8'). Empty string enables auto-detection. |
346
+ | `vocal_language` | `str` | `"unknown"` | Language code for vocals (ISO 639-1). Supported: `"en"`, `"zh"`, `"ja"`, `"es"`, `"fr"`, etc. Use `"unknown"` for auto-detection. |
347
+ | `duration` | `float` | `-1.0` | Target audio length in seconds (10-600). If <= 0 or None, model chooses automatically based on lyrics length. |
348
+
349
+ ### Generation Parameters
350
+
351
+ | Parameter | Type | Default | Description |
352
+ |-----------|------|---------|-------------|
353
+ | `inference_steps` | `int` | `8` | Number of denoising steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). Higher = better quality but slower. |
354
+ | `guidance_scale` | `float` | `7.0` | Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Only supported for non-turbo model. Typical range: 5.0-9.0. |
355
+ | `seed` | `int` | `-1` | Random seed for reproducibility. Use `-1` for random seed, or any positive integer for fixed seed. |
356
+
357
+ ### Advanced DiT Parameters
358
+
359
+ | Parameter | Type | Default | Description |
360
+ |-----------|------|---------|-------------|
361
+ | `use_adg` | `bool` | `False` | Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed. |
362
+ | `cfg_interval_start` | `float` | `0.0` | CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance. |
363
+ | `cfg_interval_end` | `float` | `1.0` | CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance. |
364
+ | `shift` | `float` | `1.0` | Timestep shift factor (range 1.0-5.0, default 1.0). When != 1.0, applies `t = shift * t / (1 + (shift - 1) * t)` to timesteps. Recommended 3.0 for turbo models. |
365
+ | `infer_method` | `str` | `"ode"` | Diffusion inference method. `"ode"` (Euler) is faster and deterministic. `"sde"` (stochastic) may produce different results with variance. |
366
+ | `timesteps` | `Optional[List[float]]` | `None` | Custom timesteps as a list of floats from 1.0 to 0.0 (e.g., `[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0]`). If provided, overrides `inference_steps` and `shift`. |
367
+
368
+ ### Task-Specific Parameters
369
+
370
+ | Parameter | Type | Default | Description |
371
+ |-----------|------|---------|-------------|
372
+ | `task_type` | `str` | `"text2music"` | Generation task type. See [Task Types](#task-types) section for details. |
373
+ | `instruction` | `str` | `"Fill the audio semantic mask based on the given conditions:"` | Task-specific instruction prompt. |
374
+ | `reference_audio` | `Optional[str]` | `None` | Path to reference audio file for style transfer or continuation tasks. |
375
+ | `src_audio` | `Optional[str]` | `None` | Path to source audio file for audio-to-audio tasks (cover, repaint, etc.). |
376
+ | `audio_codes` | `str` | `""` | Pre-extracted 5Hz audio semantic codes as a string. Advanced use only. |
377
+ | `repainting_start` | `float` | `0.0` | Repainting start time in seconds (for repaint/lego tasks). |
378
+ | `repainting_end` | `float` | `-1` | Repainting end time in seconds. Use `-1` for end of audio. |
379
+ | `audio_cover_strength` | `float` | `1.0` | Strength of audio cover/codes influence (0.0-1.0). Set smaller (0.2) for style transfer tasks. |
380
+
381
+ ### 5Hz Language Model Parameters
382
+
383
+ | Parameter | Type | Default | Description |
384
+ |-----------|------|---------|-------------|
385
+ | `thinking` | `bool` | `True` | Enable 5Hz Language Model "Chain-of-Thought" reasoning for semantic/music metadata and codes. |
386
+ | `lm_temperature` | `float` | `0.85` | LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative. |
387
+ | `lm_cfg_scale` | `float` | `2.0` | LM classifier-free guidance scale. Higher = stronger adherence to prompt. |
388
+ | `lm_top_k` | `int` | `0` | LM top-k sampling. `0` disables top-k filtering. Typical values: 40-100. |
389
+ | `lm_top_p` | `float` | `0.9` | LM nucleus sampling (0.0-1.0). `1.0` disables nucleus sampling. Typical values: 0.9-0.95. |
390
+ | `lm_negative_prompt` | `str` | `"NO USER INPUT"` | Negative prompt for LM guidance. Helps avoid unwanted characteristics. |
391
+ | `use_cot_metas` | `bool` | `True` | Generate metadata using LM CoT reasoning (BPM, key, duration, etc.). |
392
+ | `use_cot_caption` | `bool` | `True` | Refine user caption using LM CoT reasoning. |
393
+ | `use_cot_language` | `bool` | `True` | Detect vocal language using LM CoT reasoning. |
394
+ | `use_cot_lyrics` | `bool` | `False` | (Reserved for future use) Generate/refine lyrics using LM CoT. |
395
+ | `use_constrained_decoding` | `bool` | `True` | Enable constrained decoding for structured LM output. |
396
+
397
+ ### CoT Generated Values
398
+
399
+ These fields are automatically populated by the LM when CoT reasoning is enabled:
400
+
401
+ | Parameter | Type | Default | Description |
402
+ |-----------|------|---------|-------------|
403
+ | `cot_bpm` | `Optional[int]` | `None` | LM-generated BPM value. |
404
+ | `cot_keyscale` | `str` | `""` | LM-generated key/scale. |
405
+ | `cot_timesignature` | `str` | `""` | LM-generated time signature. |
406
+ | `cot_duration` | `Optional[float]` | `None` | LM-generated duration. |
407
+ | `cot_vocal_language` | `str` | `"unknown"` | LM-detected vocal language. |
408
+ | `cot_caption` | `str` | `""` | LM-refined caption. |
409
+ | `cot_lyrics` | `str` | `""` | LM-generated/refined lyrics. |
410
+
411
+ ---
412
+
413
+ ## GenerationConfig Parameters
414
+
415
+ | Parameter | Type | Default | Description |
416
+ |-----------|------|---------|-------------|
417
+ | `batch_size` | `int` | `2` | Number of samples to generate in parallel (1-8). Higher values require more GPU memory. |
418
+ | `allow_lm_batch` | `bool` | `False` | Allow batch processing in LM. Faster when `batch_size >= 2` and `thinking=True`. |
419
+ | `use_random_seed` | `bool` | `True` | Whether to use random seed. `True` for different results each time, `False` for reproducible results. |
420
+ | `seeds` | `Optional[List[int]]` | `None` | List of seeds for batch generation. If provided, will be padded with random seeds if fewer than batch_size. Can also be single int. |
421
+ | `lm_batch_chunk_size` | `int` | `8` | Maximum batch size per LM inference chunk (GPU memory constraint). |
422
+ | `constrained_decoding_debug` | `bool` | `False` | Enable debug logging for constrained decoding. |
423
+ | `audio_format` | `str` | `"flac"` | Output audio format. Options: `"mp3"`, `"wav"`, `"flac"`. Default is FLAC for fast saving. |
424
+
425
+ ---
426
+
427
+ ## Task Types
428
+
429
+ ACE-Step supports 6 different generation task types, each optimized for specific use cases.
430
+
431
+ ### 1. Text2Music (Default)
432
+
433
+ **Purpose**: Generate music from text descriptions and optional metadata.
434
+
435
+ **Key Parameters**:
436
+ ```python
437
+ params = GenerationParams(
438
+ task_type="text2music",
439
+ caption="energetic rock music with electric guitar",
440
+ lyrics="[Instrumental]", # or actual lyrics
441
+ bpm=140,
442
+ duration=30,
443
+ )
444
+ ```
445
+
446
+ **Required**:
447
+ - `caption` or `lyrics` (at least one)
448
+
449
+ **Optional but Recommended**:
450
+ - `bpm`: Controls tempo
451
+ - `keyscale`: Controls musical key
452
+ - `timesignature`: Controls rhythm structure
453
+ - `duration`: Controls length
454
+ - `vocal_language`: Controls vocal characteristics
455
+
456
+ **Use Cases**:
457
+ - Generate music from text descriptions
458
+ - Create backing tracks from prompts
459
+ - Generate songs with lyrics
460
+
461
+ ---
462
+
463
+ ### 2. Cover
464
+
465
+ **Purpose**: Transform existing audio while maintaining structure but changing style/timbre.
466
+
467
+ **Key Parameters**:
468
+ ```python
469
+ params = GenerationParams(
470
+ task_type="cover",
471
+ src_audio="original_song.mp3",
472
+ caption="jazz piano version",
473
+ audio_cover_strength=0.8, # 0.0-1.0
474
+ )
475
+ ```
476
+
477
+ **Required**:
478
+ - `src_audio`: Path to source audio file
479
+ - `caption`: Description of desired style/transformation
480
+
481
+ **Optional**:
482
+ - `audio_cover_strength`: Controls influence of original audio
483
+ - `1.0`: Strong adherence to original structure
484
+ - `0.5`: Balanced transformation
485
+ - `0.1`: Loose interpretation
486
+ - `lyrics`: New lyrics (if changing vocals)
487
+
488
+ **Use Cases**:
489
+ - Create covers in different styles
490
+ - Change instrumentation while keeping melody
491
+ - Genre transformation
492
+
493
+ ---
494
+
495
+ ### 3. Repaint
496
+
497
+ **Purpose**: Regenerate a specific time segment of audio while keeping the rest unchanged.
498
+
499
+ **Key Parameters**:
500
+ ```python
501
+ params = GenerationParams(
502
+ task_type="repaint",
503
+ src_audio="original.mp3",
504
+ repainting_start=10.0, # seconds
505
+ repainting_end=20.0, # seconds
506
+ caption="smooth transition with piano solo",
507
+ )
508
+ ```
509
+
510
+ **Required**:
511
+ - `src_audio`: Path to source audio file
512
+ - `repainting_start`: Start time in seconds
513
+ - `repainting_end`: End time in seconds (use `-1` for end of file)
514
+ - `caption`: Description of desired content for repainted section
515
+
516
+ **Use Cases**:
517
+ - Fix specific sections of generated music
518
+ - Add variations to parts of a song
519
+ - Create smooth transitions
520
+ - Replace problematic segments
521
+
522
+ ---
523
+
524
+ ### 4. Lego (Base Model Only)
525
+
526
+ **Purpose**: Generate a specific instrument track in context of existing audio.
527
+
528
+ **Key Parameters**:
529
+ ```python
530
+ params = GenerationParams(
531
+ task_type="lego",
532
+ src_audio="backing_track.mp3",
533
+ instruction="Generate the guitar track based on the audio context:",
534
+ caption="lead guitar melody with bluesy feel",
535
+ repainting_start=0.0,
536
+ repainting_end=-1,
537
+ )
538
+ ```
539
+
540
+ **Required**:
541
+ - `src_audio`: Path to source/backing audio
542
+ - `instruction`: Must specify the track type (e.g., "Generate the {TRACK_NAME} track...")
543
+ - `caption`: Description of desired track characteristics
544
+
545
+ **Available Tracks**:
546
+ - `"vocals"`, `"backing_vocals"`, `"drums"`, `"bass"`, `"guitar"`, `"keyboard"`,
547
+ - `"percussion"`, `"strings"`, `"synth"`, `"fx"`, `"brass"`, `"woodwinds"`
548
+
549
+ **Use Cases**:
550
+ - Add specific instrument tracks
551
+ - Layer additional instruments over backing tracks
552
+ - Create multi-track compositions iteratively
553
+
554
+ ---
555
+
556
+ ### 5. Extract (Base Model Only)
557
+
558
+ **Purpose**: Extract/isolate a specific instrument track from mixed audio.
559
+
560
+ **Key Parameters**:
561
+ ```python
562
+ params = GenerationParams(
563
+ task_type="extract",
564
+ src_audio="full_mix.mp3",
565
+ instruction="Extract the vocals track from the audio:",
566
+ )
567
+ ```
568
+
569
+ **Required**:
570
+ - `src_audio`: Path to mixed audio file
571
+ - `instruction`: Must specify track to extract
572
+
573
+ **Available Tracks**: Same as Lego task
574
+
575
+ **Use Cases**:
576
+ - Stem separation
577
+ - Isolate specific instruments
578
+ - Create remixes
579
+ - Analyze individual tracks
580
+
581
+ ---
582
+
583
+ ### 6. Complete (Base Model Only)
584
+
585
+ **Purpose**: Complete/extend partial tracks with specified instruments.
586
+
587
+ **Key Parameters**:
588
+ ```python
589
+ params = GenerationParams(
590
+ task_type="complete",
591
+ src_audio="incomplete_track.mp3",
592
+ instruction="Complete the input track with drums, bass, guitar:",
593
+ caption="rock style completion",
594
+ )
595
+ ```
596
+
597
+ **Required**:
598
+ - `src_audio`: Path to incomplete/partial track
599
+ - `instruction`: Must specify which tracks to add
600
+ - `caption`: Description of desired style
601
+
602
+ **Use Cases**:
603
+ - Arrange incomplete compositions
604
+ - Add backing tracks
605
+ - Auto-complete musical ideas
606
+
607
+ ---
608
+
609
+ ## Helper Functions
610
+
611
+ ### understand_music
612
+
613
+ Analyze audio codes to extract metadata about the music.
614
+
615
+ ```python
616
+ from acestep.inference import understand_music
617
+
618
+ result = understand_music(
619
+ llm_handler=llm_handler,
620
+ audio_codes="<|audio_code_123|><|audio_code_456|>...",
621
+ temperature=0.85,
622
+ use_constrained_decoding=True,
623
+ )
624
+
625
+ if result.success:
626
+ print(f"Caption: {result.caption}")
627
+ print(f"Lyrics: {result.lyrics}")
628
+ print(f"BPM: {result.bpm}")
629
+ print(f"Key: {result.keyscale}")
630
+ print(f"Duration: {result.duration}s")
631
+ print(f"Language: {result.language}")
632
+ else:
633
+ print(f"Error: {result.error}")
634
+ ```
635
+
636
+ **Use Cases**:
637
+ - Analyze existing music
638
+ - Extract metadata from audio codes
639
+ - Reverse-engineer generation parameters
640
+
641
+ ---
642
+
643
+ ### create_sample
644
+
645
+ Generate a complete music sample from a natural language description. This is the "Simple Mode" / "Inspiration Mode" feature.
646
+
647
+ ```python
648
+ from acestep.inference import create_sample
649
+
650
+ result = create_sample(
651
+ llm_handler=llm_handler,
652
+ query="a soft Bengali love song for a quiet evening",
653
+ instrumental=False,
654
+ vocal_language="bn", # Optional: constrain to Bengali
655
+ temperature=0.85,
656
+ )
657
+
658
+ if result.success:
659
+ print(f"Caption: {result.caption}")
660
+ print(f"Lyrics: {result.lyrics}")
661
+ print(f"BPM: {result.bpm}")
662
+ print(f"Duration: {result.duration}s")
663
+ print(f"Key: {result.keyscale}")
664
+ print(f"Is Instrumental: {result.instrumental}")
665
+
666
+ # Use with generate_music
667
+ params = GenerationParams(
668
+ caption=result.caption,
669
+ lyrics=result.lyrics,
670
+ bpm=result.bpm,
671
+ duration=result.duration,
672
+ keyscale=result.keyscale,
673
+ vocal_language=result.language,
674
+ )
675
+ else:
676
+ print(f"Error: {result.error}")
677
+ ```
678
+
679
+ **Parameters**:
680
+
681
+ | Parameter | Type | Default | Description |
682
+ |-----------|------|---------|-------------|
683
+ | `query` | `str` | required | Natural language description of desired music |
684
+ | `instrumental` | `bool` | `False` | Whether to generate instrumental music |
685
+ | `vocal_language` | `Optional[str]` | `None` | Constrain lyrics to specific language (e.g., "en", "zh", "bn") |
686
+ | `temperature` | `float` | `0.85` | Sampling temperature |
687
+ | `top_k` | `Optional[int]` | `None` | Top-k sampling (None disables) |
688
+ | `top_p` | `Optional[float]` | `None` | Top-p sampling (None disables) |
689
+ | `repetition_penalty` | `float` | `1.0` | Repetition penalty |
690
+ | `use_constrained_decoding` | `bool` | `True` | Use FSM-based constrained decoding |
691
+
692
+ ---
693
+
694
+ ### format_sample
695
+
696
+ Format and enhance user-provided caption and lyrics, generating structured metadata.
697
+
698
+ ```python
699
+ from acestep.inference import format_sample
700
+
701
+ result = format_sample(
702
+ llm_handler=llm_handler,
703
+ caption="Latin pop, reggaeton",
704
+ lyrics="[Verse 1]\nBailando en la noche...",
705
+ user_metadata={"bpm": 95}, # Optional: constrain specific values
706
+ temperature=0.85,
707
+ )
708
+
709
+ if result.success:
710
+ print(f"Enhanced Caption: {result.caption}")
711
+ print(f"Formatted Lyrics: {result.lyrics}")
712
+ print(f"BPM: {result.bpm}")
713
+ print(f"Duration: {result.duration}s")
714
+ print(f"Key: {result.keyscale}")
715
+ print(f"Detected Language: {result.language}")
716
+ else:
717
+ print(f"Error: {result.error}")
718
+ ```
719
+
720
+ **Parameters**:
721
+
722
+ | Parameter | Type | Default | Description |
723
+ |-----------|------|---------|-------------|
724
+ | `caption` | `str` | required | User's caption/description |
725
+ | `lyrics` | `str` | required | User's lyrics with structure tags |
726
+ | `user_metadata` | `Optional[Dict]` | `None` | Constrain specific metadata values (bpm, duration, keyscale, timesignature, language) |
727
+ | `temperature` | `float` | `0.85` | Sampling temperature |
728
+ | `top_k` | `Optional[int]` | `None` | Top-k sampling (None disables) |
729
+ | `top_p` | `Optional[float]` | `None` | Top-p sampling (None disables) |
730
+ | `repetition_penalty` | `float` | `1.0` | Repetition penalty |
731
+ | `use_constrained_decoding` | `bool` | `True` | Use FSM-based constrained decoding |
732
+
733
+ ---
734
+
735
+ ## Complete Examples
736
+
737
+ ### Example 1: Simple Text-to-Music Generation
738
+
739
+ ```python
740
+ from acestep.inference import GenerationParams, GenerationConfig, generate_music
741
+
742
+ params = GenerationParams(
743
+ task_type="text2music",
744
+ caption="calm ambient music with soft piano and strings",
745
+ duration=60,
746
+ bpm=80,
747
+ keyscale="C Major",
748
+ )
749
+
750
+ config = GenerationConfig(
751
+ batch_size=2, # Generate 2 variations
752
+ audio_format="flac",
753
+ )
754
+
755
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
756
+
757
+ if result.success:
758
+ for i, audio in enumerate(result.audios, 1):
759
+ print(f"Variation {i}: {audio['path']}")
760
+ ```
761
+
762
+ ### Example 2: Song Generation with Lyrics
763
+
764
+ ```python
765
+ params = GenerationParams(
766
+ task_type="text2music",
767
+ caption="pop ballad with emotional vocals",
768
+ lyrics="""Verse 1:
769
+ Walking down the street today
770
+ Thinking of the words you used to say
771
+ Everything feels different now
772
+ But I'll find my way somehow
773
+
774
+ Chorus:
775
+ I'm moving on, I'm staying strong
776
+ This is where I belong
777
+ """,
778
+ vocal_language="en",
779
+ bpm=72,
780
+ duration=45,
781
+ )
782
+
783
+ config = GenerationConfig(batch_size=1)
784
+
785
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
786
+ ```
787
+
788
+ ### Example 3: Using Custom Timesteps
789
+
790
+ ```python
791
+ params = GenerationParams(
792
+ task_type="text2music",
793
+ caption="jazz fusion with complex harmonies",
794
+ # Custom 9-step schedule
795
+ timesteps=[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0],
796
+ thinking=True,
797
+ )
798
+
799
+ config = GenerationConfig(batch_size=1)
800
+
801
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
802
+ ```
803
+
804
+ ### Example 4: Using Shift Parameter (Turbo Model)
805
+
806
+ ```python
807
+ params = GenerationParams(
808
+ task_type="text2music",
809
+ caption="upbeat electronic dance music",
810
+ inference_steps=8,
811
+ shift=3.0, # Recommended for turbo models
812
+ infer_method="ode",
813
+ )
814
+
815
+ config = GenerationConfig(batch_size=2)
816
+
817
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
818
+ ```
819
+
820
+ ### Example 5: Simple Mode with create_sample
821
+
822
+ ```python
823
+ from acestep.inference import create_sample, GenerationParams, GenerationConfig, generate_music
824
+
825
+ # Step 1: Create sample from description
826
+ sample = create_sample(
827
+ llm_handler=llm_handler,
828
+ query="energetic K-pop dance track with catchy hooks",
829
+ vocal_language="ko",
830
+ )
831
+
832
+ if sample.success:
833
+ # Step 2: Generate music using the sample
834
+ params = GenerationParams(
835
+ caption=sample.caption,
836
+ lyrics=sample.lyrics,
837
+ bpm=sample.bpm,
838
+ duration=sample.duration,
839
+ keyscale=sample.keyscale,
840
+ vocal_language=sample.language,
841
+ thinking=True,
842
+ )
843
+
844
+ config = GenerationConfig(batch_size=2)
845
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
846
+ ```
847
+
848
+ ### Example 6: Format and Enhance User Input
849
+
850
+ ```python
851
+ from acestep.inference import format_sample, GenerationParams, GenerationConfig, generate_music
852
+
853
+ # Step 1: Format user input
854
+ formatted = format_sample(
855
+ llm_handler=llm_handler,
856
+ caption="rock ballad",
857
+ lyrics="[Verse]\nIn the darkness I find my way...",
858
+ )
859
+
860
+ if formatted.success:
861
+ # Step 2: Generate with enhanced input
862
+ params = GenerationParams(
863
+ caption=formatted.caption,
864
+ lyrics=formatted.lyrics,
865
+ bpm=formatted.bpm,
866
+ duration=formatted.duration,
867
+ keyscale=formatted.keyscale,
868
+ thinking=True,
869
+ use_cot_metas=False, # Already formatted, skip metas CoT
870
+ )
871
+
872
+ config = GenerationConfig(batch_size=2)
873
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
874
+ ```
875
+
876
+ ### Example 7: Style Cover with LM Reasoning
877
+
878
+ ```python
879
+ params = GenerationParams(
880
+ task_type="cover",
881
+ src_audio="original_pop_song.mp3",
882
+ caption="orchestral symphonic arrangement",
883
+ audio_cover_strength=0.7,
884
+ thinking=True, # Enable LM for metadata
885
+ use_cot_metas=True,
886
+ )
887
+
888
+ config = GenerationConfig(batch_size=1)
889
+
890
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
891
+
892
+ # Access LM-generated metadata
893
+ if result.extra_outputs.get("lm_metadata"):
894
+ lm_meta = result.extra_outputs["lm_metadata"]
895
+ print(f"LM detected BPM: {lm_meta.get('bpm')}")
896
+ print(f"LM detected Key: {lm_meta.get('keyscale')}")
897
+ ```
898
+
899
+ ### Example 8: Batch Generation with Specific Seeds
900
+
901
+ ```python
902
+ params = GenerationParams(
903
+ task_type="text2music",
904
+ caption="epic cinematic trailer music",
905
+ )
906
+
907
+ config = GenerationConfig(
908
+ batch_size=4, # Generate 4 variations
909
+ seeds=[42, 123, 456], # Specify 3 seeds, 4th will be random
910
+ use_random_seed=False, # Use provided seeds
911
+ lm_batch_chunk_size=2, # Process 2 at a time (GPU memory)
912
+ )
913
+
914
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
915
+
916
+ if result.success:
917
+ print(f"Generated {len(result.audios)} variations")
918
+ for audio in result.audios:
919
+ print(f" Seed {audio['params']['seed']}: {audio['path']}")
920
+ ```
921
+
922
+ ### Example 9: High-Quality Generation (Base Model)
923
+
924
+ ```python
925
+ params = GenerationParams(
926
+ task_type="text2music",
927
+ caption="intricate jazz fusion with complex harmonies",
928
+ inference_steps=64, # High quality
929
+ guidance_scale=8.0,
930
+ use_adg=True, # Adaptive Dual Guidance
931
+ cfg_interval_start=0.0,
932
+ cfg_interval_end=1.0,
933
+ shift=3.0, # Timestep shift
934
+ seed=42, # Reproducible results
935
+ )
936
+
937
+ config = GenerationConfig(
938
+ batch_size=1,
939
+ use_random_seed=False,
940
+ audio_format="wav", # Lossless format
941
+ )
942
+
943
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
944
+ ```
945
+
946
+ ### Example 10: Understand Audio from Codes
947
+
948
+ ```python
949
+ from acestep.inference import understand_music
950
+
951
+ # Analyze audio codes (e.g., from a previous generation)
952
+ result = understand_music(
953
+ llm_handler=llm_handler,
954
+ audio_codes="<|audio_code_10695|><|audio_code_54246|>...",
955
+ temperature=0.85,
956
+ )
957
+
958
+ if result.success:
959
+ print(f"Detected Caption: {result.caption}")
960
+ print(f"Detected Lyrics: {result.lyrics}")
961
+ print(f"Detected BPM: {result.bpm}")
962
+ print(f"Detected Key: {result.keyscale}")
963
+ print(f"Detected Duration: {result.duration}s")
964
+ print(f"Detected Language: {result.language}")
965
+ ```
966
+
967
+ ---
968
+
969
+ ## Best Practices
970
+
971
+ ### 1. Caption Writing
972
+
973
+ **Good Captions**:
974
+ ```python
975
+ # Specific and descriptive
976
+ caption="upbeat electronic dance music with heavy bass and synthesizer leads"
977
+
978
+ # Include mood and genre
979
+ caption="melancholic indie folk with acoustic guitar and soft vocals"
980
+
981
+ # Specify instruments
982
+ caption="jazz trio with piano, upright bass, and brush drums"
983
+ ```
984
+
985
+ **Avoid**:
986
+ ```python
987
+ # Too vague
988
+ caption="good music"
989
+
990
+ # Contradictory
991
+ caption="fast slow music" # Conflicting tempos
992
+ ```
993
+
994
+ ### 2. Parameter Tuning
995
+
996
+ **For Best Quality**:
997
+ - Use base model with `inference_steps=64` or higher
998
+ - Enable `use_adg=True`
999
+ - Set `guidance_scale=7.0-9.0`
1000
+ - Set `shift=3.0` for better timestep distribution
1001
+ - Use lossless audio format (`audio_format="wav"`)
1002
+
1003
+ **For Speed**:
1004
+ - Use turbo model with `inference_steps=8`
1005
+ - Disable ADG (`use_adg=False`)
1006
+ - Use `infer_method="ode"` (default)
1007
+ - Use compressed format (`audio_format="mp3"`) or default FLAC
1008
+
1009
+ **For Consistency**:
1010
+ - Set `use_random_seed=False` in config
1011
+ - Use fixed `seeds` list or single `seed` in params
1012
+ - Keep `lm_temperature` lower (0.7-0.85)
1013
+
1014
+ **For Diversity**:
1015
+ - Set `use_random_seed=True` in config
1016
+ - Increase `lm_temperature` (0.9-1.1)
1017
+ - Use `batch_size > 1` for variations
1018
+
1019
+ ### 3. Duration Guidelines
1020
+
1021
+ - **Instrumental**: 30-180 seconds works well
1022
+ - **With Lyrics**: Auto-detection recommended (set `duration=-1` or leave default)
1023
+ - **Short clips**: 10-20 seconds minimum
1024
+ - **Long form**: Up to 600 seconds (10 minutes) maximum
1025
+
1026
+ ### 4. LM Usage
1027
+
1028
+ **When to Enable LM (`thinking=True`)**:
1029
+ - Need automatic metadata detection
1030
+ - Want caption refinement
1031
+ - Generating from minimal input
1032
+ - Need diverse outputs
1033
+
1034
+ **When to Disable LM (`thinking=False`)**:
1035
+ - Have precise metadata already
1036
+ - Need faster generation
1037
+ - Want full control over parameters
1038
+
1039
+ ### 5. Batch Processing
1040
+
1041
+ ```python
1042
+ # Efficient batch generation
1043
+ config = GenerationConfig(
1044
+ batch_size=8, # Max supported
1045
+ allow_lm_batch=True, # Enable for speed (when thinking=True)
1046
+ lm_batch_chunk_size=4, # Adjust based on GPU memory
1047
+ )
1048
+ ```
1049
+
1050
+ ### 6. Error Handling
1051
+
1052
+ ```python
1053
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
1054
+
1055
+ if not result.success:
1056
+ print(f"Generation failed: {result.error}")
1057
+ print(f"Status: {result.status_message}")
1058
+ else:
1059
+ # Process successful result
1060
+ for audio in result.audios:
1061
+ path = audio['path']
1062
+ key = audio['key']
1063
+ seed = audio['params']['seed']
1064
+ # ... process audio files
1065
+ ```
1066
+
1067
+ ### 7. Memory Management
1068
+
1069
+ For large batch sizes or long durations:
1070
+ - Monitor GPU memory usage
1071
+ - Reduce `batch_size` if OOM errors occur
1072
+ - Reduce `lm_batch_chunk_size` for LM operations
1073
+ - Consider using `offload_to_cpu=True` during initialization
1074
+
1075
+ ### 8. Accessing Time Costs
1076
+
1077
+ ```python
1078
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
1079
+
1080
+ if result.success:
1081
+ time_costs = result.extra_outputs.get("time_costs", {})
1082
+ print(f"LM Phase 1 Time: {time_costs.get('lm_phase1_time', 0):.2f}s")
1083
+ print(f"LM Phase 2 Time: {time_costs.get('lm_phase2_time', 0):.2f}s")
1084
+ print(f"DiT Total Time: {time_costs.get('dit_total_time_cost', 0):.2f}s")
1085
+ print(f"Pipeline Total: {time_costs.get('pipeline_total_time', 0):.2f}s")
1086
+ ```
1087
+
1088
+ ---
1089
+
1090
+ ## Troubleshooting
1091
+
1092
+ ### Common Issues
1093
+
1094
+ **Issue**: Out of memory errors
1095
+ - **Solution**: Reduce `batch_size`, `inference_steps`, or enable CPU offloading
1096
+
1097
+ **Issue**: Poor quality results
1098
+ - **Solution**: Increase `inference_steps`, adjust `guidance_scale`, use base model
1099
+
1100
+ **Issue**: Results don't match prompt
1101
+ - **Solution**: Make caption more specific, increase `guidance_scale`, enable LM refinement (`thinking=True`)
1102
+
1103
+ **Issue**: Slow generation
1104
+ - **Solution**: Use turbo model, reduce `inference_steps`, disable ADG
1105
+
1106
+ **Issue**: LM not generating codes
1107
+ - **Solution**: Verify `llm_handler` is initialized, check `thinking=True` and `use_cot_metas=True`
1108
+
1109
+ **Issue**: Seeds not being respected
1110
+ - **Solution**: Set `use_random_seed=False` in config and provide `seeds` list or `seed` in params
1111
+
1112
+ **Issue**: Custom timesteps not working
1113
+ - **Solution**: Ensure timesteps are a list of floats from 1.0 to 0.0, properly ordered
1114
+
1115
+ ---
1116
+
1117
+ ## API Reference Summary
1118
+
1119
+ ### GenerationParams Fields
1120
+
1121
+ See [GenerationParams Parameters](#generationparams-parameters) for complete documentation.
1122
+
1123
+ ### GenerationConfig Fields
1124
+
1125
+ See [GenerationConfig Parameters](#generationconfig-parameters) for complete documentation.
1126
+
1127
+ ### GenerationResult Fields
1128
+
1129
+ ```python
1130
+ @dataclass
1131
+ class GenerationResult:
1132
+ # Audio Outputs
1133
+ audios: List[Dict[str, Any]]
1134
+ # Each audio dict contains:
1135
+ # - "path": str (file path)
1136
+ # - "tensor": Tensor (audio data)
1137
+ # - "key": str (unique identifier)
1138
+ # - "sample_rate": int (48000)
1139
+ # - "params": Dict (generation params with seed, audio_codes, etc.)
1140
+
1141
+ # Generation Information
1142
+ status_message: str
1143
+ extra_outputs: Dict[str, Any]
1144
+ # extra_outputs contains:
1145
+ # - "lm_metadata": Dict (LM-generated metadata)
1146
+ # - "time_costs": Dict (timing information)
1147
+ # - "latents": Tensor (intermediate latents, if available)
1148
+ # - "masks": Tensor (attention masks, if available)
1149
+
1150
+ # Success Status
1151
+ success: bool
1152
+ error: Optional[str]
1153
+ ```
1154
+
1155
+ ---
1156
+
1157
+ ## Version History
1158
+
1159
+ - **v1.5.2**: Current version
1160
+ - Added `shift` parameter for timestep shifting
1161
+ - Added `infer_method` parameter for ODE/SDE selection
1162
+ - Added `timesteps` parameter for custom timestep schedules
1163
+ - Added `understand_music()` function for audio analysis
1164
+ - Added `create_sample()` function for simple mode generation
1165
+ - Added `format_sample()` function for input enhancement
1166
+ - Added `UnderstandResult`, `CreateSampleResult`, `FormatSampleResult` dataclasses
1167
+
1168
+ - **v1.5.1**: Previous version
1169
+ - Split `GenerationConfig` into `GenerationParams` and `GenerationConfig`
1170
+ - Renamed parameters for consistency (`key_scale` → `keyscale`, `time_signature` → `timesignature`, `audio_duration` → `duration`, `use_llm_thinking` → `thinking`, `audio_code_string` → `audio_codes`)
1171
+ - Added `instrumental` parameter
1172
+ - Added `use_constrained_decoding` parameter
1173
+ - Added CoT auto-filled fields (`cot_*`)
1174
+ - Changed default `audio_format` to "flac"
1175
+ - Changed default `batch_size` to 2
1176
+ - Changed default `thinking` to True
1177
+ - Simplified `GenerationResult` structure with unified `audios` list
1178
+ - Added unified `time_costs` in `extra_outputs`
1179
+
1180
+ - **v1.5**: Initial version
1181
+ - Introduced `GenerationConfig` and `GenerationResult` dataclasses
1182
+ - Simplified parameter passing
1183
+ - Added comprehensive documentation
1184
+
1185
+ ---
1186
+
1187
+ For more information, see:
1188
+ - Main README: [`../../README.md`](../../README.md)
1189
+ - REST API Documentation: [`API.md`](API.md)
1190
+ - Gradio Demo Guide: [`GRADIO_GUIDE.md`](GRADIO_GUIDE.md)
1191
+ - Project repository: [ACE-Step-1.5](https://github.com/yourusername/ACE-Step-1.5)
.claude/skills/acestep-docs/guides/SCRIPT_CONFIGURATION.md ADDED
@@ -0,0 +1,615 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Launch Script Configuration Guide
2
+
3
+ This guide explains how to configure the startup scripts for ACE-Step across all supported platforms: Windows (.bat), Linux (.sh), and macOS (.sh).
4
+
5
+ > **Note for uv/Python users**: If you're using `uv run acestep` or running Python directly (not using launch scripts), configure settings via the `.env` file instead. See [ENVIRONMENT_SETUP.md](ENVIRONMENT_SETUP.md#environment-variables-env) for details.
6
+
7
+ ## How to Modify
8
+
9
+ All configurable options are variables at the top of each script. Open the script with any text editor and modify the values.
10
+
11
+ **Windows (.bat)**:
12
+ - Set a variable: `set VARIABLE=value`
13
+ - Comment out a line: `REM set VARIABLE=value`
14
+ - Uncomment a line: Remove the leading `REM`
15
+
16
+ **Linux/macOS (.sh)**:
17
+ - Set a variable: `VARIABLE="value"`
18
+ - Comment out a line: `# VARIABLE="value"`
19
+ - Uncomment a line: Remove the leading `#`
20
+
21
+ ---
22
+
23
+ ## Available Launch Scripts
24
+
25
+ | Platform | Script | Purpose |
26
+ |----------|--------|---------|
27
+ | Windows (NVIDIA) | `start_gradio_ui.bat` | Gradio Web UI |
28
+ | Windows (NVIDIA) | `start_api_server.bat` | REST API Server |
29
+ | Windows (AMD ROCm) | `start_gradio_ui_rocm.bat` | Gradio Web UI for AMD GPUs |
30
+ | Windows (AMD ROCm) | `start_api_server_rocm.bat` | REST API Server for AMD GPUs |
31
+ | Linux (CUDA) | `start_gradio_ui.sh` | Gradio Web UI |
32
+ | Linux (CUDA) | `start_api_server.sh` | REST API Server |
33
+ | macOS (Apple Silicon) | `start_gradio_ui_macos.sh` | Gradio Web UI (MLX backend) |
34
+ | macOS (Apple Silicon) | `start_api_server_macos.sh` | REST API Server (MLX backend) |
35
+
36
+ ---
37
+
38
+ ## Configuration Sections
39
+
40
+ ### 1. UI Language
41
+
42
+ Controls the language displayed in the Gradio Web UI.
43
+
44
+ **Options**: `en` (English), `zh` (Chinese), `he` (Hebrew), `ja` (Japanese)
45
+
46
+ **Windows (.bat)**:
47
+ ```batch
48
+ REM UI language: en, zh, he, ja
49
+ set LANGUAGE=en
50
+ ```
51
+
52
+ **Linux/macOS (.sh)**:
53
+ ```bash
54
+ # UI language: en, zh, he, ja
55
+ LANGUAGE="en"
56
+ ```
57
+
58
+ **Example -- switch to Chinese**:
59
+
60
+ | Platform | Setting |
61
+ |----------|---------|
62
+ | Windows | `set LANGUAGE=zh` |
63
+ | Linux/macOS | `LANGUAGE="zh"` |
64
+
65
+ > **Note**: The `LANGUAGE` variable is only available in Gradio UI scripts. API server scripts do not have a UI language setting.
66
+
67
+ ---
68
+
69
+ ### 2. Server Port
70
+
71
+ Controls which port the server listens on and which address it binds to.
72
+
73
+ **Gradio UI scripts**:
74
+
75
+ | Platform | Default Port | Default Address |
76
+ |----------|-------------|-----------------|
77
+ | Windows | `7860` | `127.0.0.1` |
78
+ | Linux | `7860` | `127.0.0.1` |
79
+ | macOS | `7860` | `127.0.0.1` |
80
+
81
+ **Windows (.bat)** -- Gradio UI:
82
+ ```batch
83
+ REM Server settings
84
+ set PORT=7860
85
+ set SERVER_NAME=127.0.0.1
86
+ REM set SERVER_NAME=0.0.0.0
87
+ REM set SHARE=--share
88
+ ```
89
+
90
+ **Linux/macOS (.sh)** -- Gradio UI:
91
+ ```bash
92
+ # Server settings
93
+ PORT=7860
94
+ SERVER_NAME="127.0.0.1"
95
+ # SERVER_NAME="0.0.0.0"
96
+ SHARE=""
97
+ # SHARE="--share"
98
+ ```
99
+
100
+ **API Server scripts**:
101
+
102
+ | Platform | Default Port | Default Host |
103
+ |----------|-------------|--------------|
104
+ | Windows | `8001` | `127.0.0.1` |
105
+ | Linux | `8001` | `127.0.0.1` |
106
+ | macOS | `8001` | `127.0.0.1` |
107
+
108
+ **Windows (.bat)** -- API Server:
109
+ ```batch
110
+ set HOST=127.0.0.1
111
+ set PORT=8001
112
+ ```
113
+
114
+ **Linux/macOS (.sh)** -- API Server:
115
+ ```bash
116
+ HOST="127.0.0.1"
117
+ PORT=8001
118
+ ```
119
+
120
+ **Default URLs**:
121
+ - Gradio UI: http://127.0.0.1:7860
122
+ - API Server: http://127.0.0.1:8001
123
+ - API Documentation: http://127.0.0.1:8001/docs
124
+
125
+ **To expose to the network** (allow access from other devices):
126
+ - Set `SERVER_NAME` or `HOST` to `0.0.0.0`
127
+ - Or enable `SHARE` for Gradio's public sharing link
128
+
129
+ ---
130
+
131
+ ### 3. Download Source
132
+
133
+ Controls where model files are downloaded from. Affects all scripts that download models.
134
+
135
+ **Windows (.bat)**:
136
+ ```batch
137
+ REM Download source: auto (default), huggingface, or modelscope
138
+ REM set DOWNLOAD_SOURCE=--download-source modelscope
139
+ REM set DOWNLOAD_SOURCE=--download-source huggingface
140
+ set DOWNLOAD_SOURCE=
141
+ ```
142
+
143
+ **Linux/macOS (.sh)**:
144
+ ```bash
145
+ # Download source: auto (default), huggingface, or modelscope
146
+ DOWNLOAD_SOURCE=""
147
+ # DOWNLOAD_SOURCE="--download-source modelscope"
148
+ # DOWNLOAD_SOURCE="--download-source huggingface"
149
+ ```
150
+
151
+ **Options**:
152
+
153
+ | Value | When to Use | Speed |
154
+ |-------|-------------|-------|
155
+ | (empty) or `auto` | Auto-detect network | Automatic |
156
+ | `modelscope` | China mainland users | Fast in China |
157
+ | `huggingface` | Overseas users | Fast outside China |
158
+
159
+ **How auto-detection works**:
160
+ 1. Tests Google connectivity
161
+ - Can access Google --> uses HuggingFace Hub
162
+ - Cannot access Google --> uses ModelScope
163
+ 2. If primary source fails, falls back to the alternate source
164
+
165
+ **Examples**:
166
+
167
+ | Platform | China Users | Overseas Users |
168
+ |----------|-------------|----------------|
169
+ | Windows | `set DOWNLOAD_SOURCE=--download-source modelscope` | `set DOWNLOAD_SOURCE=--download-source huggingface` |
170
+ | Linux/macOS | `DOWNLOAD_SOURCE="--download-source modelscope"` | `DOWNLOAD_SOURCE="--download-source huggingface"` |
171
+
172
+ ---
173
+
174
+ ### 4. Update Check
175
+
176
+ Controls whether the script checks GitHub for updates before launching.
177
+
178
+ **Default**: `true` (enabled)
179
+
180
+ **Windows (.bat)**:
181
+ ```batch
182
+ REM Update check on startup (set to false to disable)
183
+ set CHECK_UPDATE=true
184
+ REM set CHECK_UPDATE=false
185
+ ```
186
+
187
+ **Linux/macOS (.sh)**:
188
+ ```bash
189
+ # Update check on startup (set to "false" to disable)
190
+ CHECK_UPDATE="true"
191
+ # CHECK_UPDATE="false"
192
+ ```
193
+
194
+ **Git detection by platform**:
195
+
196
+ | Platform | Git Resolution |
197
+ |----------|---------------|
198
+ | Windows | Tries `PortableGit\bin\git.exe` first, then falls back to system `git` (e.g., Git for Windows) |
199
+ | Linux | Uses system `git` |
200
+ | macOS | Uses system `git` (Xcode Command Line Tools or Homebrew) |
201
+
202
+ > **Important**: On Windows, PortableGit is no longer strictly required. If you have Git for Windows installed system-wide, the update check will find it automatically.
203
+
204
+ **Behavior when enabled**:
205
+ 1. Fetches latest commits from GitHub with a 10 second timeout
206
+ 2. Compares local commit hash against remote
207
+ 3. If an update is available, shows new commits and prompts `Y/N`
208
+ 4. If the network is unreachable or the fetch times out, automatically skips and continues startup
209
+
210
+ **Timeout handling by platform**:
211
+ - Linux: Uses `timeout` command (10 seconds)
212
+ - macOS: Uses `gtimeout` (from coreutils) or `timeout` if available, otherwise runs without timeout
213
+ - Windows: Network-level timeout via `git fetch`
214
+
215
+ See [UPDATE_AND_BACKUP.md](UPDATE_AND_BACKUP.md) for full details on the update process and file backup.
216
+
217
+ ---
218
+
219
+ ### 5. Model Configuration
220
+
221
+ Controls which DiT model and Language Model (LM) are loaded.
222
+
223
+ **Windows (.bat)** -- Gradio UI:
224
+ ```batch
225
+ REM Model settings
226
+ set CONFIG_PATH=--config_path acestep-v15-turbo
227
+ set LM_MODEL_PATH=--lm_model_path acestep-5Hz-lm-0.6B
228
+ REM set OFFLOAD_TO_CPU=--offload_to_cpu true
229
+ ```
230
+
231
+ **Linux/macOS (.sh)** -- Gradio UI:
232
+ ```bash
233
+ # Model settings
234
+ CONFIG_PATH="--config_path acestep-v15-turbo"
235
+ LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-0.6B"
236
+ # OFFLOAD_TO_CPU="--offload_to_cpu true"
237
+ OFFLOAD_TO_CPU=""
238
+ ```
239
+
240
+ **API Server** -- Windows (.bat):
241
+ ```batch
242
+ REM LM model path (optional, only used when LLM is enabled)
243
+ REM set LM_MODEL_PATH=--lm-model-path acestep-5Hz-lm-0.6B
244
+ ```
245
+
246
+ **API Server** -- Linux/macOS (.sh):
247
+ ```bash
248
+ # LM model path (optional, only used when LLM is enabled)
249
+ LM_MODEL_PATH=""
250
+ # LM_MODEL_PATH="--lm-model-path acestep-5Hz-lm-0.6B"
251
+ ```
252
+
253
+ > **Note**: The API server uses `--lm-model-path` (hyphens) while the Gradio UI uses `--lm_model_path` (underscores).
254
+
255
+ **Available DiT Models**:
256
+
257
+ | Model | Description |
258
+ |-------|-------------|
259
+ | `acestep-v15-turbo` | Default turbo model (8 steps, no CFG) |
260
+ | `acestep-v15-base` | Base model (50 steps, with CFG, high diversity) |
261
+ | `acestep-v15-sft` | SFT model (50 steps, with CFG, high quality) |
262
+ | `acestep-v15-turbo-shift1` | Turbo with shift1 |
263
+ | `acestep-v15-turbo-shift3` | Turbo with shift3 |
264
+ | `acestep-v15-turbo-continuous` | Turbo with continuous shift (1-5) |
265
+
266
+ **Available Language Models**:
267
+
268
+ | LM Model | Size | Quality |
269
+ |----------|------|---------|
270
+ | `acestep-5Hz-lm-0.6B` | 0.6B | Standard |
271
+ | `acestep-5Hz-lm-1.7B` | 1.7B | Better |
272
+ | `acestep-5Hz-lm-4B` | 4B | Best (requires more VRAM/RAM) |
273
+
274
+ **CPU Offload**: Enable `OFFLOAD_TO_CPU` when using larger models (especially 4B) on GPUs with limited VRAM. Models shuttle between CPU and GPU as needed, adding ~8-10s overhead per generation but preventing VRAM oversubscription.
275
+
276
+ ---
277
+
278
+ ### 6. LLM Initialization Control
279
+
280
+ Controls whether the Language Model (5Hz LM) is initialized at startup. By default, LLM is automatically enabled or disabled based on GPU VRAM:
281
+ - **<=6GB VRAM**: LLM disabled (DiT-only mode)
282
+ - **>6GB VRAM**: LLM enabled
283
+
284
+ **Processing Flow:**
285
+ ```
286
+ GPU Detection (full) --> ACESTEP_INIT_LLM / INIT_LLM Override --> Model Loading
287
+ ```
288
+
289
+ GPU optimizations (offload, quantization, batch limits) are **always applied** regardless of this setting. The override only controls whether to attempt LLM loading.
290
+
291
+ **Gradio UI** -- Windows (.bat):
292
+ ```batch
293
+ REM LLM initialization: auto (default), true, false
294
+ REM set INIT_LLM=--init_llm auto
295
+ REM set INIT_LLM=--init_llm true
296
+ REM set INIT_LLM=--init_llm false
297
+ ```
298
+
299
+ **Gradio UI** -- Linux/macOS (.sh):
300
+ ```bash
301
+ # LLM initialization: auto (default), true, false
302
+ INIT_LLM=""
303
+ # INIT_LLM="--init_llm auto"
304
+ # INIT_LLM="--init_llm true"
305
+ # INIT_LLM="--init_llm false"
306
+ ```
307
+
308
+ **API Server** -- Windows (.bat):
309
+ ```batch
310
+ REM Values: auto (default), true (force enable), false (force disable)
311
+ REM set ACESTEP_INIT_LLM=auto
312
+ REM set ACESTEP_INIT_LLM=true
313
+ REM set ACESTEP_INIT_LLM=false
314
+ ```
315
+
316
+ **API Server** -- Linux/macOS (.sh):
317
+ ```bash
318
+ # Values: auto (default), true (force enable), false (force disable)
319
+ # export ACESTEP_INIT_LLM=auto
320
+ # export ACESTEP_INIT_LLM=true
321
+ # export ACESTEP_INIT_LLM=false
322
+ ```
323
+
324
+ > **Note**: Gradio UI scripts use `--init_llm` as a command-line argument. API server scripts use the `ACESTEP_INIT_LLM` environment variable.
325
+
326
+ **When to use**:
327
+
328
+ | Setting | Use Case |
329
+ |---------|----------|
330
+ | `auto` (default) | Let GPU detection decide (recommended) |
331
+ | `true` | Force LLM on low VRAM GPU (GPU optimizations still applied, may cause OOM) |
332
+ | `false` | Pure DiT mode for faster generation, no LLM features |
333
+
334
+ **Features affected by LLM**:
335
+ - **Thinking mode**: LLM generates audio codes for better quality
336
+ - **Chain-of-Thought (CoT)**: Auto-enhance captions, detect language, generate metadata
337
+ - **Sample mode**: Generate random songs from descriptions
338
+ - **Format mode**: Enhance user input via LLM
339
+
340
+ When LLM is disabled, these features are automatically disabled, and generation uses pure DiT mode.
341
+
342
+ ---
343
+
344
+ ## Complete Configuration Examples
345
+
346
+ ### Chinese Users
347
+
348
+ **Windows (.bat)** -- `start_gradio_ui.bat`:
349
+ ```batch
350
+ REM UI language
351
+ set LANGUAGE=zh
352
+
353
+ REM Server port
354
+ set PORT=7860
355
+ set SERVER_NAME=127.0.0.1
356
+
357
+ REM Download source
358
+ set DOWNLOAD_SOURCE=--download-source modelscope
359
+
360
+ REM Update check
361
+ set CHECK_UPDATE=true
362
+
363
+ REM Model settings
364
+ set CONFIG_PATH=--config_path acestep-v15-turbo
365
+ set LM_MODEL_PATH=--lm_model_path acestep-5Hz-lm-0.6B
366
+ ```
367
+
368
+ **Linux (.sh)** -- `start_gradio_ui.sh`:
369
+ ```bash
370
+ # UI language
371
+ LANGUAGE="zh"
372
+
373
+ # Server port
374
+ PORT=7860
375
+ SERVER_NAME="127.0.0.1"
376
+
377
+ # Download source
378
+ DOWNLOAD_SOURCE="--download-source modelscope"
379
+
380
+ # Update check
381
+ CHECK_UPDATE="true"
382
+
383
+ # Model settings
384
+ CONFIG_PATH="--config_path acestep-v15-turbo"
385
+ LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-0.6B"
386
+ ```
387
+
388
+ ---
389
+
390
+ ### Overseas Users
391
+
392
+ **Windows (.bat)** -- `start_gradio_ui.bat`:
393
+ ```batch
394
+ REM UI language
395
+ set LANGUAGE=en
396
+
397
+ REM Server port
398
+ set PORT=7860
399
+ set SERVER_NAME=127.0.0.1
400
+
401
+ REM Download source
402
+ set DOWNLOAD_SOURCE=--download-source huggingface
403
+
404
+ REM Update check
405
+ set CHECK_UPDATE=true
406
+
407
+ REM Model settings
408
+ set CONFIG_PATH=--config_path acestep-v15-turbo
409
+ set LM_MODEL_PATH=--lm_model_path acestep-5Hz-lm-1.7B
410
+ ```
411
+
412
+ **Linux (.sh)** -- `start_gradio_ui.sh`:
413
+ ```bash
414
+ # UI language
415
+ LANGUAGE="en"
416
+
417
+ # Server port
418
+ PORT=7860
419
+ SERVER_NAME="127.0.0.1"
420
+
421
+ # Download source
422
+ DOWNLOAD_SOURCE="--download-source huggingface"
423
+
424
+ # Update check
425
+ CHECK_UPDATE="true"
426
+
427
+ # Model settings
428
+ CONFIG_PATH="--config_path acestep-v15-turbo"
429
+ LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-1.7B"
430
+ ```
431
+
432
+ ---
433
+
434
+ ### macOS Users (Apple Silicon / MLX)
435
+
436
+ **`start_gradio_ui_macos.sh`**:
437
+ ```bash
438
+ # MLX backend is set automatically by the script:
439
+ # export ACESTEP_LM_BACKEND="mlx"
440
+
441
+ # UI language
442
+ LANGUAGE="en"
443
+
444
+ # Server port
445
+ PORT=7860
446
+ SERVER_NAME="127.0.0.1"
447
+
448
+ # Download source (HuggingFace recommended outside China)
449
+ DOWNLOAD_SOURCE="--download-source huggingface"
450
+
451
+ # Update check
452
+ CHECK_UPDATE="true"
453
+
454
+ # Model settings
455
+ CONFIG_PATH="--config_path acestep-v15-turbo"
456
+ LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-0.6B"
457
+
458
+ # MLX backend (set automatically, do not change)
459
+ BACKEND="--backend mlx"
460
+
461
+ # CPU offload (enable for models larger than 0.6B on limited memory)
462
+ OFFLOAD_TO_CPU=""
463
+ # OFFLOAD_TO_CPU="--offload_to_cpu true"
464
+ ```
465
+
466
+ > **Note**: The macOS scripts automatically detect Apple Silicon (arm64). On Intel Macs, the MLX backend is unavailable and the script falls back to the PyTorch backend.
467
+
468
+ ---
469
+
470
+ ## ROCm Configuration
471
+
472
+ The `start_gradio_ui_rocm.bat` and `start_api_server_rocm.bat` scripts include additional settings specific to AMD GPUs running ROCm on Windows.
473
+
474
+ ### ROCm-Specific Variables
475
+
476
+ ```batch
477
+ REM ==================== ROCm Configuration ====================
478
+ REM Force PyTorch LM backend (bypasses nano-vllm flash_attn dependency)
479
+ set ACESTEP_LM_BACKEND=pt
480
+
481
+ REM RDNA3 GPU architecture override
482
+ set HSA_OVERRIDE_GFX_VERSION=11.0.0
483
+
484
+ REM Disable torch.compile Triton backend (not available on ROCm Windows)
485
+ set TORCH_COMPILE_BACKEND=eager
486
+
487
+ REM MIOpen: fast heuristic kernel selection instead of exhaustive benchmarking
488
+ set MIOPEN_FIND_MODE=FAST
489
+
490
+ REM HuggingFace tokenizer parallelism
491
+ set TOKENIZERS_PARALLELISM=false
492
+ ```
493
+
494
+ **Variable details**:
495
+
496
+ | Variable | Purpose | Common Values |
497
+ |----------|---------|---------------|
498
+ | `ACESTEP_LM_BACKEND` | Forces PyTorch backend instead of vLLM | `pt` (required for ROCm) |
499
+ | `HSA_OVERRIDE_GFX_VERSION` | Overrides GPU architecture for ROCm compatibility | `11.0.0` (gfx1100, RX 7900 XT/XTX), `11.0.1` (gfx1101, RX 7700/7800 XT), `11.0.2` (gfx1102, RX 7600) |
500
+ | `TORCH_COMPILE_BACKEND` | Sets the torch.compile backend | `eager` (required, Triton unavailable on ROCm Windows) |
501
+ | `MIOPEN_FIND_MODE` | Controls MIOpen kernel selection strategy | `FAST` (recommended; prevents first-run hangs on VAE decode) |
502
+ | `TOKENIZERS_PARALLELISM` | Controls HuggingFace tokenizer parallelism | `false` (suppresses warnings) |
503
+
504
+ **ROCm model settings**:
505
+
506
+ ```batch
507
+ REM Model settings (ROCm)
508
+ set CONFIG_PATH=--config_path acestep-v15-turbo
509
+ set LM_MODEL_PATH=--lm_model_path acestep-5Hz-lm-4B
510
+
511
+ REM CPU offload: required for 4B LM on GPUs with <=20GB VRAM
512
+ set OFFLOAD_TO_CPU=--offload_to_cpu true
513
+
514
+ REM LM backend: pt (PyTorch) recommended for ROCm
515
+ set BACKEND=--backend pt
516
+ ```
517
+
518
+ **ROCm virtual environment**:
519
+
520
+ The ROCm script uses a separate virtual environment (`venv_rocm`) instead of the standard `.venv` or `python_embeded`:
521
+ ```batch
522
+ set VENV_DIR=%~dp0venv_rocm
523
+ ```
524
+
525
+ > **Note**: The ROCm script requires a separate Python environment with ROCm-compatible PyTorch installed. See `requirements-rocm.txt` for setup instructions.
526
+
527
+ ---
528
+
529
+ ## Troubleshooting
530
+
531
+ ### Changes not taking effect
532
+
533
+ **Solution**: Save the file and restart the script. Changes only apply on the next launch.
534
+
535
+ Windows:
536
+ ```batch
537
+ REM Close current process (Ctrl+C), then run again
538
+ start_gradio_ui.bat
539
+ ```
540
+
541
+ Linux/macOS:
542
+ ```bash
543
+ # Close current process (Ctrl+C), then run again
544
+ ./start_gradio_ui.sh
545
+ ```
546
+
547
+ ### Model download is slow
548
+
549
+ **For Chinese users** -- set ModelScope:
550
+
551
+ | Platform | Setting |
552
+ |----------|---------|
553
+ | Windows | `set DOWNLOAD_SOURCE=--download-source modelscope` |
554
+ | Linux/macOS | `DOWNLOAD_SOURCE="--download-source modelscope"` |
555
+
556
+ **For overseas users** -- set HuggingFace:
557
+
558
+ | Platform | Setting |
559
+ |----------|---------|
560
+ | Windows | `set DOWNLOAD_SOURCE=--download-source huggingface` |
561
+ | Linux/macOS | `DOWNLOAD_SOURCE="--download-source huggingface"` |
562
+
563
+ ### Wrong language displayed
564
+
565
+ Verify the `LANGUAGE` variable in your Gradio UI script:
566
+
567
+ | Platform | Chinese | English |
568
+ |----------|---------|---------|
569
+ | Windows | `set LANGUAGE=zh` | `set LANGUAGE=en` |
570
+ | Linux/macOS | `LANGUAGE="zh"` | `LANGUAGE="en"` |
571
+
572
+ ### Port already in use
573
+
574
+ **Error**: `Address already in use`
575
+
576
+ **Solution 1**: Change the port number.
577
+
578
+ | Platform | Setting |
579
+ |----------|---------|
580
+ | Windows | `set PORT=7861` |
581
+ | Linux/macOS | `PORT=7861` |
582
+
583
+ **Solution 2**: Find and close the process using the port.
584
+
585
+ Windows:
586
+ ```batch
587
+ REM Find process using port 7860
588
+ netstat -ano | findstr :7860
589
+
590
+ REM Kill process (replace <PID> with the actual process ID)
591
+ taskkill /PID <PID> /F
592
+ ```
593
+
594
+ Linux/macOS:
595
+ ```bash
596
+ # Find process using port 7860
597
+ lsof -i :7860
598
+
599
+ # Kill process (replace <PID> with the actual process ID)
600
+ kill <PID>
601
+ ```
602
+
603
+ ---
604
+
605
+ ## Best Practices
606
+
607
+ 1. **Backup before editing**: Make a copy of the script before modifying it.
608
+ - Windows: `copy start_gradio_ui.bat start_gradio_ui.bat.backup`
609
+ - Linux/macOS: `cp start_gradio_ui.sh start_gradio_ui.sh.backup`
610
+
611
+ 2. **Use comments to document your changes**: Add a note explaining why you changed a value so you remember later.
612
+ - Windows: `REM Changed to port 8080 for testing`
613
+ - Linux/macOS: `# Changed to port 8080 for testing`
614
+
615
+ 3. **Test after changes**: Save the file, close any running instance, re-launch the script, and verify the changes took effect.
.claude/skills/acestep-docs/guides/UPDATE_AND_BACKUP.md ADDED
@@ -0,0 +1,496 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Update and Backup Guide
2
+
3
+ ## Overview
4
+
5
+ All ACE-Step launch scripts check for updates on startup by default. The update check is a lightweight inline operation that runs before the application starts, ensuring you are always notified about new versions without any manual setup.
6
+
7
+ - **Default behavior**: Update checking is enabled (`CHECK_UPDATE=true`) in every launch script.
8
+ - **Platforms supported**: Windows, Linux, and macOS.
9
+ - **Graceful failures**: If git is not installed, the network is unreachable, or the project is not a git repository, the check is silently skipped and the application starts normally.
10
+ - **User control**: You can disable the check at any time by setting `CHECK_UPDATE=false`.
11
+
12
+ ---
13
+
14
+ ## Update Check Feature
15
+
16
+ ### How It Works
17
+
18
+ Each launch script contains a lightweight inline update check that runs before the main application starts. The check does not require any external update service -- it uses git directly to compare your local commit with the remote.
19
+
20
+ **Flow:**
21
+
22
+ ```text
23
+ Startup
24
+ |
25
+ v
26
+ CHECK_UPDATE=true? --No--> Skip, start app
27
+ |
28
+ Yes
29
+ v
30
+ Git available? --No--> Skip, start app
31
+ |
32
+ Yes
33
+ v
34
+ Valid git repo? --No--> Skip, start app
35
+ |
36
+ Yes
37
+ v
38
+ Fetch origin (10s timeout) --Timeout/Error--> Skip, start app
39
+ |
40
+ Success
41
+ v
42
+ Compare local HEAD vs origin HEAD
43
+ |
44
+ +-- Same commit --> "Already up to date", start app
45
+ |
46
+ +-- Different commit --> Show new commits, ask Y/N
47
+ |
48
+ +-- N --> Skip, start app
49
+ |
50
+ +-- Y --> Run check_update.bat / check_update.sh for full update
51
+ |
52
+ v
53
+ Start app
54
+ ```
55
+
56
+ At every failure point (no git, no network, not a repo), the check exits gracefully and the application starts without interruption.
57
+
58
+ ### Enabling and Disabling
59
+
60
+ The update check is controlled by the `CHECK_UPDATE` variable near the top of each launch script.
61
+
62
+ **Windows** (`start_gradio_ui.bat`, `start_api_server.bat`):
63
+
64
+ ```batch
65
+ REM Update check on startup (set to false to disable)
66
+ set CHECK_UPDATE=true
67
+ REM set CHECK_UPDATE=false
68
+ ```
69
+
70
+ **Linux / macOS** (`start_gradio_ui.sh`, `start_api_server.sh`, `start_gradio_ui_macos.sh`, `start_api_server_macos.sh`):
71
+
72
+ ```bash
73
+ # Update check on startup (set to "false" to disable)
74
+ CHECK_UPDATE="true"
75
+ # CHECK_UPDATE="false"
76
+ ```
77
+
78
+ To disable, change the active line to `false`. To re-enable, change it back to `true`.
79
+
80
+ ### Git Requirements by Platform
81
+
82
+ The inline update check requires git to be available. How you obtain git depends on your platform.
83
+
84
+ **Windows:**
85
+
86
+ - **Option A -- PortableGit** (no installation required): Download from <https://git-scm.com/download/win>, choose the portable version, and extract to a `PortableGit\` folder in the project root. The launch scripts look for `PortableGit\bin\git.exe` first.
87
+ - **Option B -- System git**: Install git through any standard method (Git for Windows installer, winget, scoop, etc.). The launch scripts fall back to system git if PortableGit is not found.
88
+
89
+ ```text
90
+ Project Root/
91
+ ├── PortableGit/ <-- Optional, checked first on Windows
92
+ │ └── bin/
93
+ │ └── git.exe
94
+ ├── start_gradio_ui.bat
95
+ ├── check_update.bat
96
+ └── ...
97
+ ```
98
+
99
+ **Linux:**
100
+
101
+ Install git through your distribution's package manager:
102
+
103
+ ```bash
104
+ # Ubuntu / Debian
105
+ sudo apt install git
106
+
107
+ # CentOS / RHEL / Fedora
108
+ sudo yum install git
109
+ # or
110
+ sudo dnf install git
111
+
112
+ # Arch Linux
113
+ sudo pacman -S git
114
+ ```
115
+
116
+ **macOS:**
117
+
118
+ Install git through Xcode command-line tools or Homebrew:
119
+
120
+ ```bash
121
+ # Xcode command-line tools (includes git)
122
+ xcode-select --install
123
+
124
+ # Or via Homebrew
125
+ brew install git
126
+ ```
127
+
128
+ ### Example Output
129
+
130
+ **Already up to date:**
131
+
132
+ ```text
133
+ [Update] Checking for updates...
134
+ [Update] Already up to date (abc1234).
135
+
136
+ Starting ACE-Step Gradio Web UI...
137
+ ```
138
+
139
+ **Update available:**
140
+
141
+ ```text
142
+ [Update] Checking for updates...
143
+
144
+ ========================================
145
+ Update available!
146
+ ========================================
147
+ Current: abc1234 -> Latest: def5678
148
+
149
+ Recent changes:
150
+ * def5678 Fix audio processing bug
151
+ * ccc3333 Add new model support
152
+
153
+ Update now before starting? (Y/N):
154
+ ```
155
+
156
+ If you choose **Y**, the script delegates to `check_update.bat` (Windows) or `check_update.sh` (Linux/macOS) for the full update process including backup handling. If you choose **N**, the update is skipped and the application starts with the current version.
157
+
158
+ **Network unreachable (auto-skip):**
159
+
160
+ ```text
161
+ [Update] Checking for updates...
162
+ [Update] Network unreachable, skipping.
163
+
164
+ Starting ACE-Step Gradio Web UI...
165
+ ```
166
+
167
+ ---
168
+
169
+ ## Manual Update
170
+
171
+ You can run the update check manually at any time, outside of the launch scripts.
172
+
173
+ **Windows:**
174
+
175
+ ```batch
176
+ check_update.bat
177
+ ```
178
+
179
+ **Linux / macOS:**
180
+
181
+ ```bash
182
+ ./check_update.sh
183
+ ```
184
+
185
+ The manual update scripts perform the same 4-step process:
186
+
187
+ 1. Detect git and verify the repository
188
+ 2. Fetch from origin with a 10-second timeout
189
+ 3. Compare local and remote commits
190
+ 4. If an update is available, prompt to apply it (with automatic backup of conflicting files)
191
+
192
+ ---
193
+
194
+ ## File Backup During Updates
195
+
196
+ ### Automatic Backup
197
+
198
+ When you choose to update and you have locally modified files that also changed on the remote, ACE-Step automatically creates a backup before applying the update.
199
+
200
+ **Supported file types** (any modified text file is backed up):
201
+
202
+ - Configuration files: `.bat`, `.sh`, `.yaml`, `.json`, `.ini`
203
+ - Python code: `.py`
204
+ - Documentation: `.md`, `.txt`
205
+
206
+ ### Backup Process
207
+
208
+ ```text
209
+ 1. Update detects locally modified files
210
+ that also changed on the remote
211
+ |
212
+ v
213
+ 2. Creates a timestamped backup directory
214
+ .update_backup_YYYYMMDD_HHMMSS/
215
+ |
216
+ v
217
+ 3. Copies conflicting files into the backup
218
+ (preserves directory structure)
219
+ |
220
+ v
221
+ 4. Resets working tree to the remote version
222
+ |
223
+ v
224
+ 5. Displays backup location and instructions
225
+ ```
226
+
227
+ ### Example
228
+
229
+ **Your local modifications:**
230
+
231
+ - `start_gradio_ui.bat` -- Changed language to Chinese
232
+ - `acestep/handler.py` -- Added debug logging
233
+ - `config.yaml` -- Changed model path
234
+
235
+ **Remote updates:**
236
+
237
+ - `start_gradio_ui.bat` -- Added new features
238
+ - `acestep/handler.py` -- Bug fixes
239
+ - `config.yaml` -- New parameters
240
+
241
+ **Backup created:**
242
+
243
+ ```text
244
+ .update_backup_20260205_143022/
245
+ ├── start_gradio_ui.bat (your version)
246
+ ├── config.yaml (your version)
247
+ └── acestep/
248
+ └── handler.py (your version)
249
+ ```
250
+
251
+ **Working tree after update:**
252
+
253
+ ```text
254
+ start_gradio_ui.bat (new version from GitHub)
255
+ config.yaml (new version from GitHub)
256
+ acestep/
257
+ └── handler.py (new version from GitHub)
258
+ ```
259
+
260
+ Your original files are preserved in the backup directory so you can merge your changes back in.
261
+
262
+ ---
263
+
264
+ ## Merging Configurations
265
+
266
+ After an update that backed up your files, use the merge helper to compare and restore your settings.
267
+
268
+ ### Windows: merge_config.bat
269
+
270
+ ```batch
271
+ merge_config.bat
272
+ ```
273
+
274
+ When comparing files, this script opens two Notepad windows side by side -- one with the backup version and one with the current version -- so you can manually copy your settings across.
275
+
276
+ ### Linux / macOS: merge_config.sh
277
+
278
+ ```bash
279
+ ./merge_config.sh
280
+ ```
281
+
282
+ When comparing files, this script uses `colordiff` (if installed) or `diff` to display a unified diff in the terminal, showing exactly what changed between your backed-up version and the new version.
283
+
284
+ To install colordiff for colored output:
285
+
286
+ ```bash
287
+ # Ubuntu / Debian
288
+ sudo apt install colordiff
289
+
290
+ # macOS (Homebrew)
291
+ brew install colordiff
292
+
293
+ # Arch Linux
294
+ sudo pacman -S colordiff
295
+ ```
296
+
297
+ ### Menu Options (Both Platforms)
298
+
299
+ Both `merge_config.bat` and `merge_config.sh` present the same interactive menu:
300
+
301
+ ```text
302
+ ========================================
303
+ ACE-Step Backup Merge Helper
304
+ ========================================
305
+
306
+ 1. Compare backup with current files
307
+ 2. Restore a file from backup
308
+ 3. List all backed up files
309
+ 4. Delete old backups
310
+ 5. Exit
311
+ ```
312
+
313
+ | Option | Description |
314
+ |--------|-------------|
315
+ | **1. Compare** | Show differences between your backup and the current (updated) file. On Windows this opens two Notepad windows. On Linux/macOS this prints a unified diff to the terminal. |
316
+ | **2. Restore** | Copy a file from the backup back into the project, overwriting the updated version. Use this only if the new version causes problems. |
317
+ | **3. List** | Display all files stored in backup directories. |
318
+ | **4. Delete** | Permanently remove old backup directories. Only do this after you have finished merging. |
319
+
320
+ ### Merging Common Files
321
+
322
+ **Launch scripts** (`start_gradio_ui.bat`, `start_gradio_ui.sh`, etc.):
323
+
324
+ Look for your custom settings in the backup (language, port, download source, etc.) and copy them into the corresponding lines of the new version.
325
+
326
+ ```bash
327
+ # Example settings you may want to preserve:
328
+ LANGUAGE="zh"
329
+ PORT=8080
330
+ DOWNLOAD_SOURCE="--download-source modelscope"
331
+ ```
332
+
333
+ **Configuration files** (`config.yaml`, `.json`):
334
+
335
+ Compare the structures. Keep your custom values, add any new keys from the updated version.
336
+
337
+ ```yaml
338
+ # Backup (your version)
339
+ model_path: "custom/path"
340
+ custom_setting: true
341
+
342
+ # Current (new version)
343
+ model_path: "default/path"
344
+ new_feature: enabled
345
+
346
+ # Merged result
347
+ model_path: "custom/path" # Keep your setting
348
+ custom_setting: true # Keep your setting
349
+ new_feature: enabled # Add new feature
350
+ ```
351
+
352
+ ---
353
+
354
+ ## Testing Update Functionality
355
+
356
+ Use the test scripts to verify that your git setup and update mechanism are working correctly before relying on them.
357
+
358
+ **Windows:**
359
+
360
+ ```batch
361
+ test_git_update.bat
362
+ ```
363
+
364
+ **Linux / macOS:**
365
+
366
+ ```bash
367
+ ./test_git_update.sh
368
+ ```
369
+
370
+ ### What the Tests Check
371
+
372
+ 1. **Git availability**: Verifies that git can be found (PortableGit or system git on Windows; system git on Linux/macOS).
373
+ 2. **Repository validity**: Confirms the project directory is a valid git repository.
374
+ 3. **Update script presence**: Checks that `check_update.bat` / `check_update.sh` exists.
375
+ 4. **Network connectivity**: Attempts an actual fetch from the remote (with timeout).
376
+
377
+ ### Example Test Output
378
+
379
+ ```text
380
+ ========================================
381
+ Test Git Update Check
382
+ ========================================
383
+
384
+ [Test 1] Checking Git...
385
+ [PASS] Git found
386
+ git version 2.43.0
387
+
388
+ [Test 2] Checking git repository...
389
+ [PASS] Valid git repository
390
+ Branch: main
391
+ Commit: a1b2c3d
392
+
393
+ [Test 3] Checking update script...
394
+ [PASS] check_update.sh found
395
+
396
+ [Test 4] Running update check...
397
+ [PASS] Update check completed successfully
398
+
399
+ [PASS] All tests completed
400
+ ```
401
+
402
+ ---
403
+
404
+ ## Troubleshooting
405
+
406
+ ### Git not found
407
+
408
+ The update check is silently skipped if git is not available. To enable it, install git for your platform:
409
+
410
+ | Platform | Install Command |
411
+ |----------|----------------|
412
+ | **Windows (PortableGit)** | Download from <https://git-scm.com/download/win> and extract to `PortableGit\` in the project root |
413
+ | **Windows (system)** | `winget install --id Git.Git -e` or use the Git for Windows installer |
414
+ | **Ubuntu / Debian** | `sudo apt install git` |
415
+ | **CentOS / RHEL** | `sudo yum install git` |
416
+ | **Arch Linux** | `sudo pacman -S git` |
417
+ | **macOS** | `xcode-select --install` or `brew install git` |
418
+
419
+ ### Network timeout
420
+
421
+ The fetch operation has a 10-second timeout. If it times out, the update check is skipped automatically and the application starts normally. This is expected behavior on slow or restricted networks.
422
+
423
+ On macOS, the timeout mechanism uses `gtimeout` from GNU coreutils if available, or falls back to a plain fetch without a timeout. To get proper timeout support:
424
+
425
+ ```bash
426
+ brew install coreutils
427
+ ```
428
+
429
+ ### Proxy configuration
430
+
431
+ **Windows (`check_update.bat`):**
432
+
433
+ Create a `proxy_config.txt` file in the project root:
434
+
435
+ ```text
436
+ PROXY_ENABLED=1
437
+ PROXY_URL=http://127.0.0.1:7890
438
+ ```
439
+
440
+ Or configure interactively:
441
+
442
+ ```batch
443
+ check_update.bat proxy
444
+ ```
445
+
446
+ Common proxy formats:
447
+
448
+ | Type | Example |
449
+ |------|---------|
450
+ | HTTP proxy | `http://127.0.0.1:7890` |
451
+ | HTTPS proxy | `https://proxy.company.com:8080` |
452
+ | SOCKS5 proxy | `socks5://127.0.0.1:1080` |
453
+
454
+ To disable the proxy, set `PROXY_ENABLED=0` in `proxy_config.txt`.
455
+
456
+ **Linux / macOS:**
457
+
458
+ Set standard environment variables before running the script:
459
+
460
+ ```bash
461
+ export http_proxy="http://127.0.0.1:7890"
462
+ export https_proxy="http://127.0.0.1:7890"
463
+ ./check_update.sh
464
+ ```
465
+
466
+ Or add them to your shell profile (`~/.bashrc`, `~/.zshrc`) for persistence.
467
+
468
+ ### Merge conflicts
469
+
470
+ If the automatic update fails or produces unexpected results:
471
+
472
+ 1. Check for backup directories: look for `.update_backup_*` folders in the project root.
473
+ 2. Use the merge helper (`merge_config.bat` or `./merge_config.sh`) to compare and restore files.
474
+ 3. If needed, manually inspect the diff between your backup and the current files.
475
+
476
+ ### Lost configuration after update
477
+
478
+ 1. Find your backup:
479
+ - **Windows:** `dir /b .update_backup_*`
480
+ - **Linux / macOS:** `ls -d .update_backup_*`
481
+ 2. Use the merge helper (Option 2) to restore specific files, or manually copy settings from the backup.
482
+
483
+ ---
484
+
485
+ ## Quick Reference
486
+
487
+ | Action | Windows | Linux / macOS |
488
+ |--------|---------|---------------|
489
+ | **Enable update check** | `set CHECK_UPDATE=true` (in `.bat`) | `CHECK_UPDATE="true"` (in `.sh`) |
490
+ | **Disable update check** | `set CHECK_UPDATE=false` (in `.bat`) | `CHECK_UPDATE="false"` (in `.sh`) |
491
+ | **Manual update** | `check_update.bat` | `./check_update.sh` |
492
+ | **Configure proxy** | `check_update.bat proxy` or edit `proxy_config.txt` | `export http_proxy=... && ./check_update.sh` |
493
+ | **Merge configurations** | `merge_config.bat` | `./merge_config.sh` |
494
+ | **Test update setup** | `test_git_update.bat` | `./test_git_update.sh` |
495
+ | **List backups** | `dir /b .update_backup_*` | `ls -d .update_backup_*` |
496
+ | **Delete a backup** | `rmdir /s /q .update_backup_YYYYMMDD_HHMMSS` | `rm -rf .update_backup_YYYYMMDD_HHMMSS` |
.claude/skills/acestep-lyrics-transcription/SKILL.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: acestep-lyrics-transcription
3
+ description: Transcribe audio to timestamped lyrics using OpenAI Whisper or ElevenLabs Scribe API. Outputs LRC, SRT, or JSON with word-level timestamps. Use when users want to transcribe songs, generate LRC files, or extract lyrics with timestamps from audio.
4
+ allowed-tools: Read, Write, Bash
5
+ ---
6
+
7
+ # Lyrics Transcription Skill
8
+
9
+ Transcribe audio files to timestamped lyrics (LRC/SRT/JSON) via OpenAI Whisper or ElevenLabs Scribe API.
10
+
11
+ ## API Key Setup Guide
12
+
13
+ **Before transcribing, you MUST check whether the user's API key is configured.** Run the following command to check:
14
+
15
+ ```bash
16
+ cd "{project_root}/{.claude or .codex}/skills/acestep-lyrics-transcription/" && bash ./scripts/acestep-lyrics-transcription.sh config --check-key
17
+ ```
18
+
19
+ This command only reports whether the active provider's API key is set or empty — it does NOT print the actual key value. **NEVER read or display the user's API key content.** Do not use `config --get` on key fields or read `config.json` directly. The `config --list` command is safe — it automatically masks API keys as `***` in output.
20
+
21
+ **If the command reports the key is empty**, you MUST stop and guide the user to configure it before proceeding. Do NOT attempt transcription without a valid key — it will fail.
22
+
23
+ Use `AskUserQuestion` to ask the user to provide their API key, with the following options and guidance:
24
+
25
+ 1. Tell the user which provider is currently active (openai or elevenlabs) and that its API key is not configured. Explain that transcription cannot proceed without it.
26
+ 2. Provide clear instructions on where to obtain a key:
27
+ - **OpenAI**: Get an API key at https://platform.openai.com/api-keys — requires an OpenAI account with billing enabled. The Whisper API costs ~$0.006/min.
28
+ - **ElevenLabs**: Get an API key at https://elevenlabs.io/app/settings/api-keys — requires an ElevenLabs account. Free tier includes limited credits.
29
+ 3. Also offer the option to switch to the other provider if they already have a key for it.
30
+ 4. Once the user provides the key, configure it using:
31
+ ```bash
32
+ cd "{project_root}/{.claude or .codex}/skills/acestep-lyrics-transcription/" && bash ./scripts/acestep-lyrics-transcription.sh config --set <provider>.api_key <KEY>
33
+ ```
34
+ 5. If the user wants to switch providers, also run:
35
+ ```bash
36
+ cd "{project_root}/{.claude or .codex}/skills/acestep-lyrics-transcription/" && bash ./scripts/acestep-lyrics-transcription.sh config --set provider <provider_name>
37
+ ```
38
+ 6. After configuring, re-run `config --check-key` to verify the key is set before proceeding.
39
+
40
+ **If the API key is already configured**, proceed directly to transcription without asking.
41
+
42
+ ## Quick Start
43
+
44
+ ```bash
45
+ # 1. cd to this skill's directory
46
+ cd {project_root}/{.claude or .codex}/skills/acestep-lyrics-transcription/
47
+
48
+ # 2. Configure API key (choose one)
49
+ ./scripts/acestep-lyrics-transcription.sh config --set openai.api_key sk-...
50
+ # or
51
+ ./scripts/acestep-lyrics-transcription.sh config --set elevenlabs.api_key ...
52
+ ./scripts/acestep-lyrics-transcription.sh config --set provider elevenlabs
53
+
54
+ # 3. Transcribe
55
+ ./scripts/acestep-lyrics-transcription.sh transcribe --audio /path/to/song.mp3 --language zh
56
+
57
+ # 4. Output saved to: {project_root}/acestep_output/<filename>.lrc
58
+ ```
59
+
60
+ ## Prerequisites
61
+
62
+ - curl, jq, python3 (or python)
63
+ - An API key for OpenAI or ElevenLabs
64
+
65
+ ## Script Usage
66
+
67
+ ```bash
68
+ ./scripts/acestep-lyrics-transcription.sh transcribe --audio <file> [options]
69
+
70
+ Options:
71
+ -a, --audio Audio file path (required)
72
+ -l, --language Language code (zh, en, ja, etc.)
73
+ -f, --format Output format: lrc, srt, json (default: lrc)
74
+ -p, --provider API provider: openai, elevenlabs (overrides config)
75
+ -o, --output Output file path (default: acestep_output/<filename>.lrc)
76
+ ```
77
+
78
+ ## Post-Transcription Lyrics Correction (MANDATORY)
79
+
80
+ **CRITICAL**: After transcription, you MUST manually correct the LRC file before using it for MV rendering. Transcription models frequently produce errors on sung lyrics:
81
+
82
+ - Proper nouns: "ACE-Step" → "AC step", "Spotify" → "spot a fly"
83
+ - Similar-sounding words: "arrives" → "eyes", "open source" → "open sores"
84
+ - Merged/split words: "lighting up" → "lightin' nup"
85
+
86
+ ### Correction Workflow
87
+
88
+ 1. **Read the transcribed LRC file** using the Read tool
89
+ 2. **Read the original lyrics** from the ACE-Step output JSON file
90
+ 3. **Use original lyrics as a whole reference**: Do NOT attempt line-by-line alignment — transcription often splits, merges, or reorders lines differently from the original. Instead, read the original lyrics in full to understand the correct wording, then scan each LRC line and fix any misrecognized words based on your knowledge of what the original lyrics say.
91
+ 4. **Fix transcription errors**: Replace misrecognized words with the correct original words, keeping the timestamps intact
92
+ 5. **Write the corrected LRC** back using the Write tool
93
+
94
+ ### What to Correct
95
+
96
+ - Replace misrecognized words with their correct original versions
97
+ - Keep all `[MM:SS.cc]` timestamps exactly as-is (timestamps from transcription are accurate)
98
+ - Do NOT add structure tags like `[Verse]` or `[Chorus]` — the LRC should only have timestamped text lines
99
+
100
+ ### Example
101
+
102
+ **Transcribed (wrong):**
103
+ ```
104
+ [00:46.96]AC step alive,
105
+ [00:50.80]one point five eyes.
106
+ ```
107
+
108
+ **Original lyrics reference:**
109
+ ```
110
+ ACE-Step alive
111
+ One point five arrives
112
+ ```
113
+
114
+ **Corrected (right):**
115
+ ```
116
+ [00:46.96]ACE-Step alive,
117
+ [00:50.80]One point five arrives.
118
+ ```
119
+
120
+ ## Configuration
121
+
122
+ Config file: `scripts/config.json`
123
+
124
+ ```bash
125
+ # Switch provider
126
+ ./scripts/acestep-lyrics-transcription.sh config --set provider openai
127
+ ./scripts/acestep-lyrics-transcription.sh config --set provider elevenlabs
128
+
129
+ # Set API keys
130
+ ./scripts/acestep-lyrics-transcription.sh config --set openai.api_key sk-...
131
+ ./scripts/acestep-lyrics-transcription.sh config --set elevenlabs.api_key ...
132
+
133
+ # View config
134
+ ./scripts/acestep-lyrics-transcription.sh config --list
135
+ ```
136
+
137
+ | Option | Default | Description |
138
+ |--------|---------|-------------|
139
+ | `provider` | `openai` | Active provider: `openai` or `elevenlabs` |
140
+ | `output_format` | `lrc` | Default output: `lrc`, `srt`, or `json` |
141
+ | `openai.api_key` | `""` | OpenAI API key |
142
+ | `openai.api_url` | `https://api.openai.com/v1` | OpenAI API base URL |
143
+ | `openai.model` | `whisper-1` | OpenAI model (whisper-1 for word timestamps) |
144
+ | `elevenlabs.api_key` | `""` | ElevenLabs API key |
145
+ | `elevenlabs.api_url` | `https://api.elevenlabs.io/v1` | ElevenLabs API base URL |
146
+ | `elevenlabs.model` | `scribe_v2` | ElevenLabs model |
147
+
148
+ ## Provider Notes
149
+
150
+ | Provider | Model | Word Timestamps | Pricing |
151
+ |----------|-------|-----------------|---------|
152
+ | OpenAI | whisper-1 | Yes (segment + word) | $0.006/min |
153
+ | ElevenLabs | scribe_v2 | Yes (word-level) | Varies by plan |
154
+
155
+ - OpenAI `whisper-1` is the only OpenAI model supporting word-level timestamps
156
+ - ElevenLabs `scribe_v2` returns word-level timestamps with type filtering
157
+ - Both support multilingual transcription
158
+
159
+ ## Examples
160
+
161
+ ```bash
162
+ # Basic transcription (uses config defaults)
163
+ ./scripts/acestep-lyrics-transcription.sh transcribe --audio song.mp3
164
+
165
+ # Chinese song to LRC
166
+ ./scripts/acestep-lyrics-transcription.sh transcribe --audio song.mp3 --language zh
167
+
168
+ # Use ElevenLabs, output SRT
169
+ ./scripts/acestep-lyrics-transcription.sh transcribe --audio song.mp3 --provider elevenlabs --format srt
170
+
171
+ # Custom output path
172
+ ./scripts/acestep-lyrics-transcription.sh transcribe --audio song.mp3 --output ./my_lyrics.lrc
173
+ ```
.claude/skills/acestep-lyrics-transcription/scripts/acestep-lyrics-transcription.sh ADDED
@@ -0,0 +1,584 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #
3
+ # acestep-lyrics-transcription.sh - Transcribe audio to timestamped lyrics (LRC/SRT/JSON)
4
+ #
5
+ # Requirements: curl, jq
6
+ #
7
+ # Usage:
8
+ # ./acestep-lyrics-transcription.sh transcribe --audio <file> [options]
9
+ # ./acestep-lyrics-transcription.sh config [--get|--set|--reset]
10
+ #
11
+ # Output:
12
+ # - LRC/SRT/JSON files saved to output directory
13
+
14
+ set -e
15
+
16
+ export LANG="${LANG:-en_US.UTF-8}"
17
+ export LC_ALL="${LC_ALL:-en_US.UTF-8}"
18
+
19
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
20
+ CONFIG_FILE="${SCRIPT_DIR}/config.json"
21
+ OUTPUT_DIR="$(cd "${SCRIPT_DIR}/../../../.." && pwd)/acestep_output"
22
+
23
+ # Colors
24
+ RED='\033[0;31m'
25
+ GREEN='\033[0;32m'
26
+ YELLOW='\033[1;33m'
27
+ CYAN='\033[0;36m'
28
+ NC='\033[0m'
29
+
30
+ # Convert MSYS2/Cygwin paths to Windows-native paths for Python
31
+ to_python_path() {
32
+ if command -v cygpath &> /dev/null; then
33
+ cygpath -m "$1"
34
+ else
35
+ echo "$1"
36
+ fi
37
+ }
38
+
39
+ # Detect python executable (python3 or python)
40
+ PYTHON_CMD=""
41
+ find_python() {
42
+ if [ -n "$PYTHON_CMD" ]; then return; fi
43
+ # Test actual execution, not just existence (Windows Store python3 shim returns exit 49)
44
+ if python3 -c "pass" &> /dev/null; then
45
+ PYTHON_CMD="python3"
46
+ elif python -c "pass" &> /dev/null; then
47
+ PYTHON_CMD="python"
48
+ else
49
+ echo -e "${RED}Error: python3 or python is required but not found.${NC}"
50
+ exit 1
51
+ fi
52
+ }
53
+
54
+ # ─── Dependencies ───
55
+
56
+ check_deps() {
57
+ if ! command -v curl &> /dev/null; then
58
+ echo -e "${RED}Error: curl is required but not installed.${NC}"
59
+ exit 1
60
+ fi
61
+ if ! command -v jq &> /dev/null; then
62
+ echo -e "${RED}Error: jq is required but not installed.${NC}"
63
+ echo "Install: apt install jq / brew install jq / choco install jq"
64
+ exit 1
65
+ fi
66
+ }
67
+
68
+ # ─── Config ───
69
+
70
+ DEFAULT_CONFIG='{
71
+ "provider": "openai",
72
+ "output_format": "lrc",
73
+ "openai": {
74
+ "api_key": "",
75
+ "api_url": "https://api.openai.com/v1",
76
+ "model": "whisper-1"
77
+ },
78
+ "elevenlabs": {
79
+ "api_key": "",
80
+ "api_url": "https://api.elevenlabs.io/v1",
81
+ "model": "scribe_v2"
82
+ }
83
+ }'
84
+
85
+ ensure_config() {
86
+ if [ ! -f "$CONFIG_FILE" ]; then
87
+ local example="${SCRIPT_DIR}/config.example.json"
88
+ if [ -f "$example" ]; then
89
+ cp "$example" "$CONFIG_FILE"
90
+ echo -e "${YELLOW}Config file created from config.example.json. Please configure your settings:${NC}"
91
+ echo -e " ${CYAN}./scripts/acestep-lyrics-transcription.sh config --set provider <openai|elevenlabs>${NC}"
92
+ echo -e " ${CYAN}./scripts/acestep-lyrics-transcription.sh config --set <provider>.api_key <key>${NC}"
93
+ else
94
+ echo "$DEFAULT_CONFIG" > "$CONFIG_FILE"
95
+ fi
96
+ fi
97
+ }
98
+
99
+ get_config() {
100
+ local key="$1"
101
+ ensure_config
102
+ local jq_path=".${key}"
103
+ local value
104
+ value=$(jq -r "$jq_path" "$CONFIG_FILE" 2>/dev/null)
105
+ if [ "$value" = "null" ]; then
106
+ echo ""
107
+ else
108
+ echo "$value" | tr -d '\r\n'
109
+ fi
110
+ }
111
+
112
+ set_config() {
113
+ local key="$1"
114
+ local value="$2"
115
+ ensure_config
116
+ local tmp_file="${CONFIG_FILE}.tmp"
117
+ local jq_path=".${key}"
118
+
119
+ if [ "$value" = "true" ] || [ "$value" = "false" ]; then
120
+ jq "$jq_path = $value" "$CONFIG_FILE" > "$tmp_file"
121
+ elif [[ "$value" =~ ^-?[0-9]+$ ]] || [[ "$value" =~ ^-?[0-9]+\.[0-9]+$ ]]; then
122
+ jq "$jq_path = $value" "$CONFIG_FILE" > "$tmp_file"
123
+ else
124
+ jq "$jq_path = \"$value\"" "$CONFIG_FILE" > "$tmp_file"
125
+ fi
126
+
127
+ mv "$tmp_file" "$CONFIG_FILE"
128
+ echo "Set $key = $value"
129
+ }
130
+
131
+ ensure_output_dir() {
132
+ mkdir -p "$OUTPUT_DIR"
133
+ }
134
+
135
+ # ─── Format Conversion ───
136
+
137
+ # Convert word-level timestamps to LRC format
138
+ # Input: JSON array of {word, start, end} on stdin
139
+ # Output: LRC text
140
+ words_to_lrc() {
141
+ local json_file="$(to_python_path "$1")"
142
+ local output_file="$(to_python_path "$2")"
143
+ local line_gap="${3:-1.5}"
144
+ find_python
145
+
146
+ $PYTHON_CMD -c "
147
+ import json, sys, unicodedata
148
+
149
+ def is_cjk(ch):
150
+ cp = ord(ch)
151
+ return (0x4E00 <= cp <= 0x9FFF or 0x3400 <= cp <= 0x4DBF or
152
+ 0x20000 <= cp <= 0x2A6DF or 0x2A700 <= cp <= 0x2B73F or
153
+ 0x2B740 <= cp <= 0x2B81F or 0x2B820 <= cp <= 0x2CEAF or
154
+ 0xF900 <= cp <= 0xFAFF or 0x2F800 <= cp <= 0x2FA1F or
155
+ 0x3000 <= cp <= 0x303F or 0x3040 <= cp <= 0x309F or
156
+ 0x30A0 <= cp <= 0x30FF or 0xFF00 <= cp <= 0xFFEF)
157
+
158
+ def smart_join(word_list):
159
+ if not word_list:
160
+ return ''
161
+ result = word_list[0]
162
+ for j in range(1, len(word_list)):
163
+ prev_w = word_list[j-1]
164
+ curr_w = word_list[j]
165
+ prev_last = prev_w[-1] if prev_w else ''
166
+ curr_first = curr_w[0] if curr_w else ''
167
+ if is_cjk(prev_last) or is_cjk(curr_first):
168
+ result += curr_w
169
+ else:
170
+ result += ' ' + curr_w
171
+ return result.strip()
172
+
173
+ with open('$json_file', 'r', encoding='utf-8') as f:
174
+ words = json.load(f)
175
+
176
+ if not words:
177
+ sys.exit(0)
178
+
179
+ lines = []
180
+ current_line = []
181
+ current_start = words[0]['start']
182
+
183
+ for i, w in enumerate(words):
184
+ current_line.append(w['word'])
185
+ is_last = (i == len(words) - 1)
186
+ has_punct = w['word'].rstrip().endswith(('.', '!', '?', '。', '!', '?', ',', ','))
187
+ has_gap = (not is_last and words[i+1]['start'] - w['end'] > $line_gap)
188
+
189
+ if is_last or has_punct or has_gap:
190
+ text = smart_join(current_line)
191
+ text = text.rstrip(',。,.')
192
+ if text:
193
+ mins = int(current_start) // 60
194
+ secs = current_start - mins * 60
195
+ lines.append(f'[{mins:02d}:{secs:05.2f}]{text}')
196
+ current_line = []
197
+ if not is_last:
198
+ current_start = words[i+1]['start']
199
+
200
+ with open('$output_file', 'w', encoding='utf-8') as f:
201
+ for line in lines:
202
+ f.write(line + '\n')
203
+ "
204
+ }
205
+
206
+ # Convert word-level timestamps to SRT format
207
+ words_to_srt() {
208
+ local json_file="$(to_python_path "$1")"
209
+ local output_file="$(to_python_path "$2")"
210
+ local line_gap="${3:-1.5}"
211
+ find_python
212
+
213
+ $PYTHON_CMD -c "
214
+ import json, sys
215
+
216
+ def is_cjk(ch):
217
+ cp = ord(ch)
218
+ return (0x4E00 <= cp <= 0x9FFF or 0x3400 <= cp <= 0x4DBF or
219
+ 0x20000 <= cp <= 0x2A6DF or 0x2A700 <= cp <= 0x2B73F or
220
+ 0x2B740 <= cp <= 0x2B81F or 0x2B820 <= cp <= 0x2CEAF or
221
+ 0xF900 <= cp <= 0xFAFF or 0x2F800 <= cp <= 0x2FA1F or
222
+ 0x3000 <= cp <= 0x303F or 0x3040 <= cp <= 0x309F or
223
+ 0x30A0 <= cp <= 0x30FF or 0xFF00 <= cp <= 0xFFEF)
224
+
225
+ def smart_join(word_list):
226
+ if not word_list:
227
+ return ''
228
+ result = word_list[0]
229
+ for j in range(1, len(word_list)):
230
+ prev_w = word_list[j-1]
231
+ curr_w = word_list[j]
232
+ prev_last = prev_w[-1] if prev_w else ''
233
+ curr_first = curr_w[0] if curr_w else ''
234
+ if is_cjk(prev_last) or is_cjk(curr_first):
235
+ result += curr_w
236
+ else:
237
+ result += ' ' + curr_w
238
+ return result.strip()
239
+
240
+ with open('$json_file', 'r', encoding='utf-8') as f:
241
+ words = json.load(f)
242
+
243
+ if not words:
244
+ sys.exit(0)
245
+
246
+ def fmt(t):
247
+ h = int(t) // 3600
248
+ m = (int(t) % 3600) // 60
249
+ s = t - h*3600 - m*60
250
+ return f'{h:02d}:{m:02d}:{s:06.3f}'.replace('.', ',')
251
+
252
+ lines = []
253
+ current_line = []
254
+ current_start = words[0]['start']
255
+ current_end = words[0]['end']
256
+
257
+ for i, w in enumerate(words):
258
+ current_line.append(w['word'])
259
+ current_end = w['end']
260
+ is_last = (i == len(words) - 1)
261
+ has_punct = w['word'].rstrip().endswith(('.', '!', '?', '。', '!', '?', ',', ','))
262
+ has_gap = (not is_last and words[i+1]['start'] - w['end'] > $line_gap)
263
+
264
+ if is_last or has_punct or has_gap:
265
+ text = smart_join(current_line)
266
+ text = text.rstrip(',。,.')
267
+ if text:
268
+ lines.append((current_start, current_end, text))
269
+ current_line = []
270
+ if not is_last:
271
+ current_start = words[i+1]['start']
272
+
273
+ with open('$output_file', 'w', encoding='utf-8') as f:
274
+ for idx, (s, e, text) in enumerate(lines, 1):
275
+ f.write(f'{idx}\n')
276
+ f.write(f'{fmt(s)} --> {fmt(e)}\n')
277
+ f.write(f'{text}\n')
278
+ f.write('\n')
279
+ "
280
+ }
281
+
282
+ # ─── OpenAI Whisper ───
283
+
284
+ transcribe_openai() {
285
+ local audio_file="$1"
286
+ local language="$2"
287
+ local words_file="$3"
288
+
289
+ local api_key=$(get_config "openai.api_key")
290
+ local api_url=$(get_config "openai.api_url")
291
+ local model=$(get_config "openai.model")
292
+
293
+ [ -z "$api_key" ] && { echo -e "${RED}Error: OpenAI API key not configured.${NC}"; echo "Run: ./acestep-lyrics-transcription.sh config --set openai.api_key YOUR_KEY"; exit 1; }
294
+ [ -z "$api_url" ] && api_url="https://api.openai.com/v1"
295
+ [ -z "$model" ] && model="whisper-1"
296
+
297
+ echo -e " Provider: OpenAI (${model})"
298
+
299
+ local resp_file=$(mktemp)
300
+
301
+ # Build curl command
302
+ local curl_args=(
303
+ -s -w "%{http_code}"
304
+ -o "$resp_file"
305
+ -X POST "${api_url}/audio/transcriptions"
306
+ -H "Authorization: Bearer ${api_key}"
307
+ -F "file=@${audio_file}"
308
+ -F "model=${model}"
309
+ -F "response_format=verbose_json"
310
+ -F "timestamp_granularities[]=word"
311
+ -F "timestamp_granularities[]=segment"
312
+ )
313
+
314
+ [ -n "$language" ] && curl_args+=(-F "language=${language}")
315
+
316
+ local http_code
317
+ http_code=$(curl "${curl_args[@]}")
318
+
319
+ if [ "$http_code" != "200" ]; then
320
+ local err
321
+ err=$(jq -r '.error.message // .detail // "Unknown error"' "$resp_file" 2>/dev/null)
322
+ echo -e "${RED}Error: HTTP $http_code - $err${NC}"
323
+ rm -f "$resp_file"
324
+ return 1
325
+ fi
326
+
327
+ # Extract word-level timestamps into normalized format [{word, start, end}]
328
+ jq '[.words[] | {word: .word, start: .start, end: .end}]' "$resp_file" > "$words_file" 2>/dev/null
329
+
330
+ # Show transcription text
331
+ local text
332
+ text=$(jq -r '.text // empty' "$resp_file" 2>/dev/null)
333
+ echo -e " ${GREEN}Transcription complete${NC}"
334
+ echo ""
335
+ echo "$text"
336
+
337
+ rm -f "$resp_file"
338
+ }
339
+
340
+ # ─── ElevenLabs Scribe ───
341
+
342
+ transcribe_elevenlabs() {
343
+ local audio_file="$1"
344
+ local language="$2"
345
+ local words_file="$3"
346
+
347
+ local api_key=$(get_config "elevenlabs.api_key")
348
+ local api_url=$(get_config "elevenlabs.api_url")
349
+ local model=$(get_config "elevenlabs.model")
350
+
351
+ [ -z "$api_key" ] && { echo -e "${RED}Error: ElevenLabs API key not configured.${NC}"; echo "Run: ./acestep-lyrics-transcription.sh config --set elevenlabs.api_key YOUR_KEY"; exit 1; }
352
+ [ -z "$api_url" ] && api_url="https://api.elevenlabs.io/v1"
353
+ [ -z "$model" ] && model="scribe_v2"
354
+
355
+ echo -e " Provider: ElevenLabs (${model})"
356
+
357
+ local resp_file=$(mktemp)
358
+
359
+ local curl_args=(
360
+ -s -w "%{http_code}"
361
+ -o "$resp_file"
362
+ -X POST "${api_url}/speech-to-text"
363
+ -H "xi-api-key: ${api_key}"
364
+ -F "file=@${audio_file}"
365
+ -F "model_id=${model}"
366
+ )
367
+
368
+ [ -n "$language" ] && curl_args+=(-F "language_code=${language}")
369
+
370
+ local http_code
371
+ http_code=$(curl "${curl_args[@]}")
372
+
373
+ if [ "$http_code" != "200" ]; then
374
+ local err
375
+ err=$(jq -r '.detail.message // .detail // "Unknown error"' "$resp_file" 2>/dev/null)
376
+ echo -e "${RED}Error: HTTP $http_code - $err${NC}"
377
+ rm -f "$resp_file"
378
+ return 1
379
+ fi
380
+
381
+ # ElevenLabs response: { text, words: [{text, start, end, type}...] }
382
+ # Normalize to [{word, start, end}], timestamps already in seconds, filter only "word" type
383
+ jq '[.words[] | select(.type == "word") | {word: .text, start: .start, end: .end}]' "$resp_file" > "$words_file" 2>/dev/null
384
+
385
+ local text
386
+ text=$(jq -r '.text // empty' "$resp_file" 2>/dev/null)
387
+ echo -e " ${GREEN}Transcription complete${NC}"
388
+ echo ""
389
+ echo "$text"
390
+
391
+ rm -f "$resp_file"
392
+ }
393
+
394
+ # ─── Commands ───
395
+
396
+ cmd_transcribe() {
397
+ check_deps
398
+ ensure_config
399
+
400
+ local audio="" language="" output="" format="" provider=""
401
+
402
+ while [[ $# -gt 0 ]]; do
403
+ case $1 in
404
+ --audio|-a) audio="$2"; shift 2 ;;
405
+ --language|-l) language="$2"; shift 2 ;;
406
+ --output|-o) output="$2"; shift 2 ;;
407
+ --format|-f) format="$2"; shift 2 ;;
408
+ --provider|-p) provider="$2"; shift 2 ;;
409
+ *) [ -z "$audio" ] && audio="$1"; shift ;;
410
+ esac
411
+ done
412
+
413
+ [ -z "$audio" ] && { echo -e "${RED}Error: --audio is required${NC}"; echo "Usage: $0 transcribe --audio <file> [options]"; exit 1; }
414
+ [ ! -f "$audio" ] && { echo -e "${RED}Error: audio file not found: $audio${NC}"; exit 1; }
415
+
416
+ # Resolve absolute path
417
+ audio="$(cd "$(dirname "$audio")" && pwd)/$(basename "$audio")"
418
+
419
+ [ -z "$provider" ] && provider=$(get_config "provider")
420
+ [ -z "$provider" ] && provider="openai"
421
+
422
+ [ -z "$format" ] && format=$(get_config "output_format")
423
+ [ -z "$format" ] && format="lrc"
424
+
425
+ # Default output path
426
+ if [ -z "$output" ]; then
427
+ ensure_output_dir
428
+ local basename="$(basename "${audio%.*}")"
429
+ output="${OUTPUT_DIR}/${basename}.${format}"
430
+ fi
431
+
432
+ echo "Transcribing..."
433
+ echo " Audio: $(basename "$audio")"
434
+ echo " Format: $format"
435
+
436
+ # Transcribe to normalized word timestamps
437
+ local words_file=$(mktemp)
438
+
439
+ case "$provider" in
440
+ openai) transcribe_openai "$audio" "$language" "$words_file" ;;
441
+ elevenlabs) transcribe_elevenlabs "$audio" "$language" "$words_file" ;;
442
+ *) echo -e "${RED}Error: unknown provider: $provider${NC}"; echo "Supported: openai, elevenlabs"; rm -f "$words_file"; exit 1 ;;
443
+ esac
444
+
445
+ # Check if we got words
446
+ local word_count
447
+ word_count=$(jq 'length' "$words_file" 2>/dev/null)
448
+ if [ -z "$word_count" ] || [ "$word_count" = "0" ]; then
449
+ echo -e "${YELLOW}Warning: no word-level timestamps returned${NC}"
450
+ rm -f "$words_file"
451
+ return 1
452
+ fi
453
+
454
+ echo ""
455
+ echo " Words detected: $word_count"
456
+
457
+ # Convert to output format
458
+ mkdir -p "$(dirname "$output")"
459
+
460
+ case "$format" in
461
+ lrc)
462
+ words_to_lrc "$words_file" "$output"
463
+ ;;
464
+ srt)
465
+ words_to_srt "$words_file" "$output"
466
+ ;;
467
+ json)
468
+ cp "$words_file" "$output"
469
+ ;;
470
+ *)
471
+ echo -e "${RED}Error: unknown format: $format (supported: lrc, srt, json)${NC}"
472
+ rm -f "$words_file"
473
+ exit 1
474
+ ;;
475
+ esac
476
+
477
+ rm -f "$words_file"
478
+
479
+ echo -e " ${GREEN}Saved: $output${NC}"
480
+ echo ""
481
+ echo -e "${GREEN}Done!${NC}"
482
+ }
483
+
484
+ cmd_config() {
485
+ check_deps
486
+ ensure_config
487
+
488
+ local action="" key="" value=""
489
+
490
+ while [[ $# -gt 0 ]]; do
491
+ case $1 in
492
+ --get) action="get"; key="$2"; shift 2 ;;
493
+ --set) action="set"; key="$2"; value="$3"; shift 3 ;;
494
+ --reset) action="reset"; shift ;;
495
+ --list) action="list"; shift ;;
496
+ --check-key) action="check-key"; shift ;;
497
+ *) shift ;;
498
+ esac
499
+ done
500
+
501
+ case "$action" in
502
+ "check-key")
503
+ local provider=$(get_config "provider")
504
+ [ -z "$provider" ] && provider="openai"
505
+ local api_key=$(get_config "${provider}.api_key")
506
+ echo "provider: $provider"
507
+ if [ -n "$api_key" ]; then
508
+ echo "api_key: configured"
509
+ else
510
+ echo "api_key: empty"
511
+ fi
512
+ ;;
513
+ "get")
514
+ [ -z "$key" ] && { echo -e "${RED}Error: --get requires KEY${NC}"; exit 1; }
515
+ local result=$(get_config "$key")
516
+ [ -n "$result" ] && echo "$key = $result" || echo "Key not found: $key"
517
+ ;;
518
+ "set")
519
+ [ -z "$key" ] || [ -z "$value" ] && { echo -e "${RED}Error: --set requires KEY VALUE${NC}"; exit 1; }
520
+ set_config "$key" "$value"
521
+ ;;
522
+ "reset")
523
+ echo "$DEFAULT_CONFIG" > "$CONFIG_FILE"
524
+ echo -e "${GREEN}Configuration reset to defaults.${NC}"
525
+ jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
526
+ ;;
527
+ "list")
528
+ echo "Current configuration:"
529
+ jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
530
+ ;;
531
+ *)
532
+ echo "Config file: $CONFIG_FILE"
533
+ echo "----------------------------------------"
534
+ jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
535
+ echo ""
536
+ echo "----------------------------------------"
537
+ echo ""
538
+ echo "Usage:"
539
+ echo " config --list Show config"
540
+ echo " config --get <key> Get value"
541
+ echo " config --set <key> <val> Set value"
542
+ echo " config --reset Reset to defaults"
543
+ echo ""
544
+ echo "Examples:"
545
+ echo " config --set provider elevenlabs"
546
+ echo " config --set openai.api_key sk-..."
547
+ echo " config --set elevenlabs.api_key ..."
548
+ echo " config --set output_format srt"
549
+ ;;
550
+ esac
551
+ }
552
+
553
+ show_help() {
554
+ echo "Lyrics Transcription CLI"
555
+ echo ""
556
+ echo "Requirements: curl, jq, python3"
557
+ echo ""
558
+ echo "Usage: $0 <command> [options]"
559
+ echo ""
560
+ echo "Commands:"
561
+ echo " transcribe Transcribe audio to timestamped lyrics"
562
+ echo " config Manage configuration"
563
+ echo ""
564
+ echo "Transcribe Options:"
565
+ echo " -a, --audio Audio file path (required)"
566
+ echo " -l, --language Language code (e.g. zh, en, ja)"
567
+ echo " -f, --format Output format: lrc, srt, json (default: lrc)"
568
+ echo " -p, --provider API provider: openai, elevenlabs"
569
+ echo " -o, --output Output file path"
570
+ echo ""
571
+ echo "Examples:"
572
+ echo " $0 transcribe --audio song.mp3"
573
+ echo " $0 transcribe --audio song.mp3 --language zh --format lrc"
574
+ echo " $0 config --set provider openai"
575
+ }
576
+
577
+ # ─── Main ───
578
+
579
+ case "$1" in
580
+ transcribe) shift; cmd_transcribe "$@" ;;
581
+ config) shift; cmd_config "$@" ;;
582
+ help|--help|-h) show_help ;;
583
+ *) show_help; exit 1 ;;
584
+ esac
.claude/skills/acestep-lyrics-transcription/scripts/config.example.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "provider": "elevenlabs",
3
+ "output_format": "lrc",
4
+ "openai": {
5
+ "api_key": "",
6
+ "api_url": "https://api.openai.com/v1",
7
+ "model": "whisper-1"
8
+ },
9
+ "elevenlabs": {
10
+ "api_key": "",
11
+ "api_url": "https://api.elevenlabs.io/v1",
12
+ "model": "scribe_v2"
13
+ }
14
+ }
.claude/skills/acestep-simplemv/SKILL.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: acestep-simplemv
3
+ description: Render music videos from audio files and lyrics using Remotion. Accepts audio + LRC/JSON lyrics + title to produce MP4 videos with waveform visualization and synced lyrics display. Use when users mention MV generation, music video rendering, creating video from audio/lyrics, or visualizing songs.
4
+ ---
5
+
6
+ # MV Render
7
+
8
+ Render music videos with waveform visualization and synced lyrics from audio + lyrics input.
9
+
10
+ ## Prerequisites
11
+
12
+ - Remotion project at `scripts/` directory within this skill
13
+ - Node.js + npm dependencies installed
14
+ - ffprobe available (for audio duration detection)
15
+
16
+ ### First-Time Setup
17
+
18
+ Before first use, check and install dependencies:
19
+
20
+ ```bash
21
+ # 1. Check Node.js
22
+ node --version
23
+
24
+ # 2. Install npm dependencies
25
+ cd {project_root}/{.claude or .codex}/skills/acestep-simplemv/scripts && npm install
26
+
27
+ # 3. Check ffprobe
28
+ ffprobe -version
29
+ ```
30
+
31
+ If ffprobe is not available, install ffmpeg (which includes ffprobe):
32
+ - **Windows**: `choco install ffmpeg` or download from https://ffmpeg.org/download.html and add to PATH
33
+ - **macOS**: `brew install ffmpeg`
34
+ - **Linux**: `sudo apt-get install ffmpeg` (Debian/Ubuntu) or `sudo dnf install ffmpeg` (Fedora)
35
+
36
+ ## Quick Start
37
+
38
+ ```bash
39
+ cd {project_root}/{.claude or .codex}/skills/acestep-simplemv/
40
+ ./scripts/render-mv.sh --audio /path/to/song.mp3 --lyrics /path/to/song.lrc --title "Song Title"
41
+ ```
42
+
43
+ Output: MP4 file at `out/<audio_basename>.mp4` (or custom `--output` path).
44
+
45
+ ## Script Usage
46
+
47
+ ```bash
48
+ ./scripts/render-mv.sh --audio <file> --lyrics <lrc_file> --title "Title" [options]
49
+
50
+ Options:
51
+ --audio Audio file path (absolute paths supported)
52
+ --lyrics LRC format lyrics file (timestamped)
53
+ --lyrics-json JSON lyrics file [{start, end, text}] (alternative to --lyrics)
54
+ --title Video title (default: "Music Video")
55
+ --subtitle Subtitle text
56
+ --credit Bottom credit text
57
+ --offset Lyric timing offset in seconds (default: -0.5)
58
+ --output Output file path (default: out/<audio_basename>.mp4)
59
+ --codec h264|h265|vp8|vp9 (default: h264)
60
+ --browser Custom browser executable path (Chrome/Edge/Chromium)
61
+
62
+ Environment variables:
63
+ BROWSER_EXECUTABLE Path to browser executable (overrides auto-detection)
64
+ ```
65
+
66
+ ## Browser Detection
67
+
68
+ Remotion requires a Chromium-based browser for rendering. The script auto-detects browsers in this priority order:
69
+
70
+ 1. `BROWSER_EXECUTABLE` environment variable
71
+ 2. `--browser` CLI argument
72
+ 3. Remotion cache (`chrome-headless-shell`, downloaded by Remotion)
73
+ 4. System Chrome (auto-uses `--chrome-mode=chrome-for-testing`)
74
+ 5. **System Edge** (pre-installed on Windows 10/11, auto-uses `--chrome-mode=chrome-for-testing`)
75
+ 6. System Chromium (auto-uses `--chrome-mode=chrome-for-testing`)
76
+
77
+ **Important**: New versions of Chrome/Edge removed the old headless mode. When using regular Chrome/Edge/Chromium, the script automatically sets `--chrome-mode=chrome-for-testing` (which uses `--headless=new`). When using `chrome-headless-shell`, it uses the default `headless-shell` mode (which uses `--headless=old`). This is handled transparently.
78
+
79
+ If no browser is found, Remotion will attempt to download `chrome-headless-shell` from Google servers. **This will fail if Google servers are inaccessible from your network.**
80
+
81
+ ### Workarounds for restricted networks
82
+
83
+ Since **Edge is pre-installed on Windows 10/11**, it should be auto-detected without any manual configuration. The script automatically detects Chrome/Edge and uses the correct headless mode. If auto-detection fails:
84
+
85
+ ```bash
86
+ # Option 1: Set environment variable
87
+ export BROWSER_EXECUTABLE="/path/to/msedge.exe"
88
+
89
+ # Option 2: Pass as CLI argument
90
+ ./scripts/render-mv.sh --audio song.mp3 --lyrics song.lrc --title "Song" --browser "/path/to/msedge.exe"
91
+
92
+ # Option 3: Enable proxy and let Remotion download chrome-headless-shell
93
+ ```
94
+
95
+ ## Examples
96
+
97
+ ```bash
98
+ # Basic render
99
+ ./scripts/render-mv.sh --audio /tmp/abc123_1.mp3 --lyrics /tmp/abc123.lrc --title "夜桜"
100
+
101
+ # Custom output path
102
+ ./scripts/render-mv.sh --audio song.mp3 --lyrics song.lrc --title "My Song" --output /tmp/my_mv.mp4
103
+
104
+ # With subtitle and credit
105
+ ./scripts/render-mv.sh --audio song.mp3 --lyrics song.lrc --title "Song" --subtitle "Artist Name" --credit "Generated by ACE-Step"
106
+ ```
107
+
108
+ ## File Naming
109
+
110
+ **IMPORTANT**: Use the audio file's job ID as the output filename to avoid overwriting. Do NOT use custom names like `--output my_song.mp4`. Let the default naming handle it (derives from audio filename).
111
+
112
+ Default output uses the audio filename as base:
113
+ - Audio: `acestep_output/{job_id}_1.mp3`
114
+ - Lyrics: `acestep_output/{job_id}_1.lrc`
115
+ - Video: Pass `--output acestep_output/{job_id}.mp4` (use the job ID from the audio file)
116
+
117
+ Example: if audio is `chatcmpl-abc123_1.mp3`, pass `--output acestep_output/chatcmpl-abc123.mp4`
118
+
119
+ ## Title Guidelines
120
+
121
+ - Keep `--title` short and single-line (max ~50 chars, auto-truncated)
122
+ - Use `--subtitle` for additional info
123
+ - Do NOT put newlines in `--title`
124
+
125
+ Good: `--title "Open Source" --subtitle "ACE-Step v1.5"`
126
+ Bad: `--title "Open Source - ACE-Step v1.5\nCelebrating Music AI"`
127
+
128
+ ## Notes
129
+
130
+ - Audio files with absolute paths are auto-copied to `public/` by render.mjs
131
+ - Duration is auto-detected via ffprobe
132
+ - Typical render time: ~1-2 minutes for a 90s song
133
+ - Output resolution: 1920x1080, 30fps
.claude/skills/acestep-simplemv/scripts/package-lock.json ADDED
The diff for this file is too large to render. See raw diff
 
.claude/skills/acestep-simplemv/scripts/package.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "acestep-video",
3
+ "version": "1.0.0",
4
+ "description": "",
5
+ "main": "index.js",
6
+ "scripts": {
7
+ "start": "remotion preview",
8
+ "build": "remotion render MusicVideo out/video.mp4",
9
+ "render": "node render.mjs",
10
+ "upgrade": "remotion upgrade"
11
+ },
12
+ "keywords": [],
13
+ "author": "",
14
+ "license": "ISC",
15
+ "type": "commonjs",
16
+ "dependencies": {
17
+ "@remotion/cli": "^4.0.417",
18
+ "@remotion/media-utils": "^4.0.417",
19
+ "react": "^18.3.1",
20
+ "react-dom": "^18.3.1",
21
+ "remotion": "^4.0.417"
22
+ },
23
+ "devDependencies": {
24
+ "@types/react": "^19.2.13",
25
+ "typescript": "^5.9.3"
26
+ }
27
+ }
.claude/skills/acestep-simplemv/scripts/remotion.config.ts ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ import {Config} from '@remotion/cli/config';
2
+
3
+ Config.setVideoImageFormat('jpeg');
4
+ Config.setOverwriteOutput(true);
.claude/skills/acestep-simplemv/scripts/render-mv.sh ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # render-mv.sh - Render a music video from audio + lyrics
3
+ #
4
+ # Usage:
5
+ # ./render-mv.sh --audio <file> --lyrics <lrc_file> --title "Title" [options]
6
+ #
7
+ # Options:
8
+ # --audio Audio file path (absolute or relative)
9
+ # --lyrics LRC format lyrics file
10
+ # --lyrics-json JSON lyrics file [{start, end, text}]
11
+ # --title Video title (default: "Music Video")
12
+ # --subtitle Subtitle text
13
+ # --credit Bottom credit text
14
+ # --offset Lyric timing offset in seconds (default: -0.5)
15
+ # --output Output file path (default: acestep_output/<audio_basename>.mp4)
16
+ # --codec h264|h265|vp8|vp9 (default: h264)
17
+ # --browser Custom browser executable path (Chrome/Edge/Chromium)
18
+ #
19
+ # Environment variables:
20
+ # BROWSER_EXECUTABLE Path to browser executable (overrides auto-detection)
21
+ #
22
+ # Examples:
23
+ # ./render-mv.sh --audio song.mp3 --lyrics song.lrc --title "My Song"
24
+ # ./render-mv.sh --audio /path/to/abc123_1.mp3 --lyrics /path/to/abc123.lrc --title "夜桜"
25
+
26
+ set -euo pipefail
27
+
28
+ RENDER_DIR="$(cd "$(dirname "$0")" && pwd)"
29
+
30
+ # Ensure output directory exists
31
+ mkdir -p "${RENDER_DIR}/out"
32
+
33
+ # Cross-platform realpath alternative (works on macOS/Linux/Windows MSYS2)
34
+ resolve_path() {
35
+ local dir base
36
+ dir="$(cd "$(dirname "$1")" && pwd)"
37
+ base="$(basename "$1")"
38
+ echo "${dir}/${base}"
39
+ }
40
+
41
+ AUDIO=""
42
+ LYRICS=""
43
+ LYRICS_JSON=""
44
+ TITLE="Music Video"
45
+ SUBTITLE=""
46
+ CREDIT=""
47
+ OFFSET="-0.5"
48
+ OUTPUT=""
49
+ CODEC="h264"
50
+ BROWSER=""
51
+
52
+ # Parse args
53
+ while [[ $# -gt 0 ]]; do
54
+ case "$1" in
55
+ --audio) AUDIO="$2"; shift 2 ;;
56
+ --lyrics) LYRICS="$2"; shift 2 ;;
57
+ --lyrics-json) LYRICS_JSON="$2"; shift 2 ;;
58
+ --title) TITLE="$2"; shift 2 ;;
59
+ --subtitle) SUBTITLE="$2"; shift 2 ;;
60
+ --credit) CREDIT="$2"; shift 2 ;;
61
+ --offset) OFFSET="$2"; shift 2 ;;
62
+ --output) OUTPUT="$2"; shift 2 ;;
63
+ --codec) CODEC="$2"; shift 2 ;;
64
+ --browser) BROWSER="$2"; shift 2 ;;
65
+ -h|--help)
66
+ head -20 "$0" | tail -18
67
+ exit 0
68
+ ;;
69
+ *)
70
+ echo "Error: unknown argument: $1" >&2
71
+ exit 1
72
+ ;;
73
+ esac
74
+ done
75
+
76
+ if [[ -z "$AUDIO" ]]; then
77
+ echo "Error: --audio is required" >&2
78
+ exit 1
79
+ fi
80
+
81
+ if [[ ! -f "$AUDIO" ]]; then
82
+ echo "Error: audio file not found: $AUDIO" >&2
83
+ exit 1
84
+ fi
85
+
86
+ # Resolve absolute path for audio
87
+ AUDIO="$(resolve_path "$AUDIO")"
88
+
89
+ # Default output: acestep_output/<audio_basename>.mp4
90
+ if [[ -z "$OUTPUT" ]]; then
91
+ BASENAME="$(basename "${AUDIO%.*}")"
92
+ # Strip trailing _1, _2 etc from audio filename for cleaner video name
93
+ OUTPUT="${RENDER_DIR}/out/${BASENAME}.mp4"
94
+ fi
95
+
96
+ # Ensure output directory exists
97
+ mkdir -p "$(dirname "$OUTPUT")"
98
+
99
+ # Build node args array (safe quoting, no eval)
100
+ NODE_ARGS=(render.mjs --audio "$AUDIO" --title "$TITLE" --offset "$OFFSET" --output "$OUTPUT" --codec "$CODEC")
101
+
102
+ if [[ -n "$LYRICS" ]]; then
103
+ LYRICS="$(resolve_path "$LYRICS")"
104
+ NODE_ARGS+=(--lyrics "$LYRICS")
105
+ elif [[ -n "$LYRICS_JSON" ]]; then
106
+ LYRICS_JSON="$(resolve_path "$LYRICS_JSON")"
107
+ NODE_ARGS+=(--lyrics-json "$LYRICS_JSON")
108
+ fi
109
+
110
+ [[ -n "$SUBTITLE" ]] && NODE_ARGS+=(--subtitle "$SUBTITLE")
111
+ [[ -n "$CREDIT" ]] && NODE_ARGS+=(--credit "$CREDIT")
112
+ [[ -n "$BROWSER" ]] && NODE_ARGS+=(--browser "$BROWSER")
113
+
114
+ echo "Rendering MV..."
115
+ echo " Audio: $(basename "$AUDIO")"
116
+ echo " Title: $TITLE"
117
+ echo " Output: $OUTPUT"
118
+
119
+ cd "$RENDER_DIR"
120
+ node "${NODE_ARGS[@]}"
121
+
122
+ echo ""
123
+ echo "Output: $OUTPUT"
.claude/skills/acestep-simplemv/scripts/render.mjs ADDED
@@ -0,0 +1,345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env node
2
+
3
+ /**
4
+ * CLI entry point for rendering music videos.
5
+ *
6
+ * Usage:
7
+ * node render.mjs --audio music.mp3 --lyrics lyrics.lrc --title "Song Name" --output out/video.mp4
8
+ * node render.mjs --audio music.mp3 --lyrics-json lyrics.json --title "Song Name"
9
+ *
10
+ * Options:
11
+ * --audio Audio file path (absolute paths auto-copied to public/) or filename in public/
12
+ * --lyrics Path to LRC format lyrics file
13
+ * --lyrics-json Path to JSON lyrics file [{start, end, text}]
14
+ * --title Main title (default: "Music Video")
15
+ * --subtitle Subtitle (default: "")
16
+ * --credit Bottom credit text (default: "")
17
+ * --duration Audio duration in seconds (auto-detected if omitted)
18
+ * --offset Lyric timing offset in seconds (default: -0.5)
19
+ * --output Output file path (default: out/video.mp4)
20
+ * --codec Video codec: h264, h265, vp8, vp9 (default: h264)
21
+ */
22
+
23
+ import {execSync} from 'child_process';
24
+ import {readFileSync, readdirSync, existsSync, copyFileSync, mkdirSync} from 'fs';
25
+ import {resolve, basename, isAbsolute, join} from 'path';
26
+ import {homedir} from 'os';
27
+
28
+ /**
29
+ * Resolve a file path that may be a MSYS2/Cygwin-style path on Windows.
30
+ * Converts paths like /e/foo/bar to E:/foo/bar for Node.js compatibility.
31
+ */
32
+ function resolveFilePath(p) {
33
+ if (process.platform === 'win32' && /^\/[a-zA-Z]\//.test(p)) {
34
+ // Convert MSYS2 path /x/... to X:/...
35
+ return p[1].toUpperCase() + ':' + p.slice(2);
36
+ }
37
+ return resolve(p);
38
+ }
39
+
40
+ /**
41
+ * Find a usable browser executable for Remotion rendering.
42
+ *
43
+ * Search priority:
44
+ * 1. Environment variable BROWSER_EXECUTABLE
45
+ * 2. CLI argument --browser
46
+ * 3. Remotion cache (chrome-headless-shell)
47
+ * 4. System Chrome (requires --chrome-mode=chrome-for-testing)
48
+ * 5. System Edge (requires --chrome-mode=chrome-for-testing)
49
+ * 6. System Chromium (requires --chrome-mode=chrome-for-testing)
50
+ *
51
+ * Returns {path, chromeMode} or {path: null, chromeMode: 'headless-shell'} if not found.
52
+ *
53
+ * chromeMode:
54
+ * - 'headless-shell': for chrome-headless-shell binary (uses --headless=old)
55
+ * - 'chrome-for-testing': for regular Chrome/Edge/Chromium (uses --headless=new)
56
+ */
57
+ function findBrowserExecutable(cliOverride) {
58
+ // 1. Environment variable — highest priority
59
+ const envExe = process.env.BROWSER_EXECUTABLE;
60
+ if (envExe && existsSync(envExe)) {
61
+ const mode = isHeadlessShell(envExe) ? 'headless-shell' : 'chrome-for-testing';
62
+ return {path: envExe, chromeMode: mode};
63
+ }
64
+
65
+ // 2. CLI argument
66
+ if (cliOverride && existsSync(cliOverride)) {
67
+ const mode = isHeadlessShell(cliOverride) ? 'headless-shell' : 'chrome-for-testing';
68
+ return {path: cliOverride, chromeMode: mode};
69
+ }
70
+
71
+ const platform = process.platform;
72
+ const home = homedir();
73
+
74
+ // 3. Local node_modules/.remotion (chrome-headless-shell) — uses --headless=old
75
+ const localCacheDir = join(process.cwd(), 'node_modules', '.remotion', 'chrome-headless-shell');
76
+ if (existsSync(localCacheDir)) {
77
+ try {
78
+ // Structure: chrome-headless-shell/linux64/chrome-headless-shell-linux64/chrome-headless-shell
79
+ const platformDir = platform === 'win32' ? 'win64' : platform === 'darwin' ? 'mac-arm64' : 'linux64';
80
+ const exeName = platform === 'win32' ? 'chrome-headless-shell.exe' : 'chrome-headless-shell';
81
+ const platformPath = join(localCacheDir, platformDir);
82
+
83
+ if (existsSync(platformPath)) {
84
+ const subdirs = readdirSync(platformPath);
85
+ for (const subdir of subdirs) {
86
+ const exe = join(platformPath, subdir, exeName);
87
+ if (existsSync(exe)) return {path: exe, chromeMode: 'headless-shell'};
88
+ }
89
+ }
90
+ } catch {}
91
+ }
92
+
93
+ // 4. User home Remotion cache (chrome-headless-shell) — uses --headless=old
94
+ let cacheDir;
95
+ if (platform === 'win32') {
96
+ cacheDir = join(home, 'AppData', 'Local', 'remotion', 'chrome-headless-shell');
97
+ } else if (platform === 'darwin') {
98
+ cacheDir = join(home, 'Library', 'Caches', 'remotion', 'chrome-headless-shell');
99
+ } else {
100
+ cacheDir = join(home, '.cache', 'remotion', 'chrome-headless-shell');
101
+ }
102
+
103
+ if (existsSync(cacheDir)) {
104
+ try {
105
+ const versions = readdirSync(cacheDir).sort().reverse();
106
+ const exeName = platform === 'win32' ? 'chrome-headless-shell.exe' : 'chrome-headless-shell';
107
+ for (const ver of versions) {
108
+ const exe = join(cacheDir, ver, exeName);
109
+ if (existsSync(exe)) return {path: exe, chromeMode: 'headless-shell'};
110
+ }
111
+ } catch {}
112
+ }
113
+
114
+ // 4-6. System browsers: Chrome, Edge, Chromium — require --chrome-mode=chrome-for-testing
115
+ const browserPaths = platform === 'win32' ? [
116
+ // Chrome
117
+ 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe',
118
+ 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe',
119
+ // Edge (pre-installed on Windows 10/11)
120
+ 'C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe',
121
+ 'C:\\Program Files\\Microsoft\\Edge\\Application\\msedge.exe',
122
+ ] : platform === 'darwin' ? [
123
+ '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
124
+ '/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge',
125
+ '/Applications/Chromium.app/Contents/MacOS/Chromium',
126
+ ] : [
127
+ '/usr/bin/google-chrome',
128
+ '/usr/bin/google-chrome-stable',
129
+ '/usr/bin/chromium',
130
+ '/usr/bin/chromium-browser',
131
+ '/usr/bin/microsoft-edge',
132
+ '/usr/bin/microsoft-edge-stable',
133
+ ];
134
+
135
+ for (const p of browserPaths) {
136
+ if (existsSync(p)) return {path: p, chromeMode: 'chrome-for-testing'};
137
+ }
138
+
139
+ return {path: null, chromeMode: 'headless-shell'};
140
+ }
141
+
142
+ /**
143
+ * Check if the given executable path is a chrome-headless-shell binary.
144
+ */
145
+ function isHeadlessShell(exePath) {
146
+ const name = exePath.toLowerCase().replace(/\\/g, '/');
147
+ return name.includes('chrome-headless-shell');
148
+ }
149
+
150
+ function parseLrc(content) {
151
+ const lines = content.split(/\r?\n/).filter(l => l.trim());
152
+ const parsed = [];
153
+ for (const line of lines) {
154
+ const match = line.match(/^\[(\d{2}):(\d{2})(?:\.(\d{2,3}))?\]\s*(.*)$/);
155
+ if (match) {
156
+ const minutes = parseInt(match[1], 10);
157
+ const seconds = parseInt(match[2], 10);
158
+ const cs = match[3] ? parseInt(match[3].padEnd(3, '0'), 10) / 1000 : 0;
159
+ const time = minutes * 60 + seconds + cs;
160
+ const text = match[4].trim();
161
+ parsed.push({time, text});
162
+ }
163
+ }
164
+ const result = [];
165
+ for (let i = 0; i < parsed.length; i++) {
166
+ const start = parsed[i].time;
167
+ const end = i < parsed.length - 1 ? parsed[i + 1].time : start + 5;
168
+ if (parsed[i].text) {
169
+ result.push({start, end, text: parsed[i].text});
170
+ }
171
+ }
172
+ return result;
173
+ }
174
+
175
+ function getAudioDuration(filePath) {
176
+ try {
177
+ const result = execSync(
178
+ `ffprobe -v error -show_entries format=duration -of csv=p=0 "${filePath}"`,
179
+ {encoding: 'utf-8'}
180
+ ).trim();
181
+ return parseFloat(result);
182
+ } catch {
183
+ return null;
184
+ }
185
+ }
186
+
187
+ function parseArgs(argv) {
188
+ const args = {};
189
+ for (let i = 2; i < argv.length; i++) {
190
+ const key = argv[i];
191
+ if (key.startsWith('--') && i + 1 < argv.length) {
192
+ const name = key.slice(2);
193
+ args[name] = argv[i + 1];
194
+ i++;
195
+ }
196
+ }
197
+ return args;
198
+ }
199
+
200
+ const args = parseArgs(process.argv);
201
+
202
+ // Validate required args
203
+ if (!args.audio) {
204
+ console.error('Error: --audio is required');
205
+ console.error('Usage: node render.mjs --audio music.mp3 --lyrics lyrics.lrc --title "Song"');
206
+ process.exit(1);
207
+ }
208
+
209
+ // If audio is an absolute path, copy it into public/ and use the filename
210
+ let audioFileName = args.audio;
211
+ const resolvedAudio = resolveFilePath(args.audio);
212
+ if (isAbsolute(resolvedAudio)) {
213
+ if (!existsSync(resolvedAudio)) {
214
+ console.error(`Error: Audio file not found: ${resolvedAudio}`);
215
+ process.exit(1);
216
+ }
217
+ const pubDir = resolve('public');
218
+ mkdirSync(pubDir, {recursive: true});
219
+ const fname = basename(resolvedAudio);
220
+ const dest = resolve(pubDir, fname);
221
+ if (resolve(resolvedAudio) !== dest) {
222
+ copyFileSync(resolvedAudio, dest);
223
+ console.log(`Copied audio to public/${fname}`);
224
+ }
225
+ audioFileName = fname;
226
+ } else {
227
+ // Relative name — must exist in public/
228
+ const audioPath = resolve('public', args.audio);
229
+ if (!existsSync(audioPath)) {
230
+ console.error(`Error: Audio file not found in public/: ${args.audio}`);
231
+ process.exit(1);
232
+ }
233
+ }
234
+
235
+ // Parse lyrics
236
+ let lyrics = [];
237
+ if (args.lyrics) {
238
+ const lrcPath = resolveFilePath(args.lyrics);
239
+ if (!existsSync(lrcPath)) {
240
+ console.error(`Error: LRC file not found: ${lrcPath}`);
241
+ process.exit(1);
242
+ }
243
+ lyrics = parseLrc(readFileSync(lrcPath, 'utf-8'));
244
+ console.log(`Parsed ${lyrics.length} lyric lines from LRC file`);
245
+ } else if (args['lyrics-json']) {
246
+ const jsonPath = resolveFilePath(args['lyrics-json']);
247
+ if (!existsSync(jsonPath)) {
248
+ console.error(`Error: JSON lyrics file not found: ${jsonPath}`);
249
+ process.exit(1);
250
+ }
251
+ lyrics = JSON.parse(readFileSync(jsonPath, 'utf-8'));
252
+ console.log(`Loaded ${lyrics.length} lyric lines from JSON file`);
253
+ }
254
+
255
+ // Determine audio duration
256
+ let duration = args.duration ? parseFloat(args.duration) : null;
257
+ if (!duration) {
258
+ const audioPath = resolve('public', audioFileName);
259
+ if (existsSync(audioPath)) {
260
+ duration = getAudioDuration(audioPath);
261
+ if (duration) {
262
+ console.log(`Auto-detected audio duration: ${duration.toFixed(2)}s`);
263
+ }
264
+ }
265
+ }
266
+ if (!duration) {
267
+ console.error('Error: Could not detect audio duration. Please provide --duration');
268
+ process.exit(1);
269
+ }
270
+
271
+ // Build input props
272
+ // Sanitize title: single-line, max 50 chars
273
+ const rawTitle = (args.title || 'Music Video').replace(/[\r\n]+/g, ' ').trim();
274
+ const title = rawTitle.length > 50 ? rawTitle.slice(0, 47) + '...' : rawTitle;
275
+
276
+ const inputProps = {
277
+ audioFileName: audioFileName,
278
+ lyrics,
279
+ title,
280
+ subtitle: (args.subtitle || '').replace(/[\r\n]+/g, ' ').trim(),
281
+ creditText: args.credit || '',
282
+ durationInSeconds: duration,
283
+ lyricOffset: args.offset ? parseFloat(args.offset) : -0.5,
284
+ };
285
+
286
+ const output = args.output ? resolveFilePath(args.output) : 'out/video.mp4';
287
+ const codec = args.codec || 'h264';
288
+
289
+ // Write props to temp file to avoid shell escaping issues
290
+ const propsFile = resolve('.render-props.json');
291
+ const {writeFileSync} = await import('fs');
292
+ writeFileSync(propsFile, JSON.stringify(inputProps));
293
+
294
+ // Find browser executable to avoid re-downloading
295
+ const {path: browserExe, chromeMode} = findBrowserExecutable(args.browser);
296
+
297
+ if (!browserExe) {
298
+ console.warn('⚠️ No browser found. Remotion will attempt to download chrome-headless-shell from Google servers.');
299
+ console.warn(' If download fails (e.g. Google servers inaccessible), try one of these:');
300
+ console.warn(' 1. Set environment variable: BROWSER_EXECUTABLE=/path/to/chrome-or-edge');
301
+ console.warn(' 2. Pass CLI argument: --browser /path/to/chrome-or-edge');
302
+ console.warn(' 3. Enable proxy and retry');
303
+ console.warn('');
304
+ }
305
+
306
+ const cmd = [
307
+ 'npx remotion render',
308
+ 'MusicVideo',
309
+ `"${output}"`,
310
+ `--props="${propsFile}"`,
311
+ `--codec=${codec}`,
312
+ '--log=error',
313
+ browserExe ? `--browser-executable="${browserExe}"` : '',
314
+ chromeMode !== 'headless-shell' ? `--chrome-mode=${chromeMode}` : '',
315
+ ].filter(Boolean).join(' ');
316
+
317
+ console.log(`\nRendering video...`);
318
+ console.log(` Audio: ${args.audio}`);
319
+ console.log(` Title: ${inputProps.title}`);
320
+ console.log(` Duration: ${duration.toFixed(1)}s`);
321
+ console.log(` Lyrics: ${lyrics.length} lines`);
322
+ console.log(` Output: ${output}`);
323
+ console.log(` Codec: ${codec}`);
324
+ if (browserExe) console.log(` Browser: ${browserExe}`);
325
+ if (chromeMode !== 'headless-shell') console.log(` Chrome mode: ${chromeMode}`);
326
+ console.log('');
327
+
328
+ try {
329
+ const result = execSync(cmd, {encoding: 'utf-8', stdio: ['pipe', 'pipe', 'pipe']});
330
+ // Only show the final output file line (starts with '+') and size info
331
+ const outputLines = result.split(/\r?\n/).filter(l => l.includes(output) || /^\+/.test(l.replace(/\x1b\[[0-9;]*m/g, '').trim()));
332
+ if (outputLines.length) console.log(outputLines.join('\n'));
333
+ console.log(`\n✅ Video rendered successfully: ${output}`);
334
+ } catch (e) {
335
+ // Show stderr on failure for debugging
336
+ if (e.stderr) console.error(e.stderr.toString());
337
+ console.error('\n❌ Render failed');
338
+ process.exit(1);
339
+ } finally {
340
+ // Clean up temp props file
341
+ try {
342
+ const {unlinkSync} = await import('fs');
343
+ unlinkSync(propsFile);
344
+ } catch {}
345
+ }
.claude/skills/acestep-simplemv/scripts/render.sh ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # render.sh - Convenience wrapper for rendering music videos
3
+ #
4
+ # Usage:
5
+ # ./render.sh --audio music.mp3 --lyrics lyrics.lrc --title "Song Name"
6
+ # ./render.sh --audio music.mp3 --lyrics-json lyrics.json --title "Song" --output out/mv.mp4
7
+ #
8
+ # All options are passed through to render.mjs. See render.mjs for full options list.
9
+
10
+ set -e
11
+ cd "$(dirname "$0")"
12
+ node render.mjs "$@"
.claude/skills/acestep-simplemv/scripts/src/AudioVisualization.tsx ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import React from 'react';
2
+ import {
3
+ AbsoluteFill,
4
+ Audio,
5
+ useCurrentFrame,
6
+ useVideoConfig,
7
+ interpolate,
8
+ Easing,
9
+ staticFile,
10
+ } from 'remotion';
11
+ import {useAudioData, visualizeAudio} from '@remotion/media-utils';
12
+ import {MVInputProps} from './types';
13
+
14
+ export const AudioVisualization: React.FC<MVInputProps> = ({
15
+ audioFileName,
16
+ lyrics,
17
+ title,
18
+ subtitle,
19
+ creditText,
20
+ lyricOffset,
21
+ }) => {
22
+ const frame = useCurrentFrame();
23
+ const {fps, durationInFrames} = useVideoConfig();
24
+
25
+ const audioSrc = audioFileName.startsWith('http')
26
+ ? audioFileName
27
+ : staticFile(audioFileName);
28
+
29
+ const audioData = useAudioData(audioSrc);
30
+
31
+ if (!audioData) {
32
+ return null;
33
+ }
34
+
35
+ const visualization = visualizeAudio({
36
+ fps,
37
+ frame,
38
+ audioData,
39
+ numberOfSamples: 128,
40
+ optimizeFor: 'speed',
41
+ });
42
+
43
+ const currentTime = frame / fps + lyricOffset;
44
+
45
+ const currentLyric = lyrics.find(
46
+ (lyric) => currentTime >= lyric.start && currentTime < lyric.end
47
+ );
48
+
49
+ const lyricProgress = currentLyric
50
+ ? interpolate(
51
+ currentTime,
52
+ [currentLyric.start, currentLyric.start + 0.3],
53
+ [0, 1],
54
+ {extrapolateRight: 'clamp'}
55
+ )
56
+ : 0;
57
+
58
+ const titleOpacity = interpolate(frame, [0, 30], [0, 1], {
59
+ extrapolateRight: 'clamp',
60
+ });
61
+
62
+ const titleY = interpolate(frame, [0, 30], [-50, 0], {
63
+ extrapolateRight: 'clamp',
64
+ easing: Easing.out(Easing.ease),
65
+ });
66
+
67
+ const hue = interpolate(frame, [0, durationInFrames], [200, 320], {
68
+ extrapolateRight: 'wrap',
69
+ });
70
+
71
+ const avgAmplitude =
72
+ visualization.reduce((sum, val) => sum + val, 0) / visualization.length;
73
+
74
+ return (
75
+ <AbsoluteFill>
76
+ {/* Animated gradient background */}
77
+ <AbsoluteFill
78
+ style={{
79
+ background: `linear-gradient(135deg, hsl(${hue}, 80%, 12%) 0%, hsl(${hue + 80}, 70%, 8%) 100%)`,
80
+ }}
81
+ />
82
+
83
+ {/* Radial glow effect */}
84
+ <AbsoluteFill
85
+ style={{
86
+ background: `radial-gradient(circle at 50% 50%, hsla(${hue}, 100%, 50%, ${avgAmplitude * 0.3}) 0%, transparent 50%)`,
87
+ }}
88
+ />
89
+
90
+ {/* Audio source */}
91
+ <Audio src={audioSrc} />
92
+
93
+ {/* Bottom frequency bars */}
94
+ <AbsoluteFill
95
+ style={{
96
+ justifyContent: 'flex-end',
97
+ alignItems: 'center',
98
+ }}
99
+ >
100
+ <div
101
+ style={{
102
+ display: 'flex',
103
+ alignItems: 'flex-end',
104
+ justifyContent: 'center',
105
+ gap: 4,
106
+ height: 350,
107
+ width: '90%',
108
+ marginBottom: 180,
109
+ }}
110
+ >
111
+ {visualization.map((value, index) => {
112
+ const scaledValue = Math.pow(value, 0.6);
113
+ const barHeight = Math.max(scaledValue * 800, 20);
114
+ const colorIndex = (index / visualization.length) * 360;
115
+
116
+ return (
117
+ <div
118
+ key={index}
119
+ style={{
120
+ width: `${100 / visualization.length}%`,
121
+ height: barHeight,
122
+ background: `linear-gradient(to top,
123
+ hsl(${(colorIndex + hue) % 360}, 90%, 60%),
124
+ hsl(${(colorIndex + hue + 40) % 360}, 90%, 70%))`,
125
+ borderRadius: '4px 4px 0 0',
126
+ boxShadow: `0 0 ${10 + scaledValue * 30}px hsla(${(colorIndex + hue) % 360}, 100%, 60%, ${scaledValue})`,
127
+ transition: 'height 0.05s ease-out',
128
+ }}
129
+ />
130
+ );
131
+ })}
132
+ </div>
133
+ </AbsoluteFill>
134
+
135
+ {/* Symmetrical side bars */}
136
+ <AbsoluteFill
137
+ style={{
138
+ justifyContent: 'center',
139
+ alignItems: 'center',
140
+ }}
141
+ >
142
+ {/* Left bars */}
143
+ <div
144
+ style={{
145
+ position: 'absolute',
146
+ left: 40,
147
+ display: 'flex',
148
+ flexDirection: 'column',
149
+ gap: 8,
150
+ height: '80%',
151
+ justifyContent: 'space-around',
152
+ }}
153
+ >
154
+ {visualization.slice(0, 20).map((value, index) => {
155
+ const scaledValue = Math.pow(value, 0.6);
156
+ const barWidth = Math.max(scaledValue * 300, 10);
157
+ const colorIndex = (index / 20) * 360;
158
+ return (
159
+ <div
160
+ key={index}
161
+ style={{
162
+ width: barWidth,
163
+ height: 12,
164
+ background: `linear-gradient(to right,
165
+ hsl(${(colorIndex + hue) % 360}, 90%, 60%),
166
+ hsl(${(colorIndex + hue + 40) % 360}, 90%, 70%))`,
167
+ borderRadius: '0 6px 6px 0',
168
+ boxShadow: `0 0 ${10 + scaledValue * 20}px hsla(${(colorIndex + hue) % 360}, 100%, 60%, ${scaledValue})`,
169
+ }}
170
+ />
171
+ );
172
+ })}
173
+ </div>
174
+
175
+ {/* Right bars */}
176
+ <div
177
+ style={{
178
+ position: 'absolute',
179
+ right: 40,
180
+ display: 'flex',
181
+ flexDirection: 'column',
182
+ gap: 8,
183
+ height: '80%',
184
+ justifyContent: 'space-around',
185
+ alignItems: 'flex-end',
186
+ }}
187
+ >
188
+ {visualization.slice(0, 20).map((value, index) => {
189
+ const scaledValue = Math.pow(value, 0.6);
190
+ const barWidth = Math.max(scaledValue * 300, 10);
191
+ const colorIndex = (index / 20) * 360;
192
+ return (
193
+ <div
194
+ key={index}
195
+ style={{
196
+ width: barWidth,
197
+ height: 12,
198
+ background: `linear-gradient(to left,
199
+ hsl(${(colorIndex + hue + 180) % 360}, 90%, 60%),
200
+ hsl(${(colorIndex + hue + 220) % 360}, 90%, 70%))`,
201
+ borderRadius: '6px 0 0 6px',
202
+ boxShadow: `0 0 ${10 + scaledValue * 20}px hsla(${(colorIndex + hue + 180) % 360}, 100%, 60%, ${scaledValue})`,
203
+ }}
204
+ />
205
+ );
206
+ })}
207
+ </div>
208
+ </AbsoluteFill>
209
+
210
+ {/* Center title area */}
211
+ <AbsoluteFill
212
+ style={{
213
+ justifyContent: 'flex-start',
214
+ alignItems: 'center',
215
+ paddingTop: 60,
216
+ }}
217
+ >
218
+ <div
219
+ style={{
220
+ textAlign: 'center',
221
+ transform: `scale(${1 + avgAmplitude * 0.1})`,
222
+ transition: 'transform 0.1s ease-out',
223
+ }}
224
+ >
225
+ <div
226
+ style={{
227
+ fontSize: 96,
228
+ fontWeight: 'bold',
229
+ color: 'white',
230
+ opacity: titleOpacity,
231
+ transform: `translateY(${titleY}px)`,
232
+ textShadow: `0 0 40px hsla(${hue}, 100%, 70%, 0.8), 0 4px 20px rgba(0,0,0,0.5)`,
233
+ fontFamily: '"Noto Sans CJK JP", "Noto Sans CJK SC", Arial, sans-serif',
234
+ marginBottom: 10,
235
+ }}
236
+ >
237
+ {title}
238
+ </div>
239
+ <div
240
+ style={{
241
+ fontSize: 56,
242
+ fontWeight: '600',
243
+ color: 'rgba(255,255,255,0.95)',
244
+ opacity: titleOpacity,
245
+ transform: `translateY(${titleY}px)`,
246
+ textShadow: `0 0 30px hsla(${hue + 60}, 100%, 70%, 0.6), 0 2px 10px rgba(0,0,0,0.5)`,
247
+ fontFamily: '"Noto Sans CJK JP", "Noto Sans CJK SC", Arial, sans-serif',
248
+ letterSpacing: '4px',
249
+ }}
250
+ >
251
+ {subtitle}
252
+ </div>
253
+ </div>
254
+ </AbsoluteFill>
255
+
256
+ {/* Lyrics display */}
257
+ {currentLyric && currentLyric.text && (
258
+ <AbsoluteFill
259
+ style={{
260
+ justifyContent: 'center',
261
+ alignItems: 'center',
262
+ paddingTop: 100,
263
+ }}
264
+ >
265
+ <div
266
+ style={{
267
+ fontSize: 48,
268
+ fontWeight: '600',
269
+ color: 'white',
270
+ textAlign: 'center',
271
+ maxWidth: '85%',
272
+ opacity: lyricProgress,
273
+ transform: `translateY(${(1 - lyricProgress) * 30}px)`,
274
+ textShadow: `0 0 40px hsla(${hue}, 100%, 70%, 0.8), 0 4px 30px rgba(0,0,0,0.9)`,
275
+ fontFamily: '"Noto Sans CJK JP", "Noto Sans CJK SC", Arial, sans-serif',
276
+ lineHeight: 1.5,
277
+ padding: '25px 50px',
278
+ background: `linear-gradient(135deg, rgba(0,0,0,0.4), rgba(0,0,0,0.2))`,
279
+ backdropFilter: 'blur(15px)',
280
+ borderRadius: '20px',
281
+ border: `2px solid hsla(${hue}, 80%, 60%, 0.3)`,
282
+ boxShadow: `0 8px 32px rgba(0,0,0,0.5), inset 0 0 40px hsla(${hue}, 100%, 50%, 0.1)`,
283
+ }}
284
+ >
285
+ {currentLyric.text}
286
+ </div>
287
+ </AbsoluteFill>
288
+ )}
289
+
290
+ {/* Bottom credit text */}
291
+ <AbsoluteFill
292
+ style={{
293
+ justifyContent: 'flex-end',
294
+ alignItems: 'center',
295
+ padding: 50,
296
+ }}
297
+ >
298
+ <div
299
+ style={{
300
+ fontSize: 32,
301
+ fontWeight: '500',
302
+ color: 'white',
303
+ opacity: 0.8,
304
+ textAlign: 'center',
305
+ textShadow: `0 0 20px hsla(${hue}, 100%, 70%, 0.6), 0 2px 10px rgba(0,0,0,0.7)`,
306
+ fontFamily: '"Noto Sans CJK JP", "Noto Sans CJK SC", Arial, sans-serif',
307
+ }}
308
+ >
309
+ {creditText}
310
+ </div>
311
+ </AbsoluteFill>
312
+ </AbsoluteFill>
313
+ );
314
+ };
.claude/skills/acestep-simplemv/scripts/src/Root.tsx ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import React from 'react';
2
+ import {Composition, CalculateMetadataFunction} from 'remotion';
3
+ import {AudioVisualization} from './AudioVisualization';
4
+ import {MVInputProps, defaultProps} from './types';
5
+
6
+ const calculateMetadata: CalculateMetadataFunction<MVInputProps> = ({props}) => {
7
+ const fps = 30;
8
+ const durationInFrames = Math.ceil(props.durationInSeconds * fps);
9
+ return {
10
+ durationInFrames,
11
+ fps,
12
+ width: 1920,
13
+ height: 1080,
14
+ };
15
+ };
16
+
17
+ export const RemotionRoot: React.FC = () => {
18
+ return (
19
+ <>
20
+ <Composition
21
+ id="MusicVideo"
22
+ component={AudioVisualization}
23
+ fps={30}
24
+ width={1920}
25
+ height={1080}
26
+ defaultProps={defaultProps}
27
+ calculateMetadata={calculateMetadata}
28
+ />
29
+ </>
30
+ );
31
+ };
.claude/skills/acestep-simplemv/scripts/src/index.ts ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ import {registerRoot} from 'remotion';
2
+ import {RemotionRoot} from './Root';
3
+
4
+ registerRoot(RemotionRoot);
.claude/skills/acestep-simplemv/scripts/src/parseLrc.ts ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import {LyricLine} from './types';
2
+
3
+ /**
4
+ * Parse LRC format lyrics into LyricLine array.
5
+ * LRC format: [mm:ss.xx] lyrics text
6
+ *
7
+ * Example:
8
+ * [00:02.99] Version one point five is here today
9
+ * [00:07.00] ACE-Step's rising, leading the way
10
+ */
11
+ export function parseLrc(lrcContent: string): LyricLine[] {
12
+ const lines = lrcContent.split('\n').filter((line) => line.trim());
13
+ const parsed: {time: number; text: string}[] = [];
14
+
15
+ for (const line of lines) {
16
+ // Match [mm:ss.xx] or [mm:ss] format
17
+ const match = line.match(/^\[(\d{2}):(\d{2})(?:\.(\d{2,3}))?\]\s*(.*)$/);
18
+ if (match) {
19
+ const minutes = parseInt(match[1], 10);
20
+ const seconds = parseInt(match[2], 10);
21
+ const centiseconds = match[3] ? parseInt(match[3].padEnd(3, '0'), 10) / 1000 : 0;
22
+ const time = minutes * 60 + seconds + centiseconds;
23
+ const text = match[4].trim();
24
+ parsed.push({time, text});
25
+ }
26
+ }
27
+
28
+ // Convert to LyricLine with start/end
29
+ const result: LyricLine[] = [];
30
+ for (let i = 0; i < parsed.length; i++) {
31
+ const start = parsed[i].time;
32
+ const end = i < parsed.length - 1 ? parsed[i + 1].time : start + 5;
33
+ const text = parsed[i].text;
34
+ if (text) {
35
+ result.push({start, end, text});
36
+ }
37
+ }
38
+
39
+ return result;
40
+ }
.claude/skills/acestep-simplemv/scripts/src/types.ts ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ export interface LyricLine {
2
+ start: number;
3
+ end: number;
4
+ text: string;
5
+ }
6
+
7
+ export interface MVInputProps extends Record<string, unknown> {
8
+ /** Path to audio file (relative to public/ or absolute URL) */
9
+ audioFileName: string;
10
+ /** Lyrics as JSON array [{start, end, text}] */
11
+ lyrics: LyricLine[];
12
+ /** Main title displayed at top */
13
+ title: string;
14
+ /** Subtitle displayed below title */
15
+ subtitle: string;
16
+ /** Bottom credit text */
17
+ creditText: string;
18
+ /** Audio duration in seconds (used to calculate total frames) */
19
+ durationInSeconds: number;
20
+ /** Lyric timing offset in seconds (positive = delay, negative = advance) */
21
+ lyricOffset: number;
22
+ }
23
+
24
+ export const defaultProps: MVInputProps = {
25
+ audioFileName: 'celebration.mp3',
26
+ lyrics: [],
27
+ title: 'ACE-Step',
28
+ subtitle: 'v1.5',
29
+ creditText: 'Powered by Claude Code + ACE-Step',
30
+ durationInSeconds: 150,
31
+ lyricOffset: -0.5,
32
+ };
.claude/skills/acestep-simplemv/scripts/tsconfig.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "compilerOptions": {
3
+ "target": "ES2022",
4
+ "module": "ES2022",
5
+ "moduleResolution": "Bundler",
6
+ "lib": ["DOM", "ES2022"],
7
+ "jsx": "react-jsx",
8
+ "skipLibCheck": true,
9
+ "strict": true,
10
+ "esModuleInterop": true,
11
+ "allowSyntheticDefaultImports": true,
12
+ "forceConsistentCasingInFileNames": true,
13
+ "resolveJsonModule": true,
14
+ "isolatedModules": true,
15
+ "noEmit": true
16
+ },
17
+ "include": ["src/**/*"]
18
+ }
.claude/skills/acestep-songwriting/SKILL.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: acestep-songwriting
3
+ description: Music songwriting guide for ACE-Step. Provides professional knowledge on writing captions, lyrics, choosing BPM/key/duration, and structuring songs. Use this skill when users want to create, write, or plan a song before generating it with ACE-Step.
4
+ allowed-tools: Read
5
+ ---
6
+
7
+ # ACE-Step Songwriting Guide
8
+
9
+ Professional music creation knowledge for writing captions, lyrics, and choosing music parameters for ACE-Step.
10
+
11
+ ## Output Format
12
+
13
+ After using this guide, produce two things for the acestep skill:
14
+ 1. **Caption** (`-c`): Style/genre/instruments/emotion description
15
+ 2. **Lyrics** (`-l`): Complete structured lyrics with tags
16
+ 3. **Parameters**: `--duration`, `--bpm`, `--key`, `--time-signature`, `--language`
17
+
18
+ ---
19
+
20
+ ## Caption: The Most Important Input
21
+
22
+ **Caption is the most important factor affecting generated music.**
23
+
24
+ Supports multiple formats: simple style words, comma-separated tags, complex natural language descriptions.
25
+
26
+ ### Common Dimensions
27
+
28
+ | Dimension | Examples |
29
+ |-----------|----------|
30
+ | **Style/Genre** | pop, rock, jazz, electronic, hip-hop, R&B, folk, classical, lo-fi, synthwave |
31
+ | **Emotion/Atmosphere** | melancholic, uplifting, energetic, dreamy, dark, nostalgic, euphoric, intimate |
32
+ | **Instruments** | acoustic guitar, piano, synth pads, 808 drums, strings, brass, electric bass |
33
+ | **Timbre Texture** | warm, bright, crisp, muddy, airy, punchy, lush, raw, polished |
34
+ | **Era Reference** | 80s synth-pop, 90s grunge, 2010s EDM, vintage soul, modern trap |
35
+ | **Production Style** | lo-fi, high-fidelity, live recording, studio-polished, bedroom pop |
36
+ | **Vocal Characteristics** | female vocal, male vocal, breathy, powerful, falsetto, raspy, choir |
37
+ | **Speed/Rhythm** | slow tempo, mid-tempo, fast-paced, groovy, driving, laid-back |
38
+ | **Structure Hints** | building intro, catchy chorus, dramatic bridge, fade-out ending |
39
+
40
+ ### Caption Writing Principles
41
+
42
+ 1. **Specific beats vague** — "sad piano ballad with female breathy vocal" > "a sad song"
43
+ 2. **Combine multiple dimensions** — style+emotion+instruments+timbre anchors direction precisely
44
+ 3. **Use references well** — "in the style of 80s synthwave" conveys complex aesthetic quickly
45
+ 4. **Texture words are useful** — warm, crisp, airy, punchy influence mixing and timbre
46
+ 5. **Don't pursue perfection** — Caption is a starting point, iterate based on results
47
+ 6. **Granularity determines freedom** — Less detail = more model creativity; more detail = more control
48
+ 7. **Avoid conflicting words** — "classical strings" + "hardcore metal" degrades output
49
+ - **Fix: Repetition reinforcement** — Repeat the elements you want more
50
+ - **Fix: Conflict to evolution** — "Start with soft strings, middle becomes metal rock, end turns to hip-hop"
51
+ 8. **Don't put BPM/key/tempo in Caption** — Use dedicated parameters instead
52
+
53
+ ---
54
+
55
+ ## Lyrics: The Temporal Script
56
+
57
+ Lyrics controls how music unfolds over time. It carries:
58
+ - Lyric text itself
59
+ - **Structure tags** ([Verse], [Chorus], [Bridge]...)
60
+ - **Vocal style hints** ([raspy vocal], [whispered]...)
61
+ - **Instrumental sections** ([guitar solo], [drum break]...)
62
+ - **Energy changes** ([building energy], [explosive drop]...)
63
+
64
+ ### Structure Tags
65
+
66
+ | Category | Tag | Description |
67
+ |----------|-----|-------------|
68
+ | **Basic Structure** | `[Intro]` | Opening, establish atmosphere |
69
+ | | `[Verse]` / `[Verse 1]` | Verse, narrative progression |
70
+ | | `[Pre-Chorus]` | Pre-chorus, build energy |
71
+ | | `[Chorus]` | Chorus, emotional climax |
72
+ | | `[Bridge]` | Bridge, transition or elevation |
73
+ | | `[Outro]` | Ending, conclusion |
74
+ | **Dynamic Sections** | `[Build]` | Energy gradually rising |
75
+ | | `[Drop]` | Electronic music energy release |
76
+ | | `[Breakdown]` | Reduced instrumentation, space |
77
+ | **Instrumental** | `[Instrumental]` | Pure instrumental, no vocals |
78
+ | | `[Guitar Solo]` | Guitar solo |
79
+ | | `[Piano Interlude]` | Piano interlude |
80
+ | **Special** | `[Fade Out]` | Fade out ending |
81
+ | | `[Silence]` | Silence |
82
+
83
+ ### Combining Tags
84
+
85
+ Use `-` for finer control, but keep it concise:
86
+
87
+ ```
88
+ ✅ [Chorus - anthemic]
89
+ ❌ [Chorus - anthemic - stacked harmonies - high energy - powerful - epic]
90
+ ```
91
+
92
+ Put complex style descriptions in Caption, not in tags.
93
+
94
+ ### Caption-Lyrics Consistency
95
+
96
+ **Models are not good at resolving conflicts.** Checklist:
97
+ - Instruments in Caption ↔ Instrumental section tags in Lyrics
98
+ - Emotion in Caption ↔ Energy tags in Lyrics
99
+ - Vocal description in Caption ↔ Vocal control tags in Lyrics
100
+
101
+ ### Vocal Control Tags
102
+
103
+ | Tag | Effect |
104
+ |-----|--------|
105
+ | `[raspy vocal]` | Raspy, textured vocals |
106
+ | `[whispered]` | Whispered |
107
+ | `[falsetto]` | Falsetto |
108
+ | `[powerful belting]` | Powerful, high-pitched singing |
109
+ | `[spoken word]` | Rap/recitation |
110
+ | `[harmonies]` | Layered harmonies |
111
+ | `[call and response]` | Call and response |
112
+ | `[ad-lib]` | Improvised embellishments |
113
+
114
+ ### Energy and Emotion Tags
115
+
116
+ | Tag | Effect |
117
+ |-----|--------|
118
+ | `[high energy]` | High energy, passionate |
119
+ | `[low energy]` | Low energy, restrained |
120
+ | `[building energy]` | Increasing energy |
121
+ | `[explosive]` | Explosive energy |
122
+ | `[melancholic]` | Melancholic |
123
+ | `[euphoric]` | Euphoric |
124
+ | `[dreamy]` | Dreamy |
125
+ | `[aggressive]` | Aggressive |
126
+
127
+ ### Lyric Writing Tips
128
+
129
+ 1. **6-10 syllables per line** — Model aligns syllables to beats; keep similar counts for lines in same position (±1-2)
130
+ 2. **Uppercase = stronger intensity** — `WE ARE THE CHAMPIONS!` (shouting) vs `walking through the streets` (normal)
131
+ 3. **Parentheses = background vocals** — `We rise together (together)`
132
+ 4. **Extend vowels** — `Feeeling so aliiive` (use cautiously, effects unstable)
133
+ 5. **Clear section separation** — Blank lines between sections
134
+
135
+ ### Avoiding "AI-flavored" Lyrics
136
+
137
+ | Red Flag | Description |
138
+ |----------|-------------|
139
+ | **Adjective stacking** | "neon skies, electric hearts, endless dreams" — vague imagery filler |
140
+ | **Rhyme chaos** | Inconsistent patterns or forced rhymes breaking meaning |
141
+ | **Blurred boundaries** | Lyric content crosses structure tags |
142
+ | **No breathing room** | Lines too long to sing in one breath |
143
+ | **Mixed metaphors** | Water → fire → flying — listeners can't anchor |
144
+
145
+ **Metaphor discipline**: One core metaphor per song, explore its multiple aspects.
146
+
147
+ ---
148
+
149
+ ## Music Metadata
150
+
151
+ **Most of the time, let LM auto-infer.** Only set manually when you have clear requirements.
152
+
153
+ | Parameter | Range | Description |
154
+ |-----------|-------|-------------|
155
+ | `bpm` | 30–300 | Slow 60–80, mid 90–120, fast 130–180 |
156
+ | `keyscale` | Key | e.g. `C Major`, `Am`. Common keys (C, G, D, Am, Em) most stable |
157
+ | `timesignature` | Time sig | `4/4` (most common), `3/4` (waltz), `6/8` (swing) |
158
+ | `vocal_language` | Language | Usually auto-detected from lyrics |
159
+ | `duration` | Seconds | See duration calculation below |
160
+
161
+ ### When to Set Manually
162
+
163
+ | Scenario | Set |
164
+ |----------|-----|
165
+ | Daily generation | Let LM auto-infer |
166
+ | Clear tempo requirement | `bpm` |
167
+ | Specific style (waltz) | `timesignature=3/4` |
168
+ | Match other material | `bpm` + `duration` |
169
+ | Specific key color | `keyscale` |
170
+
171
+ ---
172
+
173
+ ## Duration Calculation
174
+
175
+ ### Estimation Method
176
+
177
+ - **Intro/Outro**: 5-10 seconds each
178
+ - **Instrumental sections**: 5-15 seconds each
179
+ - **Typical structures**:
180
+ - 2 verses + 2 choruses: 120-150s minimum
181
+ - 2 verses + 2 choruses + bridge: 180-240s minimum
182
+ - Full song with intro/outro: 210-270s (3.5-4.5 min)
183
+
184
+ ### BPM and Duration Relationship
185
+
186
+ - **Slower BPM (60-80)**: Need MORE duration for same lyrics
187
+ - **Medium BPM (100-130)**: Standard duration
188
+ - **Faster BPM (150-180)**: Can fit more lyrics, but still need breathing room
189
+
190
+ **Rule of thumb**: When in doubt, estimate longer. A song too short feels rushed.
191
+
192
+ ---
193
+
194
+ Note: Lyrics tags (piano, powerful, whispered) are consistent with Caption (piano ballad, building to powerful chorus, intimate).
.claude/skills/acestep/SKILL.md ADDED
@@ -0,0 +1,253 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: acestep
3
+ description: Use ACE-Step API to generate music, edit songs, and remix music. Supports text-to-music, lyrics generation, audio continuation, and audio repainting. Use this skill when users mention generating music, creating songs, music production, remix, or audio continuation.
4
+ allowed-tools: Read, Write, Bash, Skill
5
+ ---
6
+
7
+ # ACE-Step Music Generation Skill
8
+
9
+ Use ACE-Step V1.5 API for music generation. **Always use `scripts/acestep.sh` script** — do NOT call API endpoints directly.
10
+
11
+ ## Quick Start
12
+
13
+ ```bash
14
+ # 1. cd to this skill's directory
15
+ cd {project_root}/{.claude or .codex}/skills/acestep/
16
+
17
+ # 2. Check API service health
18
+ ./scripts/acestep.sh health
19
+
20
+ # 3. Generate with lyrics (recommended)
21
+ ./scripts/acestep.sh generate -c "pop, female vocal, piano" -l "[Verse] Your lyrics here..." --duration 120 --language zh
22
+
23
+ # 4. Output saved to: {project_root}/acestep_output/
24
+ ```
25
+
26
+ ## Workflow
27
+
28
+ For user requests requiring vocals:
29
+ 1. Use the **acestep-songwriting** skill for lyrics writing, caption creation, duration/BPM/key selection
30
+ 2. Write complete, well-structured lyrics yourself based on the songwriting guide
31
+ 3. Generate using Caption mode with `-c` and `-l` parameters
32
+
33
+ Only use Simple/Random mode (`-d` or `random`) for quick inspiration or instrumental exploration.
34
+
35
+ If the user needs a simple music video, use the **acestep-simplemv** skill to render one with waveform visualization and synced lyrics.
36
+
37
+ **MV Production Requirements**: Making a simple MV requires three additional skills to be installed:
38
+ - **acestep-songwriting** — for writing lyrics and planning song structure
39
+ - **acestep-lyrics-transcription** — for transcribing audio to timestamped lyrics (LRC)
40
+ - **acestep-simplemv** — for rendering the final music video
41
+
42
+ ## Script Commands
43
+
44
+ **CRITICAL - Complete Lyrics Input**: When providing lyrics via the `-l` parameter, you MUST pass ALL lyrics content WITHOUT any omission:
45
+ - If user provides lyrics, pass the ENTIRE text they give you
46
+ - If you generate lyrics yourself, pass the COMPLETE lyrics you created
47
+ - NEVER truncate, shorten, or pass only partial lyrics
48
+ - Missing lyrics will result in incomplete or incoherent songs
49
+
50
+ **Music Parameters**: Use the **acestep-songwriting** skill for guidance on duration, BPM, key scale, and time signature.
51
+
52
+ ```bash
53
+ # need to cd to this skill's directory first
54
+ cd {project_root}/{.claude or .codex}/skills/acestep/
55
+
56
+ # Caption mode - RECOMMENDED: Write lyrics first, then generate
57
+ ./scripts/acestep.sh generate -c "Electronic pop, energetic synths" -l "[Verse] Your complete lyrics
58
+ [Chorus] Full chorus here..." --duration 120 --bpm 128
59
+
60
+ # Instrumental only
61
+ ./scripts/acestep.sh generate "Jazz with saxophone"
62
+
63
+ # Quick exploration (Simple/Random mode)
64
+ ./scripts/acestep.sh generate -d "A cheerful song about spring"
65
+ ./scripts/acestep.sh random
66
+
67
+ # Options
68
+ ./scripts/acestep.sh generate "Rock" --duration 60 --batch 2
69
+ ./scripts/acestep.sh generate "EDM" --no-thinking # Faster
70
+
71
+ # Other commands
72
+ ./scripts/acestep.sh status <job_id>
73
+ ./scripts/acestep.sh health
74
+ ./scripts/acestep.sh models
75
+ ```
76
+
77
+ ## Output Files
78
+
79
+ After generation, the script automatically saves results to the `acestep_output` folder in the project root (same level as `.claude`):
80
+
81
+ ```
82
+ project_root/
83
+ ├── .claude/
84
+ │ └── skills/acestep/...
85
+ ├── acestep_output/ # Output directory
86
+ │ ├── <job_id>.json # Complete task result (JSON)
87
+ │ ├── <job_id>_1.mp3 # First audio file
88
+ │ ├── <job_id>_2.mp3 # Second audio file (if batch_size > 1)
89
+ │ └── ...
90
+ └── ...
91
+ ```
92
+
93
+ ### JSON Result Structure
94
+
95
+ **Important**: When LM enhancement is enabled (`use_format=true`), the final synthesized content may differ from your input. Check the JSON file for actual values:
96
+
97
+ | Field | Description |
98
+ |-------|-------------|
99
+ | `prompt` | **Actual caption** used for synthesis (may be LM-enhanced) |
100
+ | `lyrics` | **Actual lyrics** used for synthesis (may be LM-enhanced) |
101
+ | `metas.prompt` | Original input caption |
102
+ | `metas.lyrics` | Original input lyrics |
103
+ | `metas.bpm` | BPM used |
104
+ | `metas.keyscale` | Key scale used |
105
+ | `metas.duration` | Duration in seconds |
106
+ | `generation_info` | Detailed timing and model info |
107
+ | `seed_value` | Seeds used (for reproducibility) |
108
+ | `lm_model` | LM model name |
109
+ | `dit_model` | DiT model name |
110
+
111
+ To get the actual synthesized lyrics, parse the JSON and read the top-level `lyrics` field, not `metas.lyrics`.
112
+
113
+ ## Configuration
114
+
115
+ **Important**: Configuration follows this priority (high to low):
116
+
117
+ 1. **Command line arguments** > **config.json defaults**
118
+ 2. User-specified parameters **temporarily override** defaults but **do not modify** config.json
119
+ 3. Only `config --set` command **permanently modifies** config.json
120
+
121
+ ### Default Config File (`scripts/config.json`)
122
+
123
+ ```json
124
+ {
125
+ "api_url": "http://127.0.0.1:8001",
126
+ "api_key": "",
127
+ "api_mode": "completion",
128
+ "generation": {
129
+ "thinking": true,
130
+ "use_format": false,
131
+ "use_cot_caption": true,
132
+ "use_cot_language": false,
133
+ "batch_size": 1,
134
+ "audio_format": "mp3",
135
+ "vocal_language": "en"
136
+ }
137
+ }
138
+ ```
139
+
140
+ | Option | Default | Description |
141
+ |--------|---------|-------------|
142
+ | `api_url` | `http://127.0.0.1:8001` | API server address |
143
+ | `api_key` | `""` | API authentication key (optional) |
144
+ | `api_mode` | `completion` | API mode: `completion` (OpenRouter, default) or `native` (polling) |
145
+ | `generation.thinking` | `true` | Enable 5Hz LM (higher quality, slower) |
146
+ | `generation.audio_format` | `mp3` | Output format (mp3/wav/flac) |
147
+ | `generation.vocal_language` | `en` | Vocal language |
148
+
149
+ ## Prerequisites - ACE-Step API Service
150
+
151
+ **IMPORTANT**: This skill requires the ACE-Step API server to be running.
152
+
153
+ ### Required Dependencies
154
+
155
+ The `scripts/acestep.sh` script requires: **curl** and **jq**.
156
+
157
+ ```bash
158
+ # Check dependencies
159
+ curl --version
160
+ jq --version
161
+ ```
162
+
163
+ If jq is not installed, the script will attempt to install it automatically. If automatic installation fails:
164
+ - **Windows**: `choco install jq` or download from https://jqlang.github.io/jq/download/
165
+ - **macOS**: `brew install jq`
166
+ - **Linux**: `sudo apt-get install jq` (Debian/Ubuntu) or `sudo dnf install jq` (Fedora)
167
+
168
+ ### Before First Use
169
+
170
+ **You MUST check the API key and URL status before proceeding.** Run:
171
+
172
+ ```bash
173
+ cd "{project_root}/{.claude or .codex}/skills/acestep/" && bash ./scripts/acestep.sh config --check-key
174
+ cd "{project_root}/{.claude or .codex}/skills/acestep/" && bash ./scripts/acestep.sh config --get api_url
175
+ ```
176
+
177
+ #### Case 1: Using Official Cloud API (`https://api.acemusic.ai`) without API key
178
+
179
+ If `api_url` is `https://api.acemusic.ai` and `api_key` is `empty`, you MUST stop and guide the user to configure their key:
180
+
181
+ 1. Tell the user: "You're using the ACE-Step official cloud API, but no API key is configured. An API key is required to use this service."
182
+ 2. Explain how to get a key: API keys are currently available through the official ACE-Step Discord community (https://discord.gg/bGVxwUyD). Additional distribution methods will be added in the future.
183
+ 3. Use `AskUserQuestion` to ask the user to provide their API key.
184
+ 4. Once provided, configure it:
185
+ ```bash
186
+ cd "{project_root}/{.claude or .codex}/skills/acestep/" && bash ./scripts/acestep.sh config --set api_key <KEY>
187
+ ```
188
+ 5. Additionally, inform the user: "If you also want to render music videos (MV), it's recommended to configure a lyrics transcription API key as well (OpenAI Whisper or ElevenLabs Scribe), so that lyrics can be automatically transcribed with accurate timestamps. You can configure it later via the `acestep-lyrics-transcription` skill."
189
+
190
+ #### Case 2: API key is configured
191
+
192
+ Verify the API endpoint: `./scripts/acestep.sh health` and proceed with music generation.
193
+
194
+ #### Case 3: Using local/custom API without key
195
+
196
+ Local services (`http://127.0.0.1:*`) typically don't require a key. Verify with `./scripts/acestep.sh health` and proceed.
197
+
198
+ If health check fails:
199
+ - Ask: "Do you have ACE-Step installed?"
200
+ - **If installed but not running**: Use the acestep-docs skill to help them start the service
201
+ - **If not installed**: Use acestep-docs skill to guide through installation
202
+
203
+ ### Service Configuration
204
+
205
+ **Official Cloud API:** ACE-Step provides an official API endpoint at `https://api.acemusic.ai`. To use it:
206
+ ```bash
207
+ ./scripts/acestep.sh config --set api_url "https://api.acemusic.ai"
208
+ ./scripts/acestep.sh config --set api_key "your-key"
209
+ ./scripts/acestep.sh config --set api_mode completion
210
+ ```
211
+ API keys are currently available through the official ACE-Step Discord community. Additional distribution methods will be added in the future.
212
+
213
+ **Local Service (Default):** No configuration needed — connects to `http://127.0.0.1:8001`.
214
+
215
+ **Custom Remote Service:** Update `scripts/config.json` or use:
216
+ ```bash
217
+ ./scripts/acestep.sh config --set api_url "http://remote-server:8001"
218
+ ./scripts/acestep.sh config --set api_key "your-key"
219
+ ```
220
+
221
+ **API Key Handling**: When checking whether an API key is configured, use `config --check-key` which only reports `configured` or `empty` without printing the actual key. **NEVER use `config --get api_key`** or read `config.json` directly — these would expose the user's API key. The `config --list` command is safe — it automatically masks API keys as `***` in output.
222
+
223
+ ### API Mode
224
+
225
+ The skill supports two API modes. Switch via `api_mode` in `scripts/config.json`:
226
+
227
+ | Mode | Endpoint | Description |
228
+ |------|----------|-------------|
229
+ | `completion` (default) | `/v1/chat/completions` | OpenRouter-compatible, sync request, audio returned as base64 |
230
+ | `native` | `/release_task` + `/query_result` | Async polling mode, supports all parameters |
231
+
232
+ **Switch mode:**
233
+ ```bash
234
+ ./scripts/acestep.sh config --set api_mode completion
235
+ ./scripts/acestep.sh config --set api_mode native
236
+ ```
237
+
238
+ **Completion mode notes:**
239
+ - No polling needed — single request returns result directly
240
+ - Audio is base64-encoded inline in the response (auto-decoded and saved)
241
+ - `inference_steps`, `infer_method`, `shift` are not configurable (server defaults)
242
+ - `--no-wait` and `status` commands are not applicable in completion mode
243
+ - Requires `model` field — auto-detected from `/v1/models` if not specified
244
+
245
+ ### Using acestep-docs Skill for Setup Help
246
+
247
+ **IMPORTANT**: For installation and startup, always use the acestep-docs skill to get complete and accurate guidance.
248
+
249
+ **DO NOT provide simplified startup commands** - each user's environment may be different. Always guide them to use acestep-docs for proper setup.
250
+
251
+ ---
252
+
253
+ For API debugging, see [API Reference](./api-reference.md).
.claude/skills/acestep/api-reference.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step API Reference
2
+
3
+ > For debugging and advanced usage only. Normal operations should use `scripts/acestep.sh`.
4
+
5
+ ## Native Mode Endpoints
6
+
7
+ All responses wrapped: `{"data": <payload>, "code": 200, "error": null, "timestamp": ...}`
8
+
9
+ | Endpoint | Method | Description |
10
+ |----------|--------|-------------|
11
+ | `/health` | GET | Health check |
12
+ | `/release_task` | POST | Create generation task |
13
+ | `/query_result` | POST | Query task status, body: `{"task_id_list": ["id"]}` |
14
+ | `/v1/models` | GET | List available models |
15
+ | `/v1/audio?path={path}` | GET | Download audio file |
16
+
17
+ ## Completion Mode Endpoints
18
+
19
+ | Endpoint | Method | Description |
20
+ |----------|--------|-------------|
21
+ | `/v1/chat/completions` | POST | Generate music (OpenRouter-compatible) |
22
+ | `/v1/models` | GET | List available models (OpenRouter format) |
23
+
24
+ ## Query Result Response
25
+
26
+ ```json
27
+ {
28
+ "data": [{
29
+ "task_id": "xxx",
30
+ "status": 1,
31
+ "result": "[{\"file\":\"/v1/audio?path=...\",\"metas\":{\"bpm\":120,\"duration\":60,\"keyscale\":\"C Major\"}}]"
32
+ }]
33
+ }
34
+ ```
35
+
36
+ Status codes: `0` = processing, `1` = success, `2` = failed
37
+
38
+ ## Completion Mode Request (`/v1/chat/completions`)
39
+
40
+ **Caption mode** — prompt and lyrics wrapped in XML tags inside message content:
41
+ ```json
42
+ {
43
+ "model": "acestep/ACE-Step-v1.5",
44
+ "messages": [{"role": "user", "content": "<prompt>Jazz with saxophone</prompt><lyrics>[Verse] Hello...</lyrics>"}],
45
+ "stream": false,
46
+ "thinking": true,
47
+ "use_format": false,
48
+ "audio_config": {"duration": 90, "bpm": 110, "format": "mp3", "vocal_language": "en"}
49
+ }
50
+ ```
51
+
52
+ **Simple mode** — plain text message, set `sample_mode: true`:
53
+ ```json
54
+ {
55
+ "model": "acestep/ACE-Step-v1.5",
56
+ "messages": [{"role": "user", "content": "A cheerful pop song about spring"}],
57
+ "stream": false,
58
+ "sample_mode": true,
59
+ "thinking": true
60
+ }
61
+ ```
62
+
63
+ ## Completion Mode Response
64
+
65
+ ```json
66
+ {
67
+ "id": "chatcmpl-abc123",
68
+ "choices": [{
69
+ "message": {
70
+ "role": "assistant",
71
+ "content": "## Metadata\n**Caption:** ...\n**BPM:** 128\n\n## Lyrics\n...",
72
+ "audio": [{"type": "audio_url", "audio_url": {"url": "data:audio/mpeg;base64,..."}}]
73
+ },
74
+ "finish_reason": "stop"
75
+ }]
76
+ }
77
+ ```
78
+
79
+ Audio is base64-encoded inline — the script auto-decodes and saves to `acestep_output/`.
80
+
81
+ ## Request Parameters (`/release_task`)
82
+
83
+ Parameters can be placed in `param_obj` object.
84
+
85
+ ### Generation Modes
86
+
87
+ | Mode | Usage | When to Use |
88
+ |------|-------|-------------|
89
+ | **Caption** (Recommended) | `generate -c "style" -l "lyrics"` | For vocal songs - write lyrics yourself first |
90
+ | **Simple** | `generate -d "description"` | Quick exploration, LM generates everything |
91
+ | **Random** | `random` | Random generation for inspiration |
92
+
93
+ ### Core Parameters
94
+
95
+ | Parameter | Type | Default | Description |
96
+ |-----------|------|---------|-------------|
97
+ | `prompt` | string | "" | Music style description (Caption mode) |
98
+ | `lyrics` | string | "" | **Full lyrics content** - Pass ALL lyrics without omission. Use `[inst]` for instrumental. Partial/truncated lyrics = incomplete songs |
99
+ | `sample_mode` | bool | false | Enable Simple/Random mode |
100
+ | `sample_query` | string | "" | Description for Simple mode |
101
+ | `thinking` | bool | false | Enable 5Hz LM for audio code generation |
102
+ | `use_format` | bool | false | Use LM to enhance caption/lyrics |
103
+ | `model` | string | - | DiT model name |
104
+ | `batch_size` | int | 1 | Number of audio files to generate |
105
+
106
+ ### Music Attributes
107
+
108
+ | Parameter | Type | Default | Description |
109
+ |-----------|------|---------|-------------|
110
+ | `audio_duration` | float | - | Duration in seconds |
111
+ | `bpm` | int | - | Tempo (beats per minute) |
112
+ | `key_scale` | string | "" | Key (e.g. "C Major") |
113
+ | `time_signature` | string | "" | Time signature (e.g. "4/4") |
114
+ | `vocal_language` | string | "en" | Language code (en, zh, ja, etc.) |
115
+ | `audio_format` | string | "mp3" | Output format (mp3/wav/flac) |
116
+
117
+ ### Generation Parameters
118
+
119
+ | Parameter | Type | Default | Description |
120
+ |-----------|------|---------|-------------|
121
+ | `inference_steps` | int | 8 | Diffusion steps |
122
+ | `guidance_scale` | float | 7.0 | CFG scale |
123
+ | `seed` | int | -1 | Random seed (-1 for random) |
124
+ | `infer_method` | string | "ode" | Diffusion method (ode/sde) |
125
+
126
+ ### Audio Task Parameters
127
+
128
+ | Parameter | Type | Default | Description |
129
+ |-----------|------|---------|-------------|
130
+ | `task_type` | string | "text2music" | text2music / continuation / repainting |
131
+ | `src_audio_path` | string | - | Source audio for continuation |
132
+ | `repainting_start` | float | 0.0 | Repainting start position (seconds) |
133
+ | `repainting_end` | float | - | Repainting end position (seconds) |
134
+
135
+ ### Example Request (Simple Mode)
136
+
137
+ ```json
138
+ {
139
+ "sample_mode": true,
140
+ "sample_query": "A cheerful pop song about spring",
141
+ "thinking": true,
142
+ "param_obj": {
143
+ "duration": 60,
144
+ "bpm": 120,
145
+ "language": "en"
146
+ },
147
+ "batch_size": 2
148
+ }
149
+ ```
.claude/skills/acestep/scripts/acestep.sh ADDED
@@ -0,0 +1,1093 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ #
3
+ # ACE-Step Music Generation CLI (Bash + Curl + jq)
4
+ #
5
+ # Requirements: curl, jq
6
+ #
7
+ # Usage:
8
+ # ./acestep.sh generate "Music description" [options]
9
+ # ./acestep.sh random [--no-thinking]
10
+ # ./acestep.sh status <job_id>
11
+ # ./acestep.sh models
12
+ # ./acestep.sh health
13
+ # ./acestep.sh config [--get|--set|--reset]
14
+ #
15
+ # Output:
16
+ # - Results saved to output/<job_id>.json
17
+ # - Audio files downloaded to output/<job_id>_1.mp3, output/<job_id>_2.mp3, ...
18
+
19
+ set -e
20
+
21
+ # Ensure UTF-8 encoding for non-ASCII characters (Japanese, Chinese, etc.)
22
+ export LANG="${LANG:-en_US.UTF-8}"
23
+ export LC_ALL="${LC_ALL:-en_US.UTF-8}"
24
+
25
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
26
+ CONFIG_FILE="${SCRIPT_DIR}/config.json"
27
+ # Output dir at same level as .claude (go up 4 levels from scripts/)
28
+ OUTPUT_DIR="$(cd "${SCRIPT_DIR}/../../../.." && pwd)/acestep_output"
29
+ DEFAULT_API_URL="http://127.0.0.1:8001"
30
+
31
+ # Colors
32
+ RED='\033[0;31m'
33
+ GREEN='\033[0;32m'
34
+ YELLOW='\033[1;33m'
35
+ CYAN='\033[0;36m'
36
+ NC='\033[0m'
37
+
38
+ # Check dependencies
39
+ check_deps() {
40
+ if ! command -v curl &> /dev/null; then
41
+ echo -e "${RED}Error: curl is required but not installed.${NC}"
42
+ exit 1
43
+ fi
44
+ if ! command -v jq &> /dev/null; then
45
+ echo -e "${RED}Error: jq is required but not installed.${NC}"
46
+ echo "Install: apt install jq / brew install jq / choco install jq"
47
+ exit 1
48
+ fi
49
+ }
50
+
51
+ # JSON value extractor using jq
52
+ # Usage: json_get "$json" ".key" or json_get "$json" ".nested.key"
53
+ json_get() {
54
+ local json="$1"
55
+ local path="$2"
56
+ echo "$json" | jq -r "$path // empty" 2>/dev/null
57
+ }
58
+
59
+ # Extract array values using jq
60
+ json_get_array() {
61
+ local json="$1"
62
+ local path="$2"
63
+ echo "$json" | jq -r "$path[]? // empty" 2>/dev/null
64
+ }
65
+
66
+ # Ensure output directory exists
67
+ ensure_output_dir() {
68
+ mkdir -p "$OUTPUT_DIR"
69
+ }
70
+
71
+ # Default config
72
+ DEFAULT_CONFIG='{
73
+ "api_url": "http://127.0.0.1:8001",
74
+ "api_key": "",
75
+ "api_mode": "native",
76
+ "generation": {
77
+ "thinking": true,
78
+ "use_format": true,
79
+ "use_cot_caption": true,
80
+ "use_cot_language": true,
81
+ "audio_format": "mp3",
82
+ "vocal_language": "en"
83
+ }
84
+ }'
85
+
86
+ # Ensure config file exists
87
+ ensure_config() {
88
+ if [ ! -f "$CONFIG_FILE" ]; then
89
+ local example="${SCRIPT_DIR}/config.example.json"
90
+ if [ -f "$example" ]; then
91
+ cp "$example" "$CONFIG_FILE"
92
+ echo -e "${YELLOW}Config file created from config.example.json. Please configure your settings:${NC}"
93
+ echo -e " ${CYAN}./scripts/acestep.sh config --set api_url <url>${NC}"
94
+ echo -e " ${CYAN}./scripts/acestep.sh config --set api_key <key>${NC}"
95
+ else
96
+ echo "$DEFAULT_CONFIG" > "$CONFIG_FILE"
97
+ fi
98
+ fi
99
+ }
100
+
101
+ # Get config value using jq
102
+ get_config() {
103
+ local key="$1"
104
+ ensure_config
105
+ # Convert dot notation to jq path: "generation.thinking" -> ".generation.thinking"
106
+ local jq_path=".${key}"
107
+ local value
108
+ # Don't use // operator as it treats boolean false as falsy
109
+ value=$(jq -r "$jq_path" "$CONFIG_FILE" 2>/dev/null)
110
+ # Remove any trailing whitespace/newlines (Windows compatibility)
111
+ # Return empty string if value is "null" (key doesn't exist)
112
+ if [ "$value" = "null" ]; then
113
+ echo ""
114
+ else
115
+ echo "$value" | tr -d '\r\n'
116
+ fi
117
+ }
118
+
119
+ # Normalize boolean value for jq --argjson
120
+ normalize_bool() {
121
+ local val="$1"
122
+ local default="${2:-false}"
123
+ case "$val" in
124
+ true|True|TRUE|1) echo "true" ;;
125
+ false|False|FALSE|0) echo "false" ;;
126
+ *) echo "$default" ;;
127
+ esac
128
+ }
129
+
130
+ # Set config value using jq
131
+ set_config() {
132
+ local key="$1"
133
+ local value="$2"
134
+ ensure_config
135
+
136
+ local tmp_file="${CONFIG_FILE}.tmp"
137
+ local jq_path=".${key}"
138
+
139
+ # Determine value type and set accordingly
140
+ if [ "$value" = "true" ] || [ "$value" = "false" ]; then
141
+ jq "$jq_path = $value" "$CONFIG_FILE" > "$tmp_file"
142
+ elif [[ "$value" =~ ^-?[0-9]+$ ]] || [[ "$value" =~ ^-?[0-9]+\.[0-9]+$ ]]; then
143
+ jq "$jq_path = $value" "$CONFIG_FILE" > "$tmp_file"
144
+ else
145
+ jq "$jq_path = \"$value\"" "$CONFIG_FILE" > "$tmp_file"
146
+ fi
147
+
148
+ mv "$tmp_file" "$CONFIG_FILE"
149
+ echo "Set $key = $value"
150
+ }
151
+
152
+ # Load API URL
153
+ load_api_url() {
154
+ local url=$(get_config "api_url")
155
+ echo "${url:-$DEFAULT_API_URL}"
156
+ }
157
+
158
+ # Load API Key
159
+ load_api_key() {
160
+ local key=$(get_config "api_key")
161
+ echo "${key:-}"
162
+ }
163
+
164
+ # Check API health
165
+ check_health() {
166
+ local url="$1"
167
+ local status
168
+ status=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 "${url}/health" 2>/dev/null) || true
169
+ [ "$status" = "200" ]
170
+ }
171
+
172
+ # Build auth header
173
+ build_auth_header() {
174
+ local api_key=$(load_api_key)
175
+ if [ -n "$api_key" ]; then
176
+ echo "-H \"Authorization: Bearer ${api_key}\""
177
+ fi
178
+ }
179
+
180
+ # Prompt for URL
181
+ prompt_for_url() {
182
+ echo ""
183
+ echo -e "${YELLOW}API server is not responding.${NC}"
184
+ echo "Please enter the API URL (or press Enter for default):"
185
+ read -p "API URL [$DEFAULT_API_URL]: " user_input
186
+ echo "${user_input:-$DEFAULT_API_URL}"
187
+ }
188
+
189
+ # Ensure API connection
190
+ ensure_connection() {
191
+ ensure_config
192
+ local api_url=$(load_api_url)
193
+
194
+ if check_health "$api_url"; then
195
+ echo "$api_url"
196
+ return 0
197
+ fi
198
+
199
+ echo -e "${YELLOW}Cannot connect to: $api_url${NC}" >&2
200
+ local new_url=$(prompt_for_url)
201
+
202
+ if check_health "$new_url"; then
203
+ set_config "api_url" "$new_url" > /dev/null
204
+ echo -e "${GREEN}Saved API URL: $new_url${NC}" >&2
205
+ echo "$new_url"
206
+ return 0
207
+ fi
208
+
209
+ echo -e "${RED}Error: Cannot connect to $new_url${NC}" >&2
210
+ exit 1
211
+ }
212
+
213
+ # Save result to JSON file
214
+ save_result() {
215
+ local job_id="$1"
216
+ local result_json="$2"
217
+
218
+ ensure_output_dir
219
+ local output_file="${OUTPUT_DIR}/${job_id}.json"
220
+ echo "$result_json" > "$output_file"
221
+ echo -e "${GREEN}Result saved: $output_file${NC}"
222
+ }
223
+
224
+ # Health command
225
+ cmd_health() {
226
+ check_deps
227
+ ensure_config
228
+ local api_url=$(load_api_url)
229
+
230
+ echo "Checking API at: $api_url"
231
+ if check_health "$api_url"; then
232
+ echo -e "${GREEN}Status: OK${NC}"
233
+ curl -s "${api_url}/health"
234
+ echo ""
235
+ else
236
+ echo -e "${RED}Status: FAILED${NC}"
237
+ exit 1
238
+ fi
239
+ }
240
+
241
+ # Config command
242
+ cmd_config() {
243
+ check_deps
244
+ ensure_config
245
+
246
+ local action=""
247
+ local key=""
248
+ local value=""
249
+
250
+ while [[ $# -gt 0 ]]; do
251
+ case $1 in
252
+ --get) action="get"; key="$2"; shift 2 ;;
253
+ --set) action="set"; key="$2"; value="$3"; shift 3 ;;
254
+ --reset) action="reset"; shift ;;
255
+ --list) action="list"; shift ;;
256
+ --check-key) action="check-key"; shift ;;
257
+ *) shift ;;
258
+ esac
259
+ done
260
+
261
+ case "$action" in
262
+ "check-key")
263
+ local api_key=$(get_config "api_key")
264
+ if [ -n "$api_key" ]; then
265
+ echo "api_key: configured"
266
+ else
267
+ echo "api_key: empty"
268
+ fi
269
+ ;;
270
+ "get")
271
+ [ -z "$key" ] && { echo -e "${RED}Error: --get requires KEY${NC}"; exit 1; }
272
+ local result=$(get_config "$key")
273
+ [ -n "$result" ] && echo "$key = $result" || echo "Key not found: $key"
274
+ ;;
275
+ "set")
276
+ [ -z "$key" ] || [ -z "$value" ] && { echo -e "${RED}Error: --set requires KEY VALUE${NC}"; exit 1; }
277
+ set_config "$key" "$value"
278
+ ;;
279
+ "reset")
280
+ echo "$DEFAULT_CONFIG" > "$CONFIG_FILE"
281
+ echo -e "${GREEN}Configuration reset to defaults.${NC}"
282
+ jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
283
+ ;;
284
+ "list")
285
+ echo "Current configuration:"
286
+ jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
287
+ ;;
288
+ *)
289
+ echo "Config file: $CONFIG_FILE"
290
+ echo "Output dir: $OUTPUT_DIR"
291
+ echo "----------------------------------------"
292
+ cat "$CONFIG_FILE"
293
+ echo "----------------------------------------"
294
+ echo ""
295
+ echo "Usage:"
296
+ echo " config --list Show config"
297
+ echo " config --get <key> Get value"
298
+ echo " config --set <key> <val> Set value"
299
+ echo " config --reset Reset to defaults"
300
+ ;;
301
+ esac
302
+ }
303
+
304
+ # Models command
305
+ cmd_models() {
306
+ check_deps
307
+ local api_url=$(ensure_connection)
308
+ local api_key=$(load_api_key)
309
+
310
+ echo "Available Models:"
311
+ echo "----------------------------------------"
312
+ if [ -n "$api_key" ]; then
313
+ curl -s -H "Authorization: Bearer ${api_key}" "${api_url}/v1/models"
314
+ else
315
+ curl -s "${api_url}/v1/models"
316
+ fi
317
+ echo ""
318
+ }
319
+
320
+ # Query job result via /query_result endpoint
321
+ query_job_result() {
322
+ local api_url="$1"
323
+ local job_id="$2"
324
+ local api_key=$(load_api_key)
325
+
326
+ local payload=$(jq -n --arg id "$job_id" '{"task_id_list": [$id]}')
327
+
328
+ if [ -n "$api_key" ]; then
329
+ curl -s -X POST "${api_url}/query_result" \
330
+ -H "Content-Type: application/json; charset=utf-8" \
331
+ -H "Authorization: Bearer ${api_key}" \
332
+ -d "$payload"
333
+ else
334
+ curl -s -X POST "${api_url}/query_result" \
335
+ -H "Content-Type: application/json; charset=utf-8" \
336
+ -d "$payload"
337
+ fi
338
+ }
339
+
340
+ # Parse query_result response to extract status (0=processing, 1=success, 2=failed)
341
+ # Response is wrapped: {"data": [...], "code": 200, ...}
342
+ # Uses temp file to avoid jq pipe issues with special characters on Windows
343
+ parse_query_status() {
344
+ local response="$1"
345
+ local tmp_file=$(mktemp)
346
+ printf '%s' "$response" > "$tmp_file"
347
+ jq -r '.data[0].status // .[0].status // 0' "$tmp_file"
348
+ rm -f "$tmp_file"
349
+ }
350
+
351
+ # Parse result JSON string from query_result response
352
+ # The result field is a JSON string that needs to be parsed
353
+ # Uses temp file to avoid jq pipe issues with special characters on Windows
354
+ parse_query_result() {
355
+ local response="$1"
356
+ local tmp_file=$(mktemp)
357
+ printf '%s' "$response" > "$tmp_file"
358
+ jq -r '.data[0].result // .[0].result // "[]"' "$tmp_file"
359
+ rm -f "$tmp_file"
360
+ }
361
+
362
+ # Extract audio file paths from result (returns newline-separated paths)
363
+ # Uses temp file to avoid jq pipe issues with special characters on Windows
364
+ parse_audio_files() {
365
+ local result="$1"
366
+ local tmp_file=$(mktemp)
367
+ printf '%s' "$result" > "$tmp_file"
368
+ jq -r '.[].file // empty' "$tmp_file" 2>/dev/null
369
+ rm -f "$tmp_file"
370
+ }
371
+
372
+ # Extract metas value from result
373
+ # Uses temp file to avoid jq pipe issues with special characters on Windows
374
+ parse_metas_value() {
375
+ local result="$1"
376
+ local key="$2"
377
+ local tmp_file=$(mktemp)
378
+ printf '%s' "$result" > "$tmp_file"
379
+ jq -r ".[0].metas.$key // .[0].$key // empty" "$tmp_file" 2>/dev/null
380
+ rm -f "$tmp_file"
381
+ }
382
+
383
+ # Status command
384
+ cmd_status() {
385
+ check_deps
386
+ local job_id="$1"
387
+
388
+ [ -z "$job_id" ] && { echo -e "${RED}Error: job_id required${NC}"; echo "Usage: $0 status <job_id>"; exit 1; }
389
+
390
+ local api_url=$(ensure_connection)
391
+ local response=$(query_job_result "$api_url" "$job_id")
392
+
393
+ local status=$(parse_query_status "$response")
394
+ echo "Job ID: $job_id"
395
+
396
+ case "$status" in
397
+ 0)
398
+ echo "Status: processing"
399
+ ;;
400
+ 1)
401
+ echo "Status: succeeded"
402
+ echo ""
403
+ local result_file=$(mktemp)
404
+ parse_query_result "$response" > "$result_file"
405
+
406
+ local bpm=$(jq -r '.[0].metas.bpm // .[0].bpm // empty' "$result_file" 2>/dev/null)
407
+ local keyscale=$(jq -r '.[0].metas.keyscale // .[0].keyscale // empty' "$result_file" 2>/dev/null)
408
+ local duration=$(jq -r '.[0].metas.duration // .[0].duration // empty' "$result_file" 2>/dev/null)
409
+
410
+ echo "Result:"
411
+ [ -n "$bpm" ] && echo " BPM: $bpm"
412
+ [ -n "$keyscale" ] && echo " Key: $keyscale"
413
+ [ -n "$duration" ] && echo " Duration: ${duration}s"
414
+
415
+ # Save and download
416
+ save_result "$job_id" "$response"
417
+ download_audios "$api_url" "$job_id" "$result_file"
418
+ rm -f "$result_file"
419
+ ;;
420
+ 2)
421
+ echo "Status: failed"
422
+ echo ""
423
+ echo -e "${RED}Task failed${NC}"
424
+ ;;
425
+ *)
426
+ echo "Status: unknown ($status)"
427
+ ;;
428
+ esac
429
+ }
430
+
431
+ # Download audio files from result file
432
+ # Usage: download_audios <api_url> <job_id> <result_file>
433
+ download_audios() {
434
+ local api_url="$1"
435
+ local job_id="$2"
436
+ local result_file="$3"
437
+ local api_key=$(load_api_key)
438
+
439
+ ensure_output_dir
440
+
441
+ local audio_format=$(get_config "generation.audio_format")
442
+ [ -z "$audio_format" ] && audio_format="mp3"
443
+
444
+ # Read result file content and extract audio paths using pipe (avoid temp file path issues on Windows)
445
+ local result_content
446
+ result_content=$(cat "$result_file" 2>/dev/null)
447
+
448
+ if [ -z "$result_content" ]; then
449
+ echo -e " ${RED}Error: Result file is empty or cannot be read${NC}"
450
+ return 1
451
+ fi
452
+
453
+ # Extract audio paths using pipe instead of file (better Windows compatibility)
454
+ local audio_paths
455
+ audio_paths=$(echo "$result_content" | jq -r '.[].file // empty' 2>&1)
456
+ local jq_exit_code=$?
457
+
458
+ if [ $jq_exit_code -ne 0 ]; then
459
+ echo -e " ${RED}Error: Failed to parse result JSON${NC}"
460
+ echo -e " ${RED}jq error: $audio_paths${NC}"
461
+ return 1
462
+ fi
463
+
464
+ if [ -z "$audio_paths" ]; then
465
+ echo -e " ${YELLOW}No audio files found in result${NC}"
466
+ return 0
467
+ fi
468
+
469
+ local count=1
470
+ while IFS= read -r audio_path; do
471
+ # Skip empty lines and remove potential Windows carriage return
472
+ audio_path=$(echo "$audio_path" | tr -d '\r')
473
+ if [ -n "$audio_path" ]; then
474
+ local output_file="${OUTPUT_DIR}/${job_id}_${count}.${audio_format}"
475
+ local download_url="${api_url}${audio_path}"
476
+
477
+ echo -e " ${CYAN}Downloading audio $count...${NC}"
478
+ local curl_output
479
+ local curl_exit_code
480
+ if [ -n "$api_key" ]; then
481
+ curl_output=$(curl -s --connect-timeout 10 --max-time 300 \
482
+ -w "%{http_code}" \
483
+ -o "$output_file" \
484
+ -H "Authorization: Bearer ${api_key}" \
485
+ "$download_url" 2>&1)
486
+ curl_exit_code=$?
487
+ else
488
+ curl_output=$(curl -s --connect-timeout 10 --max-time 300 \
489
+ -w "%{http_code}" \
490
+ -o "$output_file" \
491
+ "$download_url" 2>&1)
492
+ curl_exit_code=$?
493
+ fi
494
+
495
+ if [ $curl_exit_code -ne 0 ]; then
496
+ echo -e " ${RED}Failed to download (curl error $curl_exit_code): $download_url${NC}"
497
+ rm -f "$output_file" 2>/dev/null
498
+ elif [ -f "$output_file" ] && [ -s "$output_file" ]; then
499
+ echo -e " ${GREEN}Saved: $output_file${NC}"
500
+ else
501
+ echo -e " ${RED}Failed to download (HTTP $curl_output): $download_url${NC}"
502
+ rm -f "$output_file" 2>/dev/null
503
+ fi
504
+ count=$((count + 1))
505
+ fi
506
+ done <<< "$audio_paths"
507
+ }
508
+
509
+ # =============================================================================
510
+ # Completion Mode (OpenRouter /v1/chat/completions)
511
+ # =============================================================================
512
+
513
+ # Load api_mode from config (default: native)
514
+ load_api_mode() {
515
+ local mode=$(get_config "api_mode")
516
+ echo "${mode:-native}"
517
+ }
518
+
519
+ # Get model ID from /v1/models endpoint for completion mode
520
+ get_completion_model() {
521
+ local api_url="$1"
522
+ local user_model="$2"
523
+ local api_key=$(load_api_key)
524
+
525
+ # If user specified a model, prefix with acemusic/ if needed
526
+ if [ -n "$user_model" ]; then
527
+ if [[ "$user_model" == */* ]]; then
528
+ echo "$user_model"
529
+ else
530
+ echo "acemusic/${user_model}"
531
+ fi
532
+ return
533
+ fi
534
+
535
+ # Query /v1/models for the first available model
536
+ local response
537
+ if [ -n "$api_key" ]; then
538
+ response=$(curl -s -H "Authorization: Bearer ${api_key}" "${api_url}/v1/models" 2>/dev/null)
539
+ else
540
+ response=$(curl -s "${api_url}/v1/models" 2>/dev/null)
541
+ fi
542
+
543
+ local model_id
544
+ model_id=$(echo "$response" | jq -r '.data[0].id // empty' 2>/dev/null)
545
+ echo "${model_id:-acemusic/acestep-v15-turbo}"
546
+ }
547
+
548
+ # Decode base64 audio data URL and save to file
549
+ # Handles cross-platform compatibility (Linux/macOS/Windows MSYS)
550
+ decode_base64_audio() {
551
+ local data_url="$1"
552
+ local output_file="$2"
553
+
554
+ # Strip data URL prefix: data:audio/mpeg;base64,...
555
+ local b64_data="${data_url#data:*;base64,}"
556
+
557
+ local tmp_b64=$(mktemp)
558
+ printf '%s' "$b64_data" > "$tmp_b64"
559
+
560
+ if command -v base64 &> /dev/null; then
561
+ # Linux / macOS / MSYS2
562
+ base64 -d < "$tmp_b64" > "$output_file" 2>/dev/null || \
563
+ base64 -D < "$tmp_b64" > "$output_file" 2>/dev/null || \
564
+ python3 -c "import base64,sys; sys.stdout.buffer.write(base64.b64decode(sys.stdin.read()))" < "$tmp_b64" > "$output_file" 2>/dev/null || \
565
+ python -c "import base64,sys; sys.stdout.buffer.write(base64.b64decode(sys.stdin.read()))" < "$tmp_b64" > "$output_file" 2>/dev/null
566
+ else
567
+ # Fallback to python
568
+ python3 -c "import base64,sys; sys.stdout.buffer.write(base64.b64decode(sys.stdin.read()))" < "$tmp_b64" > "$output_file" 2>/dev/null || \
569
+ python -c "import base64,sys; sys.stdout.buffer.write(base64.b64decode(sys.stdin.read()))" < "$tmp_b64" > "$output_file" 2>/dev/null
570
+ fi
571
+
572
+ local decode_ok=$?
573
+ rm -f "$tmp_b64"
574
+ return $decode_ok
575
+ }
576
+
577
+ # Parse completion response: extract metadata, save audio files
578
+ # Usage: parse_completion_response <response_file> <job_id>
579
+ parse_completion_response() {
580
+ local resp_file="$1"
581
+ local job_id="$2"
582
+
583
+ ensure_output_dir
584
+
585
+ local audio_format=$(get_config "generation.audio_format")
586
+ [ -z "$audio_format" ] && audio_format="mp3"
587
+
588
+ # Check for error
589
+ local finish_reason
590
+ finish_reason=$(jq -r '.choices[0].finish_reason // "stop"' "$resp_file" 2>/dev/null)
591
+ if [ "$finish_reason" = "error" ]; then
592
+ local err_content
593
+ err_content=$(jq -r '.choices[0].message.content // "Unknown error"' "$resp_file" 2>/dev/null)
594
+ echo -e "${RED}Generation failed: $err_content${NC}"
595
+ return 1
596
+ fi
597
+
598
+ # Extract and display text content (metadata + lyrics)
599
+ local content
600
+ content=$(jq -r '.choices[0].message.content // empty' "$resp_file" 2>/dev/null)
601
+ if [ -n "$content" ]; then
602
+ echo "$content"
603
+ echo ""
604
+ fi
605
+
606
+ # Extract and save audio files
607
+ local audio_count
608
+ audio_count=$(jq -r '.choices[0].message.audio | length // 0' "$resp_file" 2>/dev/null)
609
+
610
+ if [ "$audio_count" -gt 0 ] 2>/dev/null; then
611
+ local i=0
612
+ while [ "$i" -lt "$audio_count" ]; do
613
+ local audio_url
614
+ audio_url=$(jq -r ".choices[0].message.audio[$i].audio_url.url // empty" "$resp_file" 2>/dev/null)
615
+
616
+ if [ -n "$audio_url" ]; then
617
+ local output_file="${OUTPUT_DIR}/${job_id}_$((i+1)).${audio_format}"
618
+ echo -e " ${CYAN}Decoding audio $((i+1))...${NC}"
619
+
620
+ if decode_base64_audio "$audio_url" "$output_file"; then
621
+ if [ -f "$output_file" ] && [ -s "$output_file" ]; then
622
+ echo -e " ${GREEN}Saved: $output_file${NC}"
623
+ else
624
+ echo -e " ${RED}Failed to decode audio $((i+1))${NC}"
625
+ rm -f "$output_file" 2>/dev/null
626
+ fi
627
+ else
628
+ echo -e " ${RED}Failed to decode audio $((i+1))${NC}"
629
+ rm -f "$output_file" 2>/dev/null
630
+ fi
631
+ fi
632
+ i=$((i+1))
633
+ done
634
+ else
635
+ echo -e " ${YELLOW}No audio files in response${NC}"
636
+ fi
637
+
638
+ # Save full response JSON (strip base64 audio to keep file small)
639
+ local clean_resp
640
+ clean_resp=$(jq 'del(.choices[].message.audio[].audio_url.url)' "$resp_file" 2>/dev/null)
641
+ if [ -n "$clean_resp" ]; then
642
+ save_result "$job_id" "$clean_resp"
643
+ else
644
+ save_result "$job_id" "$(cat "$resp_file")"
645
+ fi
646
+ }
647
+
648
+ # Send request to /v1/chat/completions and handle response
649
+ # Usage: send_completion_request <api_url> <payload_file> <job_id_var>
650
+ send_completion_request() {
651
+ local api_url="$1"
652
+ local payload_file="$2"
653
+ local api_key=$(load_api_key)
654
+
655
+ local resp_file=$(mktemp)
656
+
657
+ local http_code
658
+ if [ -n "$api_key" ]; then
659
+ http_code=$(curl -s -w "%{http_code}" --connect-timeout 10 --max-time 660 \
660
+ -o "$resp_file" \
661
+ -X POST "${api_url}/v1/chat/completions" \
662
+ -H "Content-Type: application/json; charset=utf-8" \
663
+ -H "Authorization: Bearer ${api_key}" \
664
+ --data-binary "@${payload_file}")
665
+ else
666
+ http_code=$(curl -s -w "%{http_code}" --connect-timeout 10 --max-time 660 \
667
+ -o "$resp_file" \
668
+ -X POST "${api_url}/v1/chat/completions" \
669
+ -H "Content-Type: application/json; charset=utf-8" \
670
+ --data-binary "@${payload_file}")
671
+ fi
672
+
673
+ rm -f "$payload_file"
674
+
675
+ if [ "$http_code" != "200" ]; then
676
+ local err_detail
677
+ err_detail=$(jq -r '.detail // .error.message // empty' "$resp_file" 2>/dev/null)
678
+ echo -e "${RED}Error: HTTP $http_code${NC}"
679
+ [ -n "$err_detail" ] && echo -e "${RED}$err_detail${NC}"
680
+ rm -f "$resp_file"
681
+ return 1
682
+ fi
683
+
684
+ # Generate a job_id from the completion id
685
+ local job_id
686
+ job_id=$(jq -r '.id // empty' "$resp_file" 2>/dev/null)
687
+ [ -z "$job_id" ] && job_id="completion-$(date +%s)"
688
+
689
+ echo ""
690
+ echo -e "${GREEN}Generation completed!${NC}"
691
+ echo ""
692
+
693
+ parse_completion_response "$resp_file" "$job_id"
694
+ rm -f "$resp_file"
695
+
696
+ echo ""
697
+ echo -e "${GREEN}Done! Files saved to: $OUTPUT_DIR${NC}"
698
+ }
699
+
700
+ # Wait for job and download results
701
+ wait_for_job() {
702
+ local api_url="$1"
703
+ local job_id="$2"
704
+
705
+ echo "Job created: $job_id"
706
+ echo "Output: $OUTPUT_DIR"
707
+ echo ""
708
+
709
+ while true; do
710
+ local response=$(query_job_result "$api_url" "$job_id")
711
+ local status=$(parse_query_status "$response")
712
+
713
+ case "$status" in
714
+ 1)
715
+ echo ""
716
+ echo -e "${GREEN}Generation completed!${NC}"
717
+ echo ""
718
+
719
+ local result_file=$(mktemp)
720
+ parse_query_result "$response" > "$result_file"
721
+
722
+ local bpm=$(jq -r '.[0].metas.bpm // .[0].bpm // empty' "$result_file" 2>/dev/null)
723
+ local keyscale=$(jq -r '.[0].metas.keyscale // .[0].keyscale // empty' "$result_file" 2>/dev/null)
724
+ local duration=$(jq -r '.[0].metas.duration // .[0].duration // empty' "$result_file" 2>/dev/null)
725
+
726
+ echo "Metadata:"
727
+ [ -n "$bpm" ] && echo " BPM: $bpm"
728
+ [ -n "$keyscale" ] && echo " Key: $keyscale"
729
+ [ -n "$duration" ] && echo " Duration: ${duration}s"
730
+ echo ""
731
+
732
+ # Save result JSON
733
+ save_result "$job_id" "$response"
734
+
735
+ # Download audio files
736
+ echo "Downloading audio files..."
737
+ download_audios "$api_url" "$job_id" "$result_file"
738
+ rm -f "$result_file"
739
+
740
+ echo ""
741
+ echo -e "${GREEN}Done! Files saved to: $OUTPUT_DIR${NC}"
742
+ return 0
743
+ ;;
744
+ 2)
745
+ echo ""
746
+ echo -e "${RED}Generation failed!${NC}"
747
+
748
+ # Save error result
749
+ save_result "$job_id" "$response"
750
+ return 1
751
+ ;;
752
+ 0)
753
+ printf "\rProcessing... "
754
+ ;;
755
+ *)
756
+ printf "\rWaiting... "
757
+ ;;
758
+ esac
759
+ sleep 5
760
+ done
761
+ }
762
+
763
+ # Generate command
764
+ cmd_generate() {
765
+ check_deps
766
+ ensure_config
767
+
768
+ local caption="" lyrics="" description="" thinking="" use_format=""
769
+ local no_thinking=false no_format=false no_wait=false
770
+ local model="" language="" steps="" guidance="" seed="" duration="" bpm="" batch=""
771
+
772
+ while [[ $# -gt 0 ]]; do
773
+ case $1 in
774
+ --caption|-c) caption="$2"; shift 2 ;;
775
+ --lyrics|-l) lyrics="$2"; shift 2 ;;
776
+ --description|-d) description="$2"; shift 2 ;;
777
+ --thinking|-t) thinking="true"; shift ;;
778
+ --no-thinking) no_thinking=true; shift ;;
779
+ --use-format) use_format="true"; shift ;;
780
+ --no-format) no_format=true; shift ;;
781
+ --model|-m) model="$2"; shift 2 ;;
782
+ --language|--vocal-language) language="$2"; shift 2 ;;
783
+ --steps) steps="$2"; shift 2 ;;
784
+ --guidance) guidance="$2"; shift 2 ;;
785
+ --seed) seed="$2"; shift 2 ;;
786
+ --duration) duration="$2"; shift 2 ;;
787
+ --bpm) bpm="$2"; shift 2 ;;
788
+ --batch) batch="$2"; shift 2 ;;
789
+ --no-wait) no_wait=true; shift ;;
790
+ *) [ -z "$caption" ] && caption="$1"; shift ;;
791
+ esac
792
+ done
793
+
794
+ # If no caption but has description, use simple mode
795
+ if [ -z "$caption" ] && [ -z "$description" ]; then
796
+ echo -e "${RED}Error: caption or description required${NC}"
797
+ echo "Usage: $0 generate \"Music description\" [options]"
798
+ echo " $0 generate -d \"Simple description\" [options]"
799
+ exit 1
800
+ fi
801
+
802
+ local api_url=$(ensure_connection)
803
+
804
+ # Get defaults
805
+ local def_thinking=$(get_config "generation.thinking")
806
+ local def_format=$(get_config "generation.use_format")
807
+ local def_cot_caption=$(get_config "generation.use_cot_caption")
808
+ local def_cot_language=$(get_config "generation.use_cot_language")
809
+ local def_language=$(get_config "generation.vocal_language")
810
+ local def_audio_format=$(get_config "generation.audio_format")
811
+
812
+ [ -z "$thinking" ] && thinking="${def_thinking:-true}"
813
+ [ -z "$use_format" ] && use_format="${def_format:-true}"
814
+ [ -z "$language" ] && language="${def_language:-en}"
815
+
816
+ [ "$no_thinking" = true ] && thinking="false"
817
+ [ "$no_format" = true ] && use_format="false"
818
+
819
+ # Normalize boolean values for jq --argjson
820
+ thinking=$(normalize_bool "$thinking" "true")
821
+ use_format=$(normalize_bool "$use_format" "true")
822
+ local cot_caption=$(normalize_bool "$def_cot_caption" "true")
823
+ local cot_language=$(normalize_bool "$def_cot_language" "true")
824
+
825
+ # Build payload using jq for proper escaping
826
+ local payload=$(jq -n \
827
+ --arg prompt "$caption" \
828
+ --arg lyrics "${lyrics:-}" \
829
+ --arg sample_query "${description:-}" \
830
+ --argjson thinking "$thinking" \
831
+ --argjson use_format "$use_format" \
832
+ --argjson use_cot_caption "$cot_caption" \
833
+ --argjson use_cot_language "$cot_language" \
834
+ --arg vocal_language "$language" \
835
+ --arg audio_format "${def_audio_format:-mp3}" \
836
+ '{
837
+ prompt: $prompt,
838
+ lyrics: $lyrics,
839
+ sample_query: $sample_query,
840
+ thinking: $thinking,
841
+ use_format: $use_format,
842
+ use_cot_caption: $use_cot_caption,
843
+ use_cot_language: $use_cot_language,
844
+ vocal_language: $vocal_language,
845
+ audio_format: $audio_format,
846
+ use_random_seed: true
847
+ }')
848
+
849
+ # Add optional parameters
850
+ [ -n "$model" ] && payload=$(echo "$payload" | jq --arg v "$model" '. + {model: $v}')
851
+ [ -n "$steps" ] && payload=$(echo "$payload" | jq --argjson v "$steps" '. + {inference_steps: $v}')
852
+ [ -n "$guidance" ] && payload=$(echo "$payload" | jq --argjson v "$guidance" '. + {guidance_scale: $v}')
853
+ [ -n "$seed" ] && payload=$(echo "$payload" | jq --argjson v "$seed" '. + {seed: $v, use_random_seed: false}')
854
+ [ -n "$duration" ] && payload=$(echo "$payload" | jq --argjson v "$duration" '. + {audio_duration: $v}')
855
+ [ -n "$bpm" ] && payload=$(echo "$payload" | jq --argjson v "$bpm" '. + {bpm: $v}')
856
+ [ -n "$batch" ] && payload=$(echo "$payload" | jq --argjson v "$batch" '. + {batch_size: $v}')
857
+
858
+ local api_mode=$(load_api_mode)
859
+
860
+ echo "Generating music..."
861
+ if [ -n "$description" ]; then
862
+ echo " Mode: Simple (description)"
863
+ echo " Description: ${description:0:50}..."
864
+ else
865
+ echo " Mode: Caption"
866
+ echo " Caption: ${caption:0:50}..."
867
+ fi
868
+ echo " Thinking: $thinking, Format: $use_format"
869
+ echo " API: $api_mode"
870
+ echo " Output: $OUTPUT_DIR"
871
+ echo ""
872
+
873
+ if [ "$api_mode" = "completion" ]; then
874
+ # --- Completion mode: /v1/chat/completions ---
875
+ local model_id=$(get_completion_model "$api_url" "$model")
876
+
877
+ # Build message content
878
+ local message_content=""
879
+ local sample_mode=false
880
+ if [ -n "$description" ]; then
881
+ message_content="$description"
882
+ sample_mode=true
883
+ else
884
+ message_content="<prompt>${caption}</prompt>"
885
+ [ -n "$lyrics" ] && message_content="${message_content}<lyrics>${lyrics}</lyrics>"
886
+ fi
887
+
888
+ # Build completion payload
889
+ local payload_c=$(jq -n \
890
+ --arg model "$model_id" \
891
+ --arg content "$message_content" \
892
+ --argjson thinking "$thinking" \
893
+ --argjson use_format "$use_format" \
894
+ --argjson sample_mode "$sample_mode" \
895
+ --argjson use_cot_caption "$cot_caption" \
896
+ --argjson use_cot_language "$cot_language" \
897
+ --arg vocal_language "$language" \
898
+ --arg format "${def_audio_format:-mp3}" \
899
+ '{
900
+ model: $model,
901
+ messages: [{"role": "user", "content": $content}],
902
+ stream: false,
903
+ thinking: $thinking,
904
+ use_format: $use_format,
905
+ sample_mode: $sample_mode,
906
+ use_cot_caption: $use_cot_caption,
907
+ use_cot_language: $use_cot_language,
908
+ audio_config: {
909
+ format: $format,
910
+ vocal_language: $vocal_language
911
+ }
912
+ }')
913
+
914
+ # Add optional parameters to completion payload
915
+ [ -n "$guidance" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$guidance" '. + {guidance_scale: $v}')
916
+ [ -n "$seed" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$seed" '. + {seed: $v}')
917
+ [ -n "$batch" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$batch" '. + {batch_size: $v}')
918
+ [ -n "$duration" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$duration" '.audio_config.duration = $v')
919
+ [ -n "$bpm" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$bpm" '.audio_config.bpm = $v')
920
+
921
+ local temp_payload=$(mktemp)
922
+ printf '%s' "$payload_c" > "$temp_payload"
923
+
924
+ send_completion_request "$api_url" "$temp_payload"
925
+ else
926
+ # --- Native mode: /release_task + polling ---
927
+ local temp_payload=$(mktemp)
928
+ printf '%s' "$payload" > "$temp_payload"
929
+
930
+ local api_key=$(load_api_key)
931
+ local response
932
+ if [ -n "$api_key" ]; then
933
+ response=$(curl -s -X POST "${api_url}/release_task" \
934
+ -H "Content-Type: application/json; charset=utf-8" \
935
+ -H "Authorization: Bearer ${api_key}" \
936
+ --data-binary "@${temp_payload}")
937
+ else
938
+ response=$(curl -s -X POST "${api_url}/release_task" \
939
+ -H "Content-Type: application/json; charset=utf-8" \
940
+ --data-binary "@${temp_payload}")
941
+ fi
942
+
943
+ rm -f "$temp_payload"
944
+
945
+ local job_id=$(echo "$response" | jq -r '.data.task_id // .task_id // empty')
946
+ [ -z "$job_id" ] && { echo -e "${RED}Error: Failed to create job${NC}"; echo "$response"; exit 1; }
947
+
948
+ if [ "$no_wait" = true ]; then
949
+ echo "Job ID: $job_id"
950
+ echo "Use '$0 status $job_id' to check progress and download"
951
+ else
952
+ wait_for_job "$api_url" "$job_id"
953
+ fi
954
+ fi
955
+ }
956
+
957
+ # Random command
958
+ cmd_random() {
959
+ check_deps
960
+ ensure_config
961
+
962
+ local thinking="" no_thinking=false no_wait=false
963
+
964
+ while [[ $# -gt 0 ]]; do
965
+ case $1 in
966
+ --thinking|-t) thinking="true"; shift ;;
967
+ --no-thinking) no_thinking=true; shift ;;
968
+ --no-wait) no_wait=true; shift ;;
969
+ *) shift ;;
970
+ esac
971
+ done
972
+
973
+ local api_url=$(ensure_connection)
974
+
975
+ local def_thinking=$(get_config "generation.thinking")
976
+ [ -z "$thinking" ] && thinking="${def_thinking:-true}"
977
+ [ "$no_thinking" = true ] && thinking="false"
978
+
979
+ # Normalize boolean for jq --argjson
980
+ thinking=$(normalize_bool "$thinking" "true")
981
+
982
+ local api_mode=$(load_api_mode)
983
+
984
+ echo "Generating random music..."
985
+ echo " Thinking: $thinking"
986
+ echo " API: $api_mode"
987
+ echo " Output: $OUTPUT_DIR"
988
+ echo ""
989
+
990
+ if [ "$api_mode" = "completion" ]; then
991
+ # --- Completion mode ---
992
+ local model_id=$(get_completion_model "$api_url" "")
993
+ local def_audio_format=$(get_config "generation.audio_format")
994
+
995
+ local payload_c=$(jq -n \
996
+ --arg model "$model_id" \
997
+ --argjson thinking "$thinking" \
998
+ --arg format "${def_audio_format:-mp3}" \
999
+ '{
1000
+ model: $model,
1001
+ messages: [{"role": "user", "content": "Generate a random song"}],
1002
+ stream: false,
1003
+ sample_mode: true,
1004
+ thinking: $thinking,
1005
+ audio_config: { format: $format }
1006
+ }')
1007
+
1008
+ local temp_payload=$(mktemp)
1009
+ printf '%s' "$payload_c" > "$temp_payload"
1010
+
1011
+ send_completion_request "$api_url" "$temp_payload"
1012
+ else
1013
+ # --- Native mode ---
1014
+ local payload=$(jq -n --argjson thinking "$thinking" '{sample_mode: true, thinking: $thinking}')
1015
+
1016
+ local temp_payload=$(mktemp)
1017
+ printf '%s' "$payload" > "$temp_payload"
1018
+
1019
+ local api_key=$(load_api_key)
1020
+ local response
1021
+ if [ -n "$api_key" ]; then
1022
+ response=$(curl -s -X POST "${api_url}/release_task" \
1023
+ -H "Content-Type: application/json; charset=utf-8" \
1024
+ -H "Authorization: Bearer ${api_key}" \
1025
+ --data-binary "@${temp_payload}")
1026
+ else
1027
+ response=$(curl -s -X POST "${api_url}/release_task" \
1028
+ -H "Content-Type: application/json; charset=utf-8" \
1029
+ --data-binary "@${temp_payload}")
1030
+ fi
1031
+
1032
+ rm -f "$temp_payload"
1033
+
1034
+ local job_id=$(echo "$response" | jq -r '.data.task_id // .task_id // empty')
1035
+ [ -z "$job_id" ] && { echo -e "${RED}Error: Failed to create job${NC}"; echo "$response"; exit 1; }
1036
+
1037
+ if [ "$no_wait" = true ]; then
1038
+ echo "Job ID: $job_id"
1039
+ echo "Use '$0 status $job_id' to check progress and download"
1040
+ else
1041
+ wait_for_job "$api_url" "$job_id"
1042
+ fi
1043
+ fi
1044
+ }
1045
+
1046
+ # Help
1047
+ show_help() {
1048
+ echo "ACE-Step Music Generation CLI"
1049
+ echo ""
1050
+ echo "Requirements: curl, jq"
1051
+ echo ""
1052
+ echo "Usage: $0 <command> [options]"
1053
+ echo ""
1054
+ echo "Commands:"
1055
+ echo " generate Generate music from text"
1056
+ echo " random Generate random music"
1057
+ echo " status Check job status and download results"
1058
+ echo " models List available models"
1059
+ echo " health Check API health"
1060
+ echo " config Manage configuration"
1061
+ echo ""
1062
+ echo "Output:"
1063
+ echo " Results saved to: $OUTPUT_DIR/<job_id>.json"
1064
+ echo " Audio files: $OUTPUT_DIR/<job_id>_1.mp3, ..."
1065
+ echo ""
1066
+ echo "Generate Options:"
1067
+ echo " -c, --caption Music style/genre description (caption mode)"
1068
+ echo " -d, --description Simple description, LM auto-generates caption/lyrics"
1069
+ echo " -l, --lyrics Lyrics text"
1070
+ echo " -t, --thinking Enable thinking mode (default: true)"
1071
+ echo " --no-thinking Disable thinking mode"
1072
+ echo " --no-format Disable format enhancement"
1073
+ echo ""
1074
+ echo "Examples:"
1075
+ echo " $0 generate \"Pop music with guitar\" # Caption mode"
1076
+ echo " $0 generate -d \"A February love song\" # Simple mode (LM generates)"
1077
+ echo " $0 generate -c \"Jazz\" -l \"[Verse] Hello\" # With lyrics"
1078
+ echo " $0 random"
1079
+ echo " $0 status <job_id>"
1080
+ echo " $0 config --set generation.thinking false"
1081
+ }
1082
+
1083
+ # Main
1084
+ case "$1" in
1085
+ generate) shift; cmd_generate "$@" ;;
1086
+ random) shift; cmd_random "$@" ;;
1087
+ status) shift; cmd_status "$@" ;;
1088
+ models) cmd_models ;;
1089
+ health) cmd_health ;;
1090
+ config) shift; cmd_config "$@" ;;
1091
+ help|--help|-h) show_help ;;
1092
+ *) show_help; exit 1 ;;
1093
+ esac
.claude/skills/acestep/scripts/config.example.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "api_url": "https://api.acemusic.ai",
3
+ "api_key": "",
4
+ "api_mode": "completion",
5
+ "generation": {
6
+ "thinking": true,
7
+ "use_format": false,
8
+ "use_cot_caption": true,
9
+ "use_cot_language": false,
10
+ "audio_format": "mp3",
11
+ "batch_size": 1,
12
+ "vocal_language": "en"
13
+ }
14
+ }
.dockerignore ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reduce build context; models are downloaded at runtime from HuggingFace
2
+ .git
3
+ .gitignore
4
+ .dockerignore
5
+ *.md
6
+ !README.md
7
+
8
+ __pycache__
9
+ *.py[cod]
10
+ *$py.class
11
+ .venv
12
+ venv
13
+ .env
14
+ .env.*
15
+ !.env.example
16
+
17
+ checkpoints/
18
+ gradio_outputs/
19
+ datasets/
20
+ lora_output/
21
+ lokr_output/
22
+ *.log
23
+ .cache
24
+ .pytest_cache
25
+ .ruff_cache
26
+ .mypy_cache
27
+ torchinductor_root/
28
+ PortableGit/
29
+ proxy_config.txt
30
+ *.7z
31
+ .history/
32
+ discord_bot/
33
+ feishu_bot/
34
+ test_*.py
35
+ **/tests/
36
+ **/test_*.py
37
+ playground.ipynb
38
+ issues/
39
+ checkpoints_legacy/
40
+ checkpoints_pack/
41
+ python_embeded/
42
+ acestep/third_parts/vllm/
.editorconfig ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ root = true
2
+
3
+ [*]
4
+ charset = utf-8
5
+ end_of_line = lf
6
+ insert_final_newline = true
7
+ trim_trailing_whitespace = true
8
+
9
+ [*.{bat,cmd,ps1}]
10
+ end_of_line = crlf
11
+
12
+ [*.png]
13
+ charset = unset
14
+ end_of_line = unset
15
+ insert_final_newline = false
16
+ trim_trailing_whitespace = false
.env.example ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step Environment Configuration
2
+ # Copy this file to .env and modify as needed
3
+ #
4
+ # This file is used by:
5
+ # - Python scripts (acestep_v15_pipeline.py, api_server.py, etc.)
6
+ # - Windows launcher (start_gradio_ui.bat)
7
+ # - Linux/macOS launchers (start_gradio_ui.sh, start_gradio_ui_macos.sh)
8
+ #
9
+ # Settings in .env will survive repository updates, unlike hardcoded values
10
+ # in launcher scripts which get overwritten on each update.
11
+
12
+ # ==================== Model Settings ====================
13
+ # DiT model path
14
+ ACESTEP_CONFIG_PATH=acestep-v15-turbo
15
+
16
+ # LM model path (used when LLM is enabled)
17
+ # Available: acestep-5Hz-lm-0.6B, acestep-5Hz-lm-1.7B, acestep-5Hz-lm-4B
18
+ ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-1.7B
19
+
20
+ # Device selection: auto, cuda, cpu, xpu
21
+ ACESTEP_DEVICE=auto
22
+
23
+ # LM backend: vllm (faster) or pt (PyTorch native)
24
+ ACESTEP_LM_BACKEND=vllm
25
+
26
+ # ==================== LLM Initialization ====================
27
+ # Controls whether to initialize the Language Model (LLM/5Hz LM)
28
+ #
29
+ # Flow: GPU detection (full) → ACESTEP_INIT_LLM override → Model loading
30
+ # GPU optimizations (offload, quantization, batch limits) are ALWAYS applied.
31
+ # ACESTEP_INIT_LLM only overrides the "should we try to load LLM" decision.
32
+ #
33
+ # Values:
34
+ # auto (or empty) = Use GPU auto-detection result (recommended)
35
+ # true/1/yes = Force enable LLM after GPU detection (may cause OOM)
36
+ # false/0/no = Force disable LLM (pure DiT mode, faster)
37
+ #
38
+ # Examples:
39
+ # ACESTEP_INIT_LLM=auto # Let GPU detection decide (recommended)
40
+ # ACESTEP_INIT_LLM= # Same as auto
41
+ # ACESTEP_INIT_LLM=true # Force enable even on low VRAM GPU
42
+ # ACESTEP_INIT_LLM=false # Force disable for pure DiT mode
43
+ #
44
+ # When LLM is disabled, these features are unavailable:
45
+ # - Thinking mode (thinking=true)
46
+ # - Chain-of-Thought caption/language detection
47
+ # - Sample mode (generate from description)
48
+ # - Format mode (LLM-enhanced input)
49
+ #
50
+ # Default: auto (based on GPU VRAM detection)
51
+ ACESTEP_INIT_LLM=auto
52
+
53
+ # ==================== Download Settings ====================
54
+ # Preferred download source: auto, huggingface, modelscope
55
+ # ACESTEP_DOWNLOAD_SOURCE=auto
56
+
57
+ # ==================== API Server Settings ====================
58
+ # API key for authentication (optional)
59
+ # ACESTEP_API_KEY=sk-your-secret-key
60
+
61
+ # ==================== Gradio UI Settings ====================
62
+ # Server port (default: 7860)
63
+ # PORT=7860
64
+
65
+ # Server name/host (default: 127.0.0.1 for local only, 0.0.0.0 for network access)
66
+ # SERVER_NAME=127.0.0.1
67
+
68
+ # UI language: en, zh, he, ja (default: en)
69
+ # LANGUAGE=en
70
+
71
+ # Default batch size for generation (1 to GPU-dependent max)
72
+ # When not specified, defaults to min(2, GPU_max)
73
+ # ACESTEP_BATCH_SIZE=2
74
+
75
+ # ==================== Startup Settings ====================
76
+ # Skip model loading at startup (models will be lazy-loaded on first request)
77
+ # Set to true to start server quickly without loading models
78
+ # ACESTEP_NO_INIT=false
.github/ISSUE_TEMPLATE/bug_report.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: Bug report
3
+ about: Create a report to help us improve
4
+ title: ''
5
+ labels: ''
6
+ assignees: ''
7
+
8
+ ---
9
+
10
+ **Describe the bug**
11
+ A clear and concise description of what the bug is.
12
+
13
+ **To Reproduce**
14
+ Steps to reproduce the behavior:
15
+ 1. Go to '...'
16
+ 2. Click on '....'
17
+ 3. Scroll down to '....'
18
+ 4. See error
19
+
20
+ **Expected behavior**
21
+ A clear and concise description of what you expected to happen.
22
+
23
+ **Screenshots**
24
+ If applicable, add screenshots to help explain your problem.
25
+
26
+ **Desktop (please complete the following information):**
27
+ - OS: [e.g. iOS]
28
+ - Browser [e.g. chrome, safari]
29
+ - Version [e.g. 22]
30
+
31
+ **Smartphone (please complete the following information):**
32
+ - Device: [e.g. iPhone6]
33
+ - OS: [e.g. iOS8.1]
34
+ - Browser [e.g. stock browser, safari]
35
+ - Version [e.g. 22]
36
+
37
+ **Additional context**
38
+ Add any other context about the problem here.
.github/ISSUE_TEMPLATE/feature_request.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: Feature request
3
+ about: Suggest an idea for this project
4
+ title: ''
5
+ labels: ''
6
+ assignees: ''
7
+
8
+ ---
9
+
10
+ **Is your feature request related to a problem? Please describe.**
11
+ A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12
+
13
+ **Describe the solution you'd like**
14
+ A clear and concise description of what you want to happen.
15
+
16
+ **Describe alternatives you've considered**
17
+ A clear and concise description of any alternative solutions or features you've considered.
18
+
19
+ **Additional context**
20
+ Add any other context or screenshots about the feature request here.
.github/copilot-instructions.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step 1.5 - GitHub Copilot Instructions
2
+
3
+ ## Project Overview
4
+
5
+ ACE-Step 1.5 is an open-source music foundation model combining a Language Model (LM) as a planner with a Diffusion Transformer (DiT) for audio synthesis. It generates commercial-grade music on consumer hardware (< 4GB VRAM).
6
+
7
+ ## Tech Stack
8
+
9
+ - **Python 3.11-3.12** (ROCm on Windows requires 3.12; other platforms use 3.11)
10
+ - **PyTorch 2.7+** with CUDA 12.8 (Windows/Linux), MPS (macOS ARM64)
11
+ - **Transformers 4.51.0-4.57.x** for LLM inference
12
+ - **Diffusers** for diffusion models
13
+ - **Gradio 6.2.0** for web UI
14
+ - **FastAPI + Uvicorn** for REST API server
15
+ - **uv** for dependency management
16
+ - **MLX** (Apple Silicon native acceleration, macOS ARM64)
17
+ - **nano-vllm** (optimized LLM inference, non-macOS ARM64)
18
+
19
+ ## Multi-Platform Support
20
+
21
+ **CRITICAL**: Supports CUDA, ROCm, Intel XPU, MPS, MLX, and CPU. When fixing bugs or adding features:
22
+ - **DO NOT alter non-target platform paths** unless explicitly required
23
+ - Changes to CUDA code should not affect MPS/XPU/CPU paths
24
+ - Use `gpu_config.py` for hardware detection and configuration
25
+
26
+ ## Code Organization
27
+
28
+ ### Main Entry Points
29
+ - `acestep/acestep_v15_pipeline.py` - Gradio UI pipeline
30
+ - `acestep/api_server.py` - REST API server
31
+ - `cli.py` - Command-line interface
32
+ - `acestep/model_downloader.py` - Model downloader
33
+
34
+ ### Core Modules
35
+ - `acestep/handler.py` - Audio generation handler (AceStepHandler)
36
+ - `acestep/llm_inference.py` - LLM handler for text processing
37
+ - `acestep/inference.py` - Generation logic and parameters
38
+ - `acestep/gpu_config.py` - Hardware detection and GPU configuration
39
+ - `acestep/audio_utils.py` - Audio processing utilities
40
+ - `acestep/constants.py` - Global constants
41
+
42
+ ### UI & Internationalization
43
+ - `acestep/gradio_ui/` - Gradio interface components
44
+ - `acestep/gradio_ui/i18n.py` - i18n system (50+ languages)
45
+ - All user-facing strings must use i18n translation keys
46
+
47
+ ### Training
48
+ - `acestep/training/` - LoRA training pipeline
49
+ - `acestep/dataset/` - Dataset handling
50
+
51
+ ## Key Conventions
52
+
53
+ - **Python style**: PEP 8, 4 spaces, double quotes for strings
54
+ - **Naming**: `snake_case` functions/variables, `PascalCase` classes, `UPPER_SNAKE_CASE` constants
55
+ - **Logging**: Use `loguru` logger (not `print()` except CLI output)
56
+ - **Dependencies**: Use `uv add <package>` to add to `pyproject.toml`
57
+
58
+ ## Performance
59
+
60
+ - Target: 4GB VRAM - minimize memory allocations
61
+ - Lazy load models when needed
62
+ - Batch operations supported (up to 8 songs)
63
+
64
+ ## Additional Resources
65
+
66
+ - **AGENTS.md**: Detailed guidance for AI coding agents
67
+ - **CONTRIBUTING.md**: Contribution workflow and guidelines
.github/workflows/codeql.yml ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # For most projects, this workflow file will not need changing; you simply need
2
+ # to commit it to your repository.
3
+ #
4
+ # You may wish to alter this file to override the set of languages analyzed,
5
+ # or to provide custom queries or build logic.
6
+ #
7
+ # ******** NOTE ********
8
+ # We have attempted to detect the languages in your repository. Please check
9
+ # the `language` matrix defined below to confirm you have the correct set of
10
+ # supported CodeQL languages.
11
+ #
12
+ name: "CodeQL Advanced"
13
+
14
+ on:
15
+ push:
16
+ branches: [ "main" ]
17
+ pull_request:
18
+ branches: [ "main" ]
19
+ schedule:
20
+ - cron: '26 2 * * 5'
21
+
22
+ jobs:
23
+ analyze:
24
+ name: Analyze (${{ matrix.language }})
25
+ # Runner size impacts CodeQL analysis time. To learn more, please see:
26
+ # - https://gh.io/recommended-hardware-resources-for-running-codeql
27
+ # - https://gh.io/supported-runners-and-hardware-resources
28
+ # - https://gh.io/using-larger-runners (GitHub.com only)
29
+ # Consider using larger runners or machines with greater resources for possible analysis time improvements.
30
+ runs-on: ${{ (matrix.language == 'swift' && 'macos-latest') || 'ubuntu-latest' }}
31
+ permissions:
32
+ # required for all workflows
33
+ security-events: write
34
+
35
+ # required to fetch internal or private CodeQL packs
36
+ packages: read
37
+
38
+ # only required for workflows in private repositories
39
+ actions: read
40
+ contents: read
41
+
42
+ strategy:
43
+ fail-fast: false
44
+ matrix:
45
+ include:
46
+ - language: python
47
+ build-mode: none
48
+ # CodeQL supports the following values keywords for 'language': 'actions', 'c-cpp', 'csharp', 'go', 'java-kotlin', 'javascript-typescript', 'python', 'ruby', 'rust', 'swift'
49
+ # Use `c-cpp` to analyze code written in C, C++ or both
50
+ # Use 'java-kotlin' to analyze code written in Java, Kotlin or both
51
+ # Use 'javascript-typescript' to analyze code written in JavaScript, TypeScript or both
52
+ # To learn more about changing the languages that are analyzed or customizing the build mode for your analysis,
53
+ # see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/customizing-your-advanced-setup-for-code-scanning.
54
+ # If you are analyzing a compiled language, you can modify the 'build-mode' for that language to customize how
55
+ # your codebase is analyzed, see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/codeql-code-scanning-for-compiled-languages
56
+ steps:
57
+ - name: Checkout repository
58
+ uses: actions/checkout@v4
59
+
60
+ # Add any setup steps before running the `github/codeql-action/init` action.
61
+ # This includes steps like installing compilers or runtimes (`actions/setup-node`
62
+ # or others). This is typically only required for manual builds.
63
+ # - name: Setup runtime (example)
64
+ # uses: actions/setup-example@v1
65
+
66
+ # Initializes the CodeQL tools for scanning.
67
+ - name: Initialize CodeQL
68
+ uses: github/codeql-action/init@v4
69
+ with:
70
+ languages: ${{ matrix.language }}
71
+ build-mode: ${{ matrix.build-mode }}
72
+ # If you wish to specify custom queries, you can do so here or in a config file.
73
+ # By default, queries listed here will override any specified in a config file.
74
+ # Prefix the list here with "+" to use these queries and those in the config file.
75
+
76
+ # For more details on CodeQL's query packs, refer to: https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
77
+ # queries: security-extended,security-and-quality
78
+
79
+ # If the analyze step fails for one of the languages you are analyzing with
80
+ # "We were unable to automatically build your code", modify the matrix above
81
+ # to set the build mode to "manual" for that language. Then modify this step
82
+ # to build your code.
83
+ # ℹ️ Command-line programs to run using the OS shell.
84
+ # 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun
85
+ - name: Run manual build steps
86
+ if: matrix.build-mode == 'manual'
87
+ shell: bash
88
+ run: |
89
+ echo 'If you are using a "manual" build mode for one or more of the' \
90
+ 'languages you are analyzing, replace this with the commands to build' \
91
+ 'your code, for example:'
92
+ echo ' make bootstrap'
93
+ echo ' make release'
94
+ exit 1
95
+
96
+ - name: Perform CodeQL Analysis
97
+ uses: github/codeql-action/analyze@v4
98
+ with:
99
+ category: "/language:${{matrix.language}}"
.gitignore ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HF Spaces reject binaries in git
2
+ assets/*.png
3
+ assets/*.gif
4
+ acestep/third_parts/nano-vllm/assets/
5
+
6
+ #Exclude potential (c) training data
7
+ *_lyrics.txt
8
+ *.mp3
9
+ AlbumArt*.jpg
10
+ Folder.jpg
11
+
12
+
13
+ data/
14
+ *.mp3
15
+ *.wav
16
+
17
+ # Byte-compiled / optimized / DLL files
18
+ __pycache__/
19
+ *.py[codz]
20
+ *$py.class
21
+
22
+ # C extensions
23
+ *.so
24
+
25
+ # Distribution / packaging
26
+ .Python
27
+ build/
28
+ develop-eggs/
29
+ dist/
30
+ downloads/
31
+ eggs/
32
+ .eggs/
33
+ lib/
34
+ lib64/
35
+ parts/
36
+ sdist/
37
+ var/
38
+ wheels/
39
+ share/python-wheels/
40
+ *.egg-info/
41
+ .installed.cfg
42
+ *.egg
43
+ MANIFEST
44
+
45
+ # PyInstaller
46
+ # Usually these files are written by a python script from a template
47
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
48
+ *.manifest
49
+ *.spec
50
+
51
+ # Installer logs
52
+ pip-log.txt
53
+ pip-delete-this-directory.txt
54
+
55
+ # Unit test / coverage reports
56
+ htmlcov/
57
+ .tox/
58
+ .nox/
59
+ .coverage
60
+ .coverage.*
61
+ .cache
62
+ nosetests.xml
63
+ coverage.xml
64
+ *.cover
65
+ *.py.cover
66
+ .hypothesis/
67
+ .pytest_cache/
68
+ cover/
69
+
70
+ # Translations
71
+ *.mo
72
+ *.pot
73
+
74
+ # Django stuff:
75
+ *.log
76
+ local_settings.py
77
+ db.sqlite3
78
+ db.sqlite3-journal
79
+
80
+ # Flask stuff:
81
+ instance/
82
+ .webassets-cache
83
+
84
+ # Scrapy stuff:
85
+ .scrapy
86
+
87
+ # Sphinx documentation
88
+ docs/_build/
89
+
90
+ # PyBuilder
91
+ .pybuilder/
92
+ target/
93
+
94
+ # Jupyter Notebook
95
+ .ipynb_checkpoints
96
+
97
+ # IPython
98
+ profile_default/
99
+ ipython_config.py
100
+
101
+ # pyenv
102
+ # For a library or package, you might want to ignore these files since the code is
103
+ # intended to run in multiple environments; otherwise, check them in:
104
+ # .python-version
105
+
106
+ # pipenv
107
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
108
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
109
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
110
+ # install all needed dependencies.
111
+ #Pipfile.lock
112
+
113
+ # UV
114
+ # Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
115
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
116
+ # commonly ignored for libraries.
117
+ uv.lock
118
+
119
+ # poetry
120
+ # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
121
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
122
+ # commonly ignored for libraries.
123
+ # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
124
+ #poetry.lock
125
+ #poetry.toml
126
+
127
+ # pdm
128
+ # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
129
+ # pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
130
+ # https://pdm-project.org/en/latest/usage/project/#working-with-version-control
131
+ #pdm.lock
132
+ #pdm.toml
133
+ .pdm-python
134
+ .pdm-build/
135
+
136
+ # pixi
137
+ # Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
138
+ #pixi.lock
139
+ # Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
140
+ # in the .venv directory. It is recommended not to include this directory in version control.
141
+ .pixi
142
+
143
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
144
+ __pypackages__/
145
+
146
+ # Celery stuff
147
+ celerybeat-schedule
148
+ celerybeat.pid
149
+
150
+ # SageMath parsed files
151
+ *.sage.py
152
+
153
+ # Environments
154
+ .env
155
+ .envrc
156
+ .venv
157
+ env/
158
+ venv/
159
+ ENV/
160
+ env.bak/
161
+ venv.bak/
162
+
163
+ # Spyder project settings
164
+ .spyderproject
165
+ .spyproject
166
+
167
+ # Rope project settings
168
+ .ropeproject
169
+
170
+ # mkdocs documentation
171
+ /site
172
+
173
+ # mypy
174
+ .mypy_cache/
175
+ .dmypy.json
176
+ dmypy.json
177
+
178
+ # Pyre type checker
179
+ .pyre/
180
+
181
+ # pytype static type analyzer
182
+ .pytype/
183
+
184
+ # Cython debug symbols
185
+ cython_debug/
186
+
187
+ # PyCharm
188
+ # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
189
+ # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
190
+ # and can be added to the global gitignore or merged into this file. For a more nuclear
191
+ # option (not recommended) you can uncomment the following to ignore the entire idea folder.
192
+ #.idea/
193
+
194
+ # Abstra
195
+ # Abstra is an AI-powered process automation framework.
196
+ # Ignore directories containing user credentials, local state, and settings.
197
+ # Learn more at https://abstra.io/docs
198
+ .abstra/
199
+
200
+ # Visual Studio Code
201
+ # Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
202
+ # that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
203
+ # and can be added to the global gitignore or merged into this file. However, if you prefer,
204
+ # you could uncomment the following to ignore the entire vscode folder
205
+ # .vscode/
206
+
207
+ # Ruff stuff:
208
+ .ruff_cache/
209
+
210
+ # PyPI configuration file
211
+ .pypirc
212
+
213
+ # Cursor
214
+ # Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
215
+ # exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
216
+ # refer to https://docs.cursor.com/context/ignore-files
217
+ .cursorignore
218
+ .cursorindexingignore
219
+
220
+ # Marimo
221
+ marimo/_static/
222
+ marimo/_lsp/
223
+ __marimo__/
224
+ tests/
225
+ checkpoints/
226
+ playground.ipynb
227
+ .history/
228
+ upload_checkpoints.sh
229
+ checkpoints.7z
230
+ README_old.md
231
+ discord_bot/
232
+ feishu_bot/
233
+ tmp*
234
+ torchinductor_root/
235
+ scripts/*.pyc
236
+ scripts/__pycache__/
237
+ !scripts/check_gpu.py
238
+ !scripts/prepare_vae_calibration_data.py
239
+ checkpoints_legacy/
240
+ lora_output/
241
+ datasets/
242
+ python_embeded/
243
+ checkpoints_pack/
244
+ issues/
245
+ PortableGit/
246
+ proxy_config.txt
247
+ gradio_outputs/
248
+ acestep/third_parts/vllm/
249
+ test_lora_scale_fix.py
250
+ lokr_output/
AGENTS.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AGENTS.md
2
+
3
+ Guidance for AI coding agents working in `ace-step/ACE-Step-1.5`.
4
+
5
+ This document is aligned with the intent from:
6
+ - Discussion #408: functional decomposition to reduce risk from large mixed-responsibility files.
7
+ - Discussion #365: low-risk contribution workflow, minimal scope, and review rigor.
8
+
9
+ ## Primary Objectives
10
+
11
+ 1. Keep changes safe and reviewable.
12
+ 2. Prefer small, maintainable, decomposed modules.
13
+ 3. Preserve behavior outside the target fix.
14
+ 4. Validate with focused Python unit tests.
15
+
16
+ ## Scope and Change Control (Required)
17
+
18
+ - Solve one problem per task/PR.
19
+ - Keep edits minimal: touch only files/functions required for the requested change.
20
+ - Do not make drive-by refactors, formatting sweeps, or opportunistic cleanups.
21
+ - Do not alter non-target hardware/runtime paths (CPU/CUDA/MPS/XPU) unless required by the task.
22
+ - If any cross-path change is necessary, isolate it and justify it in the PR notes.
23
+ - Preserve existing public interfaces unless the task explicitly requires an interface change.
24
+
25
+ ## Decomposition and Module Size Policy
26
+
27
+ - Prefer single-responsibility modules with clear boundaries.
28
+ - Target module size:
29
+ - Optimal: `<= 150` LOC @coderabbit: Exceeding this limit is NIGGLE, not hard cap or major issue.
30
+ - Hard cap: `200` LOC
31
+ - Function decomposition rules:
32
+ - Do one thing at a time; if a function description naturally contains "and", split it.
33
+ - Split by responsibility, not by convenience.
34
+ - Keep data flow explicit (`data in, data out`); side effects must be obvious and deliberate.
35
+ - Push decisions up and push work down (orchestration at higher layers, execution details in lower layers).
36
+ - The call graph should read clearly from top-level orchestration to leaf operations.
37
+ - If a module would exceed `200` LOC:
38
+ - Split by responsibility before merging, or
39
+ - Add a short justification in PR notes and include a concrete follow-up split plan.
40
+ - Keep orchestrator/facade modules thin. Move logic into focused helpers/services.
41
+ - Preserve stable facade imports when splitting large files so external callers are not broken.
42
+
43
+ ## Python Unit Testing Expectations
44
+
45
+ - Add or update tests for every behavior change and bug fix.
46
+ - Match repository conventions:
47
+ - Use `unittest`-style tests.
48
+ - Name test files as `*_test.py` or `test_*.py`.
49
+ - Keep tests deterministic, fast, and scoped to changed behavior.
50
+ - Use mocks/fakes for GPU, filesystem, network, and external services where possible.
51
+ - If a change requires mocking a large portion of the system to test one unit, treat that as a decomposition smell and refactor boundaries.
52
+ - Include at least:
53
+ - One success-path test.
54
+ - One regression/edge-case test for the bug being fixed.
55
+ - One non-target behavior check when relevant.
56
+ - Run targeted tests locally before submitting.
57
+
58
+ ## Feature Gating and WIP Safety
59
+
60
+ - Do not expose unfinished or non-functional user-facing flows by default.
61
+ - Gate WIP or unstable UI/API paths behind explicit feature/release flags.
62
+ - Keep default behavior stable; "coming soon" paths must not appear as usable functionality unless they are operational and tested.
63
+
64
+ ## Python Coding Best Practices
65
+
66
+ - Use explicit, readable code over clever shortcuts.
67
+ - Docstrings are mandatory for all new or modified Python modules, classes, and functions.
68
+ - Docstrings must be concise and include purpose plus key inputs/outputs (and raised exceptions when relevant).
69
+ - Add type hints for new/modified functions when practical.
70
+ - Keep functions focused and short; extract helpers instead of nesting complexity.
71
+ - Use clear names that describe behavior, not implementation trivia.
72
+ - Prefer pure functions for logic-heavy paths where possible.
73
+ - Avoid duplicated logic, but do not introduce broad abstractions too early; prefer simple local duplication over unstable premature abstraction.
74
+ - Handle errors explicitly; avoid bare `except`.
75
+ - Keep logging actionable; avoid noisy logs and `print` debugging in committed code.
76
+ - Avoid hidden state and unintended side effects.
77
+ - Write comments only where intent is non-obvious; keep comments concise and technical.
78
+
79
+ ## AI-Agent Workflow (Recommended)
80
+
81
+ 1. Understand the task and define explicit in-scope/out-of-scope boundaries.
82
+ 2. Propose a minimal patch plan before editing.
83
+ 3. Implement the smallest viable change.
84
+ 4. Add/update focused tests.
85
+ 5. Self-review only changed hunks for regressions and scope creep.
86
+ 6. Summarize risk, validation, and non-target impact in PR notes.
87
+
88
+ ## PR Readiness Checklist
89
+
90
+ - [ ] Change is tightly scoped to one problem.
91
+ - [ ] Non-target paths are unchanged, or changes are explicitly justified.
92
+ - [ ] New/updated tests cover changed behavior and edge cases.
93
+ - [ ] No unrelated refactor/formatting churn.
94
+ - [ ] Required docstrings are present for all new/modified modules, classes, and functions.
95
+ - [ ] WIP/unstable functionality is feature-flagged and not exposed as default-ready behavior.
96
+ - [ ] Module LOC policy is met (`<=150` target, `<=200` hard cap or justified exception).
CONTRIBUTING.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Hopefully this will provide a simple, easy to understand guide to making safe contributions to the project, happy coding!
2
+ ---
3
+
4
+ ## Why This Matters
5
+
6
+ This project supports **many hardware and runtime combinations**.
7
+ A change that works perfectly on one setup can unintentionally break another if scope is not tightly controlled.
8
+
9
+ The project has kind of gone viral, and has thousands of users, amature, semi professional and professional, technical and none technical, it is important that Ace-Step has reliable builds to maintain user trust and engagement.
10
+
11
+ Recent PR patterns have shown avoidable regressions, for example:
12
+
13
+ - Fixes that changed behaviour outside the intended target path
14
+ - Hardware-specific assumptions leaking into general code paths
15
+ - String / status handling changes that broke downstream logic
16
+ - Missing or weak review before merge
17
+
18
+ The goal here is **not blame**.
19
+ The goal is **predictable, low-risk contributions** that maintainers can trust and merge with confidence.
20
+
21
+ ---
22
+
23
+ ## Core Principles for Contributors
24
+
25
+ ### Solve One Problem at a Time
26
+ - Keep each PR focused on **a single bug or feature**.
27
+ - Do **not** mix refactors, formatting, and behaviour changes unless absolutely required.
28
+
29
+ ### Minimize Blast Radius
30
+ - Touch **only** the files and functions required for the fix.
31
+ - Avoid “drive-by improvements” in unrelated code.
32
+
33
+ ### Preserve Non-Target Platforms
34
+ - If fixing **CUDA behaviour**, do not change **CPU / MPS / XPU** paths unless needed.
35
+ - Explicitly state **“non-target platforms unchanged”** in the PR notes — and verify it.
36
+
37
+ ### Prove the Change
38
+ - Add or run **targeted checks** for the affected path.
39
+ - Include a short **regression checklist** in the PR description.
40
+
41
+ ### Be Explicit About Risk
42
+ - Call out edge cases and trade-offs up front.
43
+ - If uncertain, say so and ask maintainers or experienced contributors for preferred direction.
44
+
45
+ Clarity beats confidence.
46
+
47
+ ---
48
+ ## AI Prompt Guardrails for Multi-Platform Projects
49
+
50
+ Tell your coding agent explicitly:
51
+
52
+ - Ask for a proposal and plan before making code changes.
53
+ - Make only the **minimum required changes** for the target issue.
54
+ - Do **not** refactor unrelated code.
55
+ - Do **not** alter non-target hardware/runtime paths unless required.
56
+ - If a cross-platform change is necessary, **isolate and justify it explicitly**.
57
+ - Preserve existing behaviour and interfaces unless the bug fix requires change.
58
+
59
+ These guardrails dramatically reduce accidental regressions from broad AI edits.
60
+
61
+ ---
62
+
63
+ ## Recommended AI-Assisted Workflow (Copilot / CodePilot / Codex)
64
+
65
+ ### Step 1: Commit-Scoped Review (First Pass)
66
+ Once you feel work is complete, and whatever manual or automated testing passes, commit your work to your local project. Note the commit number, or ask your agent to provide the number for your latest commit.
67
+
68
+ **Use a different agent to review your work than was used to produce the work**
69
+
70
+ If you use Claud or OpenAI codex, use your free Copilot tokens in VScode to get a Copilot review. If in doubt, ask your main agent to formulate a prompt for the review agent. It will 'know' what it has worked on and can suggest appropriate focus areas for the review agent.
71
+
72
+ Ask the agent to review **only your commit diff**, not the whole repo.
73
+
74
+ Prompt example:
75
+
76
+ Review commit <sha> only.
77
+ Focus on regressions, behaviour changes, and missing tests.
78
+ Ignore pre-existing issues outside changed hunks.
79
+ Output findings by severity with file/line references.
80
+
81
+ Fix the issues raised by the review, rerun the review process until only non-breaking trivial issues exist. This may need to be repeated a number of times until the commit is clean, but watch that this does not incorrectly blow scope out beyond what is required for the primary fix.
82
+ ---
83
+
84
+ ### Step 2: Validate Findings
85
+
86
+ Classify each finding as:
87
+
88
+ - **Accept** — real issue introduced or exposed by your change
89
+ - **Rebut** — incorrect or out-of-scope concern
90
+ - **Pre-existing** — not introduced by this PR (note separately)
91
+
92
+ ---
93
+
94
+ ### Step 3: Apply Minimal Fixes
95
+ - Fix **accepted** issues with the **smallest possible patch**.
96
+ - Do **not** broaden scope or refactor opportunistically.
97
+
98
+ ---
99
+
100
+ ### Step 4: PR-Scoped Review (Second Pass)
101
+
102
+ Run review on the **entire PR diff**, but only what changed.
103
+
104
+ Prompt example:
105
+
106
+ Review PR diff only (base <base>, head <head>),
107
+ (or alternatively, "treat commit a/b/c... as a whole.")
108
+ Prioritize regression risk across hardware paths.
109
+ Verify unchanged behaviour on non-target platforms.
110
+ Flag only issues in changed code.
111
+
112
+ ---
113
+
114
+ ### Step 5: Write Reviewer Responses
115
+
116
+ For each reviewer comment:
117
+
118
+ - Quote the concern
119
+ - Respond in one line
120
+ - Mark disposition clearly
121
+ - Link to fix if applicable
122
+
123
+ ---
124
+
125
+ ## Review / Accept / Rebut / Fix Cycle (Practical Template)
126
+
127
+ Use this structure in your notes:
128
+
129
+ Comment: <reviewer concern>
130
+ Disposition: Accepted | Rebutted | Pre-existing
131
+ Response: <one-line rationale>
132
+ Action: <commit / file / line> or No code change
133
+
134
+ This keeps discussion **objective, fast, and easy to follow**.
135
+
136
+ ---
137
+
138
+ ## PR Description Template (Recommended)
139
+
140
+ ### Summary
141
+ - What bug or feature is addressed
142
+ - Why this change is needed
143
+
144
+ ### Scope
145
+ - Files changed
146
+ - What is explicitly **out of scope**
147
+
148
+ ### Risk and Compatibility
149
+ - Target platform / path
150
+ - Confirmation that **non-target paths are unchanged**
151
+ (or describe exactly what changed and why)
152
+
153
+ ### Regression Checks
154
+ - Checks run (manual and/or automated)
155
+ - Key scenarios validated
156
+
157
+ ### Reviewer Notes
158
+ - Known pre-existing issues not addressed
159
+ - Follow-up items (if any)
160
+
161
+ Your PR description should look something like [this](https://github.com/ace-step/ACE-Step-1.5/pull/309), demonstrating care and rigor applied by the author before hitting the PR button. If you have multiple Coderabbit/copilot responses to your PR, its probably a good idea to revoke the PR, fix the issues raised by the review bot, and resubmit.
162
+
163
+ ---
164
+
165
+ Maintainers are balancing **correctness, stability, and review bandwidth**.
166
+
167
+ PRs that are:
168
+ - tightly scoped
169
+ - clearly explained
170
+ - minimally risky
171
+ - easy to reason about
172
+
173
+ are **much more likely to be reviewed and merged quickly**.
174
+
175
+ Thanks for helping keep the project stable and enjoyable to work on.
Dockerfile ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step 1.5 - Hugging Face Docker Space (GPU)
2
+ # Uses CUDA base; no GPU at build time. Port 7860 for Gradio.
3
+ # See https://huggingface.co/docs/hub/spaces-sdks-docker
4
+
5
+ FROM nvidia/cuda:12.4.0-cudnn8-runtime-ubuntu22.04
6
+
7
+ ENV DEBIAN_FRONTEND=noninteractive
8
+ RUN apt-get update && apt-get install -y --no-install-recommends \
9
+ python3 python3-pip python3-venv python3-dev \
10
+ git build-essential \
11
+ && rm -rf /var/lib/apt/lists/*
12
+
13
+ # HF Spaces run as user 1000
14
+ RUN useradd -m -u 1000 user
15
+ USER user
16
+ ENV HOME=/home/user PATH=/home/user/.local/bin:$PATH
17
+ WORKDIR /home/user/app
18
+
19
+ # Install Python deps (no GPU ops at build time)
20
+ COPY --chown=user requirements.txt .
21
+ RUN pip install --no-cache-dir --upgrade pip && \
22
+ pip install --no-cache-dir -r requirements.txt
23
+
24
+ # App code (acestep/, configs/, app.py, etc.) - copied from Space repo root
25
+ COPY --chown=user . .
26
+
27
+ EXPOSE 7860
28
+ CMD ["python3", "app.py"]
README.md CHANGED
@@ -1,285 +1,16 @@
1
  ---
2
- title: Ace Step Munk
3
  emoji: 🎵
4
- colorFrom: indigo
5
- colorTo: blue
6
- sdk: gradio
7
- sdk_version: 6.2.0
8
- app_file: app.py
9
- hf_oauth: true
10
  pinned: false
 
11
  ---
12
 
13
- <h1 align="center">ACE-Step 1.5</h1>
14
- <h1 align="center">Pushing the Boundaries of Open-Source Music Generation</h1>
15
- <p align="center">
16
- <a href="https://ace-step.github.io/ace-step-v1.5.github.io/">Project</a> |
17
- <a href="https://huggingface.co/ACE-Step/Ace-Step1.5">Hugging Face</a> |
18
- <a href="https://modelscope.cn/models/ACE-Step/Ace-Step1.5">ModelScope</a> |
19
- <a href="https://huggingface.co/spaces/ACE-Step/Ace-Step-v1.5">Space Demo</a> |
20
- <a href="https://discord.gg/PeWDxrkdj7">Discord</a> |
21
- <a href="https://arxiv.org/abs/2602.00744">Technical Report</a>
22
- </p>
23
 
24
- <p align="center">
25
- <img src="./assets/orgnization_logos.png" width="100%" alt="StepFun Logo">
26
- </p>
27
 
28
- ## Table of Contents
29
-
30
- - [✨ Features](#-features)
31
- - [⚡ Quick Start](#-quick-start)
32
- - [🚀 Launch Scripts](#-launch-scripts)
33
- - [📚 Documentation](#-documentation)
34
- - [📖 Tutorial](#-tutorial)
35
- - [🏗️ Architecture](#️-architecture)
36
- - [🦁 Model Zoo](#-model-zoo)
37
- - [🔬 Benchmark](#-benchmark)
38
-
39
- ## 📝 Abstract
40
- 🚀 We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast—under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style.
41
-
42
- 🌉 At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints—scaling from short loops to 10-minute compositions—while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). ⚡ Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. 🎚️
43
-
44
- 🔮 Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities—such as cover generation, repainting, and vocal-to-BGM conversion—while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. 🎸
45
-
46
-
47
- ## ✨ Features
48
-
49
- <p align="center">
50
- <img src="./assets/application_map.png" width="100%" alt="ACE-Step Framework">
51
- </p>
52
-
53
- ### ⚡ Performance
54
- - ✅ **Ultra-Fast Generation** — Under 2s per full song on A100, under 10s on RTX 3090 (0.5s to 10s on A100 depending on think mode & diffusion steps)
55
- - ✅ **Flexible Duration** — Supports 10 seconds to 10 minutes (600s) audio generation
56
- - ✅ **Batch Generation** — Generate up to 8 songs simultaneously
57
-
58
- ### 🎵 Generation Quality
59
- - ✅ **Commercial-Grade Output** — Quality beyond most commercial music models (between Suno v4.5 and Suno v5)
60
- - ✅ **Rich Style Support** — 1000+ instruments and styles with fine-grained timbre description
61
- - ✅ **Multi-Language Lyrics** — Supports 50+ languages with lyrics prompt for structure & style control
62
-
63
- ### 🎛️ Versatility & Control
64
-
65
- | Feature | Description |
66
- |---------|-------------|
67
- | ✅ Reference Audio Input | Use reference audio to guide generation style |
68
- | ✅ Cover Generation | Create covers from existing audio |
69
- | ✅ Repaint & Edit | Selective local audio editing and regeneration |
70
- | ✅ Track Separation | Separate audio into individual stems |
71
- | ✅ Multi-Track Generation | Add layers like Suno Studio's "Add Layer" feature |
72
- | ✅ Vocal2BGM | Auto-generate accompaniment for vocal tracks |
73
- | ✅ Metadata Control | Control duration, BPM, key/scale, time signature |
74
- | ✅ Simple Mode | Generate full songs from simple descriptions |
75
- | ✅ Query Rewriting | Auto LM expansion of tags and lyrics |
76
- | ✅ Audio Understanding | Extract BPM, key/scale, time signature & caption from audio |
77
- | ✅ LRC Generation | Auto-generate lyric timestamps for generated music |
78
- | ✅ LoRA Training | One-click annotation & training in Gradio. 8 songs, 1 hour on 3090 (12GB VRAM) |
79
- | ✅ Quality Scoring | Automatic quality assessment for generated audio |
80
-
81
- ## Staying ahead
82
- -----------------
83
- Star ACE-Step on GitHub and be instantly notified of new releases
84
- ![](assets/star.gif)
85
-
86
- ## ⚡ Quick Start
87
-
88
- > **Requirements:** Python 3.11-3.12, CUDA GPU recommended (also supports MPS / ROCm / Intel XPU / CPU)
89
- >
90
- > **Note:** ROCm on Windows requires Python 3.12 (AMD officially provides Python 3.12 wheels only)
91
-
92
- ```bash
93
- # 1. Install uv
94
- curl -LsSf https://astral.sh/uv/install.sh | sh # macOS / Linux
95
- # powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" # Windows
96
-
97
- # 2. Clone & install
98
- git clone https://github.com/ACE-Step/ACE-Step-1.5.git
99
- cd ACE-Step-1.5
100
- uv sync
101
-
102
- # 3. Launch Gradio UI (models auto-download on first run)
103
- uv run acestep
104
-
105
- # Or launch REST API server
106
- uv run acestep-api
107
- ```
108
-
109
- Open http://localhost:7860 (Gradio) or http://localhost:8001 (API).
110
-
111
- > 📦 **Windows users:** A [portable package](https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z) with pre-installed dependencies is available. See [Installation Guide](./docs/en/INSTALL.md#-windows-portable-package).
112
-
113
- > 📖 **Full installation guide** (AMD/ROCm, Intel GPU, CPU, environment variables, command-line options): [English](./docs/en/INSTALL.md) | [中文](./docs/zh/INSTALL.md) | [日本語](./docs/ja/INSTALL.md)
114
-
115
- ### 💡 Which Model Should I Choose?
116
-
117
- | Your GPU VRAM | Recommended LM Model | Backend | Notes |
118
- |---------------|---------------------|---------|-------|
119
- | **≤6GB** | None (DiT only) | — | LM disabled by default; INT8 quantization + full CPU offload |
120
- | **6-8GB** | `acestep-5Hz-lm-0.6B` | `pt` | Lightweight LM with PyTorch backend |
121
- | **8-16GB** | `acestep-5Hz-lm-0.6B` / `1.7B` | `vllm` | 0.6B for 8-12GB, 1.7B for 12-16GB |
122
- | **16-24GB** | `acestep-5Hz-lm-1.7B` | `vllm` | 4B available on 20GB+; no offload needed on 20GB+ |
123
- | **≥24GB** | `acestep-5Hz-lm-4B` | `vllm` | Best quality, all models fit without offload |
124
-
125
- The UI automatically selects the best configuration for your GPU. All settings (LM model, backend, offloading, quantization) are tier-aware and pre-configured.
126
-
127
- > 📖 GPU compatibility details: [English](./docs/en/GPU_COMPATIBILITY.md) | [中文](./docs/zh/GPU_COMPATIBILITY.md) | [日本語](./docs/ja/GPU_COMPATIBILITY.md) | [한국어](./docs/ko/GPU_COMPATIBILITY.md)
128
-
129
- ## 🚀 Launch Scripts
130
-
131
- Ready-to-use launch scripts for all platforms with auto environment detection, update checking, and dependency installation.
132
-
133
- | Platform | Scripts | Backend |
134
- |----------|---------|---------|
135
- | **Windows** | `start_gradio_ui.bat`, `start_api_server.bat` | CUDA |
136
- | **Windows (ROCm)** | `start_gradio_ui_rocm.bat`, `start_api_server_rocm.bat` | AMD ROCm |
137
- | **Linux** | `start_gradio_ui.sh`, `start_api_server.sh` | CUDA |
138
- | **macOS** | `start_gradio_ui_macos.sh`, `start_api_server_macos.sh` | MLX (Apple Silicon) |
139
-
140
- ```bash
141
- # Windows
142
- start_gradio_ui.bat
143
-
144
- # Linux
145
- chmod +x start_gradio_ui.sh && ./start_gradio_ui.sh
146
-
147
- # macOS (Apple Silicon)
148
- chmod +x start_gradio_ui_macos.sh && ./start_gradio_ui_macos.sh
149
- ```
150
-
151
- ### ⚙️ Customizing Launch Settings
152
-
153
- **Recommended:** Create a `.env` file to customize models, ports, and other settings. Your `.env` configuration will survive repository updates.
154
-
155
- ```bash
156
- # Copy the example file
157
- cp .env.example .env
158
-
159
- # Edit with your preferred settings
160
- # Examples in .env:
161
- ACESTEP_CONFIG_PATH=acestep-v15-turbo
162
- ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-1.7B
163
- PORT=7860
164
- LANGUAGE=en
165
- ```
166
-
167
- > 📖 **Script configuration & customization:** [English](./docs/en/INSTALL.md#-launch-scripts) | [中文](./docs/zh/INSTALL.md#-启动脚本) | [日本語](./docs/ja/INSTALL.md#-起動スクリプト)
168
-
169
- ## 📚 Documentation
170
-
171
- ### Usage Guides
172
-
173
- | Method | Description | Documentation |
174
- |--------|-------------|---------------|
175
- | 🖥️ **Gradio Web UI** | Interactive web interface for music generation | [Guide](./docs/en/GRADIO_GUIDE.md) |
176
- | 🎚️ **Studio UI** | Optional HTML frontend (DAW-like) | [Guide](./docs/en/studio.md) |
177
- | 🐍 **Python API** | Programmatic access for integration | [Guide](./docs/en/INFERENCE.md) |
178
- | 🌐 **REST API** | HTTP-based async API for services | [Guide](./docs/en/API.md) |
179
- | ⌨️ **CLI** | Interactive wizard and configuration | [Guide](./docs/en/CLI.md) |
180
-
181
- ### Setup & Configuration
182
-
183
- | Topic | Documentation |
184
- |-------|---------------|
185
- | 📦 Installation (all platforms) | [English](./docs/en/INSTALL.md) \| [中文](./docs/zh/INSTALL.md) \| [日本語](./docs/ja/INSTALL.md) |
186
- | 🎮 GPU Compatibility | [English](./docs/en/GPU_COMPATIBILITY.md) \| [中文](./docs/zh/GPU_COMPATIBILITY.md) \| [日本語](./docs/ja/GPU_COMPATIBILITY.md) |
187
- | 🔧 GPU Troubleshooting | [English](./docs/en/GPU_TROUBLESHOOTING.md) |
188
- | 🔬 Benchmark & Profiling | [English](./docs/en/BENCHMARK.md) \| [中文](./docs/zh/BENCHMARK.md) |
189
-
190
- ### Multi-Language Docs
191
-
192
- | Language | API | Gradio | Inference | Tutorial | Install | Benchmark |
193
- |----------|-----|--------|-----------|----------|---------|-----------|
194
- | 🇺🇸 English | [Link](./docs/en/API.md) | [Link](./docs/en/GRADIO_GUIDE.md) | [Link](./docs/en/INFERENCE.md) | [Link](./docs/en/Tutorial.md) | [Link](./docs/en/INSTALL.md) | [Link](./docs/en/BENCHMARK.md) |
195
- | 🇨🇳 中文 | [Link](./docs/zh/API.md) | [Link](./docs/zh/GRADIO_GUIDE.md) | [Link](./docs/zh/INFERENCE.md) | [Link](./docs/zh/Tutorial.md) | [Link](./docs/zh/INSTALL.md) | [Link](./docs/zh/BENCHMARK.md) |
196
- | 🇯🇵 日本語 | [Link](./docs/ja/API.md) | [Link](./docs/ja/GRADIO_GUIDE.md) | [Link](./docs/ja/INFERENCE.md) | [Link](./docs/ja/Tutorial.md) | [Link](./docs/ja/INSTALL.md) | — |
197
- | 🇰🇷 한국어 | [Link](./docs/ko/API.md) | [Link](./docs/ko/GRADIO_GUIDE.md) | [Link](./docs/ko/INFERENCE.md) | [Link](./docs/ko/Tutorial.md) | — | — |
198
-
199
- ## 📖 Tutorial
200
-
201
- **🎯 Must Read:** Comprehensive guide to ACE-Step 1.5's design philosophy and usage methods.
202
-
203
- | Language | Link |
204
- |----------|------|
205
- | 🇺🇸 English | [English Tutorial](./docs/en/Tutorial.md) |
206
- | 🇨🇳 中文 | [中文教程](./docs/zh/Tutorial.md) |
207
- | 🇯🇵 日本語 | [日本語チュートリアル](./docs/ja/Tutorial.md) |
208
-
209
- This tutorial covers: mental models and design philosophy, model architecture and selection, input control (text and audio), inference hyperparameters, random factors and optimization strategies.
210
-
211
- ## 🔨 Train
212
-
213
- See the **LoRA Training** tab in Gradio UI for one-click training, or check [Gradio Guide - LoRA Training](./docs/en/GRADIO_GUIDE.md#lora-training) for details.
214
-
215
- ## 🏗️ Architecture
216
-
217
- <p align="center">
218
- <img src="./assets/ACE-Step_framework.png" width="100%" alt="ACE-Step Framework">
219
- </p>
220
-
221
- ## 🦁 Model Zoo
222
-
223
- <p align="center">
224
- <img src="./assets/model_zoo.png" width="100%" alt="Model Zoo">
225
- </p>
226
-
227
- ### DiT Models
228
-
229
- | DiT Model | Pre-Training | SFT | RL | CFG | Step | Refer audio | Text2Music | Cover | Repaint | Extract | Lego | Complete | Quality | Diversity | Fine-Tunability | Hugging Face |
230
- |-----------|:------------:|:---:|:--:|:---:|:----:|:-----------:|:----------:|:-----:|:-------:|:-------:|:----:|:--------:|:-------:|:---------:|:---------------:|--------------|
231
- | `acestep-v15-base` | ✅ | ❌ | ❌ | ✅ | 50 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Medium | High | Easy | [Link](https://huggingface.co/ACE-Step/acestep-v15-base) |
232
- | `acestep-v15-sft` | ✅ | ✅ | ❌ | ✅ | 50 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | High | Medium | Easy | [Link](https://huggingface.co/ACE-Step/acestep-v15-sft) |
233
- | `acestep-v15-turbo` | ✅ | ✅ | ❌ | ❌ | 8 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Medium | [Link](https://huggingface.co/ACE-Step/Ace-Step1.5) |
234
- | `acestep-v15-turbo-rl` | ✅ | ✅ | ✅ | ❌ | 8 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Medium | To be released |
235
-
236
- ### LM Models
237
-
238
- | LM Model | Pretrain from | Pre-Training | SFT | RL | CoT metas | Query rewrite | Audio Understanding | Composition Capability | Copy Melody | Hugging Face |
239
- |----------|---------------|:------------:|:---:|:--:|:---------:|:-------------:|:-------------------:|:----------------------:|:-----------:|--------------|
240
- | `acestep-5Hz-lm-0.6B` | Qwen3-0.6B | ✅ | ✅ | ✅ | ✅ | ✅ | Medium | Medium | Weak | ✅ |
241
- | `acestep-5Hz-lm-1.7B` | Qwen3-1.7B | ✅ | ✅ | ✅ | ✅ | ✅ | Medium | Medium | Medium | ✅ |
242
- | `acestep-5Hz-lm-4B` | Qwen3-4B | ✅ | ✅ | ✅ | ✅ | ✅ | Strong | Strong | Strong | ✅ |
243
-
244
- ## 🔬 Benchmark
245
-
246
- ACE-Step 1.5 includes `profile_inference.py`, a profiling & benchmarking tool that measures LLM, DiT, and VAE timing across devices and configurations.
247
-
248
- ```bash
249
- python profile_inference.py # Single-run profile
250
- python profile_inference.py --mode benchmark # Configuration matrix
251
- ```
252
-
253
- > 📖 **Full guide** (all modes, CLI options, output interpretation): [English](./docs/en/BENCHMARK.md) | [中文](./docs/zh/BENCHMARK.md)
254
-
255
- ## 📜 License & Disclaimer
256
-
257
- This project is licensed under [MIT](./LICENSE)
258
-
259
- ACE-Step enables original music generation across diverse genres, with applications in creative production, education, and entertainment. While designed to support positive and artistic use cases, we acknowledge potential risks such as unintentional copyright infringement due to stylistic similarity, inappropriate blending of cultural elements, and misuse for generating harmful content. To ensure responsible use, we encourage users to verify the originality of generated works, clearly disclose AI involvement, and obtain appropriate permissions when adapting protected styles or materials. By using ACE-Step, you agree to uphold these principles and respect artistic integrity, cultural diversity, and legal compliance. The authors are not responsible for any misuse of the model, including but not limited to copyright violations, cultural insensitivity, or the generation of harmful content.
260
-
261
- 🔔 Important Notice
262
- The only official website for the ACE-Step project is our GitHub Pages site.
263
- We do not operate any other websites.
264
- 🚫 Fake domains include but are not limited to:
265
- ac\*\*p.com, a\*\*p.org, a\*\*\*c.org
266
- ⚠️ Please be cautious. Do not visit, trust, or make payments on any of those sites.
267
-
268
- ## 🙏 Acknowledgements
269
-
270
- This project is co-led by ACE Studio and StepFun.
271
-
272
-
273
- ## 📖 Citation
274
-
275
- If you find this project useful for your research, please consider citing:
276
-
277
- ```BibTeX
278
- @misc{gong2026acestep,
279
- title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
280
- author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
281
- howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
282
- year={2026},
283
- note={GitHub repository}
284
- }
285
- ```
 
1
  ---
2
+ title: ACE-Step 1.5 Music Gen
3
  emoji: 🎵
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ app_port: 7860
 
 
8
  pinned: false
9
+ license: mit
10
  ---
11
 
12
+ # ACE-Step 1.5 (Docker)
 
 
 
 
 
 
 
 
 
13
 
14
+ Lyric-controllable, open-source text-to-music. Runs as a Docker Space with GPU.
 
 
15
 
16
+ Models are downloaded from the Hub on first run (ACE-Step/Ace-Step1.5). Select **GPU** (e.g. T4 or A10G) in Space Settings.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
SECURITY.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Security Policy
2
+
3
+ ## Reporting a Vulnerability
4
+
5
+ We take security issues seriously and appreciate responsible disclosure.
6
+
7
+ If you believe you have found a security vulnerability, **please do not report it in a public GitHub issue**.
8
+
9
+ Instead, use one of the following private channels:
10
+
11
+ - Open a **GitHub Security Advisory** for this repository (preferred)
12
+ - Or contact the maintainers directly if a private email channel is listed
13
+
14
+ Please include:
15
+ - A clear description of the issue
16
+ - Steps to reproduce (if applicable)
17
+ - Potential impact
18
+ - Any relevant proof-of-concept or logs
19
+
20
+ We will acknowledge receipt and work to assess the issue as quickly as possible.
21
+
22
+ ## Bug Bounties
23
+
24
+ At this time, this project does **not** operate a formal bug bounty program.
25
+ However, valid and responsibly disclosed security issues may be acknowledged in release notes or documentation at the maintainers’ discretion.
26
+
27
+ Thank you for helping keep the project and its users safe.
app.py CHANGED
@@ -1,21 +1,26 @@
 
 
 
 
1
  import os
2
  import sys
3
 
4
- # Ensure current directory is in sys.path
5
- sys.path.append(os.path.dirname(os.path.abspath(__file__)))
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  from acestep.acestep_v15_pipeline import main
8
 
9
  if __name__ == "__main__":
10
- # ZeroGPU specific settings if needed
11
- # Usually ZeroGPU works out of the box with @spaces.GPU
12
-
13
- # Run the main function from the pipeline
14
- # We pass arguments as if they were from command line
15
- import sys
16
- sys.argv = [
17
- "app.py",
18
- "--server-name", "0.0.0.0",
19
- "--port", "7860",
20
- ]
21
  main()
 
1
+ """
2
+ Hugging Face Space entry point for ACE-Step 1.5.
3
+ Run the Gradio app bound to 0.0.0.0:7860 and with init_service so models load on startup.
4
+ """
5
  import os
6
  import sys
7
 
8
+ # Ensure this repo root is on path (Space repo contains app.py + acestep/ at same level)
9
+ _REPO_ROOT = os.path.dirname(os.path.abspath(__file__))
10
+ if _REPO_ROOT not in sys.path:
11
+ sys.path.insert(0, _REPO_ROOT)
12
+
13
+ # Override argv for Space: bind all interfaces, port 7860, init service
14
+ os.chdir(_REPO_ROOT)
15
+ sys.argv = [
16
+ sys.argv[0],
17
+ "--server_name", "0.0.0.0",
18
+ "--port", "7860",
19
+ "--init_service", "true",
20
+ "--download-source", "huggingface",
21
+ ]
22
 
23
  from acestep.acestep_v15_pipeline import main
24
 
25
  if __name__ == "__main__":
 
 
 
 
 
 
 
 
 
 
 
26
  main()
check_update.bat ADDED
@@ -0,0 +1,609 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @echo off
2
+ REM Git Update Check Utility
3
+ REM This script checks for updates from GitHub and optionally updates the repository
4
+
5
+ setlocal enabledelayedexpansion
6
+
7
+ REM Configuration
8
+ set TIMEOUT_SECONDS=10
9
+ set GIT_PORTABLE_PATH=%~dp0PortableGit\bin\git.exe
10
+ set GIT_PATH=
11
+ set REPO_PATH=%~dp0
12
+ set PROXY_CONFIG_FILE=%~dp0proxy_config.txt
13
+
14
+ echo ========================================
15
+ echo ACE-Step Update Check
16
+ echo ========================================
17
+ echo.
18
+
19
+ REM Check for Git: first try PortableGit, then system Git
20
+ if exist "%GIT_PORTABLE_PATH%" (
21
+ set "GIT_PATH=%GIT_PORTABLE_PATH%"
22
+ echo [Git] Using PortableGit
23
+ ) else (
24
+ REM Try to find git in system PATH
25
+ where git >nul 2>&1
26
+ if !ERRORLEVEL! EQU 0 (
27
+ for /f "tokens=*" %%i in ('where git 2^>nul') do (
28
+ if not defined GIT_PATH set "GIT_PATH=%%i"
29
+ )
30
+ echo [Git] Using system Git: !GIT_PATH!
31
+ ) else (
32
+ echo [Error] Git not found.
33
+ echo - PortableGit not found at: %GIT_PORTABLE_PATH%
34
+ echo - System Git not found in PATH
35
+ echo.
36
+ echo Please either:
37
+ echo 1. Install PortableGit in the PortableGit folder, or
38
+ echo 2. Install Git and add it to your system PATH
39
+ echo.
40
+ echo ========================================
41
+ echo Press any key to close...
42
+ echo ========================================
43
+ pause >nul
44
+ exit /b 1
45
+ )
46
+ )
47
+ echo.
48
+
49
+ REM Check if this is a git repository
50
+ cd /d "%REPO_PATH%"
51
+ "!GIT_PATH!" rev-parse --git-dir >nul 2>&1
52
+ if %ERRORLEVEL% NEQ 0 (
53
+ echo [Error] Not a git repository.
54
+ echo This folder does not appear to be a git repository.
55
+ echo.
56
+ echo ========================================
57
+ echo Press any key to close...
58
+ echo ========================================
59
+ pause >nul
60
+ exit /b 1
61
+ )
62
+
63
+ REM Load proxy configuration if exists
64
+ set PROXY_ENABLED=0
65
+ set PROXY_URL=
66
+ if exist "%PROXY_CONFIG_FILE%" (
67
+ for /f "usebackq tokens=1,* delims==" %%a in ("%PROXY_CONFIG_FILE%") do (
68
+ if /i "%%a"=="PROXY_ENABLED" set PROXY_ENABLED=%%b
69
+ if /i "%%a"=="PROXY_URL" set PROXY_URL=%%b
70
+ )
71
+
72
+ if "!PROXY_ENABLED!"=="1" (
73
+ if not "!PROXY_URL!"=="" (
74
+ echo [Proxy] Using proxy server: !PROXY_URL!
75
+ "!GIT_PATH!" config --local http.proxy "!PROXY_URL!"
76
+ "!GIT_PATH!" config --local https.proxy "!PROXY_URL!"
77
+ echo.
78
+ )
79
+ )
80
+ )
81
+
82
+ echo [1/4] Checking current version...
83
+ REM Get current branch
84
+ for /f "tokens=*" %%i in ('"!GIT_PATH!" rev-parse --abbrev-ref HEAD 2^>nul') do set CURRENT_BRANCH=%%i
85
+ if "%CURRENT_BRANCH%"=="" set CURRENT_BRANCH=main
86
+
87
+ REM Get current commit
88
+ for /f "tokens=*" %%i in ('"!GIT_PATH!" rev-parse --short HEAD 2^>nul') do set CURRENT_COMMIT=%%i
89
+
90
+ echo Branch: %CURRENT_BRANCH%
91
+ echo Commit: %CURRENT_COMMIT%
92
+ echo.
93
+
94
+ echo [2/4] Checking for updates (timeout: %TIMEOUT_SECONDS%s)...
95
+ echo Connecting to GitHub...
96
+
97
+ :FetchRetry
98
+ REM Fetch remote with timeout (stderr visible so "Bad credentials" etc. are shown)
99
+ set FETCH_SUCCESS=0
100
+ "!GIT_PATH!" fetch origin --quiet
101
+ if %ERRORLEVEL% EQU 0 (
102
+ set FETCH_SUCCESS=1
103
+ )
104
+ if !FETCH_SUCCESS! EQU 1 goto :FetchDone
105
+
106
+ REM Try with timeout using a temp marker file
107
+ set TEMP_MARKER=%TEMP%\acestep_git_fetch_%RANDOM%.tmp
108
+
109
+ REM Start fetch in background
110
+ set "FETCH_CMD=!GIT_PATH! fetch origin --quiet"
111
+ start /b "" cmd /c "!FETCH_CMD! >nul 2>&1 && echo SUCCESS > "!TEMP_MARKER!""
112
+
113
+ REM Wait with timeout
114
+ set /a COUNTER=0
115
+ :WaitLoop
116
+ if exist "!TEMP_MARKER!" (
117
+ set FETCH_SUCCESS=1
118
+ del "!TEMP_MARKER!" >nul 2>&1
119
+ goto :FetchDone
120
+ )
121
+
122
+ timeout /t 1 /nobreak >nul
123
+ set /a COUNTER+=1
124
+ if !COUNTER! LSS %TIMEOUT_SECONDS% goto :WaitLoop
125
+
126
+ REM Timeout reached
127
+ echo [Timeout] Could not connect to GitHub within %TIMEOUT_SECONDS% seconds.
128
+
129
+ :FetchDone
130
+ if %FETCH_SUCCESS% EQU 0 (
131
+ echo [Failed] Could not fetch from GitHub.
132
+ echo If the error above is "Bad credentials", update or clear stored Git credentials.
133
+ echo This repo is public and does not require login: https://docs.github.com/en/get-started/getting-started-with-git/caching-your-github-credentials-in-git
134
+ echo Otherwise check your internet connection or proxy.
135
+ echo.
136
+
137
+ REM Ask if user wants to configure proxy
138
+ set /p PROXY_CHOICE="Do you want to configure a proxy server to retry? (Y/N): "
139
+ if /i "!PROXY_CHOICE!"=="Y" (
140
+ call :ConfigureProxy
141
+ if !ERRORLEVEL! EQU 0 (
142
+ echo.
143
+ echo [Proxy] Retrying with proxy configuration...
144
+ echo.
145
+ goto :FetchRetry
146
+ )
147
+ )
148
+
149
+ echo.
150
+ echo ========================================
151
+ echo Press any key to close...
152
+ echo ========================================
153
+ pause >nul
154
+ exit /b 2
155
+ )
156
+
157
+ echo [Success] Fetched latest information from GitHub.
158
+ echo.
159
+
160
+ echo [3/4] Comparing versions...
161
+ REM Get remote commit
162
+ for /f "tokens=*" %%i in ('"!GIT_PATH!" rev-parse --short origin/%CURRENT_BRANCH% 2^>nul') do set REMOTE_COMMIT=%%i
163
+
164
+ if "%REMOTE_COMMIT%"=="" (
165
+ echo [Warning] Remote branch 'origin/%CURRENT_BRANCH%' not found.
166
+ echo.
167
+ echo Your current branch '%CURRENT_BRANCH%' does not exist on the remote repository.
168
+ echo This might be a local development branch.
169
+ echo.
170
+
171
+ REM Try to get main branch instead
172
+ set FALLBACK_BRANCH=main
173
+ echo Checking main branch instead...
174
+ for /f "tokens=*" %%i in ('"!GIT_PATH!" rev-parse --short origin/!FALLBACK_BRANCH! 2^>nul') do set REMOTE_COMMIT=%%i
175
+
176
+ if "!REMOTE_COMMIT!"=="" (
177
+ echo [Error] Could not find remote main branch either.
178
+ echo Please ensure you are connected to the correct repository.
179
+ echo.
180
+ echo ========================================
181
+ echo Press any key to close...
182
+ echo ========================================
183
+ pause >nul
184
+ exit /b 1
185
+ )
186
+
187
+ echo Found main branch: !REMOTE_COMMIT!
188
+ echo.
189
+ echo Recommendation: Switch to main branch to check for official updates.
190
+ echo Command: git checkout main
191
+ echo.
192
+
193
+ set /p SWITCH_BRANCH="Do you want to switch to main branch now? (Y/N): "
194
+ if /i "!SWITCH_BRANCH!"=="Y" (
195
+ echo.
196
+ echo Switching to main branch...
197
+ "!GIT_PATH!" checkout main
198
+
199
+ if !ERRORLEVEL! EQU 0 (
200
+ echo [Success] Switched to main branch.
201
+ echo.
202
+ echo Please run this script again to check for updates.
203
+ echo.
204
+ echo ========================================
205
+ echo Press any key to close...
206
+ echo ========================================
207
+ pause >nul
208
+ exit /b 0
209
+ ) else (
210
+ echo [Error] Failed to switch branch.
211
+ echo.
212
+ echo ========================================
213
+ echo Press any key to close...
214
+ echo ========================================
215
+ pause >nul
216
+ exit /b 1
217
+ )
218
+ ) else (
219
+ echo.
220
+ echo Staying on branch '%CURRENT_BRANCH%'. No update performed.
221
+ echo.
222
+ echo ========================================
223
+ echo Press any key to close...
224
+ echo ========================================
225
+ pause >nul
226
+ exit /b 0
227
+ )
228
+ )
229
+
230
+ echo Local: %CURRENT_COMMIT%
231
+ echo Remote: %REMOTE_COMMIT%
232
+ echo.
233
+
234
+ REM Compare commits
235
+ if "%CURRENT_COMMIT%"=="%REMOTE_COMMIT%" (
236
+ echo [4/4] Result: Already up to date!
237
+ echo You have the latest version.
238
+ echo.
239
+ echo ========================================
240
+ echo Press any key to close...
241
+ echo ========================================
242
+ pause >nul
243
+ exit /b 0
244
+ ) else (
245
+ echo [4/4] Result: Update available!
246
+
247
+ REM Check if local is behind remote
248
+ "!GIT_PATH!" merge-base --is-ancestor HEAD origin/%CURRENT_BRANCH% 2>nul
249
+ if !ERRORLEVEL! EQU 0 (
250
+ echo A new version is available on GitHub.
251
+ echo.
252
+
253
+ REM Show commits behind (do not suppress stderr so ref/encoding errors are visible)
254
+ echo New commits:
255
+ "!GIT_PATH!" --no-pager log --oneline --graph --decorate "HEAD..origin/!CURRENT_BRANCH!"
256
+ if !ERRORLEVEL! NEQ 0 (
257
+ echo [Could not show commit log. Check branch name and network.]
258
+ )
259
+ echo.
260
+
261
+ REM Ask if user wants to update
262
+ set /p UPDATE_CHOICE="Do you want to update now? (Y/N): "
263
+ if /i "!UPDATE_CHOICE!"=="Y" (
264
+ echo.
265
+ echo Updating...
266
+
267
+ REM First, refresh the index to avoid false positives from line ending changes
268
+ "!GIT_PATH!" update-index --refresh >nul 2>&1
269
+
270
+ REM Check for uncommitted changes
271
+ "!GIT_PATH!" diff-index --quiet HEAD -- 2>nul
272
+ if !ERRORLEVEL! NEQ 0 (
273
+ echo.
274
+ echo [Info] Checking for potential conflicts...
275
+
276
+ REM Get list of locally modified files
277
+ set TEMP_LOCAL_CHANGES=%TEMP%\acestep_local_changes_%RANDOM%.txt
278
+ "!GIT_PATH!" diff --name-only HEAD 2>nul > "!TEMP_LOCAL_CHANGES!"
279
+
280
+ REM Get list of files changed in remote
281
+ set TEMP_REMOTE_CHANGES=%TEMP%\acestep_remote_changes_%RANDOM%.txt
282
+ "!GIT_PATH!" diff --name-only HEAD..origin/%CURRENT_BRANCH% 2>nul > "!TEMP_REMOTE_CHANGES!"
283
+
284
+ REM Check for conflicts
285
+ set HAS_CONFLICTS=0
286
+ REM Use wmic to get locale-independent date/time format (YYYYMMDDHHMMSS)
287
+ for /f "tokens=2 delims==" %%a in ('wmic os get localdatetime /value 2^>nul') do set "DATETIME=%%a"
288
+ set "BACKUP_DIR=%~dp0.update_backup_!DATETIME:~0,8!_!DATETIME:~8,6!"
289
+
290
+ REM Find conflicting files
291
+ for /f "usebackq delims=" %%a in ("!TEMP_LOCAL_CHANGES!") do (
292
+ findstr /x /c:"%%a" "!TEMP_REMOTE_CHANGES!" >nul 2>&1
293
+ if !ERRORLEVEL! EQU 0 (
294
+ REM Found a conflict
295
+ set HAS_CONFLICTS=1
296
+
297
+ REM Create backup directory if not exists
298
+ if not exist "!BACKUP_DIR!" (
299
+ mkdir "!BACKUP_DIR!"
300
+ echo.
301
+ echo [Backup] Creating backup directory: !BACKUP_DIR!
302
+ )
303
+
304
+ REM Backup the file
305
+ echo [Backup] Backing up: %%a
306
+ set FILE_PATH=%%a
307
+ set FILE_DIR=
308
+ for %%i in ("!FILE_PATH!") do set FILE_DIR=%%~dpi
309
+
310
+ REM Create subdirectories in backup if needed
311
+ if not "!FILE_DIR!"=="" (
312
+ if not "!FILE_DIR!"=="." (
313
+ if not exist "!BACKUP_DIR!\!FILE_DIR!" (
314
+ mkdir "!BACKUP_DIR!\!FILE_DIR!" 2>nul
315
+ )
316
+ )
317
+ )
318
+
319
+ REM Copy file to backup
320
+ copy "%%a" "!BACKUP_DIR!\%%a" >nul 2>&1
321
+ )
322
+ )
323
+
324
+ REM Clean up temp files
325
+ del "!TEMP_LOCAL_CHANGES!" >nul 2>&1
326
+ del "!TEMP_REMOTE_CHANGES!" >nul 2>&1
327
+
328
+ if !HAS_CONFLICTS! EQU 1 (
329
+ echo.
330
+ echo ========================================
331
+ echo [Warning] Potential conflicts detected!
332
+ echo ========================================
333
+ echo.
334
+ echo Your modified files may conflict with remote updates.
335
+ echo Your changes have been backed up to:
336
+ echo !BACKUP_DIR!
337
+ echo.
338
+ echo Update will restore these files to the remote version.
339
+ echo You can manually merge your changes later.
340
+ echo.
341
+ set /p CONFLICT_CHOICE="Continue with update? (Y/N): "
342
+
343
+ if /i "!CONFLICT_CHOICE!"=="Y" (
344
+ echo.
345
+ echo [Restore] Proceeding with update...
346
+ echo [Restore] Files will be updated to remote version.
347
+ ) else (
348
+ echo.
349
+ echo Update cancelled.
350
+ echo Your backup remains at: !BACKUP_DIR!
351
+ echo.
352
+ echo ========================================
353
+ echo Press any key to close...
354
+ echo ========================================
355
+ pause >nul
356
+ exit /b 0
357
+ )
358
+ ) else (
359
+ echo.
360
+ echo [Info] No conflicts detected. Safe to stash and update.
361
+ echo.
362
+ set /p STASH_CHOICE="Stash your changes and continue? (Y/N): "
363
+ if /i "!STASH_CHOICE!"=="Y" (
364
+ echo Stashing changes...
365
+ "!GIT_PATH!" stash push -m "Auto-stash before update - %date% %time%"
366
+ ) else (
367
+ echo.
368
+ echo Update cancelled.
369
+ echo.
370
+ echo ========================================
371
+ echo Press any key to close...
372
+ echo ========================================
373
+ pause >nul
374
+ exit /b 0
375
+ )
376
+ )
377
+ )
378
+
379
+ REM Check for untracked files that could be overwritten
380
+ set STASHED_UNTRACKED=0
381
+ set TEMP_UNTRACKED=%TEMP%\acestep_untracked_%RANDOM%.txt
382
+ "!GIT_PATH!" ls-files --others --exclude-standard 2>nul > "!TEMP_UNTRACKED!"
383
+
384
+ REM Check if there are any untracked files
385
+ set HAS_UNTRACKED=0
386
+ for /f "usebackq delims=" %%u in ("!TEMP_UNTRACKED!") do set HAS_UNTRACKED=1
387
+
388
+ if !HAS_UNTRACKED! EQU 1 (
389
+ REM Get files added in remote
390
+ set TEMP_REMOTE_ADDED=%TEMP%\acestep_remote_added_%RANDOM%.txt
391
+ "!GIT_PATH!" diff --name-only --diff-filter=A HEAD..origin/%CURRENT_BRANCH% 2>nul > "!TEMP_REMOTE_ADDED!"
392
+
393
+ set HAS_UNTRACKED_CONFLICTS=0
394
+ for /f "usebackq delims=" %%u in ("!TEMP_UNTRACKED!") do (
395
+ findstr /x /c:"%%u" "!TEMP_REMOTE_ADDED!" >nul 2>&1
396
+ if !ERRORLEVEL! EQU 0 (
397
+ if !HAS_UNTRACKED_CONFLICTS! EQU 0 (
398
+ echo.
399
+ echo ========================================
400
+ echo [Warning] Untracked files conflict with update!
401
+ echo ========================================
402
+ echo.
403
+ echo The following untracked files would be overwritten:
404
+ )
405
+ set HAS_UNTRACKED_CONFLICTS=1
406
+ echo %%u
407
+ )
408
+ )
409
+
410
+ del "!TEMP_REMOTE_ADDED!" >nul 2>&1
411
+
412
+ if !HAS_UNTRACKED_CONFLICTS! EQU 1 (
413
+ echo.
414
+ set /p STASH_UNTRACKED_CHOICE="Stash untracked files before updating? (Y/N): "
415
+ if /i "!STASH_UNTRACKED_CHOICE!"=="Y" (
416
+ echo Stashing all changes including untracked files...
417
+ "!GIT_PATH!" stash push --include-untracked -m "pre-update-%RANDOM%" >nul 2>&1
418
+ if !ERRORLEVEL! EQU 0 (
419
+ set STASHED_UNTRACKED=1
420
+ echo [Stash] Changes stashed successfully.
421
+ ) else (
422
+ echo [Error] Failed to stash changes. Update aborted.
423
+ del "!TEMP_UNTRACKED!" >nul 2>&1
424
+ echo.
425
+ echo ========================================
426
+ echo Press any key to close...
427
+ echo ========================================
428
+ pause >nul
429
+ exit /b 1
430
+ )
431
+ ) else (
432
+ echo.
433
+ echo Update cancelled. Please move or remove the conflicting files manually.
434
+ del "!TEMP_UNTRACKED!" >nul 2>&1
435
+ echo.
436
+ echo ========================================
437
+ echo Press any key to close...
438
+ echo ========================================
439
+ pause >nul
440
+ exit /b 1
441
+ )
442
+ echo.
443
+ )
444
+ )
445
+
446
+ del "!TEMP_UNTRACKED!" >nul 2>&1
447
+
448
+ REM Pull changes
449
+ echo Pulling latest changes...
450
+ REM Force update by resetting to remote branch (discards any remaining local changes)
451
+ "!GIT_PATH!" reset --hard origin/%CURRENT_BRANCH% >nul 2>&1
452
+
453
+ if !ERRORLEVEL! EQU 0 (
454
+ echo.
455
+ echo ========================================
456
+ echo Update completed successfully!
457
+ echo ========================================
458
+ echo.
459
+
460
+ REM Check if backup was created
461
+ if defined BACKUP_DIR (
462
+ if exist "!BACKUP_DIR!" (
463
+ echo [Important] Your modified files were backed up to:
464
+ echo !BACKUP_DIR!
465
+ echo.
466
+ echo To restore your changes:
467
+ echo 1. Run merge_config.bat to compare and merge files
468
+ echo 2. Or manually compare backup with new version
469
+ echo.
470
+ echo Backed up files:
471
+ set "BACKUP_DIR_DISPLAY=!BACKUP_DIR!"
472
+ for /f "delims=" %%f in ('dir /b /s "!BACKUP_DIR!\*.*" 2^>nul') do (
473
+ set "FILEPATH=%%f"
474
+ REM Use call to safely handle the string replacement
475
+ call set "FILEPATH=%%FILEPATH:!BACKUP_DIR_DISPLAY!\=%%"
476
+ echo - !FILEPATH!
477
+ )
478
+ echo.
479
+ )
480
+ )
481
+
482
+ if !STASHED_UNTRACKED! EQU 1 (
483
+ echo [Stash] Untracked files were stashed before the update.
484
+ echo To restore them: git stash pop
485
+ echo To discard them: git stash drop
486
+ echo.
487
+ echo Note: 'git stash pop' may produce merge conflicts if
488
+ echo the update modified the same files. Resolve manually.
489
+ echo.
490
+ )
491
+
492
+ echo Please restart the application to use the new version.
493
+ echo.
494
+ echo ========================================
495
+ echo Press any key to close...
496
+ echo ========================================
497
+ pause >nul
498
+ exit /b 0
499
+ ) else (
500
+ echo.
501
+ echo [Error] Update failed.
502
+ echo Please check the error messages above.
503
+
504
+ if !STASHED_UNTRACKED! EQU 1 (
505
+ echo.
506
+ echo [Stash] Restoring stashed changes...
507
+ "!GIT_PATH!" stash pop >nul 2>&1
508
+ if !ERRORLEVEL! EQU 0 (
509
+ echo [Stash] Changes restored successfully.
510
+ ) else (
511
+ echo [Stash] Could not auto-restore. Run 'git stash pop' manually.
512
+ )
513
+ )
514
+
515
+ REM If backup exists, mention it
516
+ if defined BACKUP_DIR (
517
+ if exist "!BACKUP_DIR!" (
518
+ echo.
519
+ echo Your backup is still available at: !BACKUP_DIR!
520
+ )
521
+ )
522
+
523
+ echo.
524
+ echo ========================================
525
+ echo Press any key to close...
526
+ echo ========================================
527
+ pause >nul
528
+ exit /b 1
529
+ )
530
+ ) else (
531
+ echo.
532
+ echo Update skipped.
533
+ echo.
534
+ echo ========================================
535
+ echo Press any key to close...
536
+ echo ========================================
537
+ pause >nul
538
+ exit /b 0
539
+ )
540
+ ) else (
541
+ echo [Warning] Local version has diverged from remote.
542
+ echo This might be because you have local commits.
543
+ echo Please update manually or consult the documentation.
544
+ echo.
545
+ echo ========================================
546
+ echo Press any key to close...
547
+ echo ========================================
548
+ pause >nul
549
+ exit /b 0
550
+ )
551
+ )
552
+
553
+ REM ========================================
554
+ REM Function: ConfigureProxy
555
+ REM Configure proxy server for git
556
+ REM ========================================
557
+ :ConfigureProxy
558
+ echo.
559
+ echo ========================================
560
+ echo Proxy Server Configuration
561
+ echo ========================================
562
+ echo.
563
+ echo Please enter your proxy server URL.
564
+ echo.
565
+ echo Examples:
566
+ echo - HTTP proxy: http://127.0.0.1:7890
567
+ echo - HTTPS proxy: https://proxy.example.com:8080
568
+ echo - SOCKS5: socks5://127.0.0.1:1080
569
+ echo.
570
+ echo Leave empty to disable proxy.
571
+ echo.
572
+ set /p NEW_PROXY_URL="Proxy URL: "
573
+
574
+ if "!NEW_PROXY_URL!"=="" (
575
+ echo.
576
+ echo [Proxy] Disabling proxy...
577
+
578
+ REM Remove proxy configuration
579
+ "!GIT_PATH!" config --local --unset http.proxy 2>nul
580
+ "!GIT_PATH!" config --local --unset https.proxy 2>nul
581
+
582
+ REM Update config file
583
+ (
584
+ echo PROXY_ENABLED=0
585
+ echo PROXY_URL=
586
+ ) > "%PROXY_CONFIG_FILE%"
587
+
588
+ echo [Proxy] Proxy disabled.
589
+ exit /b 0
590
+ ) else (
591
+ echo.
592
+ echo [Proxy] Configuring proxy: !NEW_PROXY_URL!
593
+
594
+ REM Apply proxy to git
595
+ "!GIT_PATH!" config --local http.proxy "!NEW_PROXY_URL!"
596
+ "!GIT_PATH!" config --local https.proxy "!NEW_PROXY_URL!"
597
+
598
+ REM Save to config file
599
+ (
600
+ echo PROXY_ENABLED=1
601
+ echo PROXY_URL=!NEW_PROXY_URL!
602
+ ) > "%PROXY_CONFIG_FILE%"
603
+
604
+ echo [Proxy] Proxy configured successfully.
605
+ echo [Proxy] Configuration saved to: %PROXY_CONFIG_FILE%
606
+ exit /b 0
607
+ )
608
+
609
+ endlocal
check_update.sh ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Git Update Check Utility - Linux/macOS
3
+ # This script checks for updates from GitHub and optionally updates the repository
4
+
5
+ set -euo pipefail
6
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
7
+
8
+ # Configuration
9
+ TIMEOUT_SECONDS=10
10
+ GIT_PATH=""
11
+ REPO_PATH="$SCRIPT_DIR"
12
+
13
+ echo "========================================"
14
+ echo "ACE-Step Update Check"
15
+ echo "========================================"
16
+ echo
17
+
18
+ # Find git
19
+ if command -v git &>/dev/null; then
20
+ GIT_PATH="$(command -v git)"
21
+ echo "[Git] Using system Git: $GIT_PATH"
22
+ else
23
+ echo "[Error] Git not found."
24
+ echo
25
+ if [[ "$(uname)" == "Darwin" ]]; then
26
+ echo "Please install Git:"
27
+ echo " xcode-select --install"
28
+ echo " or: brew install git"
29
+ else
30
+ echo "Please install Git:"
31
+ echo " Ubuntu/Debian: sudo apt install git"
32
+ echo " CentOS/RHEL: sudo yum install git"
33
+ echo " Arch: sudo pacman -S git"
34
+ fi
35
+ echo
36
+ exit 1
37
+ fi
38
+ echo
39
+
40
+ # Check if this is a git repository
41
+ cd "$REPO_PATH"
42
+ if ! "$GIT_PATH" rev-parse --git-dir &>/dev/null; then
43
+ echo "[Error] Not a git repository."
44
+ echo "This folder does not appear to be a git repository."
45
+ echo
46
+ exit 1
47
+ fi
48
+
49
+ echo "[1/4] Checking current version..."
50
+ CURRENT_BRANCH="$("$GIT_PATH" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "main")"
51
+ CURRENT_COMMIT="$("$GIT_PATH" rev-parse --short HEAD 2>/dev/null || echo "unknown")"
52
+
53
+ echo " Branch: $CURRENT_BRANCH"
54
+ echo " Commit: $CURRENT_COMMIT"
55
+ echo
56
+
57
+ echo "[2/4] Checking for updates (timeout: ${TIMEOUT_SECONDS}s)..."
58
+ echo " Connecting to GitHub..."
59
+
60
+ # Fetch remote with timeout (stderr visible so "Bad credentials" etc. are shown)
61
+ FETCH_SUCCESS=0
62
+ if timeout "$TIMEOUT_SECONDS" "$GIT_PATH" fetch origin --quiet; then
63
+ FETCH_SUCCESS=1
64
+ elif command -v gtimeout &>/dev/null; then
65
+ if gtimeout "$TIMEOUT_SECONDS" "$GIT_PATH" fetch origin --quiet; then
66
+ FETCH_SUCCESS=1
67
+ fi
68
+ else
69
+ if "$GIT_PATH" fetch origin --quiet; then
70
+ FETCH_SUCCESS=1
71
+ fi
72
+ fi
73
+
74
+ if [[ $FETCH_SUCCESS -eq 0 ]]; then
75
+ echo " [Failed] Could not fetch from GitHub."
76
+ echo " If the error above is 'Bad credentials', update or clear stored Git credentials."
77
+ echo " This repo is public and does not require login: https://docs.github.com/en/get-started/getting-started-with-git/caching-your-github-credentials-in-git"
78
+ echo " Otherwise check your internet connection or proxy."
79
+ echo
80
+ exit 2
81
+ fi
82
+
83
+ echo " [Success] Fetched latest information from GitHub."
84
+ echo
85
+
86
+ echo "[3/4] Comparing versions..."
87
+ REMOTE_COMMIT="$("$GIT_PATH" rev-parse --short "origin/$CURRENT_BRANCH" 2>/dev/null || echo "")"
88
+
89
+ if [[ -z "$REMOTE_COMMIT" ]]; then
90
+ echo " [Warning] Remote branch 'origin/$CURRENT_BRANCH' not found."
91
+ echo
92
+ echo " Checking main branch instead..."
93
+ FALLBACK_BRANCH="main"
94
+ REMOTE_COMMIT="$("$GIT_PATH" rev-parse --short "origin/$FALLBACK_BRANCH" 2>/dev/null || echo "")"
95
+
96
+ if [[ -z "$REMOTE_COMMIT" ]]; then
97
+ echo " [Error] Could not find remote main branch either."
98
+ exit 1
99
+ fi
100
+
101
+ echo " Found main branch: $REMOTE_COMMIT"
102
+ echo
103
+
104
+ read -rp " Switch to main branch? (Y/N): " SWITCH_BRANCH
105
+ if [[ "${SWITCH_BRANCH^^}" == "Y" ]]; then
106
+ echo
107
+ echo " Switching to main branch..."
108
+ if "$GIT_PATH" checkout main; then
109
+ echo " [Success] Switched to main branch."
110
+ echo " Please run this script again to check for updates."
111
+ exit 0
112
+ else
113
+ echo " [Error] Failed to switch branch."
114
+ exit 1
115
+ fi
116
+ else
117
+ echo
118
+ echo " Staying on branch '$CURRENT_BRANCH'. No update performed."
119
+ exit 0
120
+ fi
121
+ fi
122
+
123
+ echo " Local: $CURRENT_COMMIT"
124
+ echo " Remote: $REMOTE_COMMIT"
125
+ echo
126
+
127
+ # Compare commits
128
+ if [[ "$CURRENT_COMMIT" == "$REMOTE_COMMIT" ]]; then
129
+ echo "[4/4] Result: Already up to date!"
130
+ echo " You have the latest version."
131
+ echo
132
+ exit 0
133
+ fi
134
+
135
+ echo "[4/4] Result: Update available!"
136
+
137
+ # Check if local is behind remote
138
+ if "$GIT_PATH" merge-base --is-ancestor HEAD "origin/$CURRENT_BRANCH" 2>/dev/null; then
139
+ echo " A new version is available on GitHub."
140
+ echo
141
+
142
+ # Show new commits (do not suppress stderr so ref/encoding errors are visible)
143
+ echo " New commits:"
144
+ if ! "$GIT_PATH" --no-pager log --oneline --graph --decorate "HEAD..origin/$CURRENT_BRANCH"; then
145
+ echo " [Could not show commit log. Check branch name and network.]"
146
+ fi
147
+ echo
148
+
149
+ read -rp "Do you want to update now? (Y/N): " UPDATE_CHOICE
150
+ if [[ "${UPDATE_CHOICE^^}" != "Y" ]]; then
151
+ echo
152
+ echo "Update skipped."
153
+ exit 0
154
+ fi
155
+
156
+ echo
157
+ echo "Updating..."
158
+
159
+ # Refresh index
160
+ "$GIT_PATH" update-index --refresh &>/dev/null || true
161
+
162
+ # Check for uncommitted changes
163
+ if ! "$GIT_PATH" diff-index --quiet HEAD -- 2>/dev/null; then
164
+ echo
165
+ echo "[Info] Checking for potential conflicts..."
166
+
167
+ # Get locally modified files
168
+ LOCAL_CHANGES="$("$GIT_PATH" diff --name-only HEAD 2>/dev/null || echo "")"
169
+ REMOTE_CHANGES="$("$GIT_PATH" diff --name-only "HEAD..origin/$CURRENT_BRANCH" 2>/dev/null || echo "")"
170
+
171
+ # Check for conflicting files
172
+ HAS_CONFLICTS=0
173
+ BACKUP_DIR="$SCRIPT_DIR/.update_backup_$(date +%Y%m%d_%H%M%S)"
174
+
175
+ while IFS= read -r local_file; do
176
+ [[ -z "$local_file" ]] && continue
177
+ if echo "$REMOTE_CHANGES" | grep -qxF "$local_file"; then
178
+ HAS_CONFLICTS=1
179
+
180
+ # Create backup directory if not exists
181
+ if [[ ! -d "$BACKUP_DIR" ]]; then
182
+ mkdir -p "$BACKUP_DIR"
183
+ echo
184
+ echo "[Backup] Creating backup directory: $BACKUP_DIR"
185
+ fi
186
+
187
+ # Backup the file
188
+ echo "[Backup] Backing up: $local_file"
189
+ FILE_DIR="$(dirname "$local_file")"
190
+ if [[ "$FILE_DIR" != "." ]]; then
191
+ mkdir -p "$BACKUP_DIR/$FILE_DIR"
192
+ fi
193
+ cp "$local_file" "$BACKUP_DIR/$local_file" 2>/dev/null || true
194
+ fi
195
+ done <<< "$LOCAL_CHANGES"
196
+
197
+ if [[ $HAS_CONFLICTS -eq 1 ]]; then
198
+ echo
199
+ echo "========================================"
200
+ echo "[Warning] Potential conflicts detected!"
201
+ echo "========================================"
202
+ echo
203
+ echo "Your modified files may conflict with remote updates."
204
+ echo "Your changes have been backed up to:"
205
+ echo " $BACKUP_DIR"
206
+ echo
207
+
208
+ read -rp "Continue with update? (Y/N): " CONFLICT_CHOICE
209
+ if [[ "${CONFLICT_CHOICE^^}" != "Y" ]]; then
210
+ echo
211
+ echo "Update cancelled. Your backup remains at: $BACKUP_DIR"
212
+ exit 0
213
+ fi
214
+ echo
215
+ echo "[Restore] Proceeding with update..."
216
+ else
217
+ echo
218
+ echo "[Info] No conflicts detected. Safe to stash and update."
219
+ echo
220
+
221
+ read -rp "Stash your changes and continue? (Y/N): " STASH_CHOICE
222
+ if [[ "${STASH_CHOICE^^}" == "Y" ]]; then
223
+ echo "Stashing changes..."
224
+ "$GIT_PATH" stash push -m "Auto-stash before update - $(date)"
225
+ else
226
+ echo
227
+ echo "Update cancelled."
228
+ exit 0
229
+ fi
230
+ fi
231
+ fi
232
+
233
+ # Check for untracked files that could be overwritten
234
+ UNTRACKED_FILES="$("$GIT_PATH" ls-files --others --exclude-standard 2>/dev/null || echo "")"
235
+ STASHED_UNTRACKED=0
236
+
237
+ if [[ -n "$UNTRACKED_FILES" ]]; then
238
+ # Check if any untracked files conflict with incoming changes
239
+ REMOTE_ALL_FILES="$("$GIT_PATH" diff --name-only --diff-filter=A "HEAD..origin/$CURRENT_BRANCH" 2>/dev/null || echo "")"
240
+ CONFLICTING_UNTRACKED=""
241
+
242
+ while IFS= read -r ufile; do
243
+ [[ -z "$ufile" ]] && continue
244
+ if echo "$REMOTE_ALL_FILES" | grep -qxF "$ufile"; then
245
+ CONFLICTING_UNTRACKED="${CONFLICTING_UNTRACKED}${ufile}"$'\n'
246
+ fi
247
+ done <<< "$UNTRACKED_FILES"
248
+
249
+ if [[ -n "$CONFLICTING_UNTRACKED" ]]; then
250
+ echo
251
+ echo "========================================"
252
+ echo "[Warning] Untracked files conflict with update!"
253
+ echo "========================================"
254
+ echo
255
+ echo "The following untracked files would be overwritten:"
256
+ echo "$CONFLICTING_UNTRACKED" | sed '/^$/d; s/^/ /'
257
+ echo
258
+
259
+ read -rp "Stash untracked files before updating? (Y/N): " STASH_UNTRACKED_CHOICE
260
+ if [[ "${STASH_UNTRACKED_CHOICE^^}" != "Y" ]]; then
261
+ echo
262
+ echo "Update cancelled. Please move or remove the conflicting files manually."
263
+ exit 1
264
+ fi
265
+
266
+ echo "Stashing all changes including untracked files..."
267
+ if "$GIT_PATH" stash push --include-untracked -m "pre-update-$(date +%s)"; then
268
+ STASHED_UNTRACKED=1
269
+ echo "[Stash] Changes stashed successfully."
270
+ else
271
+ echo "[Error] Failed to stash changes. Update aborted."
272
+ exit 1
273
+ fi
274
+ echo
275
+ fi
276
+ fi
277
+
278
+ # Pull changes
279
+ echo "Pulling latest changes..."
280
+ if "$GIT_PATH" reset --hard "origin/$CURRENT_BRANCH" &>/dev/null; then
281
+ echo
282
+ echo "========================================"
283
+ echo "Update completed successfully!"
284
+ echo "========================================"
285
+ echo
286
+
287
+ if [[ -d "${BACKUP_DIR:-}" ]]; then
288
+ echo "[Important] Your modified files were backed up to:"
289
+ echo " $BACKUP_DIR"
290
+ echo
291
+ echo "To restore your changes:"
292
+ echo " 1. Run ./merge_config.sh to compare and merge files"
293
+ echo " 2. Or manually compare backup with new version"
294
+ echo
295
+ fi
296
+
297
+ if [[ $STASHED_UNTRACKED -eq 1 ]]; then
298
+ echo "[Stash] Untracked files were stashed before the update."
299
+ echo " To restore them: git stash pop"
300
+ echo " To discard them: git stash drop"
301
+ echo
302
+ echo " Note: 'git stash pop' may produce merge conflicts if"
303
+ echo " the update modified the same files. Resolve manually."
304
+ echo
305
+ fi
306
+
307
+ echo "Please restart the application to use the new version."
308
+ exit 0
309
+ else
310
+ echo
311
+ echo "[Error] Update failed."
312
+ if [[ $STASHED_UNTRACKED -eq 1 ]]; then
313
+ echo "[Stash] Restoring stashed changes..."
314
+ if "$GIT_PATH" stash pop &>/dev/null; then
315
+ echo "[Stash] Changes restored successfully."
316
+ else
317
+ echo "[Stash] Could not auto-restore. Run 'git stash pop' manually."
318
+ fi
319
+ fi
320
+ if [[ -d "${BACKUP_DIR:-}" ]]; then
321
+ echo "Your backup is still available at: $BACKUP_DIR"
322
+ fi
323
+ exit 1
324
+ fi
325
+ else
326
+ echo " [Warning] Local version has diverged from remote."
327
+ echo " This might be because you have local commits."
328
+ echo " Please update manually or consult the documentation."
329
+ exit 0
330
+ fi
cli.py ADDED
@@ -0,0 +1,1998 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import re
3
+ import ast
4
+ import os
5
+ import sys
6
+ import toml
7
+ from pathlib import Path
8
+ from typing import List, Optional, Tuple
9
+
10
+ # Load environment variables from .env or .env.example (if available)
11
+ try:
12
+ from dotenv import load_dotenv
13
+ _current_file = os.path.abspath(__file__)
14
+ _project_root = os.path.dirname(_current_file)
15
+ _env_path = os.path.join(_project_root, '.env')
16
+ _env_example_path = os.path.join(_project_root, '.env.example')
17
+
18
+ if os.path.exists(_env_path):
19
+ load_dotenv(_env_path)
20
+ print(f"Loaded configuration from {_env_path}")
21
+ elif os.path.exists(_env_example_path):
22
+ load_dotenv(_env_example_path)
23
+ print(f"Loaded configuration from {_env_example_path} (fallback)")
24
+ except ImportError:
25
+ pass
26
+
27
+ # Clear proxy settings that may affect network behavior
28
+ for _proxy_var in ['http_proxy', 'https_proxy', 'HTTP_PROXY', 'HTTPS_PROXY', 'ALL_PROXY']:
29
+ os.environ.pop(_proxy_var, None)
30
+
31
+ def _configure_logging(
32
+ level: Optional[str] = None,
33
+ suppress_audio_tokens: Optional[bool] = None,
34
+ ) -> None:
35
+ try:
36
+ from loguru import logger
37
+ except Exception:
38
+ return
39
+
40
+ if suppress_audio_tokens is None:
41
+ suppress_audio_tokens = os.environ.get("ACE_STEP_SUPPRESS_AUDIO_TOKENS", "1") not in {"0", "false", "False"}
42
+ if level is None:
43
+ level = "INFO"
44
+ level = str(level).upper()
45
+
46
+ def _log_filter(record) -> bool:
47
+ message = record.get("message", "")
48
+ # Suppress duplicate DiT prompt logs (we print a single final prompt in cli.py)
49
+ if (
50
+ "DiT TEXT ENCODER INPUT" in message
51
+ or "text_prompt:" in message
52
+ or (message.strip() and set(message.strip()) == {"="})
53
+ ):
54
+ return False
55
+ if not suppress_audio_tokens:
56
+ return True
57
+ return "<|audio_code_" not in message
58
+
59
+ logger.remove()
60
+ logger.add(sys.stderr, level=level, filter=_log_filter)
61
+
62
+
63
+ _configure_logging()
64
+
65
+ from acestep.handler import AceStepHandler
66
+ from acestep.llm_inference import LLMHandler
67
+ from acestep.inference import GenerationParams, GenerationConfig, generate_music, create_sample, format_sample
68
+ from acestep.constants import DEFAULT_DIT_INSTRUCTION, TASK_INSTRUCTIONS
69
+ from acestep.gpu_config import get_gpu_config, set_global_gpu_config, is_mps_platform
70
+ import torch
71
+
72
+
73
+ TRACK_CHOICES = [
74
+ "vocals",
75
+ "backing_vocals",
76
+ "drums",
77
+ "bass",
78
+ "guitar",
79
+ "keyboard",
80
+ "percussion",
81
+ "strings",
82
+ "synth",
83
+ "fx",
84
+ "brass",
85
+ "woodwinds",
86
+ ]
87
+
88
+
89
+ def _get_project_root() -> str:
90
+ return os.path.dirname(os.path.abspath(__file__))
91
+
92
+
93
+ def _parse_description_hints(description: str) -> tuple[Optional[str], bool]:
94
+ import re
95
+
96
+ if not description:
97
+ return None, False
98
+
99
+ description_lower = description.lower().strip()
100
+
101
+ language_mapping = {
102
+ 'english': 'en', 'en': 'en',
103
+ 'chinese': 'zh', '中文': 'zh', 'zh': 'zh', 'mandarin': 'zh',
104
+ 'japanese': 'ja', '日本語': 'ja', 'ja': 'ja',
105
+ 'korean': 'ko', '한국어': 'ko', 'ko': 'ko',
106
+ 'spanish': 'es', 'español': 'es', 'es': 'es',
107
+ 'french': 'fr', 'français': 'fr', 'fr': 'fr',
108
+ 'german': 'de', 'deutsch': 'de', 'de': 'de',
109
+ 'italian': 'it', 'italiano': 'it', 'it': 'it',
110
+ 'portuguese': 'pt', 'português': 'pt', 'pt': 'pt',
111
+ 'russian': 'ru', 'русский': 'ru', 'ru': 'ru',
112
+ 'bengali': 'bn', 'bn': 'bn',
113
+ 'hindi': 'hi', 'hi': 'hi',
114
+ 'arabic': 'ar', 'ar': 'ar',
115
+ 'thai': 'th', 'th': 'th',
116
+ 'vietnamese': 'vi', 'vi': 'vi',
117
+ 'indonesian': 'id', 'id': 'id',
118
+ 'turkish': 'tr', 'tr': 'tr',
119
+ 'dutch': 'nl', 'nl': 'nl',
120
+ 'polish': 'pl', 'pl': 'pl',
121
+ }
122
+
123
+ detected_language = None
124
+ for lang_name, lang_code in language_mapping.items():
125
+ if len(lang_name) <= 2:
126
+ pattern = r'(?:^|\s|[.,;:!?])' + re.escape(lang_name) + r'(?:$|\s|[.,;:!?])'
127
+ else:
128
+ pattern = r'\b' + re.escape(lang_name) + r'\b'
129
+ if re.search(pattern, description_lower):
130
+ detected_language = lang_code
131
+ break
132
+
133
+ is_instrumental = False
134
+ if 'instrumental' in description_lower:
135
+ is_instrumental = True
136
+ elif 'pure music' in description_lower or 'pure instrument' in description_lower:
137
+ is_instrumental = True
138
+ elif description_lower.endswith(' solo') or description_lower == 'solo':
139
+ is_instrumental = True
140
+
141
+ return detected_language, is_instrumental
142
+
143
+
144
+ def _prompt_non_empty(prompt: str) -> str:
145
+ value = input(prompt).strip()
146
+ while not value:
147
+ value = input(prompt).strip()
148
+ return value
149
+
150
+
151
+ def _prompt_with_default(prompt: str, default: Optional[str] = None, required: bool = False) -> str:
152
+ while True:
153
+ suffix = f" [{default}]" if default not in (None, "") else ""
154
+ value = input(f"{prompt}{suffix}: ").strip()
155
+ if value:
156
+ return value
157
+ if default not in (None, ""):
158
+ return str(default)
159
+ if not required:
160
+ return ""
161
+ print("This value is required. Please try again.")
162
+
163
+
164
+ def _prompt_bool(prompt: str, default: bool) -> bool:
165
+ default_str = "y" if default else "n"
166
+ while True:
167
+ value = input(f"{prompt} (y/n) [default: {default_str}]: ").strip().lower()
168
+ if not value:
169
+ return default
170
+ if value in {"y", "yes", "1", "true"}:
171
+ return True
172
+ if value in {"n", "no", "0", "false"}:
173
+ return False
174
+ print("Please enter 'y' or 'n'.")
175
+
176
+
177
+ def _prompt_choice_from_list(
178
+ prompt: str,
179
+ options: List[str],
180
+ default: Optional[str] = None,
181
+ allow_custom: bool = True,
182
+ custom_validator=None,
183
+ custom_error: Optional[str] = None,
184
+ ) -> Optional[str]:
185
+ if not options:
186
+ return default
187
+ print("\n" + prompt)
188
+ for idx, option in enumerate(options, start=1):
189
+ print(f"{idx}. {option}")
190
+ default_display = default if default not in (None, "") else "auto"
191
+ while True:
192
+ choice = input(f"Choose a model (number or name) [default: {default_display}]: ").strip()
193
+ if not choice:
194
+ return None if default_display == "auto" else default
195
+ if choice.lower() == "auto":
196
+ return None
197
+ if choice.isdigit():
198
+ idx = int(choice)
199
+ if 1 <= idx <= len(options):
200
+ return options[idx - 1]
201
+ print("Invalid selection. Please choose a valid number.")
202
+ continue
203
+ if allow_custom:
204
+ if custom_validator and not custom_validator(choice):
205
+ print(custom_error or "Invalid selection. Please try again.")
206
+ continue
207
+ if choice not in options:
208
+ print("Unknown model. Using as-is.")
209
+ return choice
210
+ print("Please choose a valid option.")
211
+
212
+
213
+ def _edit_formatted_prompt_via_file(formatted_prompt: str, instruction_path: str) -> str:
214
+ """Write formatted prompt to file, wait for user edits, then read back."""
215
+ try:
216
+ with open(instruction_path, "w", encoding="utf-8") as f:
217
+ f.write(formatted_prompt)
218
+ except Exception as e:
219
+ print(f"WARNING: Failed to write {instruction_path}: {e}")
220
+ return formatted_prompt
221
+
222
+ print("\n--- Final Draft Saved ---")
223
+ print(f"Saved to {instruction_path}")
224
+ print("Edit the file now. Press Enter when ready to continue.")
225
+ input()
226
+
227
+ try:
228
+ with open(instruction_path, "r", encoding="utf-8") as f:
229
+ return f.read()
230
+ except Exception as e:
231
+ print(f"WARNING: Failed to read {instruction_path}: {e}")
232
+ return formatted_prompt
233
+
234
+
235
+ def _extract_caption_lyrics_from_formatted_prompt(formatted_prompt: str) -> Tuple[Optional[str], Optional[str]]:
236
+ """Best-effort extraction of caption/lyrics from a formatted prompt string."""
237
+ matches = list(re.finditer(r"# Caption\n(.*?)\n+# Lyric\n(.*)", formatted_prompt, re.DOTALL))
238
+ if not matches:
239
+ return None, None
240
+
241
+ caption = matches[-1].group(1).strip()
242
+ lyrics = matches[-1].group(2)
243
+
244
+ # Trim lyrics if chat-template markers appear after the user message.
245
+ cut_markers = ["<|eot_id|>", "<|start_header_id|>", "<|assistant|>", "<|user|>", "<|system|>", "<|im_end|>", "<|im_start|>"]
246
+ cut_at = len(lyrics)
247
+ for marker in cut_markers:
248
+ pos = lyrics.find(marker)
249
+ if pos != -1:
250
+ cut_at = min(cut_at, pos)
251
+ lyrics = lyrics[:cut_at].rstrip()
252
+
253
+ return caption or None, lyrics or None
254
+
255
+
256
+ def _extract_instruction_from_formatted_prompt(formatted_prompt: str) -> Optional[str]:
257
+ """Best-effort extraction of instruction text from a formatted prompt string."""
258
+ match = re.search(r"# Instruction\n(.*?)\n\n", formatted_prompt, re.DOTALL)
259
+ if not match:
260
+ return None
261
+ instruction = match.group(1).strip()
262
+ return instruction or None
263
+
264
+
265
+ def _extract_cot_metadata_from_formatted_prompt(formatted_prompt: str) -> dict:
266
+ """Best-effort extraction of COT metadata from a formatted prompt string,
267
+ supporting multi-line values.
268
+ """
269
+ matches = list(re.finditer(r"<think>\n(.*?)\n</think>", formatted_prompt, re.DOTALL))
270
+ if not matches:
271
+ return {}
272
+ block = matches[-1].group(1)
273
+ metadata = {}
274
+ current_key = None
275
+ current_value_lines = []
276
+
277
+ for line in block.splitlines():
278
+ line = line.strip()
279
+ if not line:
280
+ continue
281
+
282
+ key_match = re.match(r"^(\w+):\s*(.*)", line)
283
+ if key_match:
284
+ if current_key:
285
+ metadata[current_key] = " ".join(current_value_lines).strip()
286
+
287
+ current_key = key_match.group(1).strip().lower()
288
+ current_value_lines = [key_match.group(2).strip()]
289
+ else:
290
+ if current_key:
291
+ current_value_lines.append(line)
292
+
293
+ if current_key and current_value_lines:
294
+ metadata[current_key] = " ".join(current_value_lines).strip()
295
+
296
+ return metadata
297
+
298
+
299
+ def _parse_number(value: str) -> Optional[float]:
300
+ try:
301
+ match = re.search(r"[-+]?\d*\.?\d+", value)
302
+ if not match:
303
+ return None
304
+ return float(match.group(0))
305
+ except Exception:
306
+ return None
307
+
308
+
309
+ def _parse_timesteps_input(value) -> Optional[List[float]]:
310
+ if value is None:
311
+ return None
312
+ if isinstance(value, list):
313
+ if all(isinstance(t, (int, float)) for t in value):
314
+ return [float(t) for t in value]
315
+ return None
316
+ if not isinstance(value, str):
317
+ return None
318
+ raw = value.strip()
319
+ if not raw:
320
+ return None
321
+ if raw.startswith("[") or raw.startswith("("):
322
+ try:
323
+ parsed = ast.literal_eval(raw)
324
+ except Exception:
325
+ return None
326
+ if isinstance(parsed, list) and all(isinstance(t, (int, float)) for t in parsed):
327
+ return [float(t) for t in parsed]
328
+ return None
329
+ try:
330
+ return [float(t.strip()) for t in raw.split(",") if t.strip()]
331
+ except Exception:
332
+ return None
333
+
334
+
335
+ def _install_prompt_edit_hook(
336
+ llm_handler: LLMHandler,
337
+ instruction_path: str,
338
+ preloaded_prompt: Optional[str] = None,
339
+ ) -> None:
340
+ """Intercept formatted prompt generation to allow user editing before audio tokens."""
341
+ original = llm_handler.build_formatted_prompt_with_cot
342
+ cache = {}
343
+
344
+ def wrapped(caption, lyrics, cot_text, is_negative_prompt=False, negative_prompt="NO USER INPUT"):
345
+ prompt = original(
346
+ caption,
347
+ lyrics,
348
+ cot_text,
349
+ is_negative_prompt=is_negative_prompt,
350
+ negative_prompt=negative_prompt,
351
+ )
352
+ if is_negative_prompt:
353
+ conditional_prompt = original(
354
+ caption,
355
+ lyrics,
356
+ cot_text,
357
+ is_negative_prompt=False,
358
+ negative_prompt=negative_prompt,
359
+ )
360
+ cached = cache.get(conditional_prompt)
361
+ if cached and (cached.get("edited_caption") or cached.get("edited_lyrics")):
362
+ edited_caption = cached.get("edited_caption") or caption
363
+ edited_lyrics = cached.get("edited_lyrics") or lyrics
364
+ return original(
365
+ edited_caption,
366
+ edited_lyrics,
367
+ cot_text,
368
+ is_negative_prompt=True,
369
+ negative_prompt=negative_prompt,
370
+ )
371
+ return prompt
372
+ cached = cache.get(prompt)
373
+ if cached:
374
+ return cached["edited_prompt"]
375
+ if getattr(llm_handler, "_skip_prompt_edit", False):
376
+ cache[prompt] = {
377
+ "edited_prompt": prompt,
378
+ "edited_caption": None,
379
+ "edited_lyrics": None,
380
+ }
381
+ return prompt
382
+ if preloaded_prompt is not None:
383
+ edited = preloaded_prompt
384
+ else:
385
+ edited = _edit_formatted_prompt_via_file(prompt, instruction_path)
386
+ edited_caption, edited_lyrics = _extract_caption_lyrics_from_formatted_prompt(edited)
387
+ if edited != prompt:
388
+ print("INFO: Using edited draft for audio-token prompt.")
389
+ if edited_caption or edited_lyrics:
390
+ llm_handler._edited_caption = edited_caption
391
+ llm_handler._edited_lyrics = edited_lyrics
392
+ edited_instruction = _extract_instruction_from_formatted_prompt(edited)
393
+ if edited_instruction:
394
+ llm_handler._edited_instruction = edited_instruction
395
+ edited_metas = _extract_cot_metadata_from_formatted_prompt(edited)
396
+ if edited_metas:
397
+ llm_handler._edited_metas = edited_metas
398
+ cache[prompt] = {
399
+ "edited_prompt": edited,
400
+ "edited_caption": edited_caption,
401
+ "edited_lyrics": edited_lyrics,
402
+ }
403
+ return edited
404
+
405
+ llm_handler.build_formatted_prompt_with_cot = wrapped
406
+
407
+
408
+ def _prompt_int(prompt: str, default: Optional[int] = None, min_value: Optional[int] = None,
409
+ max_value: Optional[int] = None) -> Optional[int]:
410
+ default_display = "auto" if default is None else default
411
+ while True:
412
+ value = input(f"{prompt} [{default_display}]: ").strip()
413
+ if not value:
414
+ return default
415
+ try:
416
+ parsed = int(value)
417
+ except ValueError:
418
+ print("Invalid input. Please enter an integer.")
419
+ continue
420
+ if min_value is not None and parsed < min_value:
421
+ print(f"Please enter a value >= {min_value}.")
422
+ continue
423
+ if max_value is not None and parsed > max_value:
424
+ print(f"Please enter a value <= {max_value}.")
425
+ continue
426
+ return parsed
427
+
428
+
429
+ def _prompt_float(prompt: str, default: Optional[float] = None, min_value: Optional[float] = None,
430
+ max_value: Optional[float] = None) -> Optional[float]:
431
+ default_display = "auto" if default is None else default
432
+ while True:
433
+ value = input(f"{prompt} [{default_display}]: ").strip()
434
+ if not value:
435
+ return default
436
+ try:
437
+ parsed = float(value)
438
+ except ValueError:
439
+ print("Invalid input. Please enter a number.")
440
+ continue
441
+ if min_value is not None and parsed < min_value:
442
+ print(f"Please enter a value >= {min_value}.")
443
+ continue
444
+ if max_value is not None and parsed > max_value:
445
+ print(f"Please enter a value <= {max_value}.")
446
+ continue
447
+ return parsed
448
+
449
+
450
+ def _prompt_existing_file(prompt: str, default: Optional[str] = None) -> str:
451
+ while True:
452
+ suffix = f" [{default}]" if default else ""
453
+ path = input(f"{prompt}{suffix}: ").strip()
454
+ if not path and default:
455
+ path = default
456
+ if os.path.isfile(path):
457
+ return _expand_audio_path(path)
458
+ print("Invalid file path. Please try again.")
459
+
460
+
461
+ def _expand_audio_path(path_str: Optional[str]) -> Optional[str]:
462
+ if not path_str or not isinstance(path_str, str):
463
+ return path_str
464
+ try:
465
+ return Path(path_str).expanduser().resolve(strict=False).as_posix()
466
+ except Exception:
467
+ return Path(path_str).expanduser().absolute().as_posix()
468
+
469
+
470
+ def _parse_bool(value: str) -> bool:
471
+ return str(value).lower() in {"true", "1", "yes", "y"}
472
+
473
+
474
+ def _resolve_device(device: str) -> str:
475
+ if device == "auto":
476
+ if hasattr(torch, 'xpu') and torch.xpu.is_available():
477
+ return "xpu"
478
+ if torch.cuda.is_available():
479
+ return "cuda"
480
+ if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
481
+ return "mps"
482
+ return "cpu"
483
+ return device
484
+
485
+
486
+ def _default_instruction_for_task(task_type: str, tracks: Optional[List[str]] = None) -> str:
487
+ if task_type == "lego":
488
+ track = tracks[0] if tracks else "guitar"
489
+ return TASK_INSTRUCTIONS["lego"].format(TRACK_NAME=track.upper())
490
+ if task_type == "extract":
491
+ track = tracks[0] if tracks else "vocals"
492
+ return TASK_INSTRUCTIONS["extract"].format(TRACK_NAME=track.upper())
493
+ if task_type == "complete":
494
+ tracks_list = ", ".join(tracks) if tracks else "drums, bass, guitar"
495
+ return TASK_INSTRUCTIONS["complete"].format(TRACK_CLASSES=tracks_list)
496
+ return DEFAULT_DIT_INSTRUCTION
497
+
498
+
499
+ def _apply_optional_defaults(args, params_defaults: GenerationParams, config_defaults: GenerationConfig) -> None:
500
+ optional_defaults = {
501
+ "duration": params_defaults.duration,
502
+ "bpm": params_defaults.bpm,
503
+ "keyscale": params_defaults.keyscale,
504
+ "timesignature": params_defaults.timesignature,
505
+ "vocal_language": params_defaults.vocal_language,
506
+ "inference_steps": params_defaults.inference_steps,
507
+ "seed": params_defaults.seed,
508
+ "guidance_scale": params_defaults.guidance_scale,
509
+ "use_adg": params_defaults.use_adg,
510
+ "cfg_interval_start": params_defaults.cfg_interval_start,
511
+ "cfg_interval_end": params_defaults.cfg_interval_end,
512
+ "shift": 3.0,
513
+ "infer_method": params_defaults.infer_method,
514
+ "timesteps": None,
515
+ "repainting_start": params_defaults.repainting_start,
516
+ "repainting_end": params_defaults.repainting_end,
517
+ "audio_cover_strength": params_defaults.audio_cover_strength,
518
+ "thinking": params_defaults.thinking,
519
+ "lm_temperature": params_defaults.lm_temperature,
520
+ "lm_cfg_scale": params_defaults.lm_cfg_scale,
521
+ "lm_top_k": params_defaults.lm_top_k,
522
+ "lm_top_p": params_defaults.lm_top_p,
523
+ "lm_negative_prompt": params_defaults.lm_negative_prompt,
524
+ "use_cot_metas": params_defaults.use_cot_metas,
525
+ "use_cot_caption": params_defaults.use_cot_caption,
526
+ "use_cot_lyrics": params_defaults.use_cot_lyrics,
527
+ "use_cot_language": params_defaults.use_cot_language,
528
+ "use_constrained_decoding": params_defaults.use_constrained_decoding,
529
+ "batch_size": config_defaults.batch_size,
530
+ "allow_lm_batch": config_defaults.allow_lm_batch,
531
+ "use_random_seed": config_defaults.use_random_seed,
532
+ "seeds": config_defaults.seeds,
533
+ "lm_batch_chunk_size": config_defaults.lm_batch_chunk_size,
534
+ "constrained_decoding_debug": config_defaults.constrained_decoding_debug,
535
+ "audio_format": config_defaults.audio_format,
536
+ "sample_mode": False,
537
+ "sample_query": "",
538
+ "use_format": False,
539
+ }
540
+
541
+ for key, default_value in optional_defaults.items():
542
+ if getattr(args, key, None) is None:
543
+ setattr(args, key, default_value)
544
+
545
+
546
+ def _summarize_lyrics(lyrics: Optional[str]) -> str:
547
+ if not lyrics:
548
+ return "none"
549
+ if isinstance(lyrics, str):
550
+ stripped = lyrics.strip()
551
+ if not stripped:
552
+ return "none"
553
+ if os.path.isfile(stripped):
554
+ return f"file: {os.path.basename(stripped)}"
555
+ if len(stripped) <= 60:
556
+ return stripped.replace("\n", " ")
557
+ return f"text ({len(stripped)} chars)"
558
+ return "provided"
559
+
560
+
561
+ def _print_final_parameters(
562
+ args,
563
+ params: GenerationParams,
564
+ config: GenerationConfig,
565
+ params_defaults: GenerationParams,
566
+ config_defaults: GenerationConfig,
567
+ compact: bool,
568
+ resolved_device: Optional[str] = None,
569
+ ) -> None:
570
+ if not compact:
571
+ print("\n--- Final Parameters (Args) ---")
572
+ for k in sorted(vars(args).keys()):
573
+ print(f"{k}: {getattr(args, k)}")
574
+ print("------------------------------")
575
+ print("\n--- Final Parameters (GenerationParams) ---")
576
+ for k in sorted(vars(params).keys()):
577
+ print(f"{k}: {getattr(params, k)}")
578
+ print("-------------------------------------------")
579
+ print("\n--- Final Parameters (GenerationConfig) ---")
580
+ for k in sorted(vars(config).keys()):
581
+ print(f"{k}: {getattr(config, k)}")
582
+ print("-------------------------------------------\n")
583
+ return
584
+
585
+ device_display = args.device
586
+ if resolved_device and resolved_device != args.device:
587
+ device_display = f"{args.device} -> {resolved_device}"
588
+
589
+ print("\n--- Final Parameters (Summary) ---")
590
+ print(f"task_type: {params.task_type}")
591
+ print(f"caption: {params.caption or 'none'}")
592
+ print(f"lyrics: {_summarize_lyrics(params.lyrics)}")
593
+ print(f"duration: {params.duration}s")
594
+ print(f"outputs: {config.batch_size}")
595
+ if params.bpm not in (None, params_defaults.bpm):
596
+ print(f"bpm: {params.bpm}")
597
+ if params.keyscale not in (None, params_defaults.keyscale):
598
+ print(f"keyscale: {params.keyscale}")
599
+ if params.timesignature not in (None, params_defaults.timesignature):
600
+ print(f"timesignature: {params.timesignature}")
601
+ print(f"instrumental: {params.instrumental}")
602
+ print(f"thinking: {params.thinking}")
603
+ print(f"lm_model: {args.lm_model_path or 'auto'}")
604
+ print(f"dit_model: {args.config_path or 'auto'}")
605
+ print(f"backend: {args.backend}")
606
+ print(f"device: {device_display}")
607
+ print(f"audio_format: {config.audio_format}")
608
+ print(f"save_dir: {args.save_dir}")
609
+ if config.seeds:
610
+ print(f"seeds: {config.seeds}")
611
+ else:
612
+ print(f"seed: {params.seed} (random={config.use_random_seed})")
613
+ print("-------------------------------\n")
614
+
615
+
616
+ def _build_meta_dict(params: GenerationParams) -> Optional[dict]:
617
+ meta = {}
618
+ if params.bpm is not None:
619
+ meta["bpm"] = params.bpm
620
+ if params.timesignature:
621
+ meta["timesignature"] = params.timesignature
622
+ if params.keyscale:
623
+ meta["keyscale"] = params.keyscale
624
+ if params.duration is not None:
625
+ meta["duration"] = params.duration
626
+ return meta or None
627
+
628
+
629
+ def _print_dit_prompt(dit_handler: "AceStepHandler", params: GenerationParams) -> None:
630
+ meta = _build_meta_dict(params)
631
+ caption_input, lyrics_input = dit_handler.build_dit_inputs(
632
+ task=params.task_type,
633
+ instruction=params.instruction,
634
+ caption=params.caption or "",
635
+ lyrics=params.lyrics or "",
636
+ metas=meta,
637
+ vocal_language=params.vocal_language or "unknown",
638
+ )
639
+ print("\n--- Final DiT Prompt (Caption Branch) ---")
640
+ print(caption_input)
641
+ print("\n--- Final DiT Prompt (Lyrics Branch) ---")
642
+ print(lyrics_input)
643
+ print("----------------------------------------\n")
644
+
645
+
646
+ def run_wizard(args, configure_only: bool = False, default_config_path: Optional[str] = None,
647
+ params_defaults: Optional[GenerationParams] = None,
648
+ config_defaults: Optional[GenerationConfig] = None):
649
+ """
650
+ Runs an interactive wizard to set generation parameters.
651
+ """
652
+ print("Welcome to the ACE-Step Music Generation Wizard!")
653
+ print("This will guide you through creating your music.")
654
+ print("Press Ctrl+C at any time to exit.")
655
+ print("Note: Required models will be auto-downloaded if missing.")
656
+ print("-" * 30)
657
+
658
+ try:
659
+ # Task selection
660
+ print("\n--- Task Type ---")
661
+ print("1. text2music - generate music from text/lyrics.")
662
+ print("2. cover - transform existing audio into a new style.")
663
+ print("3. repaint - regenerate a specific time segment of audio.")
664
+ print("4. lego - generate a specific instrument track in context.")
665
+ print("5. extract - isolate a specific instrument track from a mix.")
666
+ print("6. complete - complete/extend partial tracks with new instruments.")
667
+ task_map = {
668
+ "1": "text2music",
669
+ "2": "cover",
670
+ "3": "repaint",
671
+ "4": "lego",
672
+ "5": "extract",
673
+ "6": "complete",
674
+ }
675
+ current_task = args.task_type or "text2music"
676
+ task_default = next((k for k, v in task_map.items() if v == current_task), "1")
677
+ task_choice = input(f"Choose a task (1-6) [default: {task_default}]: ").strip()
678
+ if not task_choice:
679
+ task_choice = task_default
680
+ args.task_type = task_map.get(task_choice, "text2music")
681
+ if args.task_type in {"lego", "extract", "complete"}:
682
+ print("Note: This task requires a base DiT model (acestep-v15-base). It will be auto-downloaded if missing.")
683
+
684
+ # Model selection (DiT)
685
+ dit_handler = AceStepHandler()
686
+ available_dit_models = dit_handler.get_available_acestep_v15_models()
687
+ base_only = args.task_type in {"lego", "extract", "complete"}
688
+ if base_only and available_dit_models:
689
+ available_dit_models = [m for m in available_dit_models if "base" in m.lower()]
690
+
691
+ if base_only and args.config_path and "base" not in str(args.config_path).lower():
692
+ args.config_path = None
693
+
694
+ if base_only:
695
+ if available_dit_models:
696
+ if args.config_path in available_dit_models:
697
+ selected = args.config_path
698
+ else:
699
+ selected = available_dit_models[0]
700
+ args.config_path = selected
701
+ print(f"\nNote: This task requires a base model. Using: {selected}")
702
+ else:
703
+ print("\nNote: This task requires a base model (e.g., 'acestep-v15-base'). It will be auto-downloaded if missing.")
704
+ elif available_dit_models:
705
+ selected = _prompt_choice_from_list(
706
+ "--- Available DiT Models ---",
707
+ available_dit_models,
708
+ default=args.config_path,
709
+ allow_custom=True,
710
+ )
711
+ if selected is not None:
712
+ args.config_path = selected
713
+ else:
714
+ print("\nNote: No local DiT models found. The main model will be auto-downloaded during initialization.")
715
+
716
+ # Model selection (LM)
717
+ llm_handler = LLMHandler()
718
+ available_lm_models = llm_handler.get_available_5hz_lm_models()
719
+ if available_lm_models:
720
+ selected_lm = _prompt_choice_from_list(
721
+ "--- Available LM Models ---",
722
+ available_lm_models,
723
+ default=args.lm_model_path,
724
+ allow_custom=True,
725
+ )
726
+ if selected_lm is not None:
727
+ args.lm_model_path = selected_lm
728
+ else:
729
+ print("\nNote: No local LM models found. If LM features are enabled, a default LM will be auto-downloaded.")
730
+
731
+ # Task-specific inputs
732
+ if args.task_type in {"cover", "repaint", "lego", "extract", "complete"}:
733
+ args.src_audio = _prompt_existing_file("Enter path to source audio file", default=args.src_audio)
734
+
735
+ if args.task_type == "repaint":
736
+ args.repainting_start = _prompt_float(
737
+ "Repaint start time in seconds", args.repainting_start
738
+ )
739
+ args.repainting_end = _prompt_float(
740
+ "Repaint end time in seconds", args.repainting_end
741
+ )
742
+
743
+ if args.task_type in {"lego", "extract"}:
744
+ print("\nAvailable tracks:")
745
+ print(", ".join(TRACK_CHOICES))
746
+ track_default = args.lego_track if args.task_type == "lego" else args.extract_track
747
+ track = _prompt_with_default("Choose a track", track_default, required=True)
748
+ if track not in TRACK_CHOICES:
749
+ print("Unknown track. Using as-is.")
750
+ if args.task_type == "lego":
751
+ args.lego_track = track
752
+ else:
753
+ args.extract_track = track
754
+ if not args.instruction or args.instruction == DEFAULT_DIT_INSTRUCTION:
755
+ args.instruction = _default_instruction_for_task(args.task_type, [track])
756
+ args.instruction = _prompt_with_default("Instruction", args.instruction, required=True)
757
+
758
+ if args.task_type == "complete":
759
+ print("\nAvailable tracks:")
760
+ print(", ".join(TRACK_CHOICES))
761
+ tracks_raw = _prompt_with_default("Choose tracks (comma-separated)", args.complete_tracks, required=True)
762
+ tracks = [t.strip() for t in tracks_raw.split(",") if t.strip()]
763
+ args.complete_tracks = ",".join(tracks)
764
+ if not args.instruction or args.instruction == DEFAULT_DIT_INSTRUCTION:
765
+ args.instruction = _default_instruction_for_task(args.task_type, tracks)
766
+ args.instruction = _prompt_with_default("Instruction", args.instruction, required=True)
767
+
768
+ if args.task_type in {"cover", "repaint", "lego", "complete"}:
769
+ args.caption = _prompt_with_default(
770
+ "Enter a music description (e.g., 'upbeat electronic dance music')",
771
+ args.caption,
772
+ required=True,
773
+ )
774
+ elif args.task_type == "text2music":
775
+ args.sample_mode = _prompt_bool("Use Simple Mode (auto-generate caption/lyrics via LM)", args.sample_mode)
776
+ if args.sample_mode:
777
+ args.sample_query = _prompt_with_default(
778
+ "Describe the music you want (for auto-generation)",
779
+ args.sample_query,
780
+ required=False,
781
+ )
782
+ if not args.sample_mode:
783
+ caption = _prompt_with_default(
784
+ "Enter a music description (optional if you provide lyrics)",
785
+ args.caption,
786
+ required=False,
787
+ )
788
+ if caption:
789
+ args.caption = caption
790
+
791
+ # Lyrics
792
+ if args.task_type in {"text2music", "cover", "repaint", "lego", "complete"} and not args.sample_mode:
793
+ print("\n--- Lyrics Options ---")
794
+ print("1. Instrumental (no lyrics).")
795
+ print("2. Generate lyrics automatically.")
796
+ print("3. Provide path to a .txt file.")
797
+ print("4. Paste lyrics directly.")
798
+
799
+ if args.instrumental or args.lyrics == "[Instrumental]":
800
+ default_choice = "1"
801
+ elif args.use_cot_lyrics:
802
+ default_choice = "2"
803
+ elif args.lyrics and isinstance(args.lyrics, str) and os.path.isfile(args.lyrics):
804
+ default_choice = "3"
805
+ elif args.lyrics:
806
+ default_choice = "4"
807
+ else:
808
+ default_choice = "1"
809
+ choice = input(f"Your choice (1-4) [default: {default_choice}]: ").strip()
810
+ if not choice:
811
+ choice = default_choice
812
+
813
+ if choice == "1": # Instrumental
814
+ args.instrumental = True
815
+ args.lyrics = "[Instrumental]"
816
+ args.use_cot_lyrics = False
817
+ print("Instrumental music will be generated.")
818
+ elif choice == "2": # Generate lyrics automatically
819
+ args.use_cot_lyrics = True
820
+ args.lyrics = ""
821
+ args.instrumental = False
822
+ print("Lyrics will be generated automatically.")
823
+ elif choice == "3":
824
+ args.instrumental = False
825
+ args.use_cot_lyrics = False
826
+ default_lyrics_path = args.lyrics if isinstance(args.lyrics, str) and os.path.isfile(args.lyrics) else None
827
+ while True:
828
+ lyrics_path = _prompt_existing_file("Please enter the path to your .txt lyrics file", default_lyrics_path)
829
+ if lyrics_path.endswith('.txt'):
830
+ args.lyrics = lyrics_path
831
+ print(f"Lyrics will be loaded from: {lyrics_path}")
832
+ break
833
+ print("Invalid file path or not a .txt file. Please try again.")
834
+ elif choice == "4":
835
+ args.instrumental = False
836
+ args.use_cot_lyrics = False
837
+ default_lyrics = args.lyrics if isinstance(args.lyrics, str) and args.lyrics and not os.path.isfile(args.lyrics) else None
838
+ args.lyrics = _prompt_with_default("Paste lyrics (single line or use \\n)", default_lyrics, required=True)
839
+
840
+ if not args.instrumental:
841
+ lang = _prompt_with_default(
842
+ "Vocal language (e.g., 'en', 'zh', 'unknown')",
843
+ args.vocal_language,
844
+ required=False
845
+ ).lower()
846
+ if lang:
847
+ args.vocal_language = lang
848
+
849
+ if args.use_cot_lyrics:
850
+ if not args.caption:
851
+ args.caption = _prompt_non_empty("Enter a music description for lyric generation: ")
852
+ if not args.thinking:
853
+ print("INFO: Automatic lyric generation requires the LM handler. Enabling LM 'thinking'.")
854
+ args.thinking = True
855
+
856
+ args.batch_size = _prompt_int(
857
+ "Number of outputs (audio clips) to generate",
858
+ args.batch_size if args.batch_size is not None else 2,
859
+ min_value=1,
860
+ )
861
+
862
+ advanced = input("\nConfigure advanced parameters? (y/n) [default: n]: ").lower()
863
+ if advanced == 'y':
864
+ if args.task_type == "text2music" and not args.sample_mode:
865
+ args.use_format = _prompt_bool("Use format_sample to enhance caption/lyrics", args.use_format)
866
+ print("\n--- Optional Metadata ---")
867
+ args.duration = _prompt_float("Duration in seconds (10-600)", args.duration, min_value=10, max_value=600)
868
+ args.bpm = _prompt_int("BPM (30-300, empty for auto)", args.bpm, min_value=30, max_value=300)
869
+ args.keyscale = _prompt_with_default("Keyscale (e.g., 'C Major', empty for auto)", args.keyscale)
870
+ args.timesignature = _prompt_with_default("Time signature (e.g., '4/4', empty for auto)", args.timesignature)
871
+ args.vocal_language = _prompt_with_default("Vocal language (e.g., 'en', 'zh', 'unknown')", args.vocal_language)
872
+
873
+ print("\n--- Advanced DiT Settings ---")
874
+ args.seed = _prompt_int("Random seed (-1 for random)", args.seed)
875
+ args.inference_steps = _prompt_int("Inference steps", args.inference_steps, min_value=1)
876
+ if args.config_path and 'base' in args.config_path:
877
+ args.guidance_scale = _prompt_float("Guidance scale (for base models)", args.guidance_scale)
878
+ args.use_adg = _prompt_bool("Enable Adaptive Dual Guidance (ADG)", args.use_adg)
879
+ args.cfg_interval_start = _prompt_float("CFG interval start (0.0-1.0)", args.cfg_interval_start, 0.0, 1.0)
880
+ args.cfg_interval_end = _prompt_float("CFG interval end (0.0-1.0)", args.cfg_interval_end, 0.0, 1.0)
881
+ args.shift = _prompt_float("Timestep shift (1.0-5.0)", args.shift, 1.0, 5.0)
882
+ args.infer_method = _prompt_with_default("Inference method (ode/sde)", args.infer_method)
883
+ timesteps_input = _prompt_with_default(
884
+ "Custom timesteps list (e.g., [0.97, 0.5, 0])",
885
+ args.timesteps,
886
+ required=False,
887
+ )
888
+ if timesteps_input:
889
+ args.timesteps = timesteps_input
890
+
891
+ if args.task_type == "cover":
892
+ args.audio_cover_strength = _prompt_float(
893
+ "Audio cover strength (0.0-1.0)", args.audio_cover_strength, 0.0, 1.0
894
+ )
895
+
896
+ print("\n--- Advanced LM Settings ---")
897
+ args.thinking = _prompt_bool("Enable LM 'thinking'", args.thinking)
898
+ args.lm_temperature = _prompt_float("LM temperature (0.0-2.0)", args.lm_temperature, 0.0, 2.0)
899
+ args.lm_cfg_scale = _prompt_float("LM CFG scale", args.lm_cfg_scale)
900
+ args.lm_top_k = _prompt_int("LM top-k (0 disables)", args.lm_top_k, min_value=0)
901
+ args.lm_top_p = _prompt_float("LM top-p (0.0-1.0)", args.lm_top_p, 0.0, 1.0)
902
+ args.lm_negative_prompt = _prompt_with_default("LM negative prompt", args.lm_negative_prompt)
903
+ args.use_cot_metas = _prompt_bool("Use CoT for metadata", args.use_cot_metas)
904
+ args.use_cot_caption = _prompt_bool("Use CoT for caption refinement", args.use_cot_caption)
905
+ args.use_cot_lyrics = _prompt_bool("Use CoT for lyrics generation", args.use_cot_lyrics)
906
+ args.use_cot_language = _prompt_bool("Use CoT for language detection", args.use_cot_language)
907
+ args.use_constrained_decoding = _prompt_bool("Use constrained decoding", args.use_constrained_decoding)
908
+
909
+ print("\n--- Output Settings ---")
910
+ args.save_dir = _prompt_with_default("Save directory", args.save_dir)
911
+ args.audio_format = _prompt_with_default("Audio format (mp3/wav/flac)", args.audio_format)
912
+ # Batch size already captured above.
913
+ args.use_random_seed = _prompt_bool("Use random seed per batch", args.use_random_seed)
914
+ seeds_input = _prompt_with_default(
915
+ "Custom seeds (comma/space separated, leave empty for random)",
916
+ "",
917
+ required=False,
918
+ )
919
+ if seeds_input:
920
+ seeds = [s for s in seeds_input.replace(",", " ").split() if s.strip()]
921
+ try:
922
+ args.seeds = [int(s) for s in seeds]
923
+ except ValueError:
924
+ print("Invalid seeds input. Ignoring custom seeds.")
925
+ args.allow_lm_batch = _prompt_bool("Allow LM batch processing", args.allow_lm_batch)
926
+ args.lm_batch_chunk_size = _prompt_int("LM batch chunk size", args.lm_batch_chunk_size, min_value=1)
927
+ args.constrained_decoding_debug = _prompt_bool("Constrained decoding debug", args.constrained_decoding_debug)
928
+ else:
929
+ if params_defaults and config_defaults:
930
+ _apply_optional_defaults(args, params_defaults, config_defaults)
931
+
932
+ # Ensure LM thinking is enabled when lyric generation is requested.
933
+ if args.use_cot_lyrics and not args.thinking:
934
+ print("INFO: Automatic lyric generation requires the LM handler. Enabling LM 'thinking'.")
935
+ args.thinking = True
936
+
937
+ print("\n--- Summary ---")
938
+ print(f"Task: {args.task_type}")
939
+ if args.caption:
940
+ print(f"Description: {args.caption}")
941
+ if args.task_type in {"lego", "extract", "complete"}:
942
+ print(f"Instruction: {args.instruction}")
943
+ if args.src_audio:
944
+ print(f"Source audio: {args.src_audio}")
945
+ print(f"Duration: {args.duration}s")
946
+ print(f"Outputs: {args.batch_size}")
947
+ if args.instrumental:
948
+ print("Lyrics: Instrumental")
949
+ elif args.use_cot_lyrics:
950
+ print(f"Lyrics: Auto-generated ({args.vocal_language})")
951
+ elif args.lyrics and os.path.isfile(args.lyrics):
952
+ print(f"Lyrics: Provided from file ({args.lyrics})")
953
+ elif args.lyrics:
954
+ print(f"Lyrics: Provided as text")
955
+
956
+ print("-" * 30)
957
+ if not configure_only:
958
+ confirm = input("Start generation with these settings? (y/n) [default: y]: ").lower()
959
+ if confirm == 'n':
960
+ print("Generation cancelled.")
961
+ sys.exit(0)
962
+
963
+ default_filename = default_config_path or "config.toml"
964
+ config_filename = input(f"\nEnter filename to save configuration [{default_filename}]: ")
965
+ if not config_filename:
966
+ config_filename = default_filename
967
+ if not config_filename.endswith(".toml"):
968
+ config_filename += ".toml"
969
+
970
+ try:
971
+ config_to_save = {
972
+ k: v for k, v in vars(args).items()
973
+ if k not in ['config'] and not k.startswith('_')
974
+ }
975
+ with open(config_filename, 'w') as f:
976
+ toml.dump(config_to_save, f)
977
+ print(f"Configuration saved to {config_filename}")
978
+ print(f"You can reuse it next time with: python cli.py -c {config_filename}")
979
+ except Exception as e:
980
+ print(f"Error saving configuration: {e}. Please try again.")
981
+
982
+ except (KeyboardInterrupt, EOFError):
983
+ print("\nWizard cancelled. Exiting.")
984
+ sys.exit(0)
985
+
986
+ return args, not configure_only
987
+
988
+
989
+ def main():
990
+ """
991
+ Main function to run ACE-Step music generation from the command line.
992
+ """
993
+
994
+ gpu_config = get_gpu_config()
995
+ set_global_gpu_config(gpu_config)
996
+ mps_available = is_mps_platform()
997
+ # Mac (Apple Silicon) uses unified memory — offloading provides no benefit
998
+ auto_offload = (not mps_available) and gpu_config.gpu_memory_gb > 0 and gpu_config.gpu_memory_gb < 16
999
+ print(f"\n{'='*60}")
1000
+ print("GPU Configuration Detected:")
1001
+ print(f"{'='*60}")
1002
+ print(f" GPU Memory: {gpu_config.gpu_memory_gb:.2f} GiB")
1003
+ print(f" Configuration Tier: {gpu_config.tier}")
1004
+ print(f" Max Duration (with LM): {gpu_config.max_duration_with_lm}s ({gpu_config.max_duration_with_lm // 60} min)")
1005
+ print(f" Max Duration (without LM): {gpu_config.max_duration_without_lm}s ({gpu_config.max_duration_without_lm // 60} min)")
1006
+ print(f" Max Batch Size (with LM): {gpu_config.max_batch_size_with_lm}")
1007
+ print(f" Max Batch Size (without LM): {gpu_config.max_batch_size_without_lm}")
1008
+ print(f" Default LM Init: {gpu_config.init_lm_default}")
1009
+ print(f" Available LM Models: {gpu_config.available_lm_models or 'None'}")
1010
+ print(f"{'='*60}\n")
1011
+
1012
+ if auto_offload:
1013
+ print("Auto-enabling CPU offload (GPU < 16GB)")
1014
+ elif gpu_config.gpu_memory_gb > 0:
1015
+ print("CPU offload disabled by default (GPU >= 16GB)")
1016
+ elif mps_available:
1017
+ print("MPS detected, running on Apple GPU")
1018
+ else:
1019
+ print("No GPU detected, running on CPU")
1020
+
1021
+ params_defaults = GenerationParams()
1022
+ config_defaults = GenerationConfig()
1023
+
1024
+ parser = argparse.ArgumentParser(
1025
+ description="ACE-Step 1.5: Music generation (wizard/config only).",
1026
+ formatter_class=argparse.ArgumentDefaultsHelpFormatter
1027
+ )
1028
+ parser.add_argument("-c", "--config", type=str, help="Path to a TOML configuration file to load.")
1029
+ parser.add_argument("--configure", action="store_true", help="Run wizard to save configuration without generating.")
1030
+ parser.add_argument(
1031
+ "--backend",
1032
+ type=str,
1033
+ default=None,
1034
+ choices=["vllm", "pt", "mlx"],
1035
+ help="5Hz LM backend. Auto-detected if not specified: 'mlx' on Apple Silicon, 'vllm' on CUDA, 'pt' otherwise.",
1036
+ )
1037
+ parser.add_argument(
1038
+ "--log-level",
1039
+ type=str,
1040
+ default="INFO",
1041
+ help="Logging level for internal modules (TRACE/DEBUG/INFO/WARNING/ERROR/CRITICAL).",
1042
+ )
1043
+ cli_args = parser.parse_args()
1044
+
1045
+ _configure_logging(level=cli_args.log_level)
1046
+
1047
+ default_batch_size = 1 if not cli_args.config else config_defaults.batch_size
1048
+
1049
+ # Auto-detect MLX on Apple Silicon, fall back to vllm
1050
+ if mps_available:
1051
+ try:
1052
+ import mlx.core # noqa: F401
1053
+ default_backend = "mlx"
1054
+ print("Apple Silicon detected with MLX available. Using MLX backend.")
1055
+ except ImportError:
1056
+ default_backend = "vllm"
1057
+ else:
1058
+ default_backend = "vllm"
1059
+
1060
+ defaults = {
1061
+ "project_root": _get_project_root(),
1062
+ "config_path": None,
1063
+ "checkpoint_dir": os.path.join(_get_project_root(), "checkpoints"),
1064
+ "lm_model_path": None,
1065
+ "backend": default_backend,
1066
+ "device": "auto",
1067
+ "use_flash_attention": None,
1068
+ "offload_to_cpu": auto_offload,
1069
+ "offload_dit_to_cpu": False,
1070
+ "save_dir": "output",
1071
+ "audio_format": config_defaults.audio_format,
1072
+ "caption": "",
1073
+ "prompt": "",
1074
+ "lyrics": None,
1075
+ "duration": params_defaults.duration,
1076
+ "instrumental": False,
1077
+ "bpm": params_defaults.bpm,
1078
+ "keyscale": params_defaults.keyscale,
1079
+ "timesignature": params_defaults.timesignature,
1080
+ "vocal_language": params_defaults.vocal_language,
1081
+ "task_type": params_defaults.task_type,
1082
+ "instruction": params_defaults.instruction,
1083
+ "reference_audio": params_defaults.reference_audio,
1084
+ "src_audio": params_defaults.src_audio,
1085
+ "repainting_start": params_defaults.repainting_start,
1086
+ "repainting_end": params_defaults.repainting_end,
1087
+ "audio_cover_strength": params_defaults.audio_cover_strength,
1088
+ "lego_track": "",
1089
+ "extract_track": "",
1090
+ "complete_tracks": "",
1091
+ "sample_mode": False,
1092
+ "sample_query": "",
1093
+ "use_format": False,
1094
+ "inference_steps": params_defaults.inference_steps,
1095
+ "seed": params_defaults.seed,
1096
+ "guidance_scale": params_defaults.guidance_scale,
1097
+ "use_adg": params_defaults.use_adg,
1098
+ "shift": 3.0,
1099
+ "infer_method": params_defaults.infer_method,
1100
+ "timesteps": None,
1101
+ "thinking": gpu_config.init_lm_default,
1102
+ "lm_temperature": params_defaults.lm_temperature,
1103
+ "lm_cfg_scale": params_defaults.lm_cfg_scale,
1104
+ "lm_top_k": params_defaults.lm_top_k,
1105
+ "lm_top_p": params_defaults.lm_top_p,
1106
+ "use_cot_metas": params_defaults.use_cot_metas,
1107
+ "use_cot_caption": params_defaults.use_cot_caption,
1108
+ "use_cot_lyrics": params_defaults.use_cot_lyrics,
1109
+ "use_cot_language": params_defaults.use_cot_language,
1110
+ "use_constrained_decoding": params_defaults.use_constrained_decoding,
1111
+ "batch_size": default_batch_size,
1112
+ "seeds": None,
1113
+ "use_random_seed": config_defaults.use_random_seed,
1114
+ "allow_lm_batch": config_defaults.allow_lm_batch,
1115
+ "lm_batch_chunk_size": config_defaults.lm_batch_chunk_size,
1116
+ "constrained_decoding_debug": config_defaults.constrained_decoding_debug,
1117
+ "audio_codes": "",
1118
+ "cfg_interval_start": params_defaults.cfg_interval_start,
1119
+ "cfg_interval_end": params_defaults.cfg_interval_end,
1120
+ "lm_negative_prompt": params_defaults.lm_negative_prompt,
1121
+ "log_level": cli_args.log_level,
1122
+ }
1123
+
1124
+ args = argparse.Namespace(**defaults)
1125
+ args.config = None
1126
+ if cli_args.config:
1127
+ if not os.path.exists(cli_args.config):
1128
+ parser.error(f"Config file not found: {cli_args.config}")
1129
+ try:
1130
+ with open(cli_args.config, 'r') as f:
1131
+ config_from_file = toml.load(f)
1132
+ print(f"Configuration loaded from {cli_args.config}")
1133
+ except Exception as e:
1134
+ parser.error(f"Error loading TOML config file {cli_args.config}: {e}")
1135
+ for key, value in config_from_file.items():
1136
+ setattr(args, key, value)
1137
+ args.config = cli_args.config
1138
+
1139
+ # CLI --backend overrides config file and auto-detection
1140
+ if cli_args.backend is not None:
1141
+ args.backend = cli_args.backend
1142
+
1143
+ if cli_args.configure:
1144
+ args, _ = run_wizard(
1145
+ args,
1146
+ configure_only=True,
1147
+ default_config_path=cli_args.config,
1148
+ params_defaults=params_defaults,
1149
+ config_defaults=config_defaults,
1150
+ )
1151
+ print("Configuration complete. Exiting without generation.")
1152
+ sys.exit(0)
1153
+
1154
+ if not cli_args.config:
1155
+ args, should_generate = run_wizard(
1156
+ args,
1157
+ configure_only=False,
1158
+ default_config_path=None,
1159
+ params_defaults=params_defaults,
1160
+ config_defaults=config_defaults,
1161
+ )
1162
+ if not should_generate:
1163
+ print("Configuration complete. Exiting without generation.")
1164
+ sys.exit(0)
1165
+
1166
+ # --- Post-parsing Setup ---
1167
+ if args.use_cot_lyrics and not args.thinking:
1168
+ print("INFO: Automatic lyric generation requires the LM handler. Forcing --thinking=True.")
1169
+ args.thinking = True
1170
+
1171
+ if not args.project_root:
1172
+ args.project_root = _get_project_root()
1173
+ else:
1174
+ args.project_root = os.path.abspath(os.path.expanduser(str(args.project_root)))
1175
+
1176
+ if args.checkpoint_dir:
1177
+ args.checkpoint_dir = os.path.expanduser(str(args.checkpoint_dir))
1178
+ if not os.path.isabs(args.checkpoint_dir):
1179
+ args.checkpoint_dir = os.path.join(args.project_root, args.checkpoint_dir)
1180
+
1181
+ if args.src_audio:
1182
+ args.src_audio = _expand_audio_path(args.src_audio)
1183
+ if args.reference_audio:
1184
+ args.reference_audio = _expand_audio_path(args.reference_audio)
1185
+
1186
+ device = _resolve_device(args.device)
1187
+
1188
+ # --- Argument Post-processing ---
1189
+ try:
1190
+ timesteps = _parse_timesteps_input(args.timesteps)
1191
+ if args.timesteps and timesteps is None:
1192
+ raise ValueError("Timesteps must be a list of numbers or a comma-separated string.")
1193
+ except ValueError as e:
1194
+ parser.error(f"Invalid format for timesteps. Expected a list of numbers (e.g., '[1.0, 0.5, 0.0]' or '0.97,0.5,0'). Error: {e}")
1195
+
1196
+ if args.seeds:
1197
+ args.batch_size = len(args.seeds)
1198
+ args.use_random_seed = False
1199
+ args.seed = -1
1200
+
1201
+ if args.instrumental and not args.lyrics:
1202
+ args.lyrics = "[Instrumental]"
1203
+ elif isinstance(args.lyrics, str) and args.lyrics.strip().lower() in {"[inst]", "[instrumental]"}:
1204
+ args.instrumental = True
1205
+
1206
+ # --- Task-specific validation and instruction helpers ---
1207
+ if args.task_type in {"cover", "repaint", "lego", "extract", "complete"}:
1208
+ if not args.src_audio:
1209
+ parser.error(f"--src_audio is required for task_type '{args.task_type}'.")
1210
+
1211
+ if args.task_type in {"cover", "repaint", "lego", "complete"}:
1212
+ if not args.caption:
1213
+ parser.error(f"--caption is required for task_type '{args.task_type}'.")
1214
+
1215
+ if args.task_type == "text2music":
1216
+ if not args.caption and not args.lyrics:
1217
+ if not args.sample_mode and not args.sample_query:
1218
+ parser.error("--caption or --lyrics is required for text2music.")
1219
+ if args.use_cot_lyrics and not args.caption:
1220
+ parser.error("--use_cot_lyrics requires --caption for lyric generation.")
1221
+ if args.sample_mode or args.sample_query:
1222
+ args.sample_mode = True
1223
+ else:
1224
+ if args.sample_mode or args.sample_query:
1225
+ parser.error("--sample_mode/sample_query are only supported for task_type 'text2music'.")
1226
+
1227
+ if args.sample_mode and args.use_cot_lyrics:
1228
+ print("INFO: sample_mode enabled. Disabling --use_cot_lyrics.")
1229
+ args.use_cot_lyrics = False
1230
+
1231
+ # Auto-select instruction based on task_type if user didn't provide a custom instruction.
1232
+ # Align with api_server behavior and TASK_INSTRUCTIONS defaults.
1233
+ if args.instruction == DEFAULT_DIT_INSTRUCTION and args.task_type in TASK_INSTRUCTIONS:
1234
+ if args.task_type in {"text2music", "cover", "repaint"}:
1235
+ args.instruction = TASK_INSTRUCTIONS[args.task_type]
1236
+
1237
+ # Base-model-only task enforcement
1238
+ base_only_tasks = {"lego", "extract", "complete"}
1239
+ if args.task_type in base_only_tasks and args.config_path:
1240
+ if "base" not in str(args.config_path).lower():
1241
+ parser.error(f"task_type '{args.task_type}' requires a base model config (e.g., 'acestep-v15-base').")
1242
+
1243
+ if args.task_type == "repaint":
1244
+ if args.repainting_end != -1 and args.repainting_end <= args.repainting_start:
1245
+ parser.error("--repainting_end must be greater than --repainting_start (or -1).")
1246
+
1247
+ if args.task_type in {"lego", "extract", "complete"}:
1248
+ has_custom_instruction = bool(args.instruction and args.instruction.strip() and args.instruction.strip() != params_defaults.instruction)
1249
+ if not has_custom_instruction:
1250
+ if args.task_type == "lego":
1251
+ if not args.lego_track:
1252
+ parser.error("--instruction or --lego_track is required for lego task.")
1253
+ args.instruction = _default_instruction_for_task("lego", [args.lego_track.strip()])
1254
+ elif args.task_type == "extract":
1255
+ if not args.extract_track:
1256
+ parser.error("--instruction or --extract_track is required for extract task.")
1257
+ args.instruction = _default_instruction_for_task("extract", [args.extract_track.strip()])
1258
+ elif args.task_type == "complete":
1259
+ if not args.complete_tracks:
1260
+ parser.error("--instruction or --complete_tracks is required for complete task.")
1261
+ tracks = [t.strip() for t in args.complete_tracks.split(",") if t.strip()]
1262
+ if not tracks:
1263
+ parser.error("--complete_tracks must contain at least one track.")
1264
+ args.instruction = _default_instruction_for_task("complete", tracks)
1265
+
1266
+ # Handle lyrics argument
1267
+ lyrics_arg = args.lyrics
1268
+ if isinstance(lyrics_arg, str) and lyrics_arg:
1269
+ lyrics_arg = os.path.expanduser(lyrics_arg)
1270
+ if not os.path.isabs(lyrics_arg):
1271
+ # Resolve relative lyrics path against config file location first, then project_root.
1272
+ resolved = None
1273
+ if args.config:
1274
+ config_dir = os.path.dirname(os.path.abspath(args.config))
1275
+ candidate = os.path.join(config_dir, lyrics_arg)
1276
+ if os.path.isfile(candidate):
1277
+ resolved = candidate
1278
+ if resolved is None and args.project_root:
1279
+ candidate = os.path.join(os.path.abspath(args.project_root), lyrics_arg)
1280
+ if os.path.isfile(candidate):
1281
+ resolved = candidate
1282
+ if resolved is not None:
1283
+ lyrics_arg = resolved
1284
+
1285
+ if lyrics_arg is not None:
1286
+ if lyrics_arg == "generate":
1287
+ args.use_cot_lyrics = True
1288
+ args.lyrics = ""
1289
+ print("Lyrics generation enabled.")
1290
+ elif os.path.isfile(lyrics_arg):
1291
+ print(f"INFO: Attempting to load lyrics from file: {lyrics_arg}")
1292
+ try:
1293
+ with open(lyrics_arg, 'r', encoding='utf-8') as f:
1294
+ args.lyrics = f.read()
1295
+ print(f"Lyrics loaded from file: {lyrics_arg}")
1296
+ except Exception as e:
1297
+ parser.error(f"Could not read lyrics file {lyrics_arg}. Error: {e}")
1298
+ # else: lyrics is a string, use as is.
1299
+
1300
+ # --- Handler Initialization ---
1301
+ if args.backend == "pyTorch":
1302
+ args.backend = "pt"
1303
+ if args.backend not in {"vllm", "pt", "mlx"}:
1304
+ args.backend = "vllm"
1305
+
1306
+ print("Initializing ACE-Step handlers...")
1307
+ dit_handler = AceStepHandler()
1308
+ llm_handler = LLMHandler()
1309
+
1310
+ base_only_tasks = {"lego", "extract", "complete"}
1311
+ skip_lm_tasks = {"cover", "repaint"}
1312
+ requires_lm = (
1313
+ args.task_type not in skip_lm_tasks and (
1314
+ args.thinking
1315
+ or args.sample_mode
1316
+ or bool(args.sample_query and str(args.sample_query).strip())
1317
+ or args.use_format
1318
+ or args.use_cot_metas
1319
+ or args.use_cot_caption
1320
+ or args.use_cot_lyrics
1321
+ or args.use_cot_language
1322
+ )
1323
+ )
1324
+
1325
+ if args.config_path is None:
1326
+ available_models = dit_handler.get_available_acestep_v15_models()
1327
+ if args.task_type in base_only_tasks and available_models:
1328
+ available_models = [m for m in available_models if "base" in m.lower()]
1329
+ if not available_models:
1330
+ print("No DiT models found. Downloading main model (acestep-v15-turbo + core components)...")
1331
+ from acestep.model_downloader import ensure_main_model, get_checkpoints_dir
1332
+ checkpoints_dir = get_checkpoints_dir()
1333
+ success, msg = ensure_main_model(checkpoints_dir)
1334
+ print(msg)
1335
+ if not success:
1336
+ parser.error(f"Failed to download main model: {msg}")
1337
+ available_models = dit_handler.get_available_acestep_v15_models()
1338
+ if args.task_type in base_only_tasks and available_models:
1339
+ available_models = [m for m in available_models if "base" in m.lower()]
1340
+ if args.task_type in base_only_tasks and not available_models:
1341
+ print("Base-only task selected. Downloading base DiT model (acestep-v15-base)...")
1342
+ from acestep.model_downloader import ensure_dit_model, get_checkpoints_dir
1343
+ checkpoints_dir = get_checkpoints_dir()
1344
+ success, msg = ensure_dit_model("acestep-v15-base", checkpoints_dir)
1345
+ print(msg)
1346
+ if not success:
1347
+ parser.error(f"Failed to download base DiT model: {msg}")
1348
+ available_models = dit_handler.get_available_acestep_v15_models()
1349
+ if available_models:
1350
+ available_models = [m for m in available_models if "base" in m.lower()]
1351
+ if available_models:
1352
+ if args.task_type in {"lego", "extract", "complete"}:
1353
+ preferred = "acestep-v15-base"
1354
+ else:
1355
+ preferred = "acestep-v15-turbo"
1356
+ args.config_path = preferred if preferred in available_models else available_models[0]
1357
+ print(f"Auto-selected config_path: {args.config_path}")
1358
+ else:
1359
+ parser.error("No available DiT models found. Please specify --config_path.")
1360
+ if args.task_type in {"lego", "extract", "complete"} and "base" not in str(args.config_path).lower():
1361
+ parser.error(f"task_type '{args.task_type}' requires a base model config (e.g., 'acestep-v15-base').")
1362
+
1363
+ # Ensure required DiT/main models are present for the selected task/model.
1364
+ from acestep.model_downloader import (
1365
+ ensure_main_model,
1366
+ ensure_dit_model,
1367
+ get_checkpoints_dir,
1368
+ check_main_model_exists,
1369
+ check_model_exists,
1370
+ SUBMODEL_REGISTRY,
1371
+ )
1372
+ checkpoints_dir = get_checkpoints_dir()
1373
+ if not check_main_model_exists(checkpoints_dir):
1374
+ print("Main model components not found. Downloading main model...")
1375
+ success, msg = ensure_main_model(checkpoints_dir)
1376
+ print(msg)
1377
+ if not success:
1378
+ parser.error(f"Failed to download main model: {msg}")
1379
+ if args.config_path:
1380
+ config_name = str(args.config_path)
1381
+ known_models = {"acestep-v15-turbo"} | set(SUBMODEL_REGISTRY.keys())
1382
+ if check_model_exists(config_name, checkpoints_dir):
1383
+ pass
1384
+ elif config_name in known_models:
1385
+ success, msg = ensure_dit_model(config_name, checkpoints_dir)
1386
+ if not success:
1387
+ parser.error(f"Failed to download DiT model '{config_name}': {msg}")
1388
+ else:
1389
+ print(f"Warning: DiT model '{config_name}' not found locally and not in registry. Skipping auto-download.")
1390
+
1391
+ use_flash_attention = args.use_flash_attention
1392
+ if use_flash_attention is None:
1393
+ use_flash_attention = dit_handler.is_flash_attention_available(device)
1394
+
1395
+ compile_model = os.environ.get("ACESTEP_COMPILE_MODEL", "").strip().lower() in {
1396
+ "1", "true", "yes", "y", "on",
1397
+ }
1398
+
1399
+ print(f"Initializing DiT handler with model: {args.config_path}")
1400
+ dit_handler.initialize_service(
1401
+ project_root=args.project_root,
1402
+ config_path=args.config_path,
1403
+ device=device,
1404
+ use_flash_attention=use_flash_attention,
1405
+ compile_model=compile_model,
1406
+ offload_to_cpu=args.offload_to_cpu,
1407
+ offload_dit_to_cpu=args.offload_dit_to_cpu,
1408
+ )
1409
+
1410
+ if requires_lm:
1411
+ from acestep.model_downloader import ensure_lm_model
1412
+ if args.lm_model_path is None:
1413
+ available_lm_models = llm_handler.get_available_5hz_lm_models()
1414
+ if available_lm_models:
1415
+ args.lm_model_path = available_lm_models[0]
1416
+ print(f"Using default LM model: {args.lm_model_path}")
1417
+ else:
1418
+ success, msg = ensure_lm_model(checkpoints_dir=checkpoints_dir)
1419
+ print(msg)
1420
+ if not success:
1421
+ parser.error("No LM models available. Please specify --lm_model_path or disable --thinking.")
1422
+ available_lm_models = llm_handler.get_available_5hz_lm_models()
1423
+ if not available_lm_models:
1424
+ parser.error("No LM models available after download. Please specify --lm_model_path or disable --thinking.")
1425
+ args.lm_model_path = available_lm_models[0]
1426
+ print(f"Using default LM model: {args.lm_model_path}")
1427
+ else:
1428
+ lm_model_path = str(args.lm_model_path)
1429
+ if os.path.isabs(lm_model_path) and os.path.exists(lm_model_path):
1430
+ pass
1431
+ elif check_model_exists(lm_model_path, checkpoints_dir):
1432
+ pass
1433
+ elif lm_model_path in SUBMODEL_REGISTRY:
1434
+ success, msg = ensure_lm_model(lm_model_path, checkpoints_dir=checkpoints_dir)
1435
+ print(msg)
1436
+ if not success:
1437
+ parser.error(f"Failed to download LM model '{lm_model_path}': {msg}")
1438
+ else:
1439
+ parser.error(f"LM model '{lm_model_path}' not found locally and not in registry. Please provide a valid --lm_model_path.")
1440
+
1441
+ print(f"Initializing LM handler with model: {args.lm_model_path}")
1442
+ llm_handler.initialize(
1443
+ checkpoint_dir=args.checkpoint_dir,
1444
+ lm_model_path=args.lm_model_path,
1445
+ backend=args.backend,
1446
+ device=device,
1447
+ offload_to_cpu=args.offload_to_cpu,
1448
+ dtype=None,
1449
+ )
1450
+ else:
1451
+ if args.task_type in skip_lm_tasks:
1452
+ print(f"LM is not required for task_type '{args.task_type}'. Skipping LM handler initialization.")
1453
+ else:
1454
+ print("LM 'thinking' is disabled. Skipping LM handler initialization.")
1455
+
1456
+ print("Handlers initialized.")
1457
+
1458
+ format_has_duration = False
1459
+
1460
+ # --- Sample Mode / Description-based Auto-Generation ---
1461
+ if args.sample_mode or (args.sample_query and str(args.sample_query).strip()):
1462
+ if not llm_handler.llm_initialized:
1463
+ parser.error("--sample_mode/sample_query requires the LM handler, but it's not initialized.")
1464
+
1465
+ sample_query = args.sample_query if args.sample_query and str(args.sample_query).strip() else "NO USER INPUT"
1466
+ parsed_language, parsed_instrumental = _parse_description_hints(sample_query)
1467
+
1468
+ if args.vocal_language and args.vocal_language not in ("en", "unknown", ""):
1469
+ sample_language = args.vocal_language
1470
+ else:
1471
+ sample_language = parsed_language
1472
+
1473
+ print("\nINFO: Creating sample via 'create_sample'...")
1474
+ sample_result = create_sample(
1475
+ llm_handler=llm_handler,
1476
+ query=sample_query,
1477
+ instrumental=parsed_instrumental,
1478
+ vocal_language=sample_language,
1479
+ temperature=args.lm_temperature,
1480
+ top_k=args.lm_top_k,
1481
+ top_p=args.lm_top_p,
1482
+ )
1483
+
1484
+ if sample_result.success:
1485
+ args.caption = sample_result.caption
1486
+ args.lyrics = sample_result.lyrics
1487
+ args.instrumental = bool(sample_result.instrumental)
1488
+ if args.bpm is None:
1489
+ args.bpm = sample_result.bpm
1490
+ if not args.keyscale:
1491
+ args.keyscale = sample_result.keyscale
1492
+ if not args.timesignature:
1493
+ args.timesignature = sample_result.timesignature
1494
+ if args.duration <= 0:
1495
+ args.duration = sample_result.duration
1496
+ if args.vocal_language in ("unknown", "", None):
1497
+ args.vocal_language = sample_result.language
1498
+ args.sample_mode = True
1499
+ print("✓ Sample created. Using generated parameters.")
1500
+ else:
1501
+ parser.error(f"create_sample failed: {sample_result.error or sample_result.status_message}")
1502
+
1503
+ # --- Format caption/lyrics if requested ---
1504
+ if args.use_format and (args.caption or args.lyrics):
1505
+ if not llm_handler.llm_initialized:
1506
+ parser.error("--use_format requires the LM handler, but it's not initialized.")
1507
+
1508
+ user_metadata_for_format = {}
1509
+ if args.bpm is not None:
1510
+ user_metadata_for_format["bpm"] = args.bpm
1511
+ if args.duration is not None and float(args.duration) > 0:
1512
+ user_metadata_for_format["duration"] = float(args.duration)
1513
+ if args.keyscale:
1514
+ user_metadata_for_format["keyscale"] = args.keyscale
1515
+ if args.timesignature:
1516
+ user_metadata_for_format["timesignature"] = args.timesignature
1517
+ if args.vocal_language and args.vocal_language != "unknown":
1518
+ user_metadata_for_format["language"] = args.vocal_language
1519
+
1520
+ print("\nINFO: Formatting caption/lyrics via 'format_sample'...")
1521
+ format_result = format_sample(
1522
+ llm_handler=llm_handler,
1523
+ caption=args.caption or "",
1524
+ lyrics=args.lyrics or "",
1525
+ user_metadata=user_metadata_for_format if user_metadata_for_format else None,
1526
+ temperature=args.lm_temperature,
1527
+ top_k=args.lm_top_k,
1528
+ top_p=args.lm_top_p,
1529
+ )
1530
+
1531
+ if format_result.success:
1532
+ args.caption = format_result.caption or args.caption
1533
+ args.lyrics = format_result.lyrics or args.lyrics
1534
+ if format_result.duration:
1535
+ args.duration = format_result.duration
1536
+ format_has_duration = True
1537
+ if format_result.bpm:
1538
+ args.bpm = format_result.bpm
1539
+ if format_result.keyscale:
1540
+ args.keyscale = format_result.keyscale
1541
+ if format_result.timesignature:
1542
+ args.timesignature = format_result.timesignature
1543
+ print("✓ Format complete.")
1544
+ else:
1545
+ parser.error(f"format_sample failed: {format_result.error or format_result.status_message}")
1546
+
1547
+ # --- Auto-generate Lyrics if Requested ---
1548
+ if args.use_cot_lyrics:
1549
+ if not llm_handler.llm_initialized:
1550
+ parser.error("--use_cot_lyrics requires the LM handler, but it's not initialized. Ensure --thinking is enabled.")
1551
+
1552
+ print("\nINFO: Generating lyrics and metadata via 'create_sample'...")
1553
+ sample_result = create_sample(
1554
+ llm_handler=llm_handler,
1555
+ query=args.caption,
1556
+ instrumental=False,
1557
+ vocal_language=args.vocal_language if args.vocal_language != 'unknown' else None,
1558
+ temperature=args.lm_temperature,
1559
+ top_k=args.lm_top_k,
1560
+ top_p=args.lm_top_p,
1561
+ )
1562
+
1563
+ if sample_result.success:
1564
+ print("✓ Automatic sample creation successful. Using generated parameters:")
1565
+ # Update args with values from create_sample, respecting user-provided values
1566
+ args.caption = sample_result.caption
1567
+ args.lyrics = sample_result.lyrics
1568
+ if args.bpm is None: args.bpm = sample_result.bpm
1569
+ if not args.keyscale: args.keyscale = sample_result.keyscale
1570
+ if not args.timesignature: args.timesignature = sample_result.timesignature
1571
+ if args.duration <= 0: args.duration = sample_result.duration
1572
+ if args.vocal_language == 'unknown': args.vocal_language = sample_result.language
1573
+
1574
+ print(f" - Caption: {args.caption}")
1575
+ lyrics_preview = args.lyrics[:150].strip().replace("\n", " ")
1576
+ print(f" - Lyrics: '{lyrics_preview}...'")
1577
+ print(f" - Metadata: BPM={args.bpm}, Key='{args.keyscale}', Lang='{args.vocal_language}'")
1578
+
1579
+ # Disable subsequent CoT steps to avoid redundancy and save time
1580
+ args.use_cot_metas = False
1581
+ args.use_cot_caption = False
1582
+ else:
1583
+ print(f"⚠️ WARNING: Automatic lyric generation via 'create_sample' failed: {sample_result.error}")
1584
+ print(" Proceeding with an instrumental track instead.")
1585
+ args.lyrics = "[Instrumental]"
1586
+ args.instrumental = True
1587
+
1588
+ # Flag has served its purpose, disable it to avoid issues with GenerationParams
1589
+ args.use_cot_lyrics = False
1590
+
1591
+ if args.sample_mode or format_has_duration:
1592
+ args.use_cot_metas = False
1593
+
1594
+ # --- Prompt Editing Hook for LLM Audio Tokens ---
1595
+ if args.thinking and args.task_type not in skip_lm_tasks:
1596
+ instruction_path = os.path.join(
1597
+ os.path.abspath(args.project_root) if args.project_root else os.getcwd(),
1598
+ "instruction.txt",
1599
+ )
1600
+ preloaded_prompt = None
1601
+ use_instruction_file = False
1602
+ if args.config and os.path.exists(instruction_path):
1603
+ use_instruction_file = True
1604
+ try:
1605
+ with open(instruction_path, "r", encoding="utf-8") as f:
1606
+ preloaded_prompt = f.read()
1607
+ except Exception as e:
1608
+ print(f"WARNING: Failed to read {instruction_path}: {e}")
1609
+ preloaded_prompt = None
1610
+ use_instruction_file = False
1611
+ if use_instruction_file:
1612
+ print(f"INFO: Found {instruction_path}. Using it without editing.")
1613
+ if preloaded_prompt is not None and not preloaded_prompt.strip():
1614
+ preloaded_prompt = None
1615
+ _install_prompt_edit_hook(llm_handler, instruction_path, preloaded_prompt=preloaded_prompt)
1616
+
1617
+ # --- Configure Generation ---
1618
+ params = GenerationParams(
1619
+ task_type=args.task_type,
1620
+ instruction=args.instruction,
1621
+ reference_audio=args.reference_audio,
1622
+ src_audio=args.src_audio,
1623
+ audio_codes=args.audio_codes,
1624
+ caption=args.caption,
1625
+ lyrics=args.lyrics,
1626
+ instrumental=args.instrumental,
1627
+ vocal_language=args.vocal_language,
1628
+ bpm=args.bpm,
1629
+ keyscale=args.keyscale,
1630
+ timesignature=args.timesignature,
1631
+ duration=args.duration,
1632
+ inference_steps=args.inference_steps,
1633
+ seed=args.seed,
1634
+ guidance_scale=args.guidance_scale,
1635
+ use_adg=args.use_adg,
1636
+ cfg_interval_start=args.cfg_interval_start,
1637
+ cfg_interval_end=args.cfg_interval_end,
1638
+ shift=args.shift,
1639
+ infer_method=args.infer_method,
1640
+ timesteps=timesteps,
1641
+ repainting_start=args.repainting_start,
1642
+ repainting_end=args.repainting_end,
1643
+ audio_cover_strength=args.audio_cover_strength,
1644
+ thinking=args.thinking,
1645
+ lm_temperature=args.lm_temperature,
1646
+ lm_cfg_scale=args.lm_cfg_scale,
1647
+ lm_top_k=args.lm_top_k,
1648
+ lm_top_p=args.lm_top_p,
1649
+ lm_negative_prompt=args.lm_negative_prompt,
1650
+ use_cot_metas=args.use_cot_metas,
1651
+ use_cot_caption=args.use_cot_caption,
1652
+ use_cot_lyrics=args.use_cot_lyrics,
1653
+ use_cot_language=args.use_cot_language,
1654
+ use_constrained_decoding=args.use_constrained_decoding
1655
+ )
1656
+
1657
+ config = GenerationConfig(
1658
+ batch_size=args.batch_size,
1659
+ allow_lm_batch=args.allow_lm_batch,
1660
+ use_random_seed=args.use_random_seed,
1661
+ seeds=args.seeds,
1662
+ lm_batch_chunk_size=args.lm_batch_chunk_size,
1663
+ constrained_decoding_debug=args.constrained_decoding_debug,
1664
+ audio_format=args.audio_format
1665
+ )
1666
+
1667
+ # --- Generate Music ---
1668
+ log_level = getattr(args, "log_level", "INFO")
1669
+ log_level_upper = str(log_level).upper()
1670
+ compact_logs = log_level_upper != "DEBUG"
1671
+ _print_final_parameters(
1672
+ args,
1673
+ params,
1674
+ config,
1675
+ params_defaults,
1676
+ config_defaults,
1677
+ compact=compact_logs,
1678
+ resolved_device=device,
1679
+ )
1680
+
1681
+ print("\n--- Starting Generation ---")
1682
+ print(f"Caption: \"{params.caption}\"")
1683
+ print(f"Duration: {params.duration}s | Outputs: {config.batch_size}")
1684
+ if config.seeds:
1685
+ print(f"Custom Seeds: {config.seeds}")
1686
+ print("---------------------------\n")
1687
+
1688
+ manual_edit_pipeline = (
1689
+ args.thinking
1690
+ and args.task_type not in skip_lm_tasks
1691
+ and not (params.audio_codes and str(params.audio_codes).strip())
1692
+ )
1693
+
1694
+ lm_time_costs = None
1695
+ if manual_edit_pipeline:
1696
+ top_k_value = None if not params.lm_top_k or params.lm_top_k == 0 else int(params.lm_top_k)
1697
+ top_p_value = None if not params.lm_top_p or params.lm_top_p >= 1.0 else params.lm_top_p
1698
+
1699
+ actual_batch_size = config.batch_size if config.batch_size is not None else 1
1700
+ seed_for_generation = ""
1701
+ if config.seeds is not None:
1702
+ if isinstance(config.seeds, list) and len(config.seeds) > 0:
1703
+ seed_for_generation = ",".join(str(s) for s in config.seeds)
1704
+ elif isinstance(config.seeds, int):
1705
+ seed_for_generation = str(config.seeds)
1706
+ actual_seed_list, _ = dit_handler.prepare_seeds(actual_batch_size, seed_for_generation, config.use_random_seed)
1707
+
1708
+ original_target_duration = params.duration
1709
+ original_bpm = params.bpm
1710
+ original_keyscale = params.keyscale
1711
+ original_timesignature = params.timesignature
1712
+ original_vocal_language = params.vocal_language
1713
+ lm_result = None
1714
+ lm_metadata = {}
1715
+ edited_caption = None
1716
+ edited_lyrics = None
1717
+ edited_instruction = None
1718
+ edited_metas = {}
1719
+ lm_time_costs = {
1720
+ "phase1_time": 0.0,
1721
+ "phase2_time": 0.0,
1722
+ "total_time": 0.0,
1723
+ }
1724
+ for attempt in range(2):
1725
+ user_metadata = {}
1726
+ if params.bpm is not None:
1727
+ try:
1728
+ bpm_value = float(params.bpm)
1729
+ if bpm_value > 0:
1730
+ user_metadata["bpm"] = int(bpm_value)
1731
+ except (ValueError, TypeError):
1732
+ pass
1733
+ if params.keyscale and params.keyscale.strip() and params.keyscale.strip().lower() not in ["n/a", ""]:
1734
+ user_metadata["keyscale"] = params.keyscale.strip()
1735
+ if params.timesignature and params.timesignature.strip() and params.timesignature.strip().lower() not in ["n/a", ""]:
1736
+ user_metadata["timesignature"] = params.timesignature.strip()
1737
+ if params.duration is not None:
1738
+ try:
1739
+ duration_value = float(params.duration)
1740
+ if duration_value > 0:
1741
+ user_metadata["duration"] = int(duration_value)
1742
+ except (ValueError, TypeError):
1743
+ pass
1744
+ # Only include caption and language in user_metadata on
1745
+ # regeneration attempts. On the first attempt the LM should
1746
+ # generate/expand these via CoT (matching inference.py behaviour).
1747
+ if attempt > 0:
1748
+ if params.caption and params.caption.strip():
1749
+ user_metadata["caption"] = params.caption.strip()
1750
+ if params.vocal_language and params.vocal_language not in ("", "unknown"):
1751
+ user_metadata["language"] = params.vocal_language
1752
+ user_metadata_to_pass = user_metadata if user_metadata else None
1753
+
1754
+ lm_result = llm_handler.generate_with_stop_condition(
1755
+ caption=params.caption or "",
1756
+ lyrics=params.lyrics or "",
1757
+ infer_type="llm_dit",
1758
+ temperature=params.lm_temperature,
1759
+ cfg_scale=params.lm_cfg_scale,
1760
+ negative_prompt=params.lm_negative_prompt,
1761
+ top_k=top_k_value,
1762
+ top_p=top_p_value,
1763
+ target_duration=params.duration,
1764
+ user_metadata=user_metadata_to_pass,
1765
+ use_cot_caption=params.use_cot_caption,
1766
+ use_cot_language=params.use_cot_language,
1767
+ use_cot_metas=params.use_cot_metas,
1768
+ use_constrained_decoding=params.use_constrained_decoding,
1769
+ constrained_decoding_debug=config.constrained_decoding_debug,
1770
+ batch_size=actual_batch_size,
1771
+ seeds=actual_seed_list,
1772
+ )
1773
+ lm_extra_time = (lm_result.get("extra_outputs") or {}).get("time_costs", {})
1774
+ if lm_extra_time:
1775
+ lm_time_costs["phase1_time"] += float(lm_extra_time.get("phase1_time", 0.0) or 0.0)
1776
+ lm_time_costs["phase2_time"] += float(lm_extra_time.get("phase2_time", 0.0) or 0.0)
1777
+ lm_time_costs["total_time"] += float(
1778
+ lm_extra_time.get(
1779
+ "total_time",
1780
+ (lm_extra_time.get("phase1_time", 0.0) or 0.0)
1781
+ + (lm_extra_time.get("phase2_time", 0.0) or 0.0),
1782
+ )
1783
+ or 0.0
1784
+ )
1785
+
1786
+ if not lm_result.get("success", False):
1787
+ error_msg = lm_result.get("error", "Unknown LM error")
1788
+ print(f"\n❌ Generation failed: {error_msg}")
1789
+ print(f" Status: {lm_result.get('error', '')}")
1790
+ return
1791
+
1792
+ if actual_batch_size > 1:
1793
+ lm_metadata = (lm_result.get("metadata") or [{}])[0]
1794
+ audio_codes = lm_result.get("audio_codes", [])
1795
+ else:
1796
+ lm_metadata = lm_result.get("metadata", {}) or {}
1797
+ audio_codes = lm_result.get("audio_codes", "")
1798
+
1799
+ if audio_codes:
1800
+ params.audio_codes = audio_codes
1801
+ else:
1802
+ print("WARNING: LM did not return audio codes; proceeding without codes.")
1803
+
1804
+ edited_caption = getattr(llm_handler, "_edited_caption", None)
1805
+ edited_lyrics = getattr(llm_handler, "_edited_lyrics", None)
1806
+ edited_instruction = getattr(llm_handler, "_edited_instruction", None)
1807
+ edited_metas = getattr(llm_handler, "_edited_metas", {})
1808
+
1809
+ parsed_duration = None
1810
+ parsed_bpm = None
1811
+ parsed_keyscale = None
1812
+ parsed_timesignature = None
1813
+ parsed_language = None
1814
+ if edited_metas:
1815
+ bpm_value = edited_metas.get("bpm")
1816
+ if bpm_value:
1817
+ parsed = _parse_number(bpm_value)
1818
+ if parsed is not None and parsed > 0:
1819
+ parsed_bpm = int(parsed)
1820
+ duration_value = edited_metas.get("duration")
1821
+ if duration_value:
1822
+ parsed = _parse_number(duration_value)
1823
+ if parsed is not None and parsed > 0:
1824
+ parsed_duration = float(parsed)
1825
+ keyscale_value = edited_metas.get("keyscale")
1826
+ if keyscale_value:
1827
+ parsed_keyscale = keyscale_value
1828
+ timesignature_value = edited_metas.get("timesignature")
1829
+ if timesignature_value:
1830
+ parsed_timesignature = timesignature_value
1831
+ language_value = edited_metas.get("language") or edited_metas.get("vocal_language")
1832
+ if language_value:
1833
+ parsed_language = language_value
1834
+
1835
+ if attempt == 0:
1836
+ duration_changed = parsed_duration is not None and (
1837
+ original_target_duration is None
1838
+ or float(original_target_duration) <= 0
1839
+ or abs(float(original_target_duration) - parsed_duration) > 1e-6
1840
+ )
1841
+ bpm_changed = parsed_bpm is not None and parsed_bpm != original_bpm
1842
+ keyscale_changed = parsed_keyscale is not None and parsed_keyscale != original_keyscale
1843
+ timesignature_changed = parsed_timesignature is not None and parsed_timesignature != original_timesignature
1844
+ language_changed = parsed_language is not None and parsed_language != original_vocal_language
1845
+ if duration_changed or bpm_changed or keyscale_changed or timesignature_changed or language_changed:
1846
+ if duration_changed:
1847
+ params.duration = parsed_duration
1848
+ if bpm_changed:
1849
+ params.bpm = parsed_bpm
1850
+ if keyscale_changed:
1851
+ params.keyscale = parsed_keyscale
1852
+ if timesignature_changed:
1853
+ params.timesignature = parsed_timesignature
1854
+ if language_changed:
1855
+ params.vocal_language = parsed_language
1856
+ # Carry forward the expanded caption so the second
1857
+ # attempt's <think> block (and user_metadata) use it
1858
+ # instead of the short original caption.
1859
+ edited_caption_for_regen = edited_metas.get("caption") if edited_metas else None
1860
+ if edited_caption_for_regen and edited_caption_for_regen.strip():
1861
+ params.caption = edited_caption_for_regen
1862
+ print("INFO: Edited metadata detected. Regenerating audio codes with updated values.")
1863
+ llm_handler._skip_prompt_edit = True
1864
+ continue
1865
+ break
1866
+
1867
+ edited_meta_caption = edited_metas.get("caption") if edited_metas else None
1868
+ if edited_meta_caption and edited_meta_caption.strip():
1869
+ params.caption = edited_meta_caption
1870
+ elif edited_caption:
1871
+ params.caption = edited_caption
1872
+ elif params.use_cot_caption and lm_metadata.get("caption"):
1873
+ params.caption = lm_metadata.get("caption")
1874
+
1875
+ if edited_lyrics:
1876
+ params.lyrics = edited_lyrics
1877
+ elif not params.lyrics and lm_metadata.get("lyrics"):
1878
+ params.lyrics = lm_metadata.get("lyrics")
1879
+
1880
+ if edited_instruction:
1881
+ params.instruction = edited_instruction
1882
+
1883
+ if edited_metas:
1884
+ bpm_value = edited_metas.get("bpm")
1885
+ if bpm_value:
1886
+ parsed = _parse_number(bpm_value)
1887
+ if parsed is not None:
1888
+ params.bpm = int(parsed)
1889
+ duration_value = edited_metas.get("duration")
1890
+ if duration_value:
1891
+ parsed = _parse_number(duration_value)
1892
+ if parsed is not None:
1893
+ params.duration = float(parsed)
1894
+ keyscale_value = edited_metas.get("keyscale")
1895
+ if keyscale_value:
1896
+ params.keyscale = keyscale_value
1897
+ timesignature_value = edited_metas.get("timesignature")
1898
+ if timesignature_value:
1899
+ params.timesignature = timesignature_value
1900
+ language_value = edited_metas.get("language") or edited_metas.get("vocal_language")
1901
+ if language_value:
1902
+ params.vocal_language = language_value
1903
+ else:
1904
+ if params.bpm is None and lm_metadata.get("bpm") not in (None, "N/A", ""):
1905
+ parsed = _parse_number(str(lm_metadata.get("bpm")))
1906
+ if parsed is not None:
1907
+ params.bpm = int(parsed)
1908
+ if not params.keyscale and lm_metadata.get("keyscale"):
1909
+ params.keyscale = lm_metadata.get("keyscale")
1910
+ if not params.timesignature and lm_metadata.get("timesignature"):
1911
+ params.timesignature = lm_metadata.get("timesignature")
1912
+ if params.duration is None and lm_metadata.get("duration") not in (None, "N/A", ""):
1913
+ parsed = _parse_number(str(lm_metadata.get("duration")))
1914
+ if parsed is not None:
1915
+ params.duration = float(parsed)
1916
+ if params.vocal_language in (None, "", "unknown"):
1917
+ language_value = lm_metadata.get("vocal_language") or lm_metadata.get("language")
1918
+ if language_value:
1919
+ params.vocal_language = language_value
1920
+
1921
+ # use_cot_language: override vocal_language with LM detection unless
1922
+ # the user explicitly edited the language in the think block.
1923
+ if params.use_cot_language:
1924
+ edited_lang = (edited_metas.get("language") or edited_metas.get("vocal_language")) if edited_metas else None
1925
+ if not edited_lang:
1926
+ lm_lang = lm_metadata.get("vocal_language") or lm_metadata.get("language")
1927
+ if lm_lang:
1928
+ params.vocal_language = lm_lang
1929
+
1930
+ # Populate cot_* fields for downstream reporting (mirrors inference.py)
1931
+ if lm_metadata:
1932
+ if original_bpm is None:
1933
+ params.cot_bpm = params.bpm
1934
+ if not original_keyscale:
1935
+ params.cot_keyscale = params.keyscale
1936
+ if not original_timesignature:
1937
+ params.cot_timesignature = params.timesignature
1938
+ if original_target_duration is None or float(original_target_duration) <= 0:
1939
+ params.cot_duration = params.duration
1940
+ if original_vocal_language in (None, "", "unknown"):
1941
+ params.cot_vocal_language = params.vocal_language
1942
+ if not params.caption:
1943
+ params.cot_caption = lm_metadata.get("caption", "")
1944
+ if not params.lyrics:
1945
+ params.cot_lyrics = lm_metadata.get("lyrics", "")
1946
+
1947
+ params.thinking = False
1948
+ params.use_cot_caption = False
1949
+ params.use_cot_language = False
1950
+ params.use_cot_metas = False
1951
+ if hasattr(llm_handler, "_skip_prompt_edit"):
1952
+ llm_handler._skip_prompt_edit = False
1953
+
1954
+ if log_level_upper in {"INFO", "DEBUG"}:
1955
+ _print_dit_prompt(dit_handler, params)
1956
+ print("Running DiT generation with edited prompt and cached audio codes...")
1957
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir=args.save_dir)
1958
+ else:
1959
+ if log_level_upper in {"INFO", "DEBUG"}:
1960
+ _print_dit_prompt(dit_handler, params)
1961
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir=args.save_dir)
1962
+
1963
+ # --- Process Results ---
1964
+ if result.success:
1965
+ print(f"\n✅ Generation successful! {len(result.audios)} audio(s) saved in '{args.save_dir}/'")
1966
+ for i, audio in enumerate(result.audios):
1967
+ print(f" [{i+1}] Path: {audio['path']} | Seed: {audio['params']['seed']}")
1968
+
1969
+ time_costs = result.extra_outputs.get("time_costs", {})
1970
+ if manual_edit_pipeline and lm_time_costs and time_costs is not None:
1971
+ if not isinstance(time_costs, dict):
1972
+ time_costs = {}
1973
+ result.extra_outputs["time_costs"] = time_costs
1974
+ if lm_time_costs["total_time"] > 0.0:
1975
+ time_costs["lm_phase1_time"] = lm_time_costs["phase1_time"]
1976
+ time_costs["lm_phase2_time"] = lm_time_costs["phase2_time"]
1977
+ time_costs["lm_total_time"] = lm_time_costs["total_time"]
1978
+ dit_total = float(time_costs.get("dit_total_time_cost", 0.0) or 0.0)
1979
+ time_costs["pipeline_total_time"] = time_costs["lm_total_time"] + dit_total
1980
+ if time_costs:
1981
+ print("\n--- Performance ---")
1982
+ total_time = time_costs.get('pipeline_total_time', 0)
1983
+ print(f"Total time: {total_time:.2f}s")
1984
+ if args.thinking:
1985
+ lm1_time = time_costs.get('lm_phase1_time', 0)
1986
+ lm2_time = time_costs.get('lm_phase2_time', 0)
1987
+ print(f" - LM time: {lm1_time + lm2_time:.2f}s")
1988
+ dit_time = time_costs.get('dit_total_time_cost', 0)
1989
+ print(f" - DiT time: {dit_time:.2f}s")
1990
+ print("-------------------\n")
1991
+
1992
+ else:
1993
+ print(f"\n❌ Generation failed: {result.error}")
1994
+ print(f" Status: {result.status_message}")
1995
+
1996
+
1997
+ if __name__ == "__main__":
1998
+ main()