Spaces:

OnyxMunk
/

Ace-Step-Munk

Running

App Files Files Community

OnyxlMunkey Cursor commited on Feb 15

Commit

e961681

1 Parent(s): e391f8d

Add ACE-Step 1.5 Docker app

Browse files

Co-authored-by: Cursor <cursoragent@cursor.com>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.claude/skills/acestep-docs/SKILL.md +60 -0
.claude/skills/acestep-docs/api/API.md +746 -0
.claude/skills/acestep-docs/api/Openrouter_API.md +517 -0
.claude/skills/acestep-docs/getting-started/ABOUT.md +87 -0
.claude/skills/acestep-docs/getting-started/README.md +232 -0
.claude/skills/acestep-docs/getting-started/Tutorial.md +964 -0
.claude/skills/acestep-docs/guides/ENVIRONMENT_SETUP.md +542 -0
.claude/skills/acestep-docs/guides/GPU_COMPATIBILITY.md +134 -0
.claude/skills/acestep-docs/guides/GRADIO_GUIDE.md +549 -0
.claude/skills/acestep-docs/guides/INFERENCE.md +1191 -0
.claude/skills/acestep-docs/guides/SCRIPT_CONFIGURATION.md +615 -0
.claude/skills/acestep-docs/guides/UPDATE_AND_BACKUP.md +496 -0
.claude/skills/acestep-lyrics-transcription/SKILL.md +173 -0
.claude/skills/acestep-lyrics-transcription/scripts/acestep-lyrics-transcription.sh +584 -0
.claude/skills/acestep-lyrics-transcription/scripts/config.example.json +14 -0
.claude/skills/acestep-simplemv/SKILL.md +133 -0
.claude/skills/acestep-simplemv/scripts/package-lock.json +0 -0
.claude/skills/acestep-simplemv/scripts/package.json +27 -0
.claude/skills/acestep-simplemv/scripts/remotion.config.ts +4 -0
.claude/skills/acestep-simplemv/scripts/render-mv.sh +123 -0
.claude/skills/acestep-simplemv/scripts/render.mjs +345 -0
.claude/skills/acestep-simplemv/scripts/render.sh +12 -0
.claude/skills/acestep-simplemv/scripts/src/AudioVisualization.tsx +314 -0
.claude/skills/acestep-simplemv/scripts/src/Root.tsx +31 -0
.claude/skills/acestep-simplemv/scripts/src/index.ts +4 -0
.claude/skills/acestep-simplemv/scripts/src/parseLrc.ts +40 -0
.claude/skills/acestep-simplemv/scripts/src/types.ts +32 -0
.claude/skills/acestep-simplemv/scripts/tsconfig.json +18 -0
.claude/skills/acestep-songwriting/SKILL.md +194 -0
.claude/skills/acestep/SKILL.md +253 -0
.claude/skills/acestep/api-reference.md +149 -0
.claude/skills/acestep/scripts/acestep.sh +1093 -0
.claude/skills/acestep/scripts/config.example.json +14 -0
.dockerignore +42 -0
.editorconfig +16 -0
.env.example +78 -0
.github/ISSUE_TEMPLATE/bug_report.md +38 -0
.github/ISSUE_TEMPLATE/feature_request.md +20 -0
.github/copilot-instructions.md +67 -0
.github/workflows/codeql.yml +99 -0
.gitignore +250 -0
AGENTS.md +96 -0
CONTRIBUTING.md +175 -0
Dockerfile +28 -0
README.md +9 -278
SECURITY.md +27 -0
app.py +18 -13
check_update.bat +609 -0
check_update.sh +330 -0
cli.py +1998 -0

.claude/skills/acestep-docs/SKILL.md ADDED Viewed

	@@ -0,0 +1,60 @@

+---
+name: acestep-docs
+description: ACE-Step documentation and troubleshooting. Use when users ask about installing ACE-Step, GPU configuration, model download, Gradio UI usage, API integration, or troubleshooting issues like VRAM problems, CUDA errors, or model loading failures.
+allowed-tools: Read, Glob, Grep
+---
+# ACE-Step Documentation
+Documentation skill for ACE-Step music generation system.
+## Quick Reference
+### Getting Started
+| Document | Description |
+|----------|-------------|
+| [README.md](getting-started/README.md) | Installation, model download, startup commands |
+| [Tutorial.md](getting-started/Tutorial.md) | Getting started tutorial, best practices |
+| [ABOUT.md](getting-started/ABOUT.md) | Project overview, architecture, model zoo |
+### Guides
+| Document | Description |
+|----------|-------------|
+| [GRADIO_GUIDE.md](guides/GRADIO_GUIDE.md) | Web UI usage guide |
+| [INFERENCE.md](guides/INFERENCE.md) | Inference parameters tuning |
+| [GPU_COMPATIBILITY.md](guides/GPU_COMPATIBILITY.md) | GPU/VRAM configuration, hardware recommendations |
+| [ENVIRONMENT_SETUP.md](guides/ENVIRONMENT_SETUP.md) | Environment detection, uv installation, python_embeded setup (Windows/Linux/macOS) |
+| [SCRIPT_CONFIGURATION.md](guides/SCRIPT_CONFIGURATION.md) | Configuring launch scripts: .bat (Windows) and .sh (Linux/macOS) |
+| [UPDATE_AND_BACKUP.md](guides/UPDATE_AND_BACKUP.md) | Git updates, file backup, conflict resolution (all platforms) |
+### API (for developers)
+| Document | Description |
+|----------|-------------|
+| [API.md](api/API.md) | REST API documentation |
+| [Openrouter_API.md](api/Openrouter_API.md) | OpenRouter API integration |
+## Instructions
+1. Installation questions → read [getting-started/README.md](getting-started/README.md)
+2. General usage / best practices → read [getting-started/Tutorial.md](getting-started/Tutorial.md)
+3. Project overview / architecture → read [getting-started/ABOUT.md](getting-started/ABOUT.md)
+4. Web UI questions → read [guides/GRADIO_GUIDE.md](guides/GRADIO_GUIDE.md)
+5. Inference parameter tuning → read [guides/INFERENCE.md](guides/INFERENCE.md)
+6. GPU/VRAM issues → read [guides/GPU_COMPATIBILITY.md](guides/GPU_COMPATIBILITY.md)
+7. Environment setup (uv, python_embeded) → read [guides/ENVIRONMENT_SETUP.md](guides/ENVIRONMENT_SETUP.md)
+8. Launch script configuration (.bat/.sh) → read [guides/SCRIPT_CONFIGURATION.md](guides/SCRIPT_CONFIGURATION.md)
+9. Updates and backup → read [guides/UPDATE_AND_BACKUP.md](guides/UPDATE_AND_BACKUP.md)
+10. API development → read [api/API.md](api/API.md) or [api/Openrouter_API.md](api/Openrouter_API.md)
+## Common Issues
+- **Installation problems**: See getting-started/README.md
+- **VRAM insufficient**: See guides/GPU_COMPATIBILITY.md
+- **Model download failed**: See getting-started/README.md or guides/SCRIPT_CONFIGURATION.md
+- **uv not found**: See guides/ENVIRONMENT_SETUP.md
+- **Environment detection issues**: See guides/ENVIRONMENT_SETUP.md
+- **BAT/SH script configuration**: See guides/SCRIPT_CONFIGURATION.md
+- **Update and backup**: See guides/UPDATE_AND_BACKUP.md
+- **Update conflicts**: See guides/UPDATE_AND_BACKUP.md
+- **Inference quality issues**: See guides/INFERENCE.md
+- **Gradio UI not starting**: See guides/GRADIO_GUIDE.md

.claude/skills/acestep-docs/api/API.md ADDED Viewed

	@@ -0,0 +1,746 @@

+# ACE-Step API Client Documentation
+---
+This service provides an HTTP-based asynchronous music generation API.
+**Basic Workflow**:
+1. Call `POST /release_task` to submit a task and obtain a `task_id`.
+2. Call `POST /query_result` to batch query task status until `status` is `1` (succeeded) or `2` (failed).
+3. Download audio files via `GET /v1/audio?path=...` URLs returned in the result.
+---
+## Table of Contents
+- [Authentication](#1-authentication)
+- [Response Format](#2-response-format)
+- [Task Status Description](#3-task-status-description)
+- [Create Generation Task](#4-create-generation-task)
+- [Batch Query Task Results](#5-batch-query-task-results)
+- [Format Input](#6-format-input)
+- [Get Random Sample](#7-get-random-sample)
+- [List Available Models](#8-list-available-models)
+- [Server Statistics](#9-server-statistics)
+- [Download Audio Files](#10-download-audio-files)
+- [Health Check](#11-health-check)
+- [Environment Variables](#12-environment-variables)
+---
+## 1. Authentication
+The API supports optional API key authentication. When enabled, a valid key must be provided in requests.
+### Authentication Methods
+Two authentication methods are supported:
+**Method A: ai_token in request body**
+```json
+{
+  "ai_token": "your-api-key",
+  "prompt": "upbeat pop song",
+  ...
+}
+```
+**Method B: Authorization header**
+```bash
+curl -X POST http://localhost:8001/release_task \
+  -H 'Authorization: Bearer your-api-key' \
+  -H 'Content-Type: application/json' \
+  -d '{"prompt": "upbeat pop song"}'
+```
+### Configuring API Key
+Set via environment variable or command-line argument:
+```bash
+# Environment variable
+export ACESTEP_API_KEY=your-secret-key
+# Or command-line argument
+python -m acestep.api_server --api-key your-secret-key
+```
+---
+## 2. Response Format
+All API responses use a unified wrapper format:
+```json
+{
+  "data": { ... },
+  "code": 200,
+  "error": null,
+  "timestamp": 1700000000000,
+  "extra": null
+}
+```
+| Field | Type | Description |
+| :--- | :--- | :--- |
+| `data` | any | Actual response data |
+| `code` | int | Status code (200=success) |
+| `error` | string | Error message (null on success) |
+| `timestamp` | int | Response timestamp (milliseconds) |
+| `extra` | any | Extra information (usually null) |
+---
+## 3. Task Status Description
+Task status (`status`) is represented as integers:
+| Status Code | Status Name | Description |
+| :--- | :--- | :--- |
+| `0` | queued/running | Task is queued or in progress |
+| `1` | succeeded | Generation succeeded, result is ready |
+| `2` | failed | Generation failed |
+---
+## 4. Create Generation Task
+### 4.1 API Definition
+- **URL**: `/release_task`
+- **Method**: `POST`
+- **Content-Type**: `application/json`, `multipart/form-data`, or `application/x-www-form-urlencoded`
+### 4.2 Request Parameters
+#### Parameter Naming Convention
+The API supports both **snake_case** and **camelCase** naming for most parameters. For example:
+- `audio_duration` / `duration` / `audioDuration`
+- `key_scale` / `keyscale` / `keyScale`
+- `time_signature` / `timesignature` / `timeSignature`
+- `sample_query` / `sampleQuery` / `description` / `desc`
+- `use_format` / `useFormat` / `format`
+Additionally, metadata can be passed in a nested object (`metas`, `metadata`, or `user_metadata`).
+#### Method A: JSON Request (application/json)
+Suitable for passing only text parameters, or referencing audio file paths that already exist on the server.
+**Basic Parameters**:
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `prompt` | string | `""` | Music description prompt (alias: `caption`) |
+| `lyrics` | string | `""` | Lyrics content |
+| `thinking` | bool | `false` | Whether to use 5Hz LM to generate audio codes (lm-dit behavior) |
+| `vocal_language` | string | `"en"` | Lyrics language (en, zh, ja, etc.) |
+| `audio_format` | string | `"mp3"` | Output format (mp3, wav, flac) |
+**Sample/Description Mode Parameters**:
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `sample_mode` | bool | `false` | Enable random sample generation mode (auto-generates caption/lyrics/metas via LM) |
+| `sample_query` | string | `""` | Natural language description for sample generation (e.g., "a soft Bengali love song"). Aliases: `description`, `desc` |
+| `use_format` | bool | `false` | Use LM to enhance/format the provided caption and lyrics. Alias: `format` |
+**Multi-Model Support**:
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `model` | string | null | Select which DiT model to use (e.g., `"acestep-v15-turbo"`, `"acestep-v15-turbo-shift3"`). Use `/v1/models` to list available models. If not specified, uses the default model. |
+**thinking Semantics (Important)**:
+- `thinking=false`:
+  - The server will **NOT** use 5Hz LM to generate `audio_code_string`.
+  - DiT runs in **text2music** mode and **ignores** any provided `audio_code_string`.
+- `thinking=true`:
+  - The server will use 5Hz LM to generate `audio_code_string` (lm-dit behavior).
+  - DiT runs with LM-generated codes for enhanced music quality.
+**Metadata Auto-Completion (Conditional)**:
+When `use_cot_caption=true` or `use_cot_language=true` or metadata fields are missing, the server may call 5Hz LM to fill the missing fields based on `caption`/`lyrics`:
+- `bpm`
+- `key_scale`
+- `time_signature`
+- `audio_duration`
+User-provided values always win; LM only fills the fields that are empty/missing.
+**Music Attribute Parameters**:
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `bpm` | int | null | Specify tempo (BPM), range 30-300 |
+| `key_scale` | string | `""` | Key/scale (e.g., "C Major", "Am"). Aliases: `keyscale`, `keyScale` |
+| `time_signature` | string | `""` | Time signature (2, 3, 4, 6 for 2/4, 3/4, 4/4, 6/8). Aliases: `timesignature`, `timeSignature` |
+| `audio_duration` | float | null | Generation duration (seconds), range 10-600. Aliases: `duration`, `target_duration` |
+**Audio Codes (Optional)**:
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `audio_code_string` | string or string[] | `""` | Audio semantic tokens (5Hz) for `llm_dit`. Alias: `audioCodeString` |
+**Generation Control Parameters**:
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `inference_steps` | int | `8` | Number of inference steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). |
+| `guidance_scale` | float | `7.0` | Prompt guidance coefficient. Only effective for base model. |
+| `use_random_seed` | bool | `true` | Whether to use random seed |
+| `seed` | int | `-1` | Specify seed (when use_random_seed=false) |
+| `batch_size` | int | `2` | Batch generation count (max 8) |
+**Advanced DiT Parameters**:
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `shift` | float | `3.0` | Timestep shift factor (range 1.0-5.0). Only effective for base models, not turbo models. |
+| `infer_method` | string | `"ode"` | Diffusion inference method: `"ode"` (Euler, faster) or `"sde"` (stochastic). |
+| `timesteps` | string | null | Custom timesteps as comma-separated values (e.g., `"0.97,0.76,0.615,0.5,0.395,0.28,0.18,0.085,0"`). Overrides `inference_steps` and `shift`. |
+| `use_adg` | bool | `false` | Use Adaptive Dual Guidance (base model only) |
+| `cfg_interval_start` | float | `0.0` | CFG application start ratio (0.0-1.0) |
+| `cfg_interval_end` | float | `1.0` | CFG application end ratio (0.0-1.0) |
+**5Hz LM Parameters (Optional, server-side)**:
+These parameters control 5Hz LM sampling, used for metadata auto-completion and (when `thinking=true`) codes generation.
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `lm_model_path` | string | null | 5Hz LM checkpoint dir name (e.g. `acestep-5Hz-lm-0.6B`) |
+| `lm_backend` | string | `"vllm"` | `vllm` or `pt` |
+| `lm_temperature` | float | `0.85` | Sampling temperature |
+| `lm_cfg_scale` | float | `2.5` | CFG scale (>1 enables CFG) |
+| `lm_negative_prompt` | string | `"NO USER INPUT"` | Negative prompt used by CFG |
+| `lm_top_k` | int | null | Top-k (0/null disables) |
+| `lm_top_p` | float | `0.9` | Top-p (>=1 will be treated as disabled) |
+| `lm_repetition_penalty` | float | `1.0` | Repetition penalty |
+**LM CoT (Chain-of-Thought) Parameters**:
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `use_cot_caption` | bool | `true` | Let LM rewrite/enhance the input caption via CoT reasoning. Aliases: `cot_caption`, `cot-caption` |
+| `use_cot_language` | bool | `true` | Let LM detect vocal language via CoT. Aliases: `cot_language`, `cot-language` |
+| `constrained_decoding` | bool | `true` | Enable FSM-based constrained decoding for structured LM output. Aliases: `constrainedDecoding`, `constrained` |
+| `constrained_decoding_debug` | bool | `false` | Enable debug logging for constrained decoding |
+| `allow_lm_batch` | bool | `true` | Allow LM batch processing for efficiency |
+**Edit/Reference Audio Parameters** (requires absolute path on server):
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `reference_audio_path` | string | null | Reference audio path (Style Transfer) |
+| `src_audio_path` | string | null | Source audio path (Repainting/Cover) |
+| `task_type` | string | `"text2music"` | Task type: `text2music`, `cover`, `repaint`, `lego`, `extract`, `complete` |
+| `instruction` | string | auto | Edit instruction (auto-generated based on task_type if not provided) |
+| `repainting_start` | float | `0.0` | Repainting start time (seconds) |
+| `repainting_end` | float | null | Repainting end time (seconds), -1 for end of audio |
+| `audio_cover_strength` | float | `1.0` | Cover strength (0.0-1.0). Lower values (0.2) for style transfer. |
+#### Method B: File Upload (multipart/form-data)
+Use this when you need to upload local audio files as reference or source audio.
+In addition to supporting all the above fields as Form Fields, the following file fields are also supported:
+- `reference_audio` or `ref_audio`: (File) Upload reference audio file
+- `src_audio` or `ctx_audio`: (File) Upload source audio file
+> **Note**: After uploading files, the corresponding `_path` parameters will be automatically ignored, and the system will use the temporary file path after upload.
+### 4.3 Response Example
+```json
+{
+  "data": {
+    "task_id": "550e8400-e29b-41d4-a716-446655440000",
+    "status": "queued",
+    "queue_position": 1
+  },
+  "code": 200,
+  "error": null,
+  "timestamp": 1700000000000,
+  "extra": null
+}
+```
+### 4.4 Usage Examples (cURL)
+**Basic JSON Method**:
+```bash
+curl -X POST http://localhost:8001/release_task \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "prompt": "upbeat pop song",
+    "lyrics": "Hello world",
+    "inference_steps": 8
+  }'
+```
+**With thinking=true (LM generates codes + fills missing metas)**:
+```bash
+curl -X POST http://localhost:8001/release_task \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "prompt": "upbeat pop song",
+    "lyrics": "Hello world",
+    "thinking": true,
+    "lm_temperature": 0.85,
+    "lm_cfg_scale": 2.5
+  }'
+```
+**Description-driven generation (sample_query)**:
+```bash
+curl -X POST http://localhost:8001/release_task \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "sample_query": "a soft Bengali love song for a quiet evening",
+    "thinking": true
+  }'
+```
+**With format enhancement (use_format=true)**:
+```bash
+curl -X POST http://localhost:8001/release_task \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "prompt": "pop rock",
+    "lyrics": "[Verse 1]\nWalking down the street...",
+    "use_format": true,
+    "thinking": true
+  }'
+```
+**Select specific model**:
+```bash
+curl -X POST http://localhost:8001/release_task \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "prompt": "electronic dance music",
+    "model": "acestep-v15-turbo",
+    "thinking": true
+  }'
+```
+**With custom timesteps**:
+```bash
+curl -X POST http://localhost:8001/release_task \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "prompt": "jazz piano trio",
+    "timesteps": "0.97,0.76,0.615,0.5,0.395,0.28,0.18,0.085,0",
+    "thinking": true
+  }'
+```
+**File Upload Method**:
+```bash
+curl -X POST http://localhost:8001/release_task \
+  -F "prompt=remix this song" \
+  -F "src_audio=@/path/to/local/song.mp3" \
+  -F "task_type=repaint"
+```
+---
+## 5. Batch Query Task Results
+### 5.1 API Definition
+- **URL**: `/query_result`
+- **Method**: `POST`
+- **Content-Type**: `application/json` or `application/x-www-form-urlencoded`
+### 5.2 Request Parameters
+| Parameter Name | Type | Description |
+| :--- | :--- | :--- |
+| `task_id_list` | string (JSON array) or array | List of task IDs to query |
+### 5.3 Response Example
+```json
+{
+  "data": [
+    {
+      "task_id": "550e8400-e29b-41d4-a716-446655440000",
+      "status": 1,
+      "result": "[{\"file\": \"/v1/audio?path=...\", \"wave\": \"\", \"status\": 1, \"create_time\": 1700000000, \"env\": \"development\", \"prompt\": \"upbeat pop song\", \"lyrics\": \"Hello world\", \"metas\": {\"bpm\": 120, \"duration\": 30, \"genres\": \"\", \"keyscale\": \"C Major\", \"timesignature\": \"4\"}, \"generation_info\": \"...\", \"seed_value\": \"12345,67890\", \"lm_model\": \"acestep-5Hz-lm-0.6B\", \"dit_model\": \"acestep-v15-turbo\"}]"
+    }
+  ],
+  "code": 200,
+  "error": null,
+  "timestamp": 1700000000000,
+  "extra": null
+}
+```
+**Result Field Description** (result is a JSON string, after parsing contains):
+| Field | Type | Description |
+| :--- | :--- | :--- |
+| `file` | string | Audio file URL (use with `/v1/audio` endpoint) |
+| `wave` | string | Waveform data (usually empty) |
+| `status` | int | Status code (0=in progress, 1=success, 2=failed) |
+| `create_time` | int | Creation time (Unix timestamp) |
+| `env` | string | Environment identifier |
+| `prompt` | string | Prompt used |
+| `lyrics` | string | Lyrics used |
+| `metas` | object | Metadata (bpm, duration, genres, keyscale, timesignature) |
+| `generation_info` | string | Generation info summary |
+| `seed_value` | string | Seed values used (comma-separated) |
+| `lm_model` | string | LM model name used |
+| `dit_model` | string | DiT model name used |
+### 5.4 Usage Example
+```bash
+curl -X POST http://localhost:8001/query_result \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "task_id_list": ["550e8400-e29b-41d4-a716-446655440000"]
+  }'
+```
+---
+## 6. Format Input
+### 6.1 API Definition
+- **URL**: `/format_input`
+- **Method**: `POST`
+This endpoint uses LLM to enhance and format user-provided caption and lyrics.
+### 6.2 Request Parameters
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `prompt` | string | `""` | Music description prompt |
+| `lyrics` | string | `""` | Lyrics content |
+| `temperature` | float | `0.85` | LM sampling temperature |
+| `param_obj` | string (JSON) | `"{}"` | JSON object containing metadata (duration, bpm, key, time_signature, language) |
+### 6.3 Response Example
+```json
+{
+  "data": {
+    "caption": "Enhanced music description",
+    "lyrics": "Formatted lyrics...",
+    "bpm": 120,
+    "key_scale": "C Major",
+    "time_signature": "4",
+    "duration": 180,
+    "vocal_language": "en"
+  },
+  "code": 200,
+  "error": null,
+  "timestamp": 1700000000000,
+  "extra": null
+}
+```
+### 6.4 Usage Example
+```bash
+curl -X POST http://localhost:8001/format_input \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "prompt": "pop rock",
+    "lyrics": "Walking down the street",
+    "param_obj": "{\"duration\": 180, \"language\": \"en\"}"
+  }'
+```
+---
+## 7. Get Random Sample
+### 7.1 API Definition
+- **URL**: `/create_random_sample`
+- **Method**: `POST`
+This endpoint returns random sample parameters from pre-loaded example data for form filling.
+### 7.2 Request Parameters
+| Parameter Name | Type | Default | Description |
+| :--- | :--- | :--- | :--- |
+| `sample_type` | string | `"simple_mode"` | Sample type: `"simple_mode"` or `"custom_mode"` |
+### 7.3 Response Example
+```json
+{
+  "data": {
+    "caption": "Upbeat pop song with guitar accompaniment",
+    "lyrics": "[Verse 1]\nSunshine on my face...",
+    "bpm": 120,
+    "key_scale": "G Major",
+    "time_signature": "4",
+    "duration": 180,
+    "vocal_language": "en"
+  },
+  "code": 200,
+  "error": null,
+  "timestamp": 1700000000000,
+  "extra": null
+}
+```
+### 7.4 Usage Example
+```bash
+curl -X POST http://localhost:8001/create_random_sample \
+  -H 'Content-Type: application/json' \
+  -d '{"sample_type": "simple_mode"}'
+```
+---
+## 8. List Available Models
+### 8.1 API Definition
+- **URL**: `/v1/models`
+- **Method**: `GET`
+Returns a list of available DiT models loaded on the server.
+### 8.2 Response Example
+```json
+{
+  "data": {
+    "models": [
+      {
+        "name": "acestep-v15-turbo",
+        "is_default": true
+      },
+      {
+        "name": "acestep-v15-turbo-shift3",
+        "is_default": false
+      }
+    ],
+    "default_model": "acestep-v15-turbo"
+  },
+  "code": 200,
+  "error": null,
+  "timestamp": 1700000000000,
+  "extra": null
+}
+```
+### 8.3 Usage Example
+```bash
+curl http://localhost:8001/v1/models
+```
+---
+## 9. Server Statistics
+### 9.1 API Definition
+- **URL**: `/v1/stats`
+- **Method**: `GET`
+Returns server runtime statistics.
+### 9.2 Response Example
+```json
+{
+  "data": {
+    "jobs": {
+      "total": 100,
+      "queued": 5,
+      "running": 1,
+      "succeeded": 90,
+      "failed": 4
+    },
+    "queue_size": 5,
+    "queue_maxsize": 200,
+    "avg_job_seconds": 8.5
+  },
+  "code": 200,
+  "error": null,
+  "timestamp": 1700000000000,
+  "extra": null
+}
+```
+### 9.3 Usage Example
+```bash
+curl http://localhost:8001/v1/stats
+```
+---
+## 10. Download Audio Files
+### 10.1 API Definition
+- **URL**: `/v1/audio`
+- **Method**: `GET`
+Download generated audio files by path.
+### 10.2 Request Parameters
+| Parameter Name | Type | Description |
+| :--- | :--- | :--- |
+| `path` | string | URL-encoded path to the audio file |
+### 10.3 Usage Example
+```bash
+# Download using the URL from task result
+curl "http://localhost:8001/v1/audio?path=%2Ftmp%2Fapi_audio%2Fabc123.mp3" -o output.mp3
+```
+---
+## 11. Health Check
+### 11.1 API Definition
+- **URL**: `/health`
+- **Method**: `GET`
+Returns service health status.
+### 11.2 Response Example
+```json
+{
+  "data": {
+    "status": "ok",
+    "service": "ACE-Step API",
+    "version": "1.0"
+  },
+  "code": 200,
+  "error": null,
+  "timestamp": 1700000000000,
+  "extra": null
+}
+```
+---
+## 12. Environment Variables
+The API server can be configured using environment variables:
+### Server Configuration
+| Variable | Default | Description |
+| :--- | :--- | :--- |
+| `ACESTEP_API_HOST` | `127.0.0.1` | Server bind host |
+| `ACESTEP_API_PORT` | `8001` | Server bind port |
+| `ACESTEP_API_KEY` | (empty) | API authentication key (empty disables auth) |
+| `ACESTEP_API_WORKERS` | `1` | API worker thread count |
+### Model Configuration
+| Variable | Default | Description |
+| :--- | :--- | :--- |
+| `ACESTEP_CONFIG_PATH` | `acestep-v15-turbo` | Primary DiT model path |
+| `ACESTEP_CONFIG_PATH2` | (empty) | Secondary DiT model path (optional) |
+| `ACESTEP_CONFIG_PATH3` | (empty) | Third DiT model path (optional) |
+| `ACESTEP_DEVICE` | `auto` | Device for model loading |
+| `ACESTEP_USE_FLASH_ATTENTION` | `true` | Enable flash attention |
+| `ACESTEP_OFFLOAD_TO_CPU` | `false` | Offload models to CPU when idle |
+| `ACESTEP_OFFLOAD_DIT_TO_CPU` | `false` | Offload DiT specifically to CPU |
+### LM Configuration
+| Variable | Default | Description |
+| :--- | :--- | :--- |
+| `ACESTEP_INIT_LLM` | auto | Whether to initialize LM at startup (auto determines based on GPU) |
+| `ACESTEP_LM_MODEL_PATH` | `acestep-5Hz-lm-0.6B` | Default 5Hz LM model |
+| `ACESTEP_LM_BACKEND` | `vllm` | LM backend (vllm or pt) |
+| `ACESTEP_LM_DEVICE` | (same as ACESTEP_DEVICE) | Device for LM |
+| `ACESTEP_LM_OFFLOAD_TO_CPU` | `false` | Offload LM to CPU |
+### Queue Configuration
+| Variable | Default | Description |
+| :--- | :--- | :--- |
+| `ACESTEP_QUEUE_MAXSIZE` | `200` | Maximum queue size |
+| `ACESTEP_QUEUE_WORKERS` | `1` | Number of queue workers |
+| `ACESTEP_AVG_JOB_SECONDS` | `5.0` | Initial average job duration estimate |
+| `ACESTEP_AVG_WINDOW` | `50` | Window for averaging job duration |
+### Cache Configuration
+| Variable | Default | Description |
+| :--- | :--- | :--- |
+| `ACESTEP_TMPDIR` | `.cache/acestep/tmp` | Temporary file directory |
+| `TRITON_CACHE_DIR` | `.cache/acestep/triton` | Triton cache directory |
+| `TORCHINDUCTOR_CACHE_DIR` | `.cache/acestep/torchinductor` | TorchInductor cache directory |
+---
+## Error Handling
+**HTTP Status Codes**:
+- `200`: Success
+- `400`: Invalid request (bad JSON, missing fields)
+- `401`: Unauthorized (missing or invalid API key)
+- `404`: Resource not found
+- `415`: Unsupported Content-Type
+- `429`: Server busy (queue is full)
+- `500`: Internal server error
+**Error Response Format**:
+```json
+{
+  "detail": "Error message describing the issue"
+}
+```
+---
+## Best Practices
+1. **Use `thinking=true`** for best quality results with LM-enhanced generation.
+2. **Use `sample_query`/`description`** for quick generation from natural language descriptions.
+3. **Use `use_format=true`** when you have caption/lyrics but want LM to enhance them.
+4. **Batch query task status** using the `/query_result` endpoint to query multiple tasks at once.
+5. **Check `/v1/stats`** to understand server load and average job time.
+6. **Use multi-model support** by setting `ACESTEP_CONFIG_PATH2` and `ACESTEP_CONFIG_PATH3` environment variables, then select with the `model` parameter.
+7. **For production**, set `ACESTEP_API_KEY` to enable authentication and secure your API.
+8. **For low VRAM environments**, enable `ACESTEP_OFFLOAD_TO_CPU=true` to support longer audio generation.

.claude/skills/acestep-docs/api/Openrouter_API.md ADDED Viewed

	@@ -0,0 +1,517 @@

+# ACE-Step OpenRouter API Documentation
+> OpenAI Chat Completions-compatible API for AI music generation
+**Base URL:** `http://{host}:{port}` (default `http://127.0.0.1:8002`)
+---
+## Table of Contents
+- [Authentication](#authentication)
+- [Endpoints](#endpoints)
+  - [POST /v1/chat/completions - Generate Music](#1-generate-music)
+  - [GET /api/v1/models - List Models](#2-list-models)
+  - [GET /health - Health Check](#3-health-check)
+- [Input Modes](#input-modes)
+- [Streaming Responses](#streaming-responses)
+- [Examples](#examples)
+- [Error Codes](#error-codes)
+---
+## Authentication
+If the server is configured with an API key (via the `OPENROUTER_API_KEY` environment variable or `--api-key` CLI flag), all requests must include the following header:
+```
+Authorization: Bearer <your-api-key>
+```
+No authentication is required when no API key is configured.
+---
+## Endpoints
+### 1. Generate Music
+**POST** `/v1/chat/completions`
+Generates music from chat messages and returns audio data along with LM-generated metadata.
+#### Request Parameters
+| Field | Type | Required | Default | Description |
+|---|---|---|---|---|
+| `model` | string | No | `"acemusic/acestep-v1.5-turbo"` | Model ID |
+| `messages` | array | **Yes** | - | Chat message list. See [Input Modes](#input-modes) |
+| `stream` | boolean | No | `false` | Enable streaming response. See [Streaming Responses](#streaming-responses) |
+| `temperature` | float | No | `0.85` | LM sampling temperature |
+| `top_p` | float | No | `0.9` | LM nucleus sampling parameter |
+| `lyrics` | string | No | `""` | Lyrics passed directly (takes priority over lyrics parsed from messages) |
+| `duration` | float | No | `null` | Audio duration in seconds. If omitted, determined automatically by the LM |
+| `bpm` | integer | No | `null` | Beats per minute. If omitted, determined automatically by the LM |
+| `vocal_language` | string | No | `"en"` | Vocal language code (e.g. `"zh"`, `"en"`, `"ja"`) |
+| `instrumental` | boolean | No | `false` | Whether to generate instrumental-only music (no vocals) |
+| `thinking` | boolean | No | `false` | Enable LLM thinking mode for deeper reasoning |
+| `use_cot_metas` | boolean | No | `true` | Auto-generate BPM, duration, key, time signature via Chain-of-Thought |
+| `use_cot_caption` | boolean | No | `true` | Rewrite/enhance the music description via Chain-of-Thought |
+| `use_cot_language` | boolean | No | `true` | Auto-detect vocal language via Chain-of-Thought |
+| `use_format` | boolean | No | `true` | When prompt/lyrics are provided directly, enhance them via LLM formatting |
+> **Note on LM parameters:** `use_format` applies when the user provides explicit prompt/lyrics (tagged or lyrics mode) and enhances the description and lyrics formatting via LLM. The `use_cot_*` parameters control Phase 1 CoT reasoning during the audio generation stage. When `use_format` or sample mode has already generated a duration, `use_cot_metas` is automatically skipped to avoid redundancy.
+#### messages Format
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "Your input content"
+    }
+  ]
+}
+```
+Set `role` to `"user"` and `content` to the text input. The system automatically determines the input mode based on the content. See [Input Modes](#input-modes) for details.
+---
+#### Non-Streaming Response (`stream: false`)
+```json
+{
+  "id": "chatcmpl-a1b2c3d4e5f6g7h8",
+  "object": "chat.completion",
+  "created": 1706688000,
+  "model": "acemusic/acestep-v1.5-turbo",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "## Metadata\n**Caption:** Upbeat pop song...\n**BPM:** 120\n**Duration:** 30s\n**Key:** C major\n\n## Lyrics\n[Verse 1]\nHello world...",
+        "audio": [
+          {
+            "type": "audio_url",
+            "audio_url": {
+              "url": "data:audio/mpeg;base64,SUQzBAAAAAAAI1RTU0UAAAA..."
+            }
+          }
+        ]
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 10,
+    "completion_tokens": 100,
+    "total_tokens": 110
+  }
+}
+```
+**Response Fields:**
+| Field | Description |
+|---|---|
+| `choices[0].message.content` | Text information generated by the LM, including Metadata (Caption, BPM, Duration, Key, Time Signature, Language) and Lyrics. Returns `"Music generated successfully."` if LM was not involved |
+| `choices[0].message.audio` | Audio data array. Each item contains `type` (`"audio_url"`) and `audio_url.url` (Base64 Data URL in format `data:audio/mpeg;base64,...`) |
+| `choices[0].finish_reason` | `"stop"` indicates normal completion |
+**Decoding Audio:**
+The `audio_url.url` value is a Data URL: `data:audio/mpeg;base64,<base64_data>`
+Extract the base64 portion after the comma and decode it to get the MP3 file:
+```python
+import base64
+url = response["choices"][0]["message"]["audio"][0]["audio_url"]["url"]
+# Strip the "data:audio/mpeg;base64," prefix
+b64_data = url.split(",", 1)[1]
+audio_bytes = base64.b64decode(b64_data)
+with open("output.mp3", "wb") as f:
+    f.write(audio_bytes)
+```
+```javascript
+const url = response.choices[0].message.audio[0].audio_url.url;
+const b64Data = url.split(",")[1];
+const audioBytes = atob(b64Data);
+// Or use the Data URL directly in an <audio> element
+const audio = new Audio(url);
+audio.play();
+```
+---
+### 2. List Models
+**GET** `/api/v1/models`
+Returns available model information.
+#### Response
+```json
+{
+  "data": [
+    {
+      "id": "acemusic/acestep-v1.5-turbo",
+      "name": "ACE-Step",
+      "created": 1706688000,
+      "description": "High-performance text-to-music generation model...",
+      "input_modalities": ["text"],
+      "output_modalities": ["audio"],
+      "context_length": 4096,
+      "pricing": {
+        "prompt": "0",
+        "completion": "0",
+        "request": "0"
+      },
+      "supported_sampling_parameters": ["temperature", "top_p"]
+    }
+  ]
+}
+```
+---
+### 3. Health Check
+**GET** `/health`
+#### Response
+```json
+{
+  "status": "ok",
+  "service": "ACE-Step OpenRouter API",
+  "version": "1.0"
+}
+```
+---
+## Input Modes
+The system automatically selects the input mode based on the content of the last `user` message:
+### Mode 1: Tagged Mode (Recommended)
+Use `<prompt>` and `<lyrics>` tags to explicitly specify the music description and lyrics:
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "<prompt>A gentle acoustic ballad in C major, 80 BPM, female vocal</prompt>\n<lyrics>[Verse 1]\nSunlight through the window\nA brand new day begins\n\n[Chorus]\nWe are the dreamers\nWe are the light</lyrics>"
+    }
+  ]
+}
+```
+- `<prompt>...</prompt>` - Music style/scene description (caption)
+- `<lyrics>...</lyrics>` - Lyrics content
+- Either tag can be used alone
+- When `use_format=true`, the LLM automatically enhances both prompt and lyrics
+### Mode 2: Natural Language Mode (Sample Mode)
+Describe the desired music in natural language. The system uses LLM to generate the prompt and lyrics automatically:
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "Generate an upbeat pop song about summer and travel"
+    }
+  ]
+}
+```
+**Trigger condition:** Message content contains no tags and does not resemble lyrics (no `[Verse]`/`[Chorus]` markers, few lines, or long single lines).
+### Mode 3: Lyrics-Only Mode
+Pass in lyrics with structural markers directly. The system identifies them automatically:
+```json
+{
+  "messages": [
+    {
+      "role": "user",
+      "content": "[Verse 1]\nWalking down the street\nFeeling the beat\n\n[Chorus]\nDance with me tonight\nUnder the moonlight"
+    }
+  ]
+}
+```
+**Trigger condition:** Message content contains `[Verse]`, `[Chorus]`, or similar markers, or has a multi-line short-text structure.
+### Instrumental Mode
+Set `instrumental: true` or use `[inst]` as the lyrics:
+```json
+{
+  "instrumental": true,
+  "messages": [
+    {
+      "role": "user",
+      "content": "<prompt>Epic orchestral cinematic score, dramatic and powerful</prompt>"
+    }
+  ]
+}
+```
+---
+## Streaming Responses
+Set `"stream": true` to enable SSE (Server-Sent Events) streaming.
+### Event Format
+Each event starts with `data: `, followed by JSON, ending with a double newline `\n\n`:
+```
+data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{...},"finish_reason":null}]}
+```
+### Streaming Event Sequence
+| Phase | Delta Content | Description |
+|---|---|---|
+| 1. Initialization | `{"role":"assistant","content":""}` | Establishes the connection |
+| 2. LM Content (optional) | `{"content":"## Metadata\n..."}` | Metadata and lyrics generated by the LM |
+| 3. Heartbeat | `{"content":"."}` | Sent every 2 seconds during audio generation to keep the connection alive |
+| 4. Audio Data | `{"audio":[{"type":"audio_url","audio_url":{"url":"data:..."}}]}` | The generated audio |
+| 5. Finish | `finish_reason: "stop"` | Generation complete |
+| 6. Termination | `data: [DONE]` | End-of-stream marker |
+### Streaming Response Example
+```
+data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
+data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{"content":"\n\n## Metadata\n**Caption:** Upbeat pop\n**BPM:** 120"},"finish_reason":null}]}
+data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{"content":"."},"finish_reason":null}]}
+data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{"audio":[{"type":"audio_url","audio_url":{"url":"data:audio/mpeg;base64,..."}}]},"finish_reason":null}]}
+data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v1.5-turbo","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
+data: [DONE]
+```
+### Client-Side Streaming Handling
+```python
+import json
+import httpx
+with httpx.stream("POST", "http://127.0.0.1:8002/v1/chat/completions", json={
+    "messages": [{"role": "user", "content": "Generate a cheerful guitar piece"}],
+    "stream": True
+}) as response:
+    content_parts = []
+    audio_url = None
+    for line in response.iter_lines():
+        if not line or not line.startswith("data: "):
+            continue
+        if line == "data: [DONE]":
+            break
+        chunk = json.loads(line[6:])
+        delta = chunk["choices"][0]["delta"]
+        if "content" in delta and delta["content"]:
+            content_parts.append(delta["content"])
+        if "audio" in delta and delta["audio"]:
+            audio_url = delta["audio"][0]["audio_url"]["url"]
+        if chunk["choices"][0].get("finish_reason") == "stop":
+            print("Generation complete!")
+    print("Content:", "".join(content_parts))
+    if audio_url:
+        import base64
+        b64_data = audio_url.split(",", 1)[1]
+        with open("output.mp3", "wb") as f:
+            f.write(base64.b64decode(b64_data))
+```
+```javascript
+const response = await fetch("http://127.0.0.1:8002/v1/chat/completions", {
+  method: "POST",
+  headers: { "Content-Type": "application/json" },
+  body: JSON.stringify({
+    messages: [{ role: "user", content: "Generate a cheerful guitar piece" }],
+    stream: true
+  })
+});
+const reader = response.body.getReader();
+const decoder = new TextDecoder();
+let audioUrl = null;
+let content = "";
+while (true) {
+  const { done, value } = await reader.read();
+  if (done) break;
+  const text = decoder.decode(value);
+  for (const line of text.split("\n")) {
+    if (!line.startsWith("data: ") || line === "data: [DONE]") continue;
+    const chunk = JSON.parse(line.slice(6));
+    const delta = chunk.choices[0].delta;
+    if (delta.content) content += delta.content;
+    if (delta.audio) audioUrl = delta.audio[0].audio_url.url;
+  }
+}
+// audioUrl can be used directly as <audio src="...">
+```
+---
+## Examples
+### Example 1: Natural Language Generation (Simplest Usage)
+```bash
+curl -X POST http://127.0.0.1:8002/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "user", "content": "A soft folk song about hometown and memories"}
+    ],
+    "vocal_language": "en"
+  }'
+```
+### Example 2: Tagged Mode with Specific Parameters
+```bash
+curl -X POST http://127.0.0.1:8002/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {
+        "role": "user",
+        "content": "<prompt>Energetic EDM track with heavy bass drops and synth leads</prompt><lyrics>[Verse 1]\nFeel the rhythm in your soul\nLet the music take control\n\n[Drop]\n(instrumental break)</lyrics>"
+      }
+    ],
+    "bpm": 128,
+    "duration": 60,
+    "vocal_language": "en"
+  }'
+```
+### Example 3: Instrumental with LM Enhancement Disabled
+```bash
+curl -X POST http://127.0.0.1:8002/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {
+        "role": "user",
+        "content": "<prompt>Peaceful piano solo, slow tempo, jazz harmony</prompt>"
+      }
+    ],
+    "instrumental": true,
+    "use_format": false,
+    "use_cot_caption": false,
+    "duration": 45
+  }'
+```
+### Example 4: Streaming Request
+```bash
+curl -X POST http://127.0.0.1:8002/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -N \
+  -d '{
+    "messages": [
+      {"role": "user", "content": "Generate a happy birthday song"}
+    ],
+    "stream": true
+  }'
+```
+### Example 5: Full Control with All Parameters
+```bash
+curl -X POST http://127.0.0.1:8002/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {
+        "role": "user",
+        "content": "<prompt>Dreamy lo-fi hip hop beat with vinyl crackle</prompt><lyrics>[inst]</lyrics>"
+      }
+    ],
+    "temperature": 0.9,
+    "top_p": 0.95,
+    "bpm": 85,
+    "duration": 30,
+    "instrumental": true,
+    "thinking": false,
+    "use_cot_metas": true,
+    "use_cot_caption": true,
+    "use_cot_language": false,
+    "use_format": true
+  }'
+```
+---
+## Error Codes
+| HTTP Status | Description |
+|---|---|
+| 400 | Invalid request format or missing valid input |
+| 401 | Missing or invalid API key |
+| 500 | Internal error during music generation |
+| 503 | Model not yet initialized |
+Error response format:
+```json
+{
+  "detail": "Error description message"
+}
+```
+---
+## Server Configuration (Environment Variables)
+The following environment variables can be used to configure the server (for operations reference):
+| Variable | Default | Description |
+|---|---|---|
+| `OPENROUTER_API_KEY` | None | API authentication key |
+| `OPENROUTER_HOST` | `127.0.0.1` | Listen address |
+| `OPENROUTER_PORT` | `8002` | Listen port |
+| `ACESTEP_CONFIG_PATH` | `acestep-v15-turbo` | DiT model configuration path |
+| `ACESTEP_DEVICE` | `auto` | Inference device |
+| `ACESTEP_LM_MODEL_PATH` | `acestep-5Hz-lm-0.6B` | LLM model path |
+| `ACESTEP_LM_BACKEND` | `vllm` | LLM inference backend |

.claude/skills/acestep-docs/getting-started/ABOUT.md ADDED Viewed

	@@ -0,0 +1,87 @@

+# ACE-Step Project Overview
+> For installation instructions, see [README.md](README.md)
+## Links
+- [Project Page](https://ace-step.github.io/ace-step-v1.5.github.io/)
+- [Hugging Face](https://huggingface.co/ACE-Step/Ace-Step1.5)
+- [ModelScope](https://modelscope.cn/models/ACE-Step/Ace-Step1.5)
+- [Space Demo](https://huggingface.co/spaces/ACE-Step/Ace-Step-v1.5)
+- [Discord](https://discord.gg/PeWDxrkdj7)
+- [Technical Report](https://arxiv.org/abs/2602.00744)
+## Abstract
+ACE-Step v1.5 is a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. Key highlights:
+- Quality beyond most commercial music models
+- Under 2 seconds per full song on A100, under 10 seconds on RTX 3090
+- Runs locally with less than 4GB of VRAM
+- Supports lightweight LoRA personalization from just a few songs
+The architecture combines a Language Model (LM) as an omni-capable planner with a Diffusion Transformer (DiT). The LM transforms simple user queries into comprehensive song blueprints—scaling from short loops to 10-minute compositions.
+## Features
+### Performance
+- **Ultra-Fast Generation** — Under 2s per full song on A100
+- **Flexible Duration** — 10 seconds to 10 minutes (600s)
+- **Batch Generation** — Up to 8 songs simultaneously
+### Generation Quality
+- **Commercial-Grade Output** — Between Suno v4.5 and Suno v5
+- **Rich Style Support** — 1000+ instruments and styles
+- **Multi-Language Lyrics** — 50+ languages
+### Capabilities
+| Feature | Description |
+|---------|-------------|
+| Reference Audio Input | Use reference audio to guide style |
+| Cover Generation | Create covers from existing audio |
+| Repaint & Edit | Selective local audio editing |
+| Track Separation | Separate into individual stems |
+| Vocal2BGM | Auto-generate accompaniment |
+| Metadata Control | Duration, BPM, key/scale, time signature |
+| Simple Mode | Full songs from simple descriptions |
+| LoRA Training | 8 songs, 1 hour on 3090 (12GB VRAM) |
+## Architecture
+The system uses a hybrid LM + DiT architecture:
+- **LM (Language Model)**: Plans metadata, lyrics, captions via Chain-of-Thought
+- **DiT (Diffusion Transformer)**: Generates audio from the LM's blueprint
+## Model Zoo
+### DiT Models
+| Model | Steps | Quality | Diversity | HuggingFace |
+|-------|:-----:|:-------:|:---------:|-------------|
+| `acestep-v15-base` | 50 | Medium | High | [Link](https://huggingface.co/ACE-Step/acestep-v15-base) |
+| `acestep-v15-sft` | 50 | High | Medium | [Link](https://huggingface.co/ACE-Step/acestep-v15-sft) |
+| `acestep-v15-turbo` | 8 | Very High | Medium | [Link](https://huggingface.co/ACE-Step/Ace-Step1.5) |
+### LM Models
+| Model | Audio Understanding | Composition | HuggingFace |
+|-------|:------------------:|:-----------:|-------------|
+| `acestep-5Hz-lm-0.6B` | Medium | Medium | [Link](https://huggingface.co/ACE-Step/acestep-5Hz-lm-0.6B) |
+| `acestep-5Hz-lm-1.7B` | Medium | Medium | [Link](https://huggingface.co/ACE-Step/Ace-Step1.5) |
+| `acestep-5Hz-lm-4B` | Strong | Strong | [Link](https://huggingface.co/ACE-Step/acestep-5Hz-lm-4B) |
+## License
+This project is licensed under [MIT](https://github.com/ACE-Step/ACE-Step-1.5/blob/main/LICENSE).
+## Citation
+```BibTeX
+@misc{gong2026acestep,
+    title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
+    author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
+    howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
+    year={2026}
+}
+```

.claude/skills/acestep-docs/getting-started/README.md ADDED Viewed

	@@ -0,0 +1,232 @@

+# ACE-Step Installation Guide
+## Requirements
+- Python 3.11
+- CUDA GPU recommended (works on CPU/MPS/MLX but slower)
+## Installation
+### Windows Portable Package (Recommended for Windows)
+1. Download and extract: [ACE-Step-1.5.7z](https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z)
+2. Requirements: CUDA 12.8
+3. The package includes `python_embeded` with all dependencies pre-installed
+**Quick Start:**
+```bash
+# Launch Gradio Web UI (CUDA)
+start_gradio_ui.bat
+# Launch REST API Server (CUDA)
+start_api_server.bat
+# Launch Gradio Web UI (AMD ROCm)
+start_gradio_ui_rocm.bat
+# Launch REST API Server (AMD ROCm)
+start_api_server_rocm.bat
+```
+### Launch Scripts (All Platforms)
+Ready-to-use launch scripts with auto environment detection, update checking, and uv auto-install.
+**Windows (.bat):**
+```bash
+start_gradio_ui.bat          # Gradio Web UI (CUDA)
+start_api_server.bat         # REST API Server (CUDA)
+start_gradio_ui_rocm.bat     # Gradio Web UI (AMD ROCm)
+start_api_server_rocm.bat    # REST API Server (AMD ROCm)
+```
+**Linux (.sh):**
+```bash
+chmod +x start_gradio_ui.sh start_api_server.sh   # First time only
+./start_gradio_ui.sh         # Gradio Web UI (CUDA)
+./start_api_server.sh        # REST API Server (CUDA)
+```
+**macOS Apple Silicon (.sh):**
+```bash
+chmod +x start_gradio_ui_macos.sh start_api_server_macos.sh   # First time only
+./start_gradio_ui_macos.sh   # Gradio Web UI (MLX backend)
+./start_api_server_macos.sh  # REST API Server (MLX backend)
+```
+All launch scripts support:
+- Startup update check (enabled by default, configurable)
+- Auto environment detection (`python_embeded` or `uv`)
+- Auto install `uv` if needed
+- Configurable download source (HuggingFace/ModelScope)
+- Customizable language, models, and parameters
+See [SCRIPT_CONFIGURATION.md](../guides/SCRIPT_CONFIGURATION.md) for configuration details.
+**Manual Launch (Using Python Directly):**
+```bash
+# Gradio Web UI
+python_embeded\python.exe acestep\acestep_v15_pipeline.py    # Windows portable
+python acestep/acestep_v15_pipeline.py                        # Linux/macOS
+# REST API Server
+python_embeded\python.exe acestep\api_server.py              # Windows portable
+python acestep/api_server.py                                  # Linux/macOS
+```
+### Standard Installation (All Platforms)
+**1. Install uv (Package Manager)**
+```bash
+# macOS / Linux
+curl -LsSf https://astral.sh/uv/install.sh | sh
+# Windows (PowerShell)
+powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
+```
+**2. Clone & Install**
+```bash
+git clone https://github.com/ACE-Step/ACE-Step-1.5.git
+cd ACE-Step-1.5
+uv sync
+```
+**3. Launch**
+**Using uv:**
+```bash
+# Gradio Web UI (http://localhost:7860)
+uv run acestep
+# REST API Server (http://localhost:8001)
+uv run acestep-api
+```
+**Using Python directly:**
+> **Note:** Make sure to activate your Python environment first:
+> - **Conda environment**: Run `conda activate your_env_name` first
+> - **venv**: Run `source venv/bin/activate` (Linux/Mac) or `venv\Scripts\activate` (Windows) first
+> - **System Python**: Use `python` or `python3` directly
+```bash
+# Gradio Web UI
+python acestep/acestep_v15_pipeline.py
+# REST API Server
+python acestep/api_server.py
+```
+## Model Download
+Models are automatically downloaded on first run. Manual download options:
+### Download Source Configuration
+ACE-Step supports multiple download sources:
+| Source | Description |
+|--------|-------------|
+| **auto** (default) | Auto-detect best source based on network |
+| **modelscope** | Use ModelScope as download source |
+| **huggingface** | Use HuggingFace Hub as download source |
+**Using uv:**
+```bash
+# Download main model
+uv run acestep-download
+# Download from ModelScope
+uv run acestep-download --download-source modelscope
+# Download from HuggingFace Hub
+uv run acestep-download --download-source huggingface
+# Download all models
+uv run acestep-download --all
+# List available models
+uv run acestep-download --list
+```
+**Using Python directly:**
+> **Note:** Replace `python` with your environment's Python executable:
+> - Windows portable package: `python_embeded\python.exe`
+> - Conda/venv: Activate environment first, then use `python`
+> - System: Use `python` or `python3`
+```bash
+# Download main model
+python -m acestep.model_downloader
+# Download from ModelScope
+python -m acestep.model_downloader --download-source modelscope
+# Download from HuggingFace Hub
+python -m acestep.model_downloader --download-source huggingface
+# Download all models
+python -m acestep.model_downloader --all
+# List available models
+python -m acestep.model_downloader --list
+```
+### GPU VRAM Recommendations
+| GPU VRAM | Recommended LM Model | Notes |
+|----------|---------------------|-------|
+| ≤6GB | None (DiT only) | LM disabled to save memory |
+| 6-12GB | `acestep-5Hz-lm-0.6B` | Lightweight, good balance |
+| 12-16GB | `acestep-5Hz-lm-1.7B` | Better quality |
+| ≥16GB | `acestep-5Hz-lm-4B` | Best quality |
+## Command Line Options
+### Gradio UI (`acestep`)
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--port` | 7860 | Server port |
+| `--server-name` | 127.0.0.1 | Server address (`0.0.0.0` for network) |
+| `--share` | false | Create public Gradio link |
+| `--language` | en | UI language: `en`, `zh`, `ja` |
+| `--init_service` | false | Auto-initialize models on startup |
+| `--config_path` | auto | DiT model name |
+| `--lm_model_path` | auto | LM model name |
+| `--offload_to_cpu` | auto | CPU offload (auto if VRAM < 16GB) |
+| `--download-source` | auto | Model download source: `auto`, `huggingface`, or `modelscope` |
+| `--enable-api` | false | Enable REST API endpoints |
+| `--api-key` | none | API authentication key |
+**Examples:**
+> **Note for Python users:** Replace `python` with your environment's Python executable (see note in Launch section above).
+```bash
+# Public access with Chinese UI
+uv run acestep --server-name 0.0.0.0 --share --language zh
+# Or using Python directly:
+python acestep/acestep_v15_pipeline.py --server-name 0.0.0.0 --share --language zh
+# Pre-initialize models
+uv run acestep --init_service true --config_path acestep-v15-turbo
+# Or using Python directly:
+python acestep/acestep_v15_pipeline.py --init_service true --config_path acestep-v15-turbo
+# Enable API with authentication
+uv run acestep --enable-api --api-key sk-your-secret-key
+# Or using Python directly:
+python acestep/acestep_v15_pipeline.py --enable-api --api-key sk-your-secret-key
+# Use ModelScope as download source
+uv run acestep --download-source modelscope
+# Or using Python directly:
+python acestep/acestep_v15_pipeline.py --download-source modelscope
+```
+### REST API Server (`acestep-api`)
+Same options as Gradio UI. See [API documentation](../api/API.md) for endpoints.

.claude/skills/acestep-docs/getting-started/Tutorial.md ADDED Viewed

	@@ -0,0 +1,964 @@

+# ACE-Step 1.5 Ultimate Guide (Must Read)
+---
+Hello everyone, I'm Gong Junmin, the developer of ACE-Step. Through this tutorial, I'll guide you through the design philosophy and usage of ACE-Step 1.5.
+## Mental Models
+Before we begin, we need to establish the correct mental models to set proper expectations.
+### Human-Centered Design
+This model is not designed for **one-click generation**, but for **human-centered generation**.
+Understanding this distinction is crucial.
+### What is One-Click Generation?
+You input a prompt, click generate, listen to a few versions, pick one that sounds good, and use it. If someone else inputs the same prompt, they'll likely get similar results.
+In this mode, you and AI have a **client-vendor** relationship. You come with a clear purpose, with a vague expectation in mind, hoping AI delivers a product close to that expectation. Essentially, it's not much different from searching on Google or finding songs on Spotify—just with a bit more customization.
+AI is a service, not a creative inspirer.
+Suno, Udio, MiniMax, Mureka—these platforms are all designed with this philosophy. They can scale up models as services to ensure delivery. Your generated music is bound by their agreements; you can't run it locally, can't fine-tune for personalized exploration; if they secretly change models or terms, you can only accept it.
+### What is Human-Centered Generation?
+If we weaken the AI layer and strengthen the human layer—letting more human will, creativity, and inspiration give life to AI—this is human-centered generation.
+Unlike the strong purposefulness of one-click generation, human-centered generation has more of a **playful** nature. It's more like an interactive game where you and the model are **collaborators**.
+The workflow is like this: you throw out some inspiration seeds, get a few songs, choose interesting directions from them to continue iterating—
+- Adjust prompts to regenerate
+- Use **Cover** to maintain structure and adjust details
+- Use **Repaint** for local modifications
+- Use **Add Layer** to add or remove instrument layers
+At this point, AI is not a servant to you, but an **inspirer**.
+### What Conditions Must This Design Meet?
+For human-centered generation to truly work, the model must meet several key conditions:
+**First, it must be open-source, locally runnable, and trainable.**
+This isn't technical purism, but a matter of ownership. When you use closed-source platforms, you don't own the model, and your generated works are bound by their agreements. Version updates, term changes, service shutdowns—none of these are under your control.
+But when the model is open-source and locally runnable, everything changes: **You forever own this model, and you forever own all the creations you make with it.** No third-party agreement hassles, no platform risks, you can fine-tune, modify, and build your own creative system based on it. Your works will forever belong to you. It's like buying an instrument—you can use it anytime, anywhere, and adjust it anytime, anywhere.
+**Second, it must be fast.**
+Human time is precious, but more importantly—**slow generation breaks flow state**.
+The core of human-centered workflow is the rapid cycle of "try, listen, adjust." If each generation takes minutes, your inspiration dissipates while waiting, and the "play" experience degrades into the "wait" ordeal.
+Therefore, we specifically optimized ACE-Step for this: while ensuring quality, we made generation fast enough to support a smooth human-machine dialogue rhythm.
+### Finite Game vs Infinite Game
+One-click generation is a **finite game**—clear goals, result-oriented, ends at the finish line. To some extent, it coldly hollows out the music industry, replacing many people's jobs.
+Human-centered generation is an **infinite game**—because the fun lies in the process, and the process never ends.
+Our vision is to democratize AI music generation. Let ACE-Step become a big toy in your pocket, let music return to **Play** itself—the creative "play," not just clicking play.
+---
+## The Elephant Rider Metaphor
+> Recommended reading: [The Complete Guide to Mastering Suno](https://www.notion.so/The-Complete-Guide-to-Mastering-Suno-Advanced-Strategies-for-Professional-Music-Generation-2d6ae744ebdf8024be42f6645f884221)—this blog tutorial can help you establish the foundational understanding of AI music.
+AI music generation is like the famous **elephant rider metaphor** in psychology.
+Consciousness rides on the subconscious, humans ride on elephants. You can give directions, but you can't make the elephant precisely and instantly execute every command. It has its own inertia, its own temperament, its own will.
+This elephant is the music generation model.
+### The Iceberg Model
+Between audio and semantics lies a hidden iceberg.
+What we can describe with language—style, instruments, timbre, emotion, scenes, progression, lyrics, vocal style—these are familiar words, the parts we can touch. But together, they're still just a tiny tip of the audio iceberg above the water.
+What's the most precise control? You input the expected audio, and the model returns it unchanged.
+But as long as you're using text descriptions, references, prompts—the model will have room to play. This isn't a bug, it's the nature of things.
+### What is the Elephant?
+This elephant is a fusion of countless elements: data distribution, model scale, algorithm design, annotation bias, evaluation bias—**it's an abstract crystallization of human music history and engineering trade-offs.**
+Any deviation in these elements will cause it to fail to accurately reflect your taste and expectations.
+Of course, we can expand data scale, improve algorithm efficiency, increase annotation precision, expand model capacity, introduce more professional evaluation systems—these are all directions we can optimize as model developers.
+But even if one day we achieve technical "perfection," there's still a fundamental problem we can't avoid: **taste.**
+### Taste and Expectations
+Taste varies from person to person.
+If a music generation model tries to please all listeners, its output will tend toward the popular average of human music history—**this will be extremely mediocre.**
+It's humans who give sound meaning, emotion, experience, life, and cultural symbolic value. It's a small group of artists who create unique tastes, then drive ordinary people to consume and follow, turning niche into mainstream popularity. These pioneering minority artists become legends.
+So when you find the model's output "not to your taste," this might not be the model's problem—**but rather your taste happens to be outside that "average."** This is a good thing.
+This means: **You need to learn to guide this elephant, not expect it to automatically understand you.**
+---
+## Knowing the Elephant Herd: Model Architecture and Selection
+Now you understand the "elephant" metaphor. But actually—
+**This isn't one elephant, but an entire herd—elephants large and small, forming a family.** 🐘🐘🐘🐘
+### Architecture Principles: Two Brains
+ACE-Step 1.5 uses a **hybrid architecture** with two core components working together:
+```
+User Input → [5Hz LM] → Semantic Blueprint → [DiT] → Audio
+              ↓
+         Metadata Inference
+         Caption Optimization
+         Structure Planning
+```
+**5Hz LM (Language Model) — Planner (Optional)**
+The LM is an "omni-capable planner" responsible for understanding your intent and making plans:
+- Infers music metadata (BPM, key, duration, etc.) through **Chain-of-Thought**
+- Optimizes and expands your caption—understanding and supplementing your intent
+- Generates **semantic codes**—implicitly containing composition melody, orchestration, and some timbre information
+The LM learns **world knowledge** from training data. It's a planner that improves usability and helps you quickly generate prototypes.
+**But the LM is not required.**
+If you're very clear about what you want, or already have a clear planning goal—you can completely skip the LM planning step by not using `thinking` mode.
+For example, in **Cover mode**, you use reference audio to constrain composition, chords, and structure, letting DiT generate directly. Here, **you replace the LM's work**—you become the planner yourself.
+Another example: in **Repaint mode**, you use reference audio as context, constraining timbre, mixing, and details, letting DiT directly adjust locally. Here, DiT is more like your creative brainstorming partner, helping with creative ideation and fixing local disharmony.
+**DiT (Diffusion Transformer) — Executor**
+DiT is the "audio craftsman," responsible for turning plans into reality:
+- Receives semantic codes and conditions generated by LM
+- Gradually "carves" audio from noise through the **diffusion process**
+- Decides final timbre, mixing, details
+**Why this design?**
+Traditional methods let diffusion models generate audio directly from text, but text-to-audio mapping is too vague. ACE-Step introduces LM as an intermediate layer:
+- LM excels at understanding semantics and planning
+- DiT excels at generating high-fidelity audio
+- They work together, each doing their part
+### Choosing the Planner: LM Models
+LM has four options: **No LM** (disable thinking mode), **0.6B**, **1.7B**, **4B**.
+Their training data is completely identical; the difference is purely in **knowledge capacity**:
+- Larger models have richer world knowledge
+- Larger models have stronger memory (e.g., remembering reference audio melodies)
+- Larger models perform relatively better on long-tail styles or instruments
+| Choice | Speed | World Knowledge | Memory | Use Cases |
+|--------|:-----:|:---------------:|:------:|-----------|
+| No LM | ⚡⚡⚡⚡ | — | — | You do the planning (e.g., Cover mode) |
+| `0.6B` | ⚡⚡⚡ | Basic | Weak | Low VRAM (< 8GB), rapid prototyping |
+| `1.7B` | ⚡⚡ | Medium | Medium | **Default recommendation** |
+| `4B` | ⚡ | Rich | Strong | Complex tasks, high-quality generation |
+**How to choose?**
+Based on your hardware:
+- **VRAM < 8GB** → No LM or `0.6B`
+- **VRAM 8–16GB** → `1.7B` (default)
+- **VRAM > 16GB** → `1.7B` or `4B`
+### Choosing the Executor: DiT Models
+With a planning scheme, you still need to choose an executor. DiT is the core of ACE-Step 1.5—it handles various tasks and decides how to interpret LM-generated codes.
+We've open-sourced **4 Turbo models**, **1 SFT model**, and **1 Base model**.
+#### Turbo Series (Recommended for Daily Use)
+Turbo models are trained with distillation, generating high-quality audio in just 8 steps. The core difference between the four variants is the **shift hyperparameter configuration during distillation**.
+**What is shift?**
+Shift determines the "attention allocation" during DiT denoising:
+- **Larger shift** → More effort spent on early denoising (building large structure from pure noise), **stronger semantics**, clearer overall framework
+- **Smaller shift** → More even step distribution, **more details**, but details might also be noise
+Simple understanding: high shift is like "draw outline first then fill details," low shift is like "draw and fix simultaneously."
+| Model | Distillation Config | Characteristics |
+|-------|---------------------|-----------------|
+| `turbo` (default) | Joint distillation on shift 1, 2, 3 | **Best balance of creativity and semantics**, thoroughly tested, recommended first choice |
+| `turbo-shift1` | Distilled only on shift=1 | Richer details, but semantics weaker |
+| `turbo-shift3` | Distilled only on shift=3 | Clearer, richer timbre, but may sound "dry," minimal orchestration |
+| `turbo-continuous` | Experimental, supports continuous shift 1–5 | Most flexible tuning, but not thoroughly tested |
+You can choose based on target music style—you might find you prefer a certain variant. **We recommend starting with default turbo**—it's the most balanced and proven choice.
+#### SFT Model
+Compared to Turbo, SFT model has two notable features:
+- **Supports CFG** (Classifier-Free Guidance), allowing fine-tuning of prompt adherence
+- **More steps** (50 steps), giving the model more time to "think"
+The cost: more steps mean error accumulation, audio clarity may be slightly inferior to Turbo. But its **detail expression and semantic parsing will be better**.
+If you don't care about inference time, like tuning CFG and steps, and prefer that rich detail feel—SFT is a good choice. LM-generated codes can also work with SFT models.
+#### Base Model
+Base is the **master of all tasks**, with three exclusive tasks beyond SFT and Turbo:
+| Task | Description |
+|------|-------------|
+| `extract` | Extract single tracks from mixed audio (e.g., separate vocals) |
+| `lego` | Add new tracks to existing tracks (e.g., add drums to guitar) |
+| `complete` | Add mixed accompaniment to single track (e.g., add guitar+drums accompaniment to vocals) |
+Additionally, Base has the **strongest plasticity**. If you have large-scale fine-tuning needs, we recommend starting experiments with Base to train your own SFT model.
+#### Creating Your Custom Model
+Beyond official models, you can also use **LoRA fine-tuning** to create your custom model.
+We'll release an example LoRA model—trained on 20+ "Happy New Year" themed songs, specifically suited for expressing festive atmosphere. This is just a starting point.
+**What does a custom model mean?**
+You can reshape DiT's capabilities and preferences with your own data recipe:
+- Like a specific timbre style? Train with that type of songs
+- Want the model better at a certain genre? Collect related data for fine-tuning
+- Have your own unique aesthetic taste? "Teach" it to the model
+This greatly expands **customization and playability**—train a model unique to you with your aesthetic taste.
+> For detailed LoRA training guide, see the "LoRA Training" tab in Gradio UI.
+#### DiT Selection Summary
+| Model | Steps | CFG | Speed | Exclusive Tasks | Recommended Scenarios |
+|-------|:-----:|:---:|:-----:|-----------------|----------------------|
+| `turbo` (default) | 8 | ❌ | ⚡⚡⚡ | — | Daily use, rapid iteration |
+| `sft` | 50 | ✅ | ⚡ | — | Pursuing details, like tuning |
+| `base` | 50 | ✅ | ⚡ | extract, lego, complete | Special tasks, large-scale fine-tuning |
+### Combination Strategies
+Default configuration is **turbo + 1.7B LM**, suitable for most scenarios.
+| Need | Recommended Combination |
+|------|------------------------|
+| Fastest speed | `turbo` + No LM or `0.6B` |
+| Daily use | `turbo` + `1.7B` (default) |
+| Pursuing details | `sft` + `1.7B` or `4B` |
+| Special tasks | `base` |
+| Large-scale fine-tuning | `base` |
+| Low VRAM (< 4GB) | `turbo` + No LM + CPU offload |
+### Downloading Models
+```bash
+# Download default models (turbo + 1.7B LM)
+uv run acestep-download
+# Download all models
+uv run acestep-download --all
+# Download specific model
+uv run acestep-download --model acestep-v15-base
+uv run acestep-download --model acestep-5Hz-lm-0.6B
+# List available models
+uv run acestep-download --list
+```
+You need to download models into a `checkpoints` folder for easy identification.
+---
+## Guiding the Elephant: What Can You Control?
+Now that you know this herd of elephants, let's learn how to communicate with them.
+Each generation is determined by three types of factors: **input control**, **inference hyperparameters**, and **random factors**.
+### I. Input Control: What Do You Want?
+This is the part where you communicate "creative intent" with the model—what kind of music you want to generate.
+| Category | Parameter | Function |
+|----------|-----------|----------|
+| **Task Type** | `task_type` | Determines generation mode: text2music, cover, repaint, lego, extract, complete |
+| **Text Input** | `caption` | Description of overall music elements: style, instruments, emotion, atmosphere, timbre, vocal gender, progression, etc. |
+| | `lyrics` | Temporal element description: lyric content, music structure evolution, vocal changes, vocal/instrument performance style, start/end style, articulation, etc. (use `[Instrumental]` for instrumental music) |
+| **Music Metadata** | `bpm` | Tempo (30–300) |
+| | `keyscale` | Key (e.g., C Major, Am) |
+| | `timesignature` | Time signature (4/4, 3/4, 6/8) |
+| | `vocal_language` | Vocal language |
+| | `duration` | Target duration (seconds) |
+| **Audio Reference** | `reference_audio` | Global reference for timbre or style (for cover, style transfer) |
+| | `src_audio` | Source audio for non-text2music tasks (text2music defaults to silence, no input needed) |
+| | `audio_codes` | Semantic codes input to model in Cover mode (advanced: reuse codes for variants, convert songs to codes for extension, combine like DJ mixing) |
+| **Interval Control** | `repainting_start/end` | Time interval for operations (repaint redraw area / lego new track area) |
+---
+#### About Caption: The Most Important Input
+**Caption is the most important factor affecting generated music.**
+It supports multiple input formats: simple style words, comma-separated tags, complex natural language descriptions. We've trained to be compatible with various formats, ensuring text format doesn't significantly affect model performance.
+**We provide at least 5 ways to help you write good captions:**
+1. **Random Dice** — Click the random button in the UI to see how example captions are written. You can use this standardized caption as a template and have an LLM rewrite it to your desired form.
+2. **Format Auto-Rewrite** — We support using the `format` feature to automatically expand your handwritten simple caption into complex descriptions.
+3. **CoT Rewrite** — If LM is initialized, whether `thinking` mode is enabled or not, we support rewriting and expanding captions through Chain-of-Thought (unless you actively disable it in settings, or LM is not initialized).
+4. **Audio to Caption** — Our LM supports converting your input audio to caption. While precision is limited, the vague direction is correct—enough as a starting point.
+5. **Simple Mode** — Just input a simple song description, and LM will automatically generate complete caption, lyrics, and metas samples—suitable for quick starts.
+Regardless of which method, they all solve a real problem: **As ordinary people, our music vocabulary is impoverished.**
+If you want generated music to be more interesting and meet expectations, **Prompting is always the optimal option**—it brings the highest marginal returns and surprises.
+**Common Dimensions for Caption Writing:**
+| Dimension | Examples |
+|-----------|----------|
+| **Style/Genre** | pop, rock, jazz, electronic, hip-hop, R&B, folk, classical, lo-fi, synthwave |
+| **Emotion/Atmosphere** | melancholic, uplifting, energetic, dreamy, dark, nostalgic, euphoric, intimate |
+| **Instruments** | acoustic guitar, piano, synth pads, 808 drums, strings, brass, electric bass |
+| **Timbre Texture** | warm, bright, crisp, muddy, airy, punchy, lush, raw, polished |
+| **Era Reference** | 80s synth-pop, 90s grunge, 2010s EDM, vintage soul, modern trap |
+| **Production Style** | lo-fi, high-fidelity, live recording, studio-polished, bedroom pop |
+| **Vocal Characteristics** | female vocal, male vocal, breathy, powerful, falsetto, raspy, choir |
+| **Speed/Rhythm** | slow tempo, mid-tempo, fast-paced, groovy, driving, laid-back |
+| **Structure Hints** | building intro, catchy chorus, dramatic bridge, fade-out ending |
+**Some Practical Principles:**
+1. **Specific beats vague** — "sad piano ballad with female breathy vocal" works better than "a sad song."
+2. **Combine multiple dimensions** — Single-dimension descriptions give the model too much room to play; combining style+emotion+instruments+timbre can more precisely anchor your desired direction.
+3. **Use references well** — "in the style of 80s synthwave" or "reminiscent of Bon Iver" can quickly convey complex aesthetic preferences.
+4. **Texture words are useful** — Adjectives like warm, crisp, airy, punchy can influence mixing and timbre tendencies.
+5. **Don't pursue perfect descriptions** — Caption is a starting point, not an endpoint. Write a general direction first, then iterate based on results.
+6. **Description granularity determines freedom** — More omitted descriptions give the model more room to play, more random factor influence; more detailed descriptions constrain the model more. Decide specificity based on your needs—want surprises? Write less. Want control? Write more details.
+7. **Avoid conflicting words** — Conflicting style combinations easily lead to degraded output. For example, wanting both "classical strings" and "hardcore metal" simultaneously—the model will try to fuse but usually not ideal. Especially when `thinking` mode is enabled, LM has weaker caption generalization than DiT. When prompting is unreasonable, the chance of pleasant surprises is smaller.
+   **Ways to resolve conflicts:**
+   - **Repetition reinforcement** — Strengthen the elements you want more in mixed styles by repeating certain words
+   - **Conflict to evolution** — Transform style conflicts into temporal style evolution. For example: "Start with soft strings, middle becomes noisy dynamic metal rock, end turns to hip-hop"—this gives the model clear guidance on how to handle different styles, rather than mixing them into a mess
+> For more prompting tips, see: [The Complete Guide to Mastering Suno](https://www.notion.so/The-Complete-Guide-to-Mastering-Suno-Advanced-Strategies-for-Professional-Music-Generation-2d6ae744ebdf8024be42f6645f884221)—although it's a Suno tutorial, prompting ideas are universal.
+---
+#### About Lyrics: The Temporal Script
+If Caption describes the music's "overall portrait"—style, atmosphere, timbre—then **Lyrics is the music's "temporal script"**, controlling how music unfolds over time.
+Lyrics is not just lyric content. It carries:
+- The lyric text itself
+- **Structure tags** ([Verse], [Chorus], [Bridge]...)
+- **Vocal style hints** ([raspy vocal], [whispered]...)
+- **Instrumental sections** ([guitar solo], [drum break]...)
+- **Energy changes** ([building energy], [explosive drop]...)
+**Structure Tags are Key**
+Structure tags (Meta Tags) are the most powerful tool in Lyrics. They tell the model: "What is this section, how should it be performed?"
+**Common Structure Tags:**
+| Category | Tag | Description |
+|----------|-----|-------------|
+| **Basic Structure** | `[Intro]` | Opening, establish atmosphere |
+| | `[Verse]` / `[Verse 1]` | Verse, narrative progression |
+| | `[Pre-Chorus]` | Pre-chorus, build energy |
+| | `[Chorus]` | Chorus, emotional climax |
+| | `[Bridge]` | Bridge, transition or elevation |
+| | `[Outro]` | Ending, conclusion |
+| **Dynamic Sections** | `[Build]` | Energy gradually rising |
+| | `[Drop]` | Electronic music energy release |
+| | `[Breakdown]` | Reduced instrumentation, space |
+| **Instrumental Sections** | `[Instrumental]` | Pure instrumental, no vocals |
+| | `[Guitar Solo]` | Guitar solo |
+| | `[Piano Interlude]` | Piano interlude |
+| **Special Tags** | `[Fade Out]` | Fade out ending |
+| | `[Silence]` | Silence |
+**Combining Tags: Use Moderately**
+Structure tags can be combined with `-` for finer control:
+```
+[Chorus - anthemic]
+This is the chorus lyrics
+Dreams are burning
+[Bridge - whispered]
+Whisper those words softly
+```
+This works better than writing `[Chorus]` alone—you're telling the model both what this section is (Chorus) and how to sing it (anthemic).
+**⚠️ Note: Don't stack too many tags.**
+```
+❌ Not recommended:
+[Chorus - anthemic - stacked harmonies - high energy - powerful - epic]
+✅ Recommended:
+[Chorus - anthemic]
+```
+Stacking too many tags has two risks:
+1. The model might mistake tag content as lyrics to sing
+2. Too many instructions confuse the model, making effects worse
+**Principle**: Keep structure tags concise; put complex style descriptions in Caption.
+**⚠️ Key: Maintain Consistency Between Caption and Lyrics**
+**Models are not good at resolving conflicts.** If descriptions in Caption and Lyrics contradict, the model gets confused and output quality decreases.
+```
+❌ Conflict example:
+Caption: "violin solo, classical, intimate chamber music"
+Lyrics: [Guitar Solo - electric - distorted]
+✅ Consistent example:
+Caption: "violin solo, classical, intimate chamber music"
+Lyrics: [Violin Solo - expressive]
+```
+**Checklist:**
+- Instruments in Caption ↔ Instrumental section tags in Lyrics
+- Emotion in Caption ↔ Energy tags in Lyrics
+- Vocal description in Caption ↔ Vocal control tags in Lyrics
+Think of Caption as "overall setting" and Lyrics as "shot script"—they should tell the same story.
+**Vocal Control Tags:**
+| Tag | Effect |
+|-----|--------|
+| `[raspy vocal]` | Raspy, textured vocals |
+| `[whispered]` | Whispered |
+| `[falsetto]` | Falsetto |
+| `[powerful belting]` | Powerful, high-pitched singing |
+| `[spoken word]` | Rap/recitation |
+| `[harmonies]` | Layered harmonies |
+| `[call and response]` | Call and response |
+| `[ad-lib]` | Improvised embellishments |
+**Energy and Emotion Tags:**
+| Tag | Effect |
+|-----|--------|
+| `[high energy]` | High energy, passionate |
+| `[low energy]` | Low energy, restrained |
+| `[building energy]` | Increasing energy |
+| `[explosive]` | Explosive energy |
+| `[melancholic]` | Melancholic |
+| `[euphoric]` | Euphoric |
+| `[dreamy]` | Dreamy |
+| `[aggressive]` | Aggressive |
+**Lyric Text Writing Tips**
+**1. Control Syllable Count**
+**6-10 syllables per line** usually works best. The model aligns syllables to beats—if one line has 6 syllables and the next has 14, rhythm becomes strange.
+```
+❌ Bad example:
+我站在窗前看着外面的世界一切都在改变（18 syllables）
+你好（2 syllables）
+✅ Good example:
+我站在窗前（5 syllables）
+看着外面世界（6 syllables）
+一切都在改变（6 syllables）
+```
+**Tip**: Keep similar syllable counts for lines in the same position (e.g., first line of each verse) (±1-2 deviation).
+**2. Use Case to Control Intensity**
+Uppercase indicates stronger vocal intensity:
+```
+[Verse]
+walking through the empty streets (normal intensity)
+[Chorus]
+WE ARE THE CHAMPIONS! (high intensity, shouting)
+```
+**3. Use Parentheses for Background Vocals**
+```
+[Chorus]
+We rise together (together)
+Into the light (into the light)
+```
+Content in parentheses is processed as background vocals or harmonies.
+**4. Extend Vowels**
+You can extend sounds by repeating vowels:
+```
+Feeeling so aliiive
+```
+But use cautiously—effects are unstable, sometimes ignored or mispronounced.
+**5. Clear Section Separation**
+Separate each section with blank lines:
+```
+[Verse 1]
+First verse lyrics
+Continue first verse
+[Chorus]
+Chorus lyrics
+Chorus continues
+```
+**Avoiding "AI-flavored" Lyrics**
+These characteristics make lyrics seem mechanical and lack human touch:
+| Red Flag 🚩 | Description |
+|-------------|-------------|
+| **Adjective stacking** | "neon skies, electric hearts, endless dreams"—filling a section with vague imagery |
+| **Rhyme chaos** | Inconsistent rhyme patterns, or forced rhymes causing semantic breaks |
+| **Blurred section boundaries** | Lyric content crosses structure tags, Verse content "flows" into Chorus |
+| **No breathing room** | Each line too long, can't sing in one breath |
+| **Mixed metaphors** | First verse uses water imagery, second suddenly becomes fire, third is flying—listeners can't anchor |
+**Metaphor discipline**: Stick to one core metaphor per song, exploring its multiple aspects. For example, choosing "water" as metaphor, you can explore: how love flows around obstacles like water, can be gentle rain or flood, reflects the other's image, can't be grasped but exists. One image, multiple facets—this gives lyrics cohesion.
+**Writing Instrumental Music**
+If generating pure instrumental music without vocals:
+```
+[Instrumental]
+```
+Or use structure tags to describe instrumental development:
+```
+[Intro - ambient]
+[Main Theme - piano]
+[Climax - powerful]
+[Outro - fade out]
+```
+**Complete Example**
+Assuming Caption is: `female vocal, piano ballad, emotional, intimate atmosphere, strings, building to powerful chorus`
+```
+[Intro - piano]
+[Verse 1]
+月光洒在窗台上
+我听见你的呼吸
+城市在远处沉睡
+只有我们还醒着
+[Pre-Chorus]
+这一刻如此安静
+却藏着汹涌的心
+[Chorus - powerful]
+让我们燃烧吧
+像夜空中的烟火
+短暂却绚烂
+这就是我们的时刻
+[Verse 2]
+时间在指尖流过
+我们抓不住什么
+但至少此刻拥有
+彼此眼中的火焰
+[Bridge - whispered]
+如果明天一切消散
+至少我们曾经闪耀
+[Final Chorus]
+让我们燃烧吧
+像夜空中的烟火
+短暂却绚烂
+THIS IS OUR MOMENT!
+[Outro - fade out]
+```
+Note: In this example, Lyrics tags (piano, powerful, whispered) are consistent with Caption descriptions (piano ballad, building to powerful chorus, intimate), with no conflicts.
+---
+#### About Music Metadata: Optional Fine Control
+**Most of the time, you don't need to manually set metadata.**
+When you enable `thinking` mode (or enable `use_cot_metas`), LM automatically infers appropriate BPM, key, time signature, etc. based on your Caption and Lyrics. This is usually good enough.
+But if you have clear ideas, you can also manually control them:
+| Parameter | Control Range | Description |
+|-----------|--------------|-------------|
+| `bpm` | 30–300 | Tempo. Common distribution: slow songs 60–80, mid-tempo 90–120, fast songs 130–180 |
+| `keyscale` | Key | e.g., `C Major`, `Am`, `F# Minor`. Affects overall pitch and emotional color |
+| `timesignature` | Time signature | `4/4` (most common), `3/4` (waltz), `6/8` (swing feel) |
+| `vocal_language` | Language | Vocal language. LM usually auto-detects from lyrics |
+| `duration` | Seconds | Target duration. Actual generation may vary slightly |
+**Understanding Control Boundaries**
+These parameters are **guidance** rather than **precise commands**:
+- **BPM**: Common range (60–180) works well; extreme values (like 30 or 280) have less training data, may be unstable
+- **Key**: Common keys (C, G, D, Am, Em) are stable; rare keys may be ignored or shifted
+- **Time signature**: `4/4` is most reliable; `3/4`, `6/8` usually OK; complex signatures (like `5/4`, `7/8`) are advanced, effects vary by style
+- **Duration**: Short songs (30–60s) and medium length (2–4min) are stable; very long generation may have repetition or structure issues
+**The Model's "Reference" Approach**
+The model doesn't mechanically execute `bpm=120`, but rather:
+1. Uses `120 BPM` as an **anchor point**
+2. Samples from distribution near this anchor
+3. Final result might be 118 or 122, not exactly 120
+It's like telling a musician "around 120 tempo"—they'll naturally play in this range, not rigidly follow a metronome.
+**When Do You Need Manual Settings?**
+| Scenario | Suggestion |
+|----------|------------|
+| Daily generation | Don't worry, let LM auto-infer |
+| Clear tempo requirement | Manually set `bpm` |
+| Specific style (e.g., waltz) | Manually set `timesignature=3/4` |
+| Need to match other material | Manually set `bpm` and `duration` |
+| Pursue specific key color | Manually set `keyscale` |
+**Tip**: If you manually set metadata but generation results clearly don't match—check if there's conflict with Caption/Lyrics. For example, Caption says "slow ballad" but `bpm=160`, the model gets confused.
+**Recommended Practice**: Don't write tempo, BPM, key, and other metadata information in Caption. These should be set through dedicated metadata parameters (`bpm`, `keyscale`, `timesignature`, etc.), not described in Caption. Caption should focus on style, emotion, instruments, timbre, and other musical characteristics, while metadata information is handled by corresponding parameters.
+---
+#### About Audio Control: Controlling Sound with Sound
+**Text is dimensionally reduced abstraction; the best control is still controlling with audio.**
+There are three ways to control generation with audio, each with different control ranges and uses:
+---
+##### 1. Reference Audio: Global Acoustic Feature Control
+Reference audio (`reference_audio`) is used to control the **acoustic features** of generated music—timbre, mixing style, performance style, etc. It **averages temporal dimension information** and acts **globally**.
+**What Does Reference Audio Control?**
+Reference audio mainly controls the **acoustic features** of generated music, including:
+- **Timbre texture**: Vocal timbre, instrument timbre
+- **Mixing style**: Spatial sense, dynamic range, frequency distribution
+- **Performance style**: Vocal techniques, playing techniques, expression
+- **Overall atmosphere**: The "feeling" conveyed through reference audio
+**How Does the Backend Process Reference Audio?**
+When you provide reference audio, the system performs the following processing:
+1. **Audio Preprocessing**:
+   - Load audio file, normalize to **stereo 48kHz** format
+   - Detect silence, ignore if audio is completely silent
+   - If audio length is less than 30 seconds, repeat to fill to at least 30 seconds
+   - Randomly select 10-second segments from front, middle, and back positions, concatenate into 30-second reference segment
+2. **Encoding Conversion**:
+   - Use **VAE (Variational Autoencoder)** `tiled_encode` method to encode audio into **latent representation (latents)**
+   - These latents contain acoustic feature information but remove specific melody, rhythm, and other structural information
+   - Encoded latents are input as conditions to DiT generation process, **averaging temporal dimension information, acting globally on entire generation process**
+---
+##### 2. Source Audio: Semantic Structure Control
+Source audio (`src_audio`) is used for **Cover tasks**, performing **melodic structure control**. Its principle is to quantize your input source audio into semantically structured information.
+**What Does Source Audio Control?**
+Source audio is converted into **semantically structured information**, including:
+- **Melody**: Note direction and pitch
+- **Rhythm**: Beat, accent, groove
+- **Chords**: Harmonic progression and changes
+- **Orchestration**: Instrument arrangement and layers
+- **Some timbre**: Partial timbre information
+**What Can You Do With It?**
+1. **Control style**: Maintain source audio structure, change style and details
+2. **Transfer style**: Apply source audio structure to different styles
+3. **Retake lottery**: Generate similar structure but different variants, get different interpretations through multiple generations
+4. **Control influence degree**: Control source audio influence strength through `audio_cover_strength` parameter (0.0–1.0)
+   - Higher strength: generation results more strictly follow source audio structure
+   - Lower strength: generation results have more room for free play
+**Advanced Cover Usage**
+You can use Cover to **Remix a song**, and it supports changing Caption and Lyrics:
+- **Remix creation**: Input a song as source audio, reinterpret it by modifying Caption and Lyrics
+  - Change style: Use different Caption descriptions (e.g., change from pop to rock)
+  - Change lyrics: Rewrite lyrics with new Lyrics, maintaining original melody structure
+  - Change emotion: Adjust overall atmosphere through Caption (e.g., change from sad to joyful)
+- **Build complex music structures**: Build complex melodic direction, layers, and groove based on your needed structure influence degree
+  - Fine-tune structure adherence through `audio_cover_strength`
+  - Combine Caption and Lyrics modifications to create new expression while maintaining core structure
+  - Can generate multiple versions, each with different emphasis on structure, style, lyrics
+---
+##### 3. Source Audio Context-Based Control: Local Completion and Modification
+This is the **Repaint task**, performing completion or modification based on source audio context.
+**Repaint Principle**
+Repaint is based on **context completion** principle:
+- Can complete **beginning**, **middle local**, **ending**, or **any region**
+- Operation range: **3 seconds to 90 seconds**
+- Model references source audio context information, generating within specified interval
+**What Can You Do With It?**
+1. **Local modification**: Modify lyrics, structure, or content in specified interval
+2. **Change lyrics**: Maintain melody and orchestration, only change lyric content
+3. **Change structure**: Change music structure in specified interval (e.g., change Verse to Chorus)
+4. **Continue writing**: Continue writing beginning or ending based on context
+5. **Clone timbre**: Clone source audio timbre characteristics based on context
+**Advanced Repaint Usage**
+You can use Repaint for more complex creative needs:
+- **Infinite duration generation**:
+  - Through multiple Repaint operations, can continuously extend audio, achieving infinite duration generation
+  - Each continuation is based on previous segment's context, maintaining natural transitions and coherence
+  - Can generate in segments, each 3–90 seconds, finally concatenate into complete work
+- **Intelligent audio stitching**:
+  - Intelligently organize and stitch two audios together
+  - Use Repaint at first audio's end to continue, making transitions naturally connect
+  - Or use Repaint to modify connection part between two audios for smooth transitions
+  - Model automatically handles rhythm, harmony, timbre connections based on context, making stitched audio sound like a complete work
+---
+##### 4. Base Model Advanced Audio Control Tasks
+In the **Base model**, we also support more advanced audio control tasks:
+**Lego Task**: Intelligently add new tracks based on existing tracks
+- Input an existing audio track (e.g., vocals)
+- Model intelligently adds new tracks (e.g., drums, guitar, bass, etc.)
+- New tracks coordinate with original tracks in rhythm and harmony
+**Complete Task**: Add mixed tracks to single track
+- Input a single-track audio (e.g., a cappella vocals)
+- Model generates complete mixed accompaniment tracks
+- Generated accompaniment matches vocals in style, rhythm, and harmony
+**These advanced context completion tasks** greatly expand control methods, more intelligently providing inspiration and creativity.
+---
+The combination of these parameters determines what you "want." We'll explain input control **principles** and **techniques** in detail later.
+### II. Inference Hyperparameters: How Does the Model Generate?
+This is the part that affects "generation process behavior"—doesn't change what you want, but changes how the model does it.
+**DiT (Diffusion Model) Hyperparameters:**
+| Parameter | Function | Default | Tuning Advice |
+|-----------|----------|---------|---------------|
+| `inference_steps` | Diffusion steps | 8 (turbo) | More steps = finer but slower. Turbo uses 8, Base uses 32–100 |
+| `guidance_scale` | CFG strength | 7.0 | Higher = more prompt adherence, but may overfit. Only Base model effective |
+| `use_adg` | Adaptive Dual Guidance | False | After enabling, dynamically adjusts CFG, Base model only |
+| `cfg_interval_start/end` | CFG effective interval | 0.0–1.0 | Controls which stage to apply CFG |
+| `shift` | Timestep offset | 1.0 | Adjusts denoising trajectory, affects generation style |
+| `infer_method` | Inference method | "ode" | `ode` deterministic, `sde` introduces randomness |
+| `timesteps` | Custom timesteps | None | Advanced usage, overrides steps and shift |
+| `audio_cover_strength` | Reference audio/codes influence strength | 1.0 | 0.0–1.0, higher = closer to reference, lower = more freedom |
+**5Hz LM (Language Model) Hyperparameters:**
+| Parameter | Function | Default | Tuning Advice |
+|-----------|----------|---------|---------------|
+| `thinking` | Enable CoT reasoning | True | Enable to let LM reason metadata and codes |
+| `lm_temperature` | Sampling temperature | 0.85 | Higher = more random/creative, lower = more conservative/deterministic |
+| `lm_cfg_scale` | LM CFG strength | 2.0 | Higher = more positive prompt adherence |
+| `lm_top_k` | Top-K sampling | 0 | 0 means disabled, limits candidate word count |
+| `lm_top_p` | Top-P sampling | 0.9 | Nucleus sampling, limits cumulative probability |
+| `lm_negative_prompt` | Negative prompt | "NO USER INPUT" | Tells LM what not to generate |
+| `use_cot_metas` | CoT reason metadata | True | Let LM auto-infer BPM, key, etc. |
+| `use_cot_caption` | CoT rewrite caption | True | Let LM optimize your description |
+| `use_cot_language` | CoT detect language | True | Let LM auto-detect vocal language |
+| `use_constrained_decoding` | Constrained decoding | True | Ensures correct output format |
+The combination of these parameters determines how the model "does it."
+**About Parameter Tuning**
+It's important to emphasize that **tuning factors and random factors sometimes have comparable influence**. When you adjust a parameter, it may be hard to tell if it's the parameter's effect or randomness causing the change.
+Therefore, **we recommend fixing random factors when tuning**—by setting a fixed `seed` value, ensuring each generation starts from the same initial noise, so you can accurately feel the parameter's real impact on generated audio. Otherwise, parameter change effects may be masked by randomness, causing you to misjudge the parameter's role.
+### III. Random Factors: Sources of Uncertainty
+Even with identical inputs and hyperparameters, two generations may produce different results. This is because:
+**1. DiT's Initial Noise**
+- Diffusion models start from random noise and gradually denoise
+- `seed` parameter controls this initial noise
+- Different seed → different starting point → different endpoint
+**2. LM's Sampling Randomness**
+- When `lm_temperature > 0`, the sampling process itself has randomness
+- Same prompt, each sampling may choose different tokens
+**3. Additional Noise When `infer_method = "sde"`**
+- SDE method injects additional randomness during denoising
+---
+#### Pros and Cons of Random Factors
+Randomness is a double-edged sword.
+**Benefits of Randomness:**
+- **Explore creative space**: Same input can produce different variants, giving you more choices
+- **Discover unexpected surprises**: Sometimes randomness brings excellent results you didn't expect
+- **Avoid repetition**: Each generation is different, won't fall into single-pattern loops
+**Challenges of Randomness:**
+- **Uncontrollable results**: You can't precisely predict generation results, may generate multiple times without satisfaction
+- **Hard to reproduce**: Even with identical inputs, hard to reproduce a specific good result
+- **Tuning difficulty**: When adjusting parameters, hard to tell if it's parameter effect or randomness change
+- **Screening cost**: Need to generate multiple versions to find satisfactory ones, increasing time cost
+#### What Mindset to Face Random Factors?
+**1. Accept Uncertainty**
+- Randomness is an essential characteristic of AI music generation, not a bug, but a feature
+- Don't expect every generation to be perfect; treat randomness as an exploration tool
+**2. Embrace the Exploration Process**
+- Treat generation process as "gacha" or "treasure hunting"—try multiple times, always find surprises
+- Enjoy discovering unexpectedly good results, rather than obsessing over one-time success
+**3. Use Fixed Seed Wisely**
+- When you want to **understand parameter effects**, fix `seed` to eliminate randomness interference
+- When you want to **explore creative space**, let `seed` vary randomly
+**4. Batch Generation + Intelligent Screening**
+- Don't rely on single generation; batch generate multiple versions
+- Use automatic scoring mechanisms for initial screening to improve efficiency
+#### Our Solution: Large Batch + Automatic Scoring
+Because our inference is extremely fast, if your GPU VRAM is sufficient, you can explore random space through **large batch**:
+- **Batch generation**: Generate multiple versions at once (e.g., batch_size=2,4,8), quickly explore random space
+- **Automatic scoring mechanism**: We provide automatic scoring mechanisms that can help you initially screen, doing **test time scaling**
+**Automatic Scoring Mechanism**
+We provide multiple scoring metrics, among which **my favorite is DiT Lyrics Alignment Score**:
+- **DiT Lyrics Alignment Score**: This score implicitly affects lyric accuracy
+  - It evaluates the alignment degree between lyrics and audio in generated audio
+  - Higher score means lyrics are more accurately positioned in audio, better match between singing and lyrics
+  - This is particularly important for music generation with lyrics, can help you screen versions with higher lyric accuracy
+- **Other scoring metrics**: Also include other quality assessment metrics, can evaluate generation results from multiple dimensions
+**Recommended Workflow:**
+1. **Batch generation**: Set larger `batch_size` (e.g., 2, 4, 8), generate multiple versions at once
+2. **Enable AutoGen**: Enable automatic generation, let system continuously generate new batches in background
+   - **AutoGen mechanism**: AutoGen automatically uses same parameters (but random seed) to generate next batch in background while you're viewing current batch results
+   - This lets you continuously explore random space without manually clicking generate button
+   - Each new batch uses new random seed, ensuring result diversity
+3. **Automatic scoring**: Enable automatic scoring, let system automatically score each version
+4. **Initial screening**: Screen versions with higher scores based on DiT Lyrics Alignment Score and other metrics
+5. **Manual selection**: Manually select the final version that best meets your needs from screened versions
+This fully utilizes randomness to explore creative space while improving efficiency through automation tools, avoiding blind searching in large generation results. AutoGen lets you "generate while listening"—while browsing current results, the next batch is already prepared in the background.
+---
+## Conclusion
+This tutorial currently covers ACE-Step 1.5's core concepts and usage methods:
+- **Mental Models**: Understanding human-centered generation design philosophy
+- **Model Architecture**: Understanding how LM and DiT work together
+- **Input Control**: Mastering text (Caption, Lyrics, metadata) and audio (reference audio, source audio) control methods
+- **Inference Hyperparameters**: Understanding parameters affecting generation process
+- **Random Factors**: Learning to use randomness to explore creative space, improving efficiency through Large Batch + AutoGen + Automatic Scoring
+This is just the beginning. There's much more content we want to share with you:
+- More Prompting tips and practical cases
+- Detailed usage guides for different task types
+- Advanced techniques and creative workflows
+- Common issues and solutions
+- Performance optimization suggestions
+**This tutorial will continue to be updated and improved.** If you have any questions or suggestions during use, feedback is welcome. Let's make ACE-Step your creative partner in your pocket together.
+---
+*To be continued...*

.claude/skills/acestep-docs/guides/ENVIRONMENT_SETUP.md ADDED Viewed

	@@ -0,0 +1,542 @@

+# Environment Setup Guide
+This guide covers Python environment setup for ACE-Step on Windows, Linux, and macOS.
+## Environment Options
+### Windows
+**Option 1: python_embeded (Portable Package)**
+- **Best for**: New users, zero-configuration setup
+- **Pros**: Extract and run, no installation required
+- **Cons**: Large download size (~7GB)
+- **Location**: `python_embeded\python.exe`
+- **Download**: https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z
+**Option 2: uv (Package Manager)**
+- **Best for**: Developers, Git repository users
+- **Pros**: Smaller installation, easy updates, excellent tooling
+- **Cons**: Requires uv installation
+- **Installation**: See [Installing uv](#installing-uv) below
+### Linux
+**uv (Package Manager)**
+- **Only supported option** (no portable package available for Linux)
+- **Best for**: All Linux users
+- **Requires**: uv package manager
+- **Backend**: vllm (default) or pt (PyTorch)
+- **Installation**: See [Installing uv](#installing-uv) below
+### macOS (Apple Silicon)
+**uv with MLX Backend**
+- **Only supported option** (no portable package available for macOS)
+- **Best for**: All macOS Apple Silicon (M1/M2/M3/M4) users
+- **Requires**: uv package manager
+- **Backend**: mlx (native Apple Silicon acceleration)
+- **Dedicated scripts**: `start_gradio_ui_macos.sh`, `start_api_server_macos.sh`
+- **Installation**: See [Installing uv](#installing-uv) below
+Note: Intel Macs can use the standard `start_gradio_ui.sh` with the PyTorch (pt) backend, but Apple Silicon Macs should use the macOS-specific scripts for optimal performance.
+## Automatic Detection
+### Windows (bat scripts)
+The `.bat` startup scripts detect the environment in this order:
+1. **First**: Check for `python_embeded\python.exe`
+   - If found: Use embedded Python directly
+   - If not found: Continue to step 2
+2. **Second**: Check for `uv` command
+   - If found: Use uv
+   - If not found: Prompt to install uv
+**Example output:**
+```
+[Environment] Using embedded Python...
+```
+or
+```
+[Environment] Embedded Python not found, checking for uv...
+[Environment] Using uv package manager...
+```
+### Linux/macOS (sh scripts)
+The `.sh` startup scripts detect the environment in this order:
+1. **First**: Check for `uv` in PATH
+   - Also checks `~/.local/bin/uv` and `~/.cargo/bin/uv`
+   - If found: Use uv
+   - If not found: Prompt to install uv
+2. **If not found**: Offer automatic installation
+   - Calls `install_uv.sh --silent` to install uv
+   - Updates PATH and continues
+**Example output (Linux):**
+```
+[Environment] Using uv package manager...
+```
+**Example output (macOS):**
+```
+============================================
+  ACE-Step 1.5 - macOS Apple Silicon (MLX)
+============================================
+[Environment] Using uv package manager...
+```
+## Installing uv
+### All Platforms
+**Automatic**: When you run a startup script and uv is not found, you will be prompted:
+```
+uv package manager not found!
+Install uv now? (Y/N):
+```
+Type `Y` and press Enter. The script will automatically install uv using the appropriate method for your platform.
+### Windows Methods
+**Method 1: PowerShell (Recommended)**
+```powershell
+irm https://astral.sh/uv/install.ps1 | iex
+```
+**Method 2: winget (Windows 10 1809+, Windows 11)**
+```batch
+winget install --id=astral-sh.uv -e
+```
+**Method 3: Run the install script**
+```batch
+install_uv.bat
+```
+The `install_uv.bat` script tries PowerShell first, then falls back to winget if PowerShell fails.
+### Linux Methods
+**Method 1: curl installer (Recommended)**
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+```
+**Method 2: Run the install script**
+```bash
+chmod +x install_uv.sh
+./install_uv.sh
+```
+The `install_uv.sh` script uses `curl` or `wget` to download and run the official installer.
+### macOS Methods
+**Method 1: curl installer (Recommended)**
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+```
+**Method 2: Homebrew**
+```bash
+brew install uv
+```
+**Method 3: Run the install script**
+```bash
+chmod +x install_uv.sh
+./install_uv.sh
+```
+The `install_uv.sh` script works on both Linux and macOS, and will suggest `brew install curl` on macOS if neither `curl` nor `wget` is available.
+## Installation Locations
+### Windows
+**PowerShell installation:**
+```
+%USERPROFILE%\.local\bin\uv.exe
+Example: C:\Users\YourName\.local\bin\uv.exe
+```
+**winget installation:**
+```
+%LOCALAPPDATA%\Microsoft\WinGet\Links\uv.exe
+Example: C:\Users\YourName\AppData\Local\Microsoft\WinGet\Links\uv.exe
+```
+### Linux
+**Default installation (curl installer):**
+```
+~/.local/bin/uv
+Example: /home/yourname/.local/bin/uv
+```
+**Alternative location (cargo):**
+```
+~/.cargo/bin/uv
+Example: /home/yourname/.cargo/bin/uv
+```
+### macOS
+**Default installation (curl installer):**
+```
+~/.local/bin/uv
+Example: /Users/yourname/.local/bin/uv
+```
+**Alternative location (cargo):**
+```
+~/.cargo/bin/uv
+Example: /Users/yourname/.cargo/bin/uv
+```
+**Homebrew installation:**
+```
+/opt/homebrew/bin/uv  (Apple Silicon)
+/usr/local/bin/uv     (Intel)
+```
+## First Run
+### Windows with python_embeded
+```batch
+REM Download and extract portable package from:
+REM https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z
+REM Run the startup script
+start_gradio_ui.bat
+REM Output:
+REM [Environment] Using embedded Python...
+REM Starting ACE-Step Gradio UI...
+```
+### Windows with uv
+```batch
+REM First time: uv will create a virtual environment and sync dependencies
+start_gradio_ui.bat
+REM Output:
+REM [Environment] Using uv package manager...
+REM [Setup] Virtual environment not found. Setting up environment...
+REM Running: uv sync
+```
+### Linux with uv
+```bash
+# Make scripts executable (first time only)
+chmod +x start_gradio_ui.sh install_uv.sh
+# First time: uv will create a virtual environment and sync dependencies
+./start_gradio_ui.sh
+# Output:
+# [Environment] Using uv package manager...
+# [Setup] Virtual environment not found. Setting up environment...
+# Running: uv sync
+```
+### macOS (Apple Silicon) with uv
+```bash
+# Make scripts executable (first time only)
+chmod +x start_gradio_ui_macos.sh install_uv.sh
+# Use the macOS-specific script for MLX backend
+./start_gradio_ui_macos.sh
+# Output:
+# ============================================
+#   ACE-Step 1.5 - macOS Apple Silicon (MLX)
+# ============================================
+# [Environment] Using uv package manager...
+# [Setup] Virtual environment not found. Setting up environment...
+# Running: uv sync
+```
+Note: On macOS Apple Silicon, always use `start_gradio_ui_macos.sh` instead of `start_gradio_ui.sh` to enable the MLX backend for native acceleration.
+## Troubleshooting
+### "uv not found" after installation
+**Windows**
+Cause: PATH not refreshed after installation.
+Solution 1: Restart your terminal (close and reopen Command Prompt or PowerShell).
+Solution 2: Use the full path temporarily:
+```batch
+%USERPROFILE%\.local\bin\uv.exe run acestep
+```
+**Linux/macOS**
+Cause: uv installed but not in PATH.
+Solution 1: Restart your terminal or source your profile:
+```bash
+source ~/.bashrc    # or ~/.zshrc on macOS
+```
+Solution 2: Add uv to your PATH manually:
+```bash
+# For ~/.local/bin installation
+echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
+source ~/.bashrc
+# For macOS with zsh (default shell)
+echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc
+source ~/.zshrc
+```
+Solution 3: Use the full path temporarily:
+```bash
+~/.local/bin/uv run acestep
+```
+### Permission issues (Linux/macOS)
+**Symptom**: `Permission denied` when running scripts.
+**Solution**:
+```bash
+chmod +x start_gradio_ui.sh
+chmod +x start_gradio_ui_macos.sh
+chmod +x install_uv.sh
+```
+**Symptom**: `Permission denied` during uv installation.
+**Solution**: The curl installer installs to `~/.local/bin` which should not require root. If you see permission errors:
+```bash
+# Ensure the directory exists and is writable
+mkdir -p ~/.local/bin
+```
+Do not use `sudo` with the uv installer.
+### winget not available (Windows)
+**Symptom**:
+```
+'winget' is not recognized as an internal or external command
+```
+**Solution**:
+- Windows 11: Should be pre-installed. Try updating Windows.
+- Windows 10: Install "App Installer" from the Microsoft Store.
+- Alternative: Use the PowerShell installation method instead:
+  ```powershell
+  irm https://astral.sh/uv/install.ps1 | iex
+  ```
+### Installation fails
+**Common causes**:
+- Network connection issues
+- Firewall blocking downloads
+- Antivirus software interference (Windows)
+- Missing `curl` or `wget` (Linux/macOS)
+**Solutions**:
+1. Check your internet connection.
+2. Temporarily disable firewall/antivirus (Windows).
+3. Try an alternative installation method:
+   - **Windows**: Use PowerShell method if winget fails, or vice versa.
+   - **Linux**: Install `curl` first (`sudo apt install curl` on Ubuntu/Debian, `sudo yum install curl` on CentOS/RHEL).
+   - **macOS**: Use `brew install uv` as an alternative.
+4. **Windows only**: Use the portable package instead: https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z
+## Switching Environments (Windows Only)
+Windows is the only platform with two environment options. Linux and macOS use uv exclusively.
+### From python_embeded to uv
+```batch
+REM 1. Install uv
+install_uv.bat
+REM 2. Rename or delete python_embeded folder
+rename python_embeded python_embeded_backup
+REM 3. Run startup script (will use uv)
+start_gradio_ui.bat
+```
+### From uv to python_embeded
+```batch
+REM 1. Download portable package
+REM https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z
+REM 2. Extract python_embeded folder to project root
+REM 3. Run startup script (will use python_embeded)
+start_gradio_ui.bat
+```
+## Environment Variables (.env)
+ACE-Step can be configured using environment variables in a `.env` file.
+### Setup
+```bash
+# Copy the example file
+cp .env.example .env
+# Edit .env with your preferred settings
+```
+### Available Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `ACESTEP_INIT_LLM` | auto | LLM initialization control |
+| `ACESTEP_CONFIG_PATH` | acestep-v15-turbo | DiT model path |
+| `ACESTEP_LM_MODEL_PATH` | acestep-5Hz-lm-1.7B | LM model path |
+| `ACESTEP_DEVICE` | auto | Device: auto, cuda, cpu, xpu |
+| `ACESTEP_LM_BACKEND` | vllm | LM backend: vllm, pt, mlx |
+| `ACESTEP_DOWNLOAD_SOURCE` | auto | Download source |
+| `ACESTEP_API_KEY` | (none) | API authentication key |
+### ACESTEP_LM_BACKEND
+Controls which backend is used for the Language Model.
+| Value | Platform | Description |
+|-------|----------|-------------|
+| `vllm` | Linux (CUDA) | Default. Fastest backend for NVIDIA GPUs. |
+| `pt` | All | PyTorch native backend. Works everywhere but slower. |
+| `mlx` | macOS (Apple Silicon) | Native Apple Silicon acceleration via MLX. |
+**Platform-specific recommendations:**
+- **Windows**: Use `vllm` (default) with NVIDIA GPU, or `pt` as fallback.
+- **Linux**: Use `vllm` (default) with NVIDIA GPU, or `pt` as fallback.
+- **macOS Apple Silicon**: Use `mlx` for best performance. The `start_gradio_ui_macos.sh` script sets this automatically via `export ACESTEP_LM_BACKEND="mlx"`.
+**Example .env for macOS Apple Silicon:**
+```bash
+ACESTEP_LM_BACKEND=mlx
+ACESTEP_CONFIG_PATH=acestep-v15-turbo
+ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B
+```
+### ACESTEP_INIT_LLM - LLM Initialization Control
+Controls whether the Language Model (5Hz LM) is initialized at startup.
+**Processing Flow:**
+```
+GPU Detection (full) --> ACESTEP_INIT_LLM Override --> Model Loading
+```
+- GPU optimizations (offload, quantization, batch limits) are **always applied**
+- `ACESTEP_INIT_LLM` only overrides the "should we load LLM" decision
+- Model validation shows warnings but does not block when forcing
+| Value | Behavior |
+|-------|----------|
+| `auto` (or empty) | Use GPU auto-detection result (recommended) |
+| `true` / `1` / `yes` | Force enable LLM after GPU detection (may cause OOM) |
+| `false` / `0` / `no` | Force disable for pure DiT mode |
+**Example configurations:**
+```bash
+# Auto mode (recommended) - let GPU detection decide
+ACESTEP_INIT_LLM=auto
+# Auto mode - leave empty (same as above)
+ACESTEP_INIT_LLM=
+# Force enable on low VRAM GPU (GPU optimizations still applied)
+ACESTEP_INIT_LLM=true
+ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-0.6B  # Use smallest model
+# Force disable LLM for faster generation
+ACESTEP_INIT_LLM=false
+```
+### Features Affected by LLM
+When LLM is disabled (`ACESTEP_INIT_LLM=false`), these features are unavailable:
+| Feature | Description | Available without LLM |
+|---------|-------------|----------------------|
+| Thinking mode | LLM generates audio codes | No |
+| CoT caption | LLM enhances captions | No (auto-disabled) |
+| CoT language | LLM detects vocal language | No (auto-disabled) |
+| Sample mode | Generate from description | No |
+| Format mode | LLM-enhanced input | No |
+| Basic generation | DiT-based synthesis | Yes |
+| Cover/Repaint | Audio editing tasks | Yes |
+Note: When using the API server, CoT features (`use_cot_caption`, `use_cot_language`) are automatically disabled when LLM is unavailable, allowing basic generation to proceed.
+## Environment Comparison
+| Feature | python_embeded (Windows) | uv (Windows) | uv (Linux) | uv (macOS) |
+|---------|--------------------------|---------------|-------------|-------------|
+| Setup Difficulty | Zero config | Need install | Need install | Need install |
+| Startup Speed | Fast | Fast | Fast | Fast |
+| Update Ease | Re-download | uv command | uv command | uv command |
+| Environment Isolation | Complete | Virtual env | Virtual env | Virtual env |
+| Development | Basic | Excellent | Excellent | Excellent |
+| Beginner Friendly | Best | Good | Good | Good |
+| GPU Backend | CUDA | CUDA | CUDA (vllm) | MLX (Apple Silicon) |
+| Install Script | N/A | install_uv.bat | install_uv.sh | install_uv.sh |
+| Launch Script | start_gradio_ui.bat | start_gradio_ui.bat | start_gradio_ui.sh | start_gradio_ui_macos.sh |
+## Recommendations
+### Windows
+**Use python_embeded if:**
+- First time using ACE-Step
+- Want zero configuration
+- Do not need frequent updates
+- Prefer a self-contained package
+**Use uv if:**
+- Developer or experienced with Python
+- Need to modify dependencies
+- Using the Git repository
+- Want smaller installation size
+- Need frequent code updates
+### Linux
+**Use uv (only option):**
+- Install uv via the curl installer or `install_uv.sh`
+- Use `start_gradio_ui.sh` to launch
+- NVIDIA GPU with CUDA is recommended for vllm backend
+- CPU-only is possible with `ACESTEP_DEVICE=cpu` and `ACESTEP_LM_BACKEND=pt`
+### macOS (Apple Silicon)
+**Use uv with MLX backend (recommended):**
+- Install uv via curl installer, Homebrew, or `install_uv.sh`
+- Use `start_gradio_ui_macos.sh` to launch (sets MLX backend automatically)
+- The 0.6B LM model (`acestep-5Hz-lm-0.6B`) is recommended for devices with limited unified memory
+- Set `ACESTEP_LM_BACKEND=mlx` in `.env` if launching manually
+- Intel Macs should use `start_gradio_ui.sh` with `ACESTEP_LM_BACKEND=pt` instead

.claude/skills/acestep-docs/guides/GPU_COMPATIBILITY.md ADDED Viewed

	@@ -0,0 +1,134 @@

+# GPU Compatibility Guide
+ACE-Step 1.5 automatically adapts to your GPU's available VRAM, adjusting generation limits and LM model availability accordingly. The system detects GPU memory at startup and configures optimal settings.
+## GPU Tier Configuration
+| VRAM | Tier | LM Mode | Max Duration | Max Batch Size | LM Memory Allocation |
+|------|------|---------|--------------|----------------|---------------------|
+| ≤4GB | Tier 1 | Not available | 3 min | 1 | - |
+| 4-6GB | Tier 2 | Not available | 6 min | 1 | - |
+| 6-8GB | Tier 3 | 0.6B (optional) | With LM: 4 min / Without: 6 min | With LM: 1 / Without: 2 | 3GB |
+| 8-12GB | Tier 4 | 0.6B (optional) | With LM: 4 min / Without: 6 min | With LM: 2 / Without: 4 | 3GB |
+| 12-16GB | Tier 5 | 0.6B / 1.7B | With LM: 4 min / Without: 6 min | With LM: 2 / Without: 4 | 0.6B: 3GB, 1.7B: 8GB |
+| 16-24GB | Tier 6 | 0.6B / 1.7B / 4B | 8 min | With LM: 4 / Without: 8 | 0.6B: 3GB, 1.7B: 8GB, 4B: 12GB |
+| ≥24GB | Unlimited | All models | 10 min | 8 | Unrestricted |
+## Notes
+- **Default settings** are automatically configured based on detected GPU memory
+- **LM Mode** refers to the Language Model used for Chain-of-Thought generation and audio understanding
+- **Flash Attention**, **CPU Offload**, **Compile**, and **Quantization** are enabled by default for optimal performance
+- If you request a duration or batch size exceeding your GPU's limits, a warning will be displayed and values will be clamped
+- **Constrained Decoding**: When LM is initialized, the LM's duration generation is also constrained to the GPU tier's maximum duration limit, preventing out-of-memory errors during CoT generation
+- For GPUs with ≤6GB VRAM, LM initialization is disabled by default to preserve memory for the DiT model
+- You can manually override settings via command-line arguments or the Gradio UI
+## Overriding LLM Initialization
+By default, LLM is auto-enabled/disabled based on GPU VRAM. You can override this behavior.
+**Important:** GPU optimizations (offload, quantization, batch limits) are **always applied** regardless of override. `ACESTEP_INIT_LLM` only controls whether to attempt LLM loading.
+### Processing Flow
+```
+GPU Detection (full)  →  ACESTEP_INIT_LLM Override  →  Model Loading
+     │                          │                          │
+     ├─ offload settings        ├─ auto: use GPU result    ├─ Download model
+     ├─ batch limits            ├─ true: force enable      ├─ Initialize LLM
+     ├─ duration limits         └─ false: force disable    └─ (with GPU settings)
+     └─ recommended models
+```
+### Gradio UI
+```bash
+# Force enable LLM (may cause OOM on low VRAM)
+uv run acestep --init_llm true
+# Force disable LLM (pure DiT mode)
+uv run acestep --init_llm false
+```
+Or in `start_gradio_ui.bat`:
+```batch
+set INIT_LLM=--init_llm true
+```
+### API Server
+Using environment variable:
+```bash
+# Auto mode (recommended)
+set ACESTEP_INIT_LLM=auto
+uv run acestep-api
+# Force enable LLM
+set ACESTEP_INIT_LLM=true
+uv run acestep-api
+# Force disable LLM
+set ACESTEP_INIT_LLM=false
+uv run acestep-api
+```
+Or using command line:
+```bash
+uv run acestep-api --init-llm
+```
+Or in `start_api_server.bat`:
+```batch
+set ACESTEP_INIT_LLM=true
+```
+### When to Override
+| Scenario | Setting | Notes |
+|----------|---------|-------|
+| Low VRAM but need thinking mode | `true` | May cause OOM, use with caution |
+| Fast generation without CoT | `false` | Skips LLM, uses pure DiT |
+| API server pure DiT mode | `false` | Faster responses, simpler setup |
+| High VRAM but want minimal setup | `false` | No LLM model download needed |
+### Features Affected by LLM
+When LLM is disabled, these features are automatically disabled:
+- **Thinking mode** (`thinking=true`)
+- **CoT caption/language detection** (`use_cot_caption`, `use_cot_language`)
+- **Sample mode** (generate from description)
+- **Format mode** (LLM-enhanced input)
+The API server will automatically fallback to pure DiT mode when these features are requested but LLM is unavailable.
+> **Community Contributions Welcome**: The GPU tier configurations above are based on our testing across common hardware. If you find that your device's actual performance differs from these parameters (e.g., can handle longer durations or larger batch sizes), we welcome you to conduct more thorough testing and submit a PR to optimize these configurations in `acestep/gpu_config.py`. Your contributions help improve the experience for all users!
+## Memory Optimization Tips
+1. **Low VRAM (<8GB)**: Use DiT-only mode without LM initialization for maximum duration
+2. **Medium VRAM (8-16GB)**: Use the 0.6B LM model for best balance of quality and memory
+3. **High VRAM (>16GB)**: Enable larger LM models (1.7B/4B) for better audio understanding and generation quality
+## Debug Mode: Simulating Different GPU Configurations
+For testing and development, you can simulate different GPU memory sizes using the `MAX_CUDA_VRAM` environment variable:
+```bash
+# Simulate a 4GB GPU (Tier 1)
+MAX_CUDA_VRAM=4 uv run acestep
+# Simulate an 8GB GPU (Tier 4)
+MAX_CUDA_VRAM=8 uv run acestep
+# Simulate a 12GB GPU (Tier 5)
+MAX_CUDA_VRAM=12 uv run acestep
+# Simulate a 16GB GPU (Tier 6)
+MAX_CUDA_VRAM=16 uv run acestep
+```
+This is useful for:
+- Testing GPU tier configurations on high-end hardware
+- Verifying that warnings and limits work correctly for each tier
+- Developing and testing new GPU configuration parameters before submitting a PR

.claude/skills/acestep-docs/guides/GRADIO_GUIDE.md ADDED Viewed

	@@ -0,0 +1,549 @@

+# ACE-Step Gradio Demo User Guide
+---
+This guide provides comprehensive documentation for using the ACE-Step Gradio web interface for music generation, including all features and settings.
+## Table of Contents
+- [Getting Started](#getting-started)
+- [Service Configuration](#service-configuration)
+- [Generation Modes](#generation-modes)
+- [Task Types](#task-types)
+- [Input Parameters](#input-parameters)
+- [Advanced Settings](#advanced-settings)
+- [Results Section](#results-section)
+- [LoRA Training](#lora-training)
+- [Tips and Best Practices](#tips-and-best-practices)
+---
+## Getting Started
+### Launching the Demo
+```bash
+# Basic launch
+python app.py
+# With pre-initialization
+python app.py --config acestep-v15-turbo --init-llm
+# With specific port
+python app.py --port 7860
+```
+### Interface Overview
+The Gradio interface consists of several main sections:
+1. **Service Configuration** - Model loading and initialization
+2. **Required Inputs** - Task type, audio uploads, and generation mode
+3. **Music Caption & Lyrics** - Text inputs for generation
+4. **Optional Parameters** - Metadata like BPM, key, duration
+5. **Advanced Settings** - Fine-grained control over generation
+6. **Results** - Generated audio playback and management
+---
+## Service Configuration
+### Model Selection
+| Setting | Description |
+|---------|-------------|
+| **Checkpoint File** | Select a trained model checkpoint (if available) |
+| **Main Model Path** | Choose the DiT model configuration (e.g., `acestep-v15-turbo`, `acestep-v15-turbo-shift3`) |
+| **Device** | Processing device: `auto` (recommended), `cuda`, or `cpu` |
+### 5Hz LM Configuration
+| Setting | Description |
+|---------|-------------|
+| **5Hz LM Model Path** | Select the language model (e.g., `acestep-5Hz-lm-0.6B`, `acestep-5Hz-lm-1.7B`) |
+| **5Hz LM Backend** | `vllm` (faster, recommended) or `pt` (PyTorch, more compatible) |
+| **Initialize 5Hz LM** | Check to load the LM during initialization (required for thinking mode) |
+### Performance Options
+| Setting | Description |
+|---------|-------------|
+| **Use Flash Attention** | Enable for faster inference (requires flash_attn package) |
+| **Offload to CPU** | Offload models to CPU when idle to save GPU memory |
+| **Offload DiT to CPU** | Specifically offload the DiT model to CPU |
+### LoRA Adapter
+| Setting | Description |
+|---------|-------------|
+| **LoRA Path** | Path to trained LoRA adapter directory |
+| **Load LoRA** | Load the specified LoRA adapter |
+| **Unload** | Remove the currently loaded LoRA |
+| **Use LoRA** | Enable/disable the loaded LoRA for inference |
+### Initialization
+Click **Initialize Service** to load the models. The status box will show progress and confirmation.
+---
+## Generation Modes
+### Simple Mode
+Simple mode is designed for quick, natural language-based music generation.
+**How to use:**
+1. Select "Simple" in the Generation Mode radio button
+2. Enter a natural language description in the "Song Description" field
+3. Optionally check "Instrumental" if you don't want vocals
+4. Optionally select a preferred vocal language
+5. Click **Create Sample** to generate caption, lyrics, and metadata
+6. Review the generated content in the expanded sections
+7. Click **Generate Music** to create the audio
+**Example descriptions:**
+- "a soft Bengali love song for a quiet evening"
+- "upbeat electronic dance music with heavy bass drops"
+- "melancholic indie folk with acoustic guitar"
+- "jazz trio playing in a smoky bar"
+**Random Sample:** Click the 🎲 button to load a random example description.
+### Custom Mode
+Custom mode provides full control over all generation parameters.
+**How to use:**
+1. Select "Custom" in the Generation Mode radio button
+2. Manually fill in the Caption and Lyrics fields
+3. Set optional metadata (BPM, Key, Duration, etc.)
+4. Optionally click **Format** to enhance your input using the LM
+5. Configure advanced settings as needed
+6. Click **Generate Music** to create the audio
+---
+## Task Types
+### text2music (Default)
+Generate music from text descriptions and/or lyrics.
+**Use case:** Creating new music from scratch based on prompts.
+**Required inputs:** Caption or Lyrics (at least one)
+### cover
+Transform existing audio while maintaining structure but changing style.
+**Use case:** Creating cover versions in different styles.
+**Required inputs:**
+- Source Audio (upload in Audio Uploads section)
+- Caption describing the target style
+**Key parameter:** `Audio Cover Strength` (0.0-1.0)
+- Higher values maintain more of the original structure
+- Lower values allow more creative freedom
+### repaint
+Regenerate a specific time segment of audio.
+**Use case:** Fixing or modifying specific sections of generated music.
+**Required inputs:**
+- Source Audio
+- Repainting Start (seconds)
+- Repainting End (seconds, -1 for end of file)
+- Caption describing the desired content
+### lego (Base Model Only)
+Generate a specific instrument track in context of existing audio.
+**Use case:** Adding instrument layers to backing tracks.
+**Required inputs:**
+- Source Audio
+- Track Name (select from dropdown)
+- Caption describing the track characteristics
+**Available tracks:** vocals, backing_vocals, drums, bass, guitar, keyboard, percussion, strings, synth, fx, brass, woodwinds
+### extract (Base Model Only)
+Extract/isolate a specific instrument track from mixed audio.
+**Use case:** Stem separation, isolating instruments.
+**Required inputs:**
+- Source Audio
+- Track Name to extract
+### complete (Base Model Only)
+Complete partial tracks with specified instruments.
+**Use case:** Auto-arranging incomplete compositions.
+**Required inputs:**
+- Source Audio
+- Track Names (multiple selection)
+- Caption describing the desired style
+---
+## Input Parameters
+### Required Inputs
+#### Task Type
+Select the generation task from the dropdown. The instruction field updates automatically based on the selected task.
+#### Audio Uploads
+| Field | Description |
+|-------|-------------|
+| **Reference Audio** | Optional audio for style reference |
+| **Source Audio** | Required for cover, repaint, lego, extract, complete tasks |
+| **Convert to Codes** | Extract 5Hz semantic codes from source audio |
+#### LM Codes Hints
+Pre-computed audio semantic codes can be pasted here to guide generation. Use the **Transcribe** button to analyze codes and extract metadata.
+### Music Caption
+The text description of the desired music. Be specific about:
+- Genre and style
+- Instruments
+- Mood and atmosphere
+- Tempo feel (if not specifying BPM)
+**Example:** "upbeat pop rock with electric guitars, driving drums, and catchy synth hooks"
+Click 🎲 to load a random example caption.
+### Lyrics
+Enter lyrics with structure tags:
+```
+[Verse 1]
+Walking down the street today
+Thinking of the words you used to say
+[Chorus]
+I'm moving on, I'm staying strong
+This is where I belong
+[Verse 2]
+...
+```
+**Instrumental checkbox:** Check this to generate instrumental music regardless of lyrics content.
+**Vocal Language:** Select the language for vocals. Use "unknown" for auto-detection or instrumental tracks.
+**Format button:** Click to enhance caption and lyrics using the 5Hz LM.
+### Optional Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| **BPM** | Auto | Tempo in beats per minute (30-300) |
+| **Key Scale** | Auto | Musical key (e.g., "C Major", "Am", "F# minor") |
+| **Time Signature** | Auto | Time signature: 2 (2/4), 3 (3/4), 4 (4/4), 6 (6/8) |
+| **Audio Duration** | Auto/-1 | Target length in seconds (10-600). -1 for automatic |
+| **Batch Size** | 2 | Number of audio variations to generate (1-8) |
+---
+## Advanced Settings
+### DiT Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| **Inference Steps** | 8 | Denoising steps. Turbo: 1-20, Base: 1-200 |
+| **Guidance Scale** | 7.0 | CFG strength (base model only). Higher = follows prompt more |
+| **Seed** | -1 | Random seed. Use comma-separated values for batches |
+| **Random Seed** | ✓ | When checked, generates random seeds |
+| **Audio Format** | mp3 | Output format: mp3, flac |
+| **Shift** | 3.0 | Timestep shift factor (1.0-5.0). Recommended 3.0 for turbo |
+| **Inference Method** | ode | ode (Euler, faster) or sde (stochastic) |
+| **Custom Timesteps** | - | Override timesteps (e.g., "0.97,0.76,0.615,0.5,0.395,0.28,0.18,0.085,0") |
+### Base Model Only Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| **Use ADG** | ✗ | Enable Adaptive Dual Guidance for better quality |
+| **CFG Interval Start** | 0.0 | When to start applying CFG (0.0-1.0) |
+| **CFG Interval End** | 1.0 | When to stop applying CFG (0.0-1.0) |
+### LM Parameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| **LM Temperature** | 0.85 | Sampling temperature (0.0-2.0). Higher = more creative |
+| **LM CFG Scale** | 2.0 | LM guidance strength (1.0-3.0) |
+| **LM Top-K** | 0 | Top-K sampling. 0 disables |
+| **LM Top-P** | 0.9 | Nucleus sampling (0.0-1.0) |
+| **LM Negative Prompt** | "NO USER INPUT" | Negative prompt for CFG |
+### CoT (Chain-of-Thought) Options
+| Option | Default | Description |
+|--------|---------|-------------|
+| **CoT Metas** | ✓ | Generate metadata via LM reasoning |
+| **CoT Language** | ✓ | Detect vocal language via LM |
+| **Constrained Decoding Debug** | ✗ | Enable debug logging |
+### Generation Options
+| Option | Default | Description |
+|--------|---------|-------------|
+| **LM Codes Strength** | 1.0 | How strongly LM codes influence generation (0.0-1.0) |
+| **Auto Score** | ✗ | Automatically calculate quality scores |
+| **Auto LRC** | ✗ | Automatically generate lyrics timestamps |
+| **LM Batch Chunk Size** | 8 | Max items per LM batch (GPU memory) |
+### Main Generation Controls
+| Control | Description |
+|---------|-------------|
+| **Think** | Enable 5Hz LM for code generation and metadata |
+| **ParallelThinking** | Enable parallel LM batch processing |
+| **CaptionRewrite** | Let LM enhance the input caption |
+| **AutoGen** | Automatically start next batch after completion |
+---
+## Results Section
+### Generated Audio
+Up to 8 audio samples are displayed based on batch size. Each sample includes:
+- **Audio Player** - Play, pause, and download the generated audio
+- **Send To Src** - Send this audio to the Source Audio input for further processing
+- **Save** - Save audio and metadata to a JSON file
+- **Score** - Calculate perplexity-based quality score
+- **LRC** - Generate lyrics timestamps (LRC format)
+### Details Accordion
+Click "Score & LRC & LM Codes" to expand and view:
+- **LM Codes** - The 5Hz semantic codes for this sample
+- **Quality Score** - Perplexity-based quality metric
+- **Lyrics Timestamps** - LRC format timing data
+### Batch Navigation
+| Control | Description |
+|---------|-------------|
+| **◀ Previous** | View the previous batch |
+| **Batch Indicator** | Shows current batch position (e.g., "Batch 1 / 3") |
+| **Next Batch Status** | Shows background generation progress |
+| **Next ▶** | View the next batch (triggers generation if AutoGen is on) |
+### Restore Parameters
+Click **Apply These Settings to UI** to restore all generation parameters from the current batch back to the input fields. Useful for iterating on a good result.
+### Batch Results
+The "Batch Results & Generation Details" accordion contains:
+- **All Generated Files** - Download all files from all batches
+- **Generation Details** - Detailed information about the generation process
+---
+## LoRA Training
+The LoRA Training tab provides tools for creating custom LoRA adapters.
+### Dataset Builder Tab
+#### Step 1: Load or Scan
+**Option A: Load Existing Dataset**
+1. Enter the path to a previously saved dataset JSON
+2. Click **Load**
+**Option B: Scan New Directory**
+1. Enter the path to your audio folder
+2. Click **Scan** to find audio files (wav, mp3, flac, ogg, opus)
+#### Step 2: Configure Dataset
+| Setting | Description |
+|---------|-------------|
+| **Dataset Name** | Name for your dataset |
+| **All Instrumental** | Check if all tracks have no vocals |
+| **Custom Activation Tag** | Unique tag to activate this LoRA's style |
+| **Tag Position** | Where to place the tag: Prepend, Append, or Replace caption |
+#### Step 3: Auto-Label
+Click **Auto-Label All** to generate metadata for all audio files:
+- Caption (music description)
+- BPM
+- Key
+- Time Signature
+**Skip Metas** option will skip LLM labeling and use N/A values.
+#### Step 4: Preview & Edit
+Use the slider to select samples and manually edit:
+- Caption
+- Lyrics
+- BPM, Key, Time Signature
+- Language
+- Instrumental flag
+Click **Save Changes** to update the sample.
+#### Step 5: Save Dataset
+Enter a save path and click **Save Dataset** to export as JSON.
+#### Step 6: Preprocess
+Convert the dataset to pre-computed tensors for fast training:
+1. Optionally load an existing dataset JSON
+2. Set the tensor output directory
+3. Click **Preprocess**
+This encodes audio to VAE latents, text to embeddings, and runs the condition encoder.
+### Train LoRA Tab
+#### Dataset Selection
+Enter the path to preprocessed tensors directory and click **Load Dataset**.
+#### LoRA Settings
+| Setting | Default | Description |
+|---------|---------|-------------|
+| **LoRA Rank (r)** | 64 | Capacity of LoRA. Higher = more capacity, more memory |
+| **LoRA Alpha** | 128 | Scaling factor (typically 2x rank) |
+| **LoRA Dropout** | 0.1 | Dropout rate for regularization |
+#### Training Parameters
+| Setting | Default | Description |
+|---------|---------|-------------|
+| **Learning Rate** | 1e-4 | Optimization learning rate |
+| **Max Epochs** | 500 | Maximum training epochs |
+| **Batch Size** | 1 | Training batch size |
+| **Gradient Accumulation** | 1 | Effective batch = batch_size × accumulation |
+| **Save Every N Epochs** | 200 | Checkpoint save frequency |
+| **Shift** | 3.0 | Timestep shift for turbo model |
+| **Seed** | 42 | Random seed for reproducibility |
+#### Training Controls
+- **Start Training** - Begin the training process
+- **Stop Training** - Interrupt training
+- **Training Progress** - Shows current epoch and loss
+- **Training Log** - Detailed training output
+- **Training Loss Plot** - Visual loss curve
+#### Export LoRA
+After training, export the final adapter:
+1. Enter the export path
+2. Click **Export LoRA**
+---
+## Tips and Best Practices
+### For Best Quality
+1. **Use thinking mode** - Keep "Think" checkbox enabled for LM-enhanced generation
+2. **Be specific in captions** - Include genre, instruments, mood, and style details
+3. **Let LM detect metadata** - Leave BPM/Key/Duration empty for auto-detection
+4. **Use batch generation** - Generate 2-4 variations and pick the best
+### For Faster Generation
+1. **Use turbo model** - Select `acestep-v15-turbo` or `acestep-v15-turbo-shift3`
+2. **Keep inference steps at 8** - Default is optimal for turbo
+3. **Reduce batch size** - Lower batch size if you need quick results
+4. **Disable AutoGen** - Manual control over batch generation
+### For Consistent Results
+1. **Set a specific seed** - Uncheck "Random Seed" and enter a seed value
+2. **Save good results** - Use "Save" to export parameters for reproduction
+3. **Use "Apply These Settings"** - Restore parameters from a good batch
+### For Long-form Music
+1. **Set explicit duration** - Specify duration in seconds
+2. **Use repaint task** - Fix problematic sections after initial generation
+3. **Chain generations** - Use "Send To Src" to build upon previous results
+### For Style Consistency
+1. **Train a LoRA** - Create a custom adapter for your style
+2. **Use reference audio** - Upload style reference in Audio Uploads
+3. **Use consistent captions** - Maintain similar descriptive language
+### Troubleshooting
+**No audio generated:**
+- Check that the model is initialized (green status message)
+- Ensure 5Hz LM is initialized if using thinking mode
+- Check the status output for error messages
+**Poor quality results:**
+- Increase inference steps (for base model)
+- Adjust guidance scale
+- Try different seeds
+- Make caption more specific
+**Out of memory:**
+- Reduce batch size
+- Enable CPU offloading
+- Reduce LM batch chunk size
+**LM not working:**
+- Ensure "Initialize 5Hz LM" was checked during initialization
+- Check that a valid LM model path is selected
+- Verify vllm or PyTorch backend is available
+---
+## Keyboard Shortcuts
+The Gradio interface supports standard web shortcuts:
+- **Tab** - Move between input fields
+- **Enter** - Submit text inputs
+- **Space** - Toggle checkboxes
+---
+## Language Support
+The interface supports multiple UI languages:
+- **English** (en)
+- **Chinese** (zh)
+- **Japanese** (ja)
+Select your preferred language in the Service Configuration section.
+---
+For more information, see:
+- Main README: [`../../README.md`](../../README.md)
+- REST API Documentation: [`API.md`](API.md)
+- Python Inference API: [`INFERENCE.md`](INFERENCE.md)

.claude/skills/acestep-docs/guides/INFERENCE.md ADDED Viewed

	@@ -0,0 +1,1191 @@

+# ACE-Step Inference API Documentation
+---
+This document provides comprehensive documentation for the ACE-Step inference API, including parameter specifications for all supported task types.
+## Table of Contents
+- [Quick Start](#quick-start)
+- [API Overview](#api-overview)
+- [GenerationParams Parameters](#generationparams-parameters)
+- [GenerationConfig Parameters](#generationconfig-parameters)
+- [Task Types](#task-types)
+- [Helper Functions](#helper-functions)
+- [Complete Examples](#complete-examples)
+- [Best Practices](#best-practices)
+---
+## Quick Start
+### Basic Usage
+```python
+from acestep.handler import AceStepHandler
+from acestep.llm_inference import LLMHandler
+from acestep.inference import GenerationParams, GenerationConfig, generate_music
+# Initialize handlers
+dit_handler = AceStepHandler()
+llm_handler = LLMHandler()
+# Initialize services
+dit_handler.initialize_service(
+    project_root="/path/to/project",
+    config_path="acestep-v15-turbo",
+    device="cuda"
+)
+llm_handler.initialize(
+    checkpoint_dir="/path/to/checkpoints",
+    lm_model_path="acestep-5Hz-lm-0.6B",
+    backend="vllm",
+    device="cuda"
+)
+# Configure generation parameters
+params = GenerationParams(
+    caption="upbeat electronic dance music with heavy bass",
+    bpm=128,
+    duration=30,
+)
+# Configure generation settings
+config = GenerationConfig(
+    batch_size=2,
+    audio_format="flac",
+)
+# Generate music
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/path/to/output")
+# Access results
+if result.success:
+    for audio in result.audios:
+        print(f"Generated: {audio['path']}")
+        print(f"Key: {audio['key']}")
+        print(f"Seed: {audio['params']['seed']}")
+else:
+    print(f"Error: {result.error}")
+```
+---
+## API Overview
+### Main Functions
+#### generate_music
+```python
+def generate_music(
+    dit_handler,
+    llm_handler,
+    params: GenerationParams,
+    config: GenerationConfig,
+    save_dir: Optional[str] = None,
+    progress=None,
+) -> GenerationResult
+```
+Main function for generating music using the ACE-Step model.
+#### understand_music
+```python
+def understand_music(
+    llm_handler,
+    audio_codes: str,
+    temperature: float = 0.85,
+    top_k: Optional[int] = None,
+    top_p: Optional[float] = None,
+    repetition_penalty: float = 1.0,
+    use_constrained_decoding: bool = True,
+    constrained_decoding_debug: bool = False,
+) -> UnderstandResult
+```
+Analyze audio semantic codes and extract metadata (caption, lyrics, BPM, key, etc.).
+#### create_sample
+```python
+def create_sample(
+    llm_handler,
+    query: str,
+    instrumental: bool = False,
+    vocal_language: Optional[str] = None,
+    temperature: float = 0.85,
+    top_k: Optional[int] = None,
+    top_p: Optional[float] = None,
+    repetition_penalty: float = 1.0,
+    use_constrained_decoding: bool = True,
+    constrained_decoding_debug: bool = False,
+) -> CreateSampleResult
+```
+Generate a complete music sample (caption, lyrics, metadata) from a natural language description.
+#### format_sample
+```python
+def format_sample(
+    llm_handler,
+    caption: str,
+    lyrics: str,
+    user_metadata: Optional[Dict[str, Any]] = None,
+    temperature: float = 0.85,
+    top_k: Optional[int] = None,
+    top_p: Optional[float] = None,
+    repetition_penalty: float = 1.0,
+    use_constrained_decoding: bool = True,
+    constrained_decoding_debug: bool = False,
+) -> FormatSampleResult
+```
+Format and enhance user-provided caption and lyrics, generating structured metadata.
+### Configuration Objects
+The API uses two configuration dataclasses:
+**GenerationParams** - Contains all music generation parameters:
+```python
+@dataclass
+class GenerationParams:
+    # Task & Instruction
+    task_type: str = "text2music"
+    instruction: str = "Fill the audio semantic mask based on the given conditions:"
+    # Audio Uploads
+    reference_audio: Optional[str] = None
+    src_audio: Optional[str] = None
+    # LM Codes Hints
+    audio_codes: str = ""
+    # Text Inputs
+    caption: str = ""
+    lyrics: str = ""
+    instrumental: bool = False
+    # Metadata
+    vocal_language: str = "unknown"
+    bpm: Optional[int] = None
+    keyscale: str = ""
+    timesignature: str = ""
+    duration: float = -1.0
+    # Advanced Settings
+    inference_steps: int = 8
+    seed: int = -1
+    guidance_scale: float = 7.0
+    use_adg: bool = False
+    cfg_interval_start: float = 0.0
+    cfg_interval_end: float = 1.0
+    shift: float = 1.0                    # NEW: Timestep shift factor
+    infer_method: str = "ode"             # NEW: Diffusion inference method
+    timesteps: Optional[List[float]] = None  # NEW: Custom timesteps
+    repainting_start: float = 0.0
+    repainting_end: float = -1
+    audio_cover_strength: float = 1.0
+    # 5Hz Language Model Parameters
+    thinking: bool = True
+    lm_temperature: float = 0.85
+    lm_cfg_scale: float = 2.0
+    lm_top_k: int = 0
+    lm_top_p: float = 0.9
+    lm_negative_prompt: str = "NO USER INPUT"
+    use_cot_metas: bool = True
+    use_cot_caption: bool = True
+    use_cot_lyrics: bool = False
+    use_cot_language: bool = True
+    use_constrained_decoding: bool = True
+    # CoT Generated Values (auto-filled by LM)
+    cot_bpm: Optional[int] = None
+    cot_keyscale: str = ""
+    cot_timesignature: str = ""
+    cot_duration: Optional[float] = None
+    cot_vocal_language: str = "unknown"
+    cot_caption: str = ""
+    cot_lyrics: str = ""
+```
+**GenerationConfig** - Contains batch and output configuration:
+```python
+@dataclass
+class GenerationConfig:
+    batch_size: int = 2
+    allow_lm_batch: bool = False
+    use_random_seed: bool = True
+    seeds: Optional[List[int]] = None
+    lm_batch_chunk_size: int = 8
+    constrained_decoding_debug: bool = False
+    audio_format: str = "flac"
+```
+### Result Objects
+**GenerationResult** - Result of music generation:
+```python
+@dataclass
+class GenerationResult:
+    # Audio Outputs
+    audios: List[Dict[str, Any]]  # List of audio dictionaries
+    # Generation Information
+    status_message: str           # Status message from generation
+    extra_outputs: Dict[str, Any] # Extra outputs (latents, masks, lm_metadata, time_costs)
+    # Success Status
+    success: bool                 # Whether generation succeeded
+    error: Optional[str]          # Error message if failed
+```
+**Audio Dictionary Structure:**
+Each item in `audios` list contains:
+```python
+{
+    "path": str,           # File path to saved audio
+    "tensor": Tensor,      # Audio tensor [channels, samples], CPU, float32
+    "key": str,            # Unique audio key (UUID based on params)
+    "sample_rate": int,    # Sample rate (default: 48000)
+    "params": Dict,        # Generation params for this audio (includes seed, audio_codes, etc.)
+}
+```
+**UnderstandResult** - Result of music understanding:
+```python
+@dataclass
+class UnderstandResult:
+    # Metadata Fields
+    caption: str = ""
+    lyrics: str = ""
+    bpm: Optional[int] = None
+    duration: Optional[float] = None
+    keyscale: str = ""
+    language: str = ""
+    timesignature: str = ""
+    # Status
+    status_message: str = ""
+    success: bool = True
+    error: Optional[str] = None
+```
+**CreateSampleResult** - Result of sample creation:
+```python
+@dataclass
+class CreateSampleResult:
+    # Metadata Fields
+    caption: str = ""
+    lyrics: str = ""
+    bpm: Optional[int] = None
+    duration: Optional[float] = None
+    keyscale: str = ""
+    language: str = ""
+    timesignature: str = ""
+    instrumental: bool = False
+    # Status
+    status_message: str = ""
+    success: bool = True
+    error: Optional[str] = None
+```
+**FormatSampleResult** - Result of sample formatting:
+```python
+@dataclass
+class FormatSampleResult:
+    # Metadata Fields
+    caption: str = ""
+    lyrics: str = ""
+    bpm: Optional[int] = None
+    duration: Optional[float] = None
+    keyscale: str = ""
+    language: str = ""
+    timesignature: str = ""
+    # Status
+    status_message: str = ""
+    success: bool = True
+    error: Optional[str] = None
+```
+---
+## GenerationParams Parameters
+### Text Inputs
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `caption` | `str` | `""` | Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. Max 512 characters. |
+| `lyrics` | `str` | `""` | Lyrics text for vocal music. Use `"[Instrumental]"` for instrumental tracks. Supports multiple languages. Max 4096 characters. |
+| `instrumental` | `bool` | `False` | If True, generate instrumental music regardless of lyrics. |
+### Music Metadata
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `bpm` | `Optional[int]` | `None` | Beats per minute (30-300). `None` enables auto-detection via LM. |
+| `keyscale` | `str` | `""` | Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection. |
+| `timesignature` | `str` | `""` | Time signature (2 for '2/4', 3 for '3/4', 4 for '4/4', 6 for '6/8'). Empty string enables auto-detection. |
+| `vocal_language` | `str` | `"unknown"` | Language code for vocals (ISO 639-1). Supported: `"en"`, `"zh"`, `"ja"`, `"es"`, `"fr"`, etc. Use `"unknown"` for auto-detection. |
+| `duration` | `float` | `-1.0` | Target audio length in seconds (10-600). If <= 0 or None, model chooses automatically based on lyrics length. |
+### Generation Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `inference_steps` | `int` | `8` | Number of denoising steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). Higher = better quality but slower. |
+| `guidance_scale` | `float` | `7.0` | Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Only supported for non-turbo model. Typical range: 5.0-9.0. |
+| `seed` | `int` | `-1` | Random seed for reproducibility. Use `-1` for random seed, or any positive integer for fixed seed. |
+### Advanced DiT Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `use_adg` | `bool` | `False` | Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed. |
+| `cfg_interval_start` | `float` | `0.0` | CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance. |
+| `cfg_interval_end` | `float` | `1.0` | CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance. |
+| `shift` | `float` | `1.0` | Timestep shift factor (range 1.0-5.0, default 1.0). When != 1.0, applies `t = shift * t / (1 + (shift - 1) * t)` to timesteps. Recommended 3.0 for turbo models. |
+| `infer_method` | `str` | `"ode"` | Diffusion inference method. `"ode"` (Euler) is faster and deterministic. `"sde"` (stochastic) may produce different results with variance. |
+| `timesteps` | `Optional[List[float]]` | `None` | Custom timesteps as a list of floats from 1.0 to 0.0 (e.g., `[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0]`). If provided, overrides `inference_steps` and `shift`. |
+### Task-Specific Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `task_type` | `str` | `"text2music"` | Generation task type. See [Task Types](#task-types) section for details. |
+| `instruction` | `str` | `"Fill the audio semantic mask based on the given conditions:"` | Task-specific instruction prompt. |
+| `reference_audio` | `Optional[str]` | `None` | Path to reference audio file for style transfer or continuation tasks. |
+| `src_audio` | `Optional[str]` | `None` | Path to source audio file for audio-to-audio tasks (cover, repaint, etc.). |
+| `audio_codes` | `str` | `""` | Pre-extracted 5Hz audio semantic codes as a string. Advanced use only. |
+| `repainting_start` | `float` | `0.0` | Repainting start time in seconds (for repaint/lego tasks). |
+| `repainting_end` | `float` | `-1` | Repainting end time in seconds. Use `-1` for end of audio. |
+| `audio_cover_strength` | `float` | `1.0` | Strength of audio cover/codes influence (0.0-1.0). Set smaller (0.2) for style transfer tasks. |
+### 5Hz Language Model Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `thinking` | `bool` | `True` | Enable 5Hz Language Model "Chain-of-Thought" reasoning for semantic/music metadata and codes. |
+| `lm_temperature` | `float` | `0.85` | LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative. |
+| `lm_cfg_scale` | `float` | `2.0` | LM classifier-free guidance scale. Higher = stronger adherence to prompt. |
+| `lm_top_k` | `int` | `0` | LM top-k sampling. `0` disables top-k filtering. Typical values: 40-100. |
+| `lm_top_p` | `float` | `0.9` | LM nucleus sampling (0.0-1.0). `1.0` disables nucleus sampling. Typical values: 0.9-0.95. |
+| `lm_negative_prompt` | `str` | `"NO USER INPUT"` | Negative prompt for LM guidance. Helps avoid unwanted characteristics. |
+| `use_cot_metas` | `bool` | `True` | Generate metadata using LM CoT reasoning (BPM, key, duration, etc.). |
+| `use_cot_caption` | `bool` | `True` | Refine user caption using LM CoT reasoning. |
+| `use_cot_language` | `bool` | `True` | Detect vocal language using LM CoT reasoning. |
+| `use_cot_lyrics` | `bool` | `False` | (Reserved for future use) Generate/refine lyrics using LM CoT. |
+| `use_constrained_decoding` | `bool` | `True` | Enable constrained decoding for structured LM output. |
+### CoT Generated Values
+These fields are automatically populated by the LM when CoT reasoning is enabled:
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `cot_bpm` | `Optional[int]` | `None` | LM-generated BPM value. |
+| `cot_keyscale` | `str` | `""` | LM-generated key/scale. |
+| `cot_timesignature` | `str` | `""` | LM-generated time signature. |
+| `cot_duration` | `Optional[float]` | `None` | LM-generated duration. |
+| `cot_vocal_language` | `str` | `"unknown"` | LM-detected vocal language. |
+| `cot_caption` | `str` | `""` | LM-refined caption. |
+| `cot_lyrics` | `str` | `""` | LM-generated/refined lyrics. |
+---
+## GenerationConfig Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `batch_size` | `int` | `2` | Number of samples to generate in parallel (1-8). Higher values require more GPU memory. |
+| `allow_lm_batch` | `bool` | `False` | Allow batch processing in LM. Faster when `batch_size >= 2` and `thinking=True`. |
+| `use_random_seed` | `bool` | `True` | Whether to use random seed. `True` for different results each time, `False` for reproducible results. |
+| `seeds` | `Optional[List[int]]` | `None` | List of seeds for batch generation. If provided, will be padded with random seeds if fewer than batch_size. Can also be single int. |
+| `lm_batch_chunk_size` | `int` | `8` | Maximum batch size per LM inference chunk (GPU memory constraint). |
+| `constrained_decoding_debug` | `bool` | `False` | Enable debug logging for constrained decoding. |
+| `audio_format` | `str` | `"flac"` | Output audio format. Options: `"mp3"`, `"wav"`, `"flac"`. Default is FLAC for fast saving. |
+---
+## Task Types
+ACE-Step supports 6 different generation task types, each optimized for specific use cases.
+### 1. Text2Music (Default)
+**Purpose**: Generate music from text descriptions and optional metadata.
+**Key Parameters**:
+```python
+params = GenerationParams(
+    task_type="text2music",
+    caption="energetic rock music with electric guitar",
+    lyrics="[Instrumental]",  # or actual lyrics
+    bpm=140,
+    duration=30,
+)
+```
+**Required**:
+- `caption` or `lyrics` (at least one)
+**Optional but Recommended**:
+- `bpm`: Controls tempo
+- `keyscale`: Controls musical key
+- `timesignature`: Controls rhythm structure
+- `duration`: Controls length
+- `vocal_language`: Controls vocal characteristics
+**Use Cases**:
+- Generate music from text descriptions
+- Create backing tracks from prompts
+- Generate songs with lyrics
+---
+### 2. Cover
+**Purpose**: Transform existing audio while maintaining structure but changing style/timbre.
+**Key Parameters**:
+```python
+params = GenerationParams(
+    task_type="cover",
+    src_audio="original_song.mp3",
+    caption="jazz piano version",
+    audio_cover_strength=0.8,  # 0.0-1.0
+)
+```
+**Required**:
+- `src_audio`: Path to source audio file
+- `caption`: Description of desired style/transformation
+**Optional**:
+- `audio_cover_strength`: Controls influence of original audio
+  - `1.0`: Strong adherence to original structure
+  - `0.5`: Balanced transformation
+  - `0.1`: Loose interpretation
+- `lyrics`: New lyrics (if changing vocals)
+**Use Cases**:
+- Create covers in different styles
+- Change instrumentation while keeping melody
+- Genre transformation
+---
+### 3. Repaint
+**Purpose**: Regenerate a specific time segment of audio while keeping the rest unchanged.
+**Key Parameters**:
+```python
+params = GenerationParams(
+    task_type="repaint",
+    src_audio="original.mp3",
+    repainting_start=10.0,  # seconds
+    repainting_end=20.0,    # seconds
+    caption="smooth transition with piano solo",
+)
+```
+**Required**:
+- `src_audio`: Path to source audio file
+- `repainting_start`: Start time in seconds
+- `repainting_end`: End time in seconds (use `-1` for end of file)
+- `caption`: Description of desired content for repainted section
+**Use Cases**:
+- Fix specific sections of generated music
+- Add variations to parts of a song
+- Create smooth transitions
+- Replace problematic segments
+---
+### 4. Lego (Base Model Only)
+**Purpose**: Generate a specific instrument track in context of existing audio.
+**Key Parameters**:
+```python
+params = GenerationParams(
+    task_type="lego",
+    src_audio="backing_track.mp3",
+    instruction="Generate the guitar track based on the audio context:",
+    caption="lead guitar melody with bluesy feel",
+    repainting_start=0.0,
+    repainting_end=-1,
+)
+```
+**Required**:
+- `src_audio`: Path to source/backing audio
+- `instruction`: Must specify the track type (e.g., "Generate the {TRACK_NAME} track...")
+- `caption`: Description of desired track characteristics
+**Available Tracks**:
+- `"vocals"`, `"backing_vocals"`, `"drums"`, `"bass"`, `"guitar"`, `"keyboard"`,
+- `"percussion"`, `"strings"`, `"synth"`, `"fx"`, `"brass"`, `"woodwinds"`
+**Use Cases**:
+- Add specific instrument tracks
+- Layer additional instruments over backing tracks
+- Create multi-track compositions iteratively
+---
+### 5. Extract (Base Model Only)
+**Purpose**: Extract/isolate a specific instrument track from mixed audio.
+**Key Parameters**:
+```python
+params = GenerationParams(
+    task_type="extract",
+    src_audio="full_mix.mp3",
+    instruction="Extract the vocals track from the audio:",
+)
+```
+**Required**:
+- `src_audio`: Path to mixed audio file
+- `instruction`: Must specify track to extract
+**Available Tracks**: Same as Lego task
+**Use Cases**:
+- Stem separation
+- Isolate specific instruments
+- Create remixes
+- Analyze individual tracks
+---
+### 6. Complete (Base Model Only)
+**Purpose**: Complete/extend partial tracks with specified instruments.
+**Key Parameters**:
+```python
+params = GenerationParams(
+    task_type="complete",
+    src_audio="incomplete_track.mp3",
+    instruction="Complete the input track with drums, bass, guitar:",
+    caption="rock style completion",
+)
+```
+**Required**:
+- `src_audio`: Path to incomplete/partial track
+- `instruction`: Must specify which tracks to add
+- `caption`: Description of desired style
+**Use Cases**:
+- Arrange incomplete compositions
+- Add backing tracks
+- Auto-complete musical ideas
+---
+## Helper Functions
+### understand_music
+Analyze audio codes to extract metadata about the music.
+```python
+from acestep.inference import understand_music
+result = understand_music(
+    llm_handler=llm_handler,
+    audio_codes="<|audio_code_123|><|audio_code_456|>...",
+    temperature=0.85,
+    use_constrained_decoding=True,
+)
+if result.success:
+    print(f"Caption: {result.caption}")
+    print(f"Lyrics: {result.lyrics}")
+    print(f"BPM: {result.bpm}")
+    print(f"Key: {result.keyscale}")
+    print(f"Duration: {result.duration}s")
+    print(f"Language: {result.language}")
+else:
+    print(f"Error: {result.error}")
+```
+**Use Cases**:
+- Analyze existing music
+- Extract metadata from audio codes
+- Reverse-engineer generation parameters
+---
+### create_sample
+Generate a complete music sample from a natural language description. This is the "Simple Mode" / "Inspiration Mode" feature.
+```python
+from acestep.inference import create_sample
+result = create_sample(
+    llm_handler=llm_handler,
+    query="a soft Bengali love song for a quiet evening",
+    instrumental=False,
+    vocal_language="bn",  # Optional: constrain to Bengali
+    temperature=0.85,
+)
+if result.success:
+    print(f"Caption: {result.caption}")
+    print(f"Lyrics: {result.lyrics}")
+    print(f"BPM: {result.bpm}")
+    print(f"Duration: {result.duration}s")
+    print(f"Key: {result.keyscale}")
+    print(f"Is Instrumental: {result.instrumental}")
+    # Use with generate_music
+    params = GenerationParams(
+        caption=result.caption,
+        lyrics=result.lyrics,
+        bpm=result.bpm,
+        duration=result.duration,
+        keyscale=result.keyscale,
+        vocal_language=result.language,
+    )
+else:
+    print(f"Error: {result.error}")
+```
+**Parameters**:
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `query` | `str` | required | Natural language description of desired music |
+| `instrumental` | `bool` | `False` | Whether to generate instrumental music |
+| `vocal_language` | `Optional[str]` | `None` | Constrain lyrics to specific language (e.g., "en", "zh", "bn") |
+| `temperature` | `float` | `0.85` | Sampling temperature |
+| `top_k` | `Optional[int]` | `None` | Top-k sampling (None disables) |
+| `top_p` | `Optional[float]` | `None` | Top-p sampling (None disables) |
+| `repetition_penalty` | `float` | `1.0` | Repetition penalty |
+| `use_constrained_decoding` | `bool` | `True` | Use FSM-based constrained decoding |
+---
+### format_sample
+Format and enhance user-provided caption and lyrics, generating structured metadata.
+```python
+from acestep.inference import format_sample
+result = format_sample(
+    llm_handler=llm_handler,
+    caption="Latin pop, reggaeton",
+    lyrics="[Verse 1]\nBailando en la noche...",
+    user_metadata={"bpm": 95},  # Optional: constrain specific values
+    temperature=0.85,
+)
+if result.success:
+    print(f"Enhanced Caption: {result.caption}")
+    print(f"Formatted Lyrics: {result.lyrics}")
+    print(f"BPM: {result.bpm}")
+    print(f"Duration: {result.duration}s")
+    print(f"Key: {result.keyscale}")
+    print(f"Detected Language: {result.language}")
+else:
+    print(f"Error: {result.error}")
+```
+**Parameters**:
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `caption` | `str` | required | User's caption/description |
+| `lyrics` | `str` | required | User's lyrics with structure tags |
+| `user_metadata` | `Optional[Dict]` | `None` | Constrain specific metadata values (bpm, duration, keyscale, timesignature, language) |
+| `temperature` | `float` | `0.85` | Sampling temperature |
+| `top_k` | `Optional[int]` | `None` | Top-k sampling (None disables) |
+| `top_p` | `Optional[float]` | `None` | Top-p sampling (None disables) |
+| `repetition_penalty` | `float` | `1.0` | Repetition penalty |
+| `use_constrained_decoding` | `bool` | `True` | Use FSM-based constrained decoding |
+---
+## Complete Examples
+### Example 1: Simple Text-to-Music Generation
+```python
+from acestep.inference import GenerationParams, GenerationConfig, generate_music
+params = GenerationParams(
+    task_type="text2music",
+    caption="calm ambient music with soft piano and strings",
+    duration=60,
+    bpm=80,
+    keyscale="C Major",
+)
+config = GenerationConfig(
+    batch_size=2,  # Generate 2 variations
+    audio_format="flac",
+)
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+if result.success:
+    for i, audio in enumerate(result.audios, 1):
+        print(f"Variation {i}: {audio['path']}")
+```
+### Example 2: Song Generation with Lyrics
+```python
+params = GenerationParams(
+    task_type="text2music",
+    caption="pop ballad with emotional vocals",
+    lyrics="""Verse 1:
+Walking down the street today
+Thinking of the words you used to say
+Everything feels different now
+But I'll find my way somehow
+Chorus:
+I'm moving on, I'm staying strong
+This is where I belong
+""",
+    vocal_language="en",
+    bpm=72,
+    duration=45,
+)
+config = GenerationConfig(batch_size=1)
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+```
+### Example 3: Using Custom Timesteps
+```python
+params = GenerationParams(
+    task_type="text2music",
+    caption="jazz fusion with complex harmonies",
+    # Custom 9-step schedule
+    timesteps=[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0],
+    thinking=True,
+)
+config = GenerationConfig(batch_size=1)
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+```
+### Example 4: Using Shift Parameter (Turbo Model)
+```python
+params = GenerationParams(
+    task_type="text2music",
+    caption="upbeat electronic dance music",
+    inference_steps=8,
+    shift=3.0,  # Recommended for turbo models
+    infer_method="ode",
+)
+config = GenerationConfig(batch_size=2)
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+```
+### Example 5: Simple Mode with create_sample
+```python
+from acestep.inference import create_sample, GenerationParams, GenerationConfig, generate_music
+# Step 1: Create sample from description
+sample = create_sample(
+    llm_handler=llm_handler,
+    query="energetic K-pop dance track with catchy hooks",
+    vocal_language="ko",
+)
+if sample.success:
+    # Step 2: Generate music using the sample
+    params = GenerationParams(
+        caption=sample.caption,
+        lyrics=sample.lyrics,
+        bpm=sample.bpm,
+        duration=sample.duration,
+        keyscale=sample.keyscale,
+        vocal_language=sample.language,
+        thinking=True,
+    )
+    config = GenerationConfig(batch_size=2)
+    result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+```
+### Example 6: Format and Enhance User Input
+```python
+from acestep.inference import format_sample, GenerationParams, GenerationConfig, generate_music
+# Step 1: Format user input
+formatted = format_sample(
+    llm_handler=llm_handler,
+    caption="rock ballad",
+    lyrics="[Verse]\nIn the darkness I find my way...",
+)
+if formatted.success:
+    # Step 2: Generate with enhanced input
+    params = GenerationParams(
+        caption=formatted.caption,
+        lyrics=formatted.lyrics,
+        bpm=formatted.bpm,
+        duration=formatted.duration,
+        keyscale=formatted.keyscale,
+        thinking=True,
+        use_cot_metas=False,  # Already formatted, skip metas CoT
+    )
+    config = GenerationConfig(batch_size=2)
+    result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+```
+### Example 7: Style Cover with LM Reasoning
+```python
+params = GenerationParams(
+    task_type="cover",
+    src_audio="original_pop_song.mp3",
+    caption="orchestral symphonic arrangement",
+    audio_cover_strength=0.7,
+    thinking=True,  # Enable LM for metadata
+    use_cot_metas=True,
+)
+config = GenerationConfig(batch_size=1)
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+# Access LM-generated metadata
+if result.extra_outputs.get("lm_metadata"):
+    lm_meta = result.extra_outputs["lm_metadata"]
+    print(f"LM detected BPM: {lm_meta.get('bpm')}")
+    print(f"LM detected Key: {lm_meta.get('keyscale')}")
+```
+### Example 8: Batch Generation with Specific Seeds
+```python
+params = GenerationParams(
+    task_type="text2music",
+    caption="epic cinematic trailer music",
+)
+config = GenerationConfig(
+    batch_size=4,           # Generate 4 variations
+    seeds=[42, 123, 456],   # Specify 3 seeds, 4th will be random
+    use_random_seed=False,  # Use provided seeds
+    lm_batch_chunk_size=2,  # Process 2 at a time (GPU memory)
+)
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+if result.success:
+    print(f"Generated {len(result.audios)} variations")
+    for audio in result.audios:
+        print(f"  Seed {audio['params']['seed']}: {audio['path']}")
+```
+### Example 9: High-Quality Generation (Base Model)
+```python
+params = GenerationParams(
+    task_type="text2music",
+    caption="intricate jazz fusion with complex harmonies",
+    inference_steps=64,     # High quality
+    guidance_scale=8.0,
+    use_adg=True,           # Adaptive Dual Guidance
+    cfg_interval_start=0.0,
+    cfg_interval_end=1.0,
+    shift=3.0,              # Timestep shift
+    seed=42,                # Reproducible results
+)
+config = GenerationConfig(
+    batch_size=1,
+    use_random_seed=False,
+    audio_format="wav",     # Lossless format
+)
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+```
+### Example 10: Understand Audio from Codes
+```python
+from acestep.inference import understand_music
+# Analyze audio codes (e.g., from a previous generation)
+result = understand_music(
+    llm_handler=llm_handler,
+    audio_codes="<|audio_code_10695|><|audio_code_54246|>...",
+    temperature=0.85,
+)
+if result.success:
+    print(f"Detected Caption: {result.caption}")
+    print(f"Detected Lyrics: {result.lyrics}")
+    print(f"Detected BPM: {result.bpm}")
+    print(f"Detected Key: {result.keyscale}")
+    print(f"Detected Duration: {result.duration}s")
+    print(f"Detected Language: {result.language}")
+```
+---
+## Best Practices
+### 1. Caption Writing
+**Good Captions**:
+```python
+# Specific and descriptive
+caption="upbeat electronic dance music with heavy bass and synthesizer leads"
+# Include mood and genre
+caption="melancholic indie folk with acoustic guitar and soft vocals"
+# Specify instruments
+caption="jazz trio with piano, upright bass, and brush drums"
+```
+**Avoid**:
+```python
+# Too vague
+caption="good music"
+# Contradictory
+caption="fast slow music"  # Conflicting tempos
+```
+### 2. Parameter Tuning
+**For Best Quality**:
+- Use base model with `inference_steps=64` or higher
+- Enable `use_adg=True`
+- Set `guidance_scale=7.0-9.0`
+- Set `shift=3.0` for better timestep distribution
+- Use lossless audio format (`audio_format="wav"`)
+**For Speed**:
+- Use turbo model with `inference_steps=8`
+- Disable ADG (`use_adg=False`)
+- Use `infer_method="ode"` (default)
+- Use compressed format (`audio_format="mp3"`) or default FLAC
+**For Consistency**:
+- Set `use_random_seed=False` in config
+- Use fixed `seeds` list or single `seed` in params
+- Keep `lm_temperature` lower (0.7-0.85)
+**For Diversity**:
+- Set `use_random_seed=True` in config
+- Increase `lm_temperature` (0.9-1.1)
+- Use `batch_size > 1` for variations
+### 3. Duration Guidelines
+- **Instrumental**: 30-180 seconds works well
+- **With Lyrics**: Auto-detection recommended (set `duration=-1` or leave default)
+- **Short clips**: 10-20 seconds minimum
+- **Long form**: Up to 600 seconds (10 minutes) maximum
+### 4. LM Usage
+**When to Enable LM (`thinking=True`)**:
+- Need automatic metadata detection
+- Want caption refinement
+- Generating from minimal input
+- Need diverse outputs
+**When to Disable LM (`thinking=False`)**:
+- Have precise metadata already
+- Need faster generation
+- Want full control over parameters
+### 5. Batch Processing
+```python
+# Efficient batch generation
+config = GenerationConfig(
+    batch_size=8,           # Max supported
+    allow_lm_batch=True,    # Enable for speed (when thinking=True)
+    lm_batch_chunk_size=4,  # Adjust based on GPU memory
+)
+```
+### 6. Error Handling
+```python
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+if not result.success:
+    print(f"Generation failed: {result.error}")
+    print(f"Status: {result.status_message}")
+else:
+    # Process successful result
+    for audio in result.audios:
+        path = audio['path']
+        key = audio['key']
+        seed = audio['params']['seed']
+        # ... process audio files
+```
+### 7. Memory Management
+For large batch sizes or long durations:
+- Monitor GPU memory usage
+- Reduce `batch_size` if OOM errors occur
+- Reduce `lm_batch_chunk_size` for LM operations
+- Consider using `offload_to_cpu=True` during initialization
+### 8. Accessing Time Costs
+```python
+result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
+if result.success:
+    time_costs = result.extra_outputs.get("time_costs", {})
+    print(f"LM Phase 1 Time: {time_costs.get('lm_phase1_time', 0):.2f}s")
+    print(f"LM Phase 2 Time: {time_costs.get('lm_phase2_time', 0):.2f}s")
+    print(f"DiT Total Time: {time_costs.get('dit_total_time_cost', 0):.2f}s")
+    print(f"Pipeline Total: {time_costs.get('pipeline_total_time', 0):.2f}s")
+```
+---
+## Troubleshooting
+### Common Issues
+**Issue**: Out of memory errors
+- **Solution**: Reduce `batch_size`, `inference_steps`, or enable CPU offloading
+**Issue**: Poor quality results
+- **Solution**: Increase `inference_steps`, adjust `guidance_scale`, use base model
+**Issue**: Results don't match prompt
+- **Solution**: Make caption more specific, increase `guidance_scale`, enable LM refinement (`thinking=True`)
+**Issue**: Slow generation
+- **Solution**: Use turbo model, reduce `inference_steps`, disable ADG
+**Issue**: LM not generating codes
+- **Solution**: Verify `llm_handler` is initialized, check `thinking=True` and `use_cot_metas=True`
+**Issue**: Seeds not being respected
+- **Solution**: Set `use_random_seed=False` in config and provide `seeds` list or `seed` in params
+**Issue**: Custom timesteps not working
+- **Solution**: Ensure timesteps are a list of floats from 1.0 to 0.0, properly ordered
+---
+## API Reference Summary
+### GenerationParams Fields
+See [GenerationParams Parameters](#generationparams-parameters) for complete documentation.
+### GenerationConfig Fields
+See [GenerationConfig Parameters](#generationconfig-parameters) for complete documentation.
+### GenerationResult Fields
+```python
+@dataclass
+class GenerationResult:
+    # Audio Outputs
+    audios: List[Dict[str, Any]]
+    # Each audio dict contains:
+    #   - "path": str (file path)
+    #   - "tensor": Tensor (audio data)
+    #   - "key": str (unique identifier)
+    #   - "sample_rate": int (48000)
+    #   - "params": Dict (generation params with seed, audio_codes, etc.)
+    # Generation Information
+    status_message: str
+    extra_outputs: Dict[str, Any]
+    # extra_outputs contains:
+    #   - "lm_metadata": Dict (LM-generated metadata)
+    #   - "time_costs": Dict (timing information)
+    #   - "latents": Tensor (intermediate latents, if available)
+    #   - "masks": Tensor (attention masks, if available)
+    # Success Status
+    success: bool
+    error: Optional[str]
+```
+---
+## Version History
+- **v1.5.2**: Current version
+  - Added `shift` parameter for timestep shifting
+  - Added `infer_method` parameter for ODE/SDE selection
+  - Added `timesteps` parameter for custom timestep schedules
+  - Added `understand_music()` function for audio analysis
+  - Added `create_sample()` function for simple mode generation
+  - Added `format_sample()` function for input enhancement
+  - Added `UnderstandResult`, `CreateSampleResult`, `FormatSampleResult` dataclasses
+- **v1.5.1**: Previous version
+  - Split `GenerationConfig` into `GenerationParams` and `GenerationConfig`
+  - Renamed parameters for consistency (`key_scale` → `keyscale`, `time_signature` → `timesignature`, `audio_duration` → `duration`, `use_llm_thinking` → `thinking`, `audio_code_string` → `audio_codes`)
+  - Added `instrumental` parameter
+  - Added `use_constrained_decoding` parameter
+  - Added CoT auto-filled fields (`cot_*`)
+  - Changed default `audio_format` to "flac"
+  - Changed default `batch_size` to 2
+  - Changed default `thinking` to True
+  - Simplified `GenerationResult` structure with unified `audios` list
+  - Added unified `time_costs` in `extra_outputs`
+- **v1.5**: Initial version
+  - Introduced `GenerationConfig` and `GenerationResult` dataclasses
+  - Simplified parameter passing
+  - Added comprehensive documentation
+---
+For more information, see:
+- Main README: [`../../README.md`](../../README.md)
+- REST API Documentation: [`API.md`](API.md)
+- Gradio Demo Guide: [`GRADIO_GUIDE.md`](GRADIO_GUIDE.md)
+- Project repository: [ACE-Step-1.5](https://github.com/yourusername/ACE-Step-1.5)

.claude/skills/acestep-docs/guides/SCRIPT_CONFIGURATION.md ADDED Viewed

	@@ -0,0 +1,615 @@

+# Launch Script Configuration Guide
+This guide explains how to configure the startup scripts for ACE-Step across all supported platforms: Windows (.bat), Linux (.sh), and macOS (.sh).
+> **Note for uv/Python users**: If you're using `uv run acestep` or running Python directly (not using launch scripts), configure settings via the `.env` file instead. See [ENVIRONMENT_SETUP.md](ENVIRONMENT_SETUP.md#environment-variables-env) for details.
+## How to Modify
+All configurable options are variables at the top of each script. Open the script with any text editor and modify the values.
+**Windows (.bat)**:
+- Set a variable: `set VARIABLE=value`
+- Comment out a line: `REM set VARIABLE=value`
+- Uncomment a line: Remove the leading `REM`
+**Linux/macOS (.sh)**:
+- Set a variable: `VARIABLE="value"`
+- Comment out a line: `# VARIABLE="value"`
+- Uncomment a line: Remove the leading `#`
+---
+## Available Launch Scripts
+| Platform | Script | Purpose |
+|----------|--------|---------|
+| Windows (NVIDIA) | `start_gradio_ui.bat` | Gradio Web UI |
+| Windows (NVIDIA) | `start_api_server.bat` | REST API Server |
+| Windows (AMD ROCm) | `start_gradio_ui_rocm.bat` | Gradio Web UI for AMD GPUs |
+| Windows (AMD ROCm) | `start_api_server_rocm.bat` | REST API Server for AMD GPUs |
+| Linux (CUDA) | `start_gradio_ui.sh` | Gradio Web UI |
+| Linux (CUDA) | `start_api_server.sh` | REST API Server |
+| macOS (Apple Silicon) | `start_gradio_ui_macos.sh` | Gradio Web UI (MLX backend) |
+| macOS (Apple Silicon) | `start_api_server_macos.sh` | REST API Server (MLX backend) |
+---
+## Configuration Sections
+### 1. UI Language
+Controls the language displayed in the Gradio Web UI.
+**Options**: `en` (English), `zh` (Chinese), `he` (Hebrew), `ja` (Japanese)
+**Windows (.bat)**:
+```batch
+REM UI language: en, zh, he, ja
+set LANGUAGE=en
+```
+**Linux/macOS (.sh)**:
+```bash
+# UI language: en, zh, he, ja
+LANGUAGE="en"
+```
+**Example -- switch to Chinese**:
+| Platform | Setting |
+|----------|---------|
+| Windows | `set LANGUAGE=zh` |
+| Linux/macOS | `LANGUAGE="zh"` |
+> **Note**: The `LANGUAGE` variable is only available in Gradio UI scripts. API server scripts do not have a UI language setting.
+---
+### 2. Server Port
+Controls which port the server listens on and which address it binds to.
+**Gradio UI scripts**:
+| Platform | Default Port | Default Address |
+|----------|-------------|-----------------|
+| Windows | `7860` | `127.0.0.1` |
+| Linux | `7860` | `127.0.0.1` |
+| macOS | `7860` | `127.0.0.1` |
+**Windows (.bat)** -- Gradio UI:
+```batch
+REM Server settings
+set PORT=7860
+set SERVER_NAME=127.0.0.1
+REM set SERVER_NAME=0.0.0.0
+REM set SHARE=--share
+```
+**Linux/macOS (.sh)** -- Gradio UI:
+```bash
+# Server settings
+PORT=7860
+SERVER_NAME="127.0.0.1"
+# SERVER_NAME="0.0.0.0"
+SHARE=""
+# SHARE="--share"
+```
+**API Server scripts**:
+| Platform | Default Port | Default Host |
+|----------|-------------|--------------|
+| Windows | `8001` | `127.0.0.1` |
+| Linux | `8001` | `127.0.0.1` |
+| macOS | `8001` | `127.0.0.1` |
+**Windows (.bat)** -- API Server:
+```batch
+set HOST=127.0.0.1
+set PORT=8001
+```
+**Linux/macOS (.sh)** -- API Server:
+```bash
+HOST="127.0.0.1"
+PORT=8001
+```
+**Default URLs**:
+- Gradio UI: http://127.0.0.1:7860
+- API Server: http://127.0.0.1:8001
+- API Documentation: http://127.0.0.1:8001/docs
+**To expose to the network** (allow access from other devices):
+- Set `SERVER_NAME` or `HOST` to `0.0.0.0`
+- Or enable `SHARE` for Gradio's public sharing link
+---
+### 3. Download Source
+Controls where model files are downloaded from. Affects all scripts that download models.
+**Windows (.bat)**:
+```batch
+REM Download source: auto (default), huggingface, or modelscope
+REM set DOWNLOAD_SOURCE=--download-source modelscope
+REM set DOWNLOAD_SOURCE=--download-source huggingface
+set DOWNLOAD_SOURCE=
+```
+**Linux/macOS (.sh)**:
+```bash
+# Download source: auto (default), huggingface, or modelscope
+DOWNLOAD_SOURCE=""
+# DOWNLOAD_SOURCE="--download-source modelscope"
+# DOWNLOAD_SOURCE="--download-source huggingface"
+```
+**Options**:
+| Value | When to Use | Speed |
+|-------|-------------|-------|
+| (empty) or `auto` | Auto-detect network | Automatic |
+| `modelscope` | China mainland users | Fast in China |
+| `huggingface` | Overseas users | Fast outside China |
+**How auto-detection works**:
+1. Tests Google connectivity
+   - Can access Google --> uses HuggingFace Hub
+   - Cannot access Google --> uses ModelScope
+2. If primary source fails, falls back to the alternate source
+**Examples**:
+| Platform | China Users | Overseas Users |
+|----------|-------------|----------------|
+| Windows | `set DOWNLOAD_SOURCE=--download-source modelscope` | `set DOWNLOAD_SOURCE=--download-source huggingface` |
+| Linux/macOS | `DOWNLOAD_SOURCE="--download-source modelscope"` | `DOWNLOAD_SOURCE="--download-source huggingface"` |
+---
+### 4. Update Check
+Controls whether the script checks GitHub for updates before launching.
+**Default**: `true` (enabled)
+**Windows (.bat)**:
+```batch
+REM Update check on startup (set to false to disable)
+set CHECK_UPDATE=true
+REM set CHECK_UPDATE=false
+```
+**Linux/macOS (.sh)**:
+```bash
+# Update check on startup (set to "false" to disable)
+CHECK_UPDATE="true"
+# CHECK_UPDATE="false"
+```
+**Git detection by platform**:
+| Platform | Git Resolution |
+|----------|---------------|
+| Windows | Tries `PortableGit\bin\git.exe` first, then falls back to system `git` (e.g., Git for Windows) |
+| Linux | Uses system `git` |
+| macOS | Uses system `git` (Xcode Command Line Tools or Homebrew) |
+> **Important**: On Windows, PortableGit is no longer strictly required. If you have Git for Windows installed system-wide, the update check will find it automatically.
+**Behavior when enabled**:
+1. Fetches latest commits from GitHub with a 10 second timeout
+2. Compares local commit hash against remote
+3. If an update is available, shows new commits and prompts `Y/N`
+4. If the network is unreachable or the fetch times out, automatically skips and continues startup
+**Timeout handling by platform**:
+- Linux: Uses `timeout` command (10 seconds)
+- macOS: Uses `gtimeout` (from coreutils) or `timeout` if available, otherwise runs without timeout
+- Windows: Network-level timeout via `git fetch`
+See [UPDATE_AND_BACKUP.md](UPDATE_AND_BACKUP.md) for full details on the update process and file backup.
+---
+### 5. Model Configuration
+Controls which DiT model and Language Model (LM) are loaded.
+**Windows (.bat)** -- Gradio UI:
+```batch
+REM Model settings
+set CONFIG_PATH=--config_path acestep-v15-turbo
+set LM_MODEL_PATH=--lm_model_path acestep-5Hz-lm-0.6B
+REM set OFFLOAD_TO_CPU=--offload_to_cpu true
+```
+**Linux/macOS (.sh)** -- Gradio UI:
+```bash
+# Model settings
+CONFIG_PATH="--config_path acestep-v15-turbo"
+LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-0.6B"
+# OFFLOAD_TO_CPU="--offload_to_cpu true"
+OFFLOAD_TO_CPU=""
+```
+**API Server** -- Windows (.bat):
+```batch
+REM LM model path (optional, only used when LLM is enabled)
+REM set LM_MODEL_PATH=--lm-model-path acestep-5Hz-lm-0.6B
+```
+**API Server** -- Linux/macOS (.sh):
+```bash
+# LM model path (optional, only used when LLM is enabled)
+LM_MODEL_PATH=""
+# LM_MODEL_PATH="--lm-model-path acestep-5Hz-lm-0.6B"
+```
+> **Note**: The API server uses `--lm-model-path` (hyphens) while the Gradio UI uses `--lm_model_path` (underscores).
+**Available DiT Models**:
+| Model | Description |
+|-------|-------------|
+| `acestep-v15-turbo` | Default turbo model (8 steps, no CFG) |
+| `acestep-v15-base` | Base model (50 steps, with CFG, high diversity) |
+| `acestep-v15-sft` | SFT model (50 steps, with CFG, high quality) |
+| `acestep-v15-turbo-shift1` | Turbo with shift1 |
+| `acestep-v15-turbo-shift3` | Turbo with shift3 |
+| `acestep-v15-turbo-continuous` | Turbo with continuous shift (1-5) |
+**Available Language Models**:
+| LM Model | Size | Quality |
+|----------|------|---------|
+| `acestep-5Hz-lm-0.6B` | 0.6B | Standard |
+| `acestep-5Hz-lm-1.7B` | 1.7B | Better |
+| `acestep-5Hz-lm-4B` | 4B | Best (requires more VRAM/RAM) |
+**CPU Offload**: Enable `OFFLOAD_TO_CPU` when using larger models (especially 4B) on GPUs with limited VRAM. Models shuttle between CPU and GPU as needed, adding ~8-10s overhead per generation but preventing VRAM oversubscription.
+---
+### 6. LLM Initialization Control
+Controls whether the Language Model (5Hz LM) is initialized at startup. By default, LLM is automatically enabled or disabled based on GPU VRAM:
+- **<=6GB VRAM**: LLM disabled (DiT-only mode)
+- **>6GB VRAM**: LLM enabled
+**Processing Flow:**
+```
+GPU Detection (full) --> ACESTEP_INIT_LLM / INIT_LLM Override --> Model Loading
+```
+GPU optimizations (offload, quantization, batch limits) are **always applied** regardless of this setting. The override only controls whether to attempt LLM loading.
+**Gradio UI** -- Windows (.bat):
+```batch
+REM LLM initialization: auto (default), true, false
+REM set INIT_LLM=--init_llm auto
+REM set INIT_LLM=--init_llm true
+REM set INIT_LLM=--init_llm false
+```
+**Gradio UI** -- Linux/macOS (.sh):
+```bash
+# LLM initialization: auto (default), true, false
+INIT_LLM=""
+# INIT_LLM="--init_llm auto"
+# INIT_LLM="--init_llm true"
+# INIT_LLM="--init_llm false"
+```
+**API Server** -- Windows (.bat):
+```batch
+REM Values: auto (default), true (force enable), false (force disable)
+REM set ACESTEP_INIT_LLM=auto
+REM set ACESTEP_INIT_LLM=true
+REM set ACESTEP_INIT_LLM=false
+```
+**API Server** -- Linux/macOS (.sh):
+```bash
+# Values: auto (default), true (force enable), false (force disable)
+# export ACESTEP_INIT_LLM=auto
+# export ACESTEP_INIT_LLM=true
+# export ACESTEP_INIT_LLM=false
+```
+> **Note**: Gradio UI scripts use `--init_llm` as a command-line argument. API server scripts use the `ACESTEP_INIT_LLM` environment variable.
+**When to use**:
+| Setting | Use Case |
+|---------|----------|
+| `auto` (default) | Let GPU detection decide (recommended) |
+| `true` | Force LLM on low VRAM GPU (GPU optimizations still applied, may cause OOM) |
+| `false` | Pure DiT mode for faster generation, no LLM features |
+**Features affected by LLM**:
+- **Thinking mode**: LLM generates audio codes for better quality
+- **Chain-of-Thought (CoT)**: Auto-enhance captions, detect language, generate metadata
+- **Sample mode**: Generate random songs from descriptions
+- **Format mode**: Enhance user input via LLM
+When LLM is disabled, these features are automatically disabled, and generation uses pure DiT mode.
+---
+## Complete Configuration Examples
+### Chinese Users
+**Windows (.bat)** -- `start_gradio_ui.bat`:
+```batch
+REM UI language
+set LANGUAGE=zh
+REM Server port
+set PORT=7860
+set SERVER_NAME=127.0.0.1
+REM Download source
+set DOWNLOAD_SOURCE=--download-source modelscope
+REM Update check
+set CHECK_UPDATE=true
+REM Model settings
+set CONFIG_PATH=--config_path acestep-v15-turbo
+set LM_MODEL_PATH=--lm_model_path acestep-5Hz-lm-0.6B
+```
+**Linux (.sh)** -- `start_gradio_ui.sh`:
+```bash
+# UI language
+LANGUAGE="zh"
+# Server port
+PORT=7860
+SERVER_NAME="127.0.0.1"
+# Download source
+DOWNLOAD_SOURCE="--download-source modelscope"
+# Update check
+CHECK_UPDATE="true"
+# Model settings
+CONFIG_PATH="--config_path acestep-v15-turbo"
+LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-0.6B"
+```
+---
+### Overseas Users
+**Windows (.bat)** -- `start_gradio_ui.bat`:
+```batch
+REM UI language
+set LANGUAGE=en
+REM Server port
+set PORT=7860
+set SERVER_NAME=127.0.0.1
+REM Download source
+set DOWNLOAD_SOURCE=--download-source huggingface
+REM Update check
+set CHECK_UPDATE=true
+REM Model settings
+set CONFIG_PATH=--config_path acestep-v15-turbo
+set LM_MODEL_PATH=--lm_model_path acestep-5Hz-lm-1.7B
+```
+**Linux (.sh)** -- `start_gradio_ui.sh`:
+```bash
+# UI language
+LANGUAGE="en"
+# Server port
+PORT=7860
+SERVER_NAME="127.0.0.1"
+# Download source
+DOWNLOAD_SOURCE="--download-source huggingface"
+# Update check
+CHECK_UPDATE="true"
+# Model settings
+CONFIG_PATH="--config_path acestep-v15-turbo"
+LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-1.7B"
+```
+---
+### macOS Users (Apple Silicon / MLX)
+**`start_gradio_ui_macos.sh`**:
+```bash
+# MLX backend is set automatically by the script:
+# export ACESTEP_LM_BACKEND="mlx"
+# UI language
+LANGUAGE="en"
+# Server port
+PORT=7860
+SERVER_NAME="127.0.0.1"
+# Download source (HuggingFace recommended outside China)
+DOWNLOAD_SOURCE="--download-source huggingface"
+# Update check
+CHECK_UPDATE="true"
+# Model settings
+CONFIG_PATH="--config_path acestep-v15-turbo"
+LM_MODEL_PATH="--lm_model_path acestep-5Hz-lm-0.6B"
+# MLX backend (set automatically, do not change)
+BACKEND="--backend mlx"
+# CPU offload (enable for models larger than 0.6B on limited memory)
+OFFLOAD_TO_CPU=""
+# OFFLOAD_TO_CPU="--offload_to_cpu true"
+```
+> **Note**: The macOS scripts automatically detect Apple Silicon (arm64). On Intel Macs, the MLX backend is unavailable and the script falls back to the PyTorch backend.
+---
+## ROCm Configuration
+The `start_gradio_ui_rocm.bat` and `start_api_server_rocm.bat` scripts include additional settings specific to AMD GPUs running ROCm on Windows.
+### ROCm-Specific Variables
+```batch
+REM ==================== ROCm Configuration ====================
+REM Force PyTorch LM backend (bypasses nano-vllm flash_attn dependency)
+set ACESTEP_LM_BACKEND=pt
+REM RDNA3 GPU architecture override
+set HSA_OVERRIDE_GFX_VERSION=11.0.0
+REM Disable torch.compile Triton backend (not available on ROCm Windows)
+set TORCH_COMPILE_BACKEND=eager
+REM MIOpen: fast heuristic kernel selection instead of exhaustive benchmarking
+set MIOPEN_FIND_MODE=FAST
+REM HuggingFace tokenizer parallelism
+set TOKENIZERS_PARALLELISM=false
+```
+**Variable details**:
+| Variable | Purpose | Common Values |
+|----------|---------|---------------|
+| `ACESTEP_LM_BACKEND` | Forces PyTorch backend instead of vLLM | `pt` (required for ROCm) |
+| `HSA_OVERRIDE_GFX_VERSION` | Overrides GPU architecture for ROCm compatibility | `11.0.0` (gfx1100, RX 7900 XT/XTX), `11.0.1` (gfx1101, RX 7700/7800 XT), `11.0.2` (gfx1102, RX 7600) |
+| `TORCH_COMPILE_BACKEND` | Sets the torch.compile backend | `eager` (required, Triton unavailable on ROCm Windows) |
+| `MIOPEN_FIND_MODE` | Controls MIOpen kernel selection strategy | `FAST` (recommended; prevents first-run hangs on VAE decode) |
+| `TOKENIZERS_PARALLELISM` | Controls HuggingFace tokenizer parallelism | `false` (suppresses warnings) |
+**ROCm model settings**:
+```batch
+REM Model settings (ROCm)
+set CONFIG_PATH=--config_path acestep-v15-turbo
+set LM_MODEL_PATH=--lm_model_path acestep-5Hz-lm-4B
+REM CPU offload: required for 4B LM on GPUs with <=20GB VRAM
+set OFFLOAD_TO_CPU=--offload_to_cpu true
+REM LM backend: pt (PyTorch) recommended for ROCm
+set BACKEND=--backend pt
+```
+**ROCm virtual environment**:
+The ROCm script uses a separate virtual environment (`venv_rocm`) instead of the standard `.venv` or `python_embeded`:
+```batch
+set VENV_DIR=%~dp0venv_rocm
+```
+> **Note**: The ROCm script requires a separate Python environment with ROCm-compatible PyTorch installed. See `requirements-rocm.txt` for setup instructions.
+---
+## Troubleshooting
+### Changes not taking effect
+**Solution**: Save the file and restart the script. Changes only apply on the next launch.
+Windows:
+```batch
+REM Close current process (Ctrl+C), then run again
+start_gradio_ui.bat
+```
+Linux/macOS:
+```bash
+# Close current process (Ctrl+C), then run again
+./start_gradio_ui.sh
+```
+### Model download is slow
+**For Chinese users** -- set ModelScope:
+| Platform | Setting |
+|----------|---------|
+| Windows | `set DOWNLOAD_SOURCE=--download-source modelscope` |
+| Linux/macOS | `DOWNLOAD_SOURCE="--download-source modelscope"` |
+**For overseas users** -- set HuggingFace:
+| Platform | Setting |
+|----------|---------|
+| Windows | `set DOWNLOAD_SOURCE=--download-source huggingface` |
+| Linux/macOS | `DOWNLOAD_SOURCE="--download-source huggingface"` |
+### Wrong language displayed
+Verify the `LANGUAGE` variable in your Gradio UI script:
+| Platform | Chinese | English |
+|----------|---------|---------|
+| Windows | `set LANGUAGE=zh` | `set LANGUAGE=en` |
+| Linux/macOS | `LANGUAGE="zh"` | `LANGUAGE="en"` |
+### Port already in use
+**Error**: `Address already in use`
+**Solution 1**: Change the port number.
+| Platform | Setting |
+|----------|---------|
+| Windows | `set PORT=7861` |
+| Linux/macOS | `PORT=7861` |
+**Solution 2**: Find and close the process using the port.
+Windows:
+```batch
+REM Find process using port 7860
+netstat -ano | findstr :7860
+REM Kill process (replace <PID> with the actual process ID)
+taskkill /PID <PID> /F
+```
+Linux/macOS:
+```bash
+# Find process using port 7860
+lsof -i :7860
+# Kill process (replace <PID> with the actual process ID)
+kill <PID>
+```
+---
+## Best Practices
+1. **Backup before editing**: Make a copy of the script before modifying it.
+   - Windows: `copy start_gradio_ui.bat start_gradio_ui.bat.backup`
+   - Linux/macOS: `cp start_gradio_ui.sh start_gradio_ui.sh.backup`
+2. **Use comments to document your changes**: Add a note explaining why you changed a value so you remember later.
+   - Windows: `REM Changed to port 8080 for testing`
+   - Linux/macOS: `# Changed to port 8080 for testing`
+3. **Test after changes**: Save the file, close any running instance, re-launch the script, and verify the changes took effect.

.claude/skills/acestep-docs/guides/UPDATE_AND_BACKUP.md ADDED Viewed

	@@ -0,0 +1,496 @@

+# Update and Backup Guide
+## Overview
+All ACE-Step launch scripts check for updates on startup by default. The update check is a lightweight inline operation that runs before the application starts, ensuring you are always notified about new versions without any manual setup.
+- **Default behavior**: Update checking is enabled (`CHECK_UPDATE=true`) in every launch script.
+- **Platforms supported**: Windows, Linux, and macOS.
+- **Graceful failures**: If git is not installed, the network is unreachable, or the project is not a git repository, the check is silently skipped and the application starts normally.
+- **User control**: You can disable the check at any time by setting `CHECK_UPDATE=false`.
+---
+## Update Check Feature
+### How It Works
+Each launch script contains a lightweight inline update check that runs before the main application starts. The check does not require any external update service -- it uses git directly to compare your local commit with the remote.
+**Flow:**
+```text
+Startup
+  |
+  v
+CHECK_UPDATE=true?  --No--> Skip, start app
+  |
+  Yes
+  v
+Git available?  --No--> Skip, start app
+  |
+  Yes
+  v
+Valid git repo?  --No--> Skip, start app
+  |
+  Yes
+  v
+Fetch origin (10s timeout)  --Timeout/Error--> Skip, start app
+  |
+  Success
+  v
+Compare local HEAD vs origin HEAD
+  |
+  +-- Same commit --> "Already up to date", start app
+  |
+  +-- Different commit --> Show new commits, ask Y/N
+        |
+        +-- N --> Skip, start app
+        |
+        +-- Y --> Run check_update.bat / check_update.sh for full update
+                    |
+                    v
+                  Start app
+```
+At every failure point (no git, no network, not a repo), the check exits gracefully and the application starts without interruption.
+### Enabling and Disabling
+The update check is controlled by the `CHECK_UPDATE` variable near the top of each launch script.
+**Windows** (`start_gradio_ui.bat`, `start_api_server.bat`):
+```batch
+REM Update check on startup (set to false to disable)
+set CHECK_UPDATE=true
+REM set CHECK_UPDATE=false
+```
+**Linux / macOS** (`start_gradio_ui.sh`, `start_api_server.sh`, `start_gradio_ui_macos.sh`, `start_api_server_macos.sh`):
+```bash
+# Update check on startup (set to "false" to disable)
+CHECK_UPDATE="true"
+# CHECK_UPDATE="false"
+```
+To disable, change the active line to `false`. To re-enable, change it back to `true`.
+### Git Requirements by Platform
+The inline update check requires git to be available. How you obtain git depends on your platform.
+**Windows:**
+- **Option A -- PortableGit** (no installation required): Download from <https://git-scm.com/download/win>, choose the portable version, and extract to a `PortableGit\` folder in the project root. The launch scripts look for `PortableGit\bin\git.exe` first.
+- **Option B -- System git**: Install git through any standard method (Git for Windows installer, winget, scoop, etc.). The launch scripts fall back to system git if PortableGit is not found.
+```text
+Project Root/
+├── PortableGit/          <-- Optional, checked first on Windows
+│   └── bin/
+│       └── git.exe
+├── start_gradio_ui.bat
+├── check_update.bat
+└── ...
+```
+**Linux:**
+Install git through your distribution's package manager:
+```bash
+# Ubuntu / Debian
+sudo apt install git
+# CentOS / RHEL / Fedora
+sudo yum install git
+# or
+sudo dnf install git
+# Arch Linux
+sudo pacman -S git
+```
+**macOS:**
+Install git through Xcode command-line tools or Homebrew:
+```bash
+# Xcode command-line tools (includes git)
+xcode-select --install
+# Or via Homebrew
+brew install git
+```
+### Example Output
+**Already up to date:**
+```text
+[Update] Checking for updates...
+[Update] Already up to date (abc1234).
+Starting ACE-Step Gradio Web UI...
+```
+**Update available:**
+```text
+[Update] Checking for updates...
+========================================
+  Update available!
+========================================
+  Current: abc1234  ->  Latest: def5678
+  Recent changes:
+* def5678 Fix audio processing bug
+* ccc3333 Add new model support
+Update now before starting? (Y/N):
+```
+If you choose **Y**, the script delegates to `check_update.bat` (Windows) or `check_update.sh` (Linux/macOS) for the full update process including backup handling. If you choose **N**, the update is skipped and the application starts with the current version.
+**Network unreachable (auto-skip):**
+```text
+[Update] Checking for updates...
+[Update] Network unreachable, skipping.
+Starting ACE-Step Gradio Web UI...
+```
+---
+## Manual Update
+You can run the update check manually at any time, outside of the launch scripts.
+**Windows:**
+```batch
+check_update.bat
+```
+**Linux / macOS:**
+```bash
+./check_update.sh
+```
+The manual update scripts perform the same 4-step process:
+1. Detect git and verify the repository
+2. Fetch from origin with a 10-second timeout
+3. Compare local and remote commits
+4. If an update is available, prompt to apply it (with automatic backup of conflicting files)
+---
+## File Backup During Updates
+### Automatic Backup
+When you choose to update and you have locally modified files that also changed on the remote, ACE-Step automatically creates a backup before applying the update.
+**Supported file types** (any modified text file is backed up):
+- Configuration files: `.bat`, `.sh`, `.yaml`, `.json`, `.ini`
+- Python code: `.py`
+- Documentation: `.md`, `.txt`
+### Backup Process
+```text
+1. Update detects locally modified files
+   that also changed on the remote
+   |
+   v
+2. Creates a timestamped backup directory
+   .update_backup_YYYYMMDD_HHMMSS/
+   |
+   v
+3. Copies conflicting files into the backup
+   (preserves directory structure)
+   |
+   v
+4. Resets working tree to the remote version
+   |
+   v
+5. Displays backup location and instructions
+```
+### Example
+**Your local modifications:**
+- `start_gradio_ui.bat` -- Changed language to Chinese
+- `acestep/handler.py` -- Added debug logging
+- `config.yaml` -- Changed model path
+**Remote updates:**
+- `start_gradio_ui.bat` -- Added new features
+- `acestep/handler.py` -- Bug fixes
+- `config.yaml` -- New parameters
+**Backup created:**
+```text
+.update_backup_20260205_143022/
+├── start_gradio_ui.bat          (your version)
+├── config.yaml                  (your version)
+└── acestep/
+    └── handler.py               (your version)
+```
+**Working tree after update:**
+```text
+start_gradio_ui.bat              (new version from GitHub)
+config.yaml                      (new version from GitHub)
+acestep/
+└── handler.py                   (new version from GitHub)
+```
+Your original files are preserved in the backup directory so you can merge your changes back in.
+---
+## Merging Configurations
+After an update that backed up your files, use the merge helper to compare and restore your settings.
+### Windows: merge_config.bat
+```batch
+merge_config.bat
+```
+When comparing files, this script opens two Notepad windows side by side -- one with the backup version and one with the current version -- so you can manually copy your settings across.
+### Linux / macOS: merge_config.sh
+```bash
+./merge_config.sh
+```
+When comparing files, this script uses `colordiff` (if installed) or `diff` to display a unified diff in the terminal, showing exactly what changed between your backed-up version and the new version.
+To install colordiff for colored output:
+```bash
+# Ubuntu / Debian
+sudo apt install colordiff
+# macOS (Homebrew)
+brew install colordiff
+# Arch Linux
+sudo pacman -S colordiff
+```
+### Menu Options (Both Platforms)
+Both `merge_config.bat` and `merge_config.sh` present the same interactive menu:
+```text
+========================================
+ACE-Step Backup Merge Helper
+========================================
+1. Compare backup with current files
+2. Restore a file from backup
+3. List all backed up files
+4. Delete old backups
+5. Exit
+```
+| Option | Description |
+|--------|-------------|
+| **1. Compare** | Show differences between your backup and the current (updated) file. On Windows this opens two Notepad windows. On Linux/macOS this prints a unified diff to the terminal. |
+| **2. Restore** | Copy a file from the backup back into the project, overwriting the updated version. Use this only if the new version causes problems. |
+| **3. List** | Display all files stored in backup directories. |
+| **4. Delete** | Permanently remove old backup directories. Only do this after you have finished merging. |
+### Merging Common Files
+**Launch scripts** (`start_gradio_ui.bat`, `start_gradio_ui.sh`, etc.):
+Look for your custom settings in the backup (language, port, download source, etc.) and copy them into the corresponding lines of the new version.
+```bash
+# Example settings you may want to preserve:
+LANGUAGE="zh"
+PORT=8080
+DOWNLOAD_SOURCE="--download-source modelscope"
+```
+**Configuration files** (`config.yaml`, `.json`):
+Compare the structures. Keep your custom values, add any new keys from the updated version.
+```yaml
+# Backup (your version)
+model_path: "custom/path"
+custom_setting: true
+# Current (new version)
+model_path: "default/path"
+new_feature: enabled
+# Merged result
+model_path: "custom/path"       # Keep your setting
+custom_setting: true             # Keep your setting
+new_feature: enabled             # Add new feature
+```
+---
+## Testing Update Functionality
+Use the test scripts to verify that your git setup and update mechanism are working correctly before relying on them.
+**Windows:**
+```batch
+test_git_update.bat
+```
+**Linux / macOS:**
+```bash
+./test_git_update.sh
+```
+### What the Tests Check
+1. **Git availability**: Verifies that git can be found (PortableGit or system git on Windows; system git on Linux/macOS).
+2. **Repository validity**: Confirms the project directory is a valid git repository.
+3. **Update script presence**: Checks that `check_update.bat` / `check_update.sh` exists.
+4. **Network connectivity**: Attempts an actual fetch from the remote (with timeout).
+### Example Test Output
+```text
+========================================
+Test Git Update Check
+========================================
+[Test 1] Checking Git...
+[PASS] Git found
+git version 2.43.0
+[Test 2] Checking git repository...
+[PASS] Valid git repository
+  Branch: main
+  Commit: a1b2c3d
+[Test 3] Checking update script...
+[PASS] check_update.sh found
+[Test 4] Running update check...
+[PASS] Update check completed successfully
+[PASS] All tests completed
+```
+---
+## Troubleshooting
+### Git not found
+The update check is silently skipped if git is not available. To enable it, install git for your platform:
+| Platform | Install Command |
+|----------|----------------|
+| **Windows (PortableGit)** | Download from <https://git-scm.com/download/win> and extract to `PortableGit\` in the project root |
+| **Windows (system)** | `winget install --id Git.Git -e` or use the Git for Windows installer |
+| **Ubuntu / Debian** | `sudo apt install git` |
+| **CentOS / RHEL** | `sudo yum install git` |
+| **Arch Linux** | `sudo pacman -S git` |
+| **macOS** | `xcode-select --install` or `brew install git` |
+### Network timeout
+The fetch operation has a 10-second timeout. If it times out, the update check is skipped automatically and the application starts normally. This is expected behavior on slow or restricted networks.
+On macOS, the timeout mechanism uses `gtimeout` from GNU coreutils if available, or falls back to a plain fetch without a timeout. To get proper timeout support:
+```bash
+brew install coreutils
+```
+### Proxy configuration
+**Windows (`check_update.bat`):**
+Create a `proxy_config.txt` file in the project root:
+```text
+PROXY_ENABLED=1
+PROXY_URL=http://127.0.0.1:7890
+```
+Or configure interactively:
+```batch
+check_update.bat proxy
+```
+Common proxy formats:
+| Type | Example |
+|------|---------|
+| HTTP proxy | `http://127.0.0.1:7890` |
+| HTTPS proxy | `https://proxy.company.com:8080` |
+| SOCKS5 proxy | `socks5://127.0.0.1:1080` |
+To disable the proxy, set `PROXY_ENABLED=0` in `proxy_config.txt`.
+**Linux / macOS:**
+Set standard environment variables before running the script:
+```bash
+export http_proxy="http://127.0.0.1:7890"
+export https_proxy="http://127.0.0.1:7890"
+./check_update.sh
+```
+Or add them to your shell profile (`~/.bashrc`, `~/.zshrc`) for persistence.
+### Merge conflicts
+If the automatic update fails or produces unexpected results:
+1. Check for backup directories: look for `.update_backup_*` folders in the project root.
+2. Use the merge helper (`merge_config.bat` or `./merge_config.sh`) to compare and restore files.
+3. If needed, manually inspect the diff between your backup and the current files.
+### Lost configuration after update
+1. Find your backup:
+   - **Windows:** `dir /b .update_backup_*`
+   - **Linux / macOS:** `ls -d .update_backup_*`
+2. Use the merge helper (Option 2) to restore specific files, or manually copy settings from the backup.
+---
+## Quick Reference
+| Action | Windows | Linux / macOS |
+|--------|---------|---------------|
+| **Enable update check** | `set CHECK_UPDATE=true` (in `.bat`) | `CHECK_UPDATE="true"` (in `.sh`) |
+| **Disable update check** | `set CHECK_UPDATE=false` (in `.bat`) | `CHECK_UPDATE="false"` (in `.sh`) |
+| **Manual update** | `check_update.bat` | `./check_update.sh` |
+| **Configure proxy** | `check_update.bat proxy` or edit `proxy_config.txt` | `export http_proxy=... && ./check_update.sh` |
+| **Merge configurations** | `merge_config.bat` | `./merge_config.sh` |
+| **Test update setup** | `test_git_update.bat` | `./test_git_update.sh` |
+| **List backups** | `dir /b .update_backup_*` | `ls -d .update_backup_*` |
+| **Delete a backup** | `rmdir /s /q .update_backup_YYYYMMDD_HHMMSS` | `rm -rf .update_backup_YYYYMMDD_HHMMSS` |

.claude/skills/acestep-lyrics-transcription/SKILL.md ADDED Viewed

	@@ -0,0 +1,173 @@

+---
+name: acestep-lyrics-transcription
+description: Transcribe audio to timestamped lyrics using OpenAI Whisper or ElevenLabs Scribe API. Outputs LRC, SRT, or JSON with word-level timestamps. Use when users want to transcribe songs, generate LRC files, or extract lyrics with timestamps from audio.
+allowed-tools: Read, Write, Bash
+---
+# Lyrics Transcription Skill
+Transcribe audio files to timestamped lyrics (LRC/SRT/JSON) via OpenAI Whisper or ElevenLabs Scribe API.
+## API Key Setup Guide
+**Before transcribing, you MUST check whether the user's API key is configured.** Run the following command to check:
+```bash
+cd "{project_root}/{.claude or .codex}/skills/acestep-lyrics-transcription/" && bash ./scripts/acestep-lyrics-transcription.sh config --check-key
+```
+This command only reports whether the active provider's API key is set or empty — it does NOT print the actual key value. **NEVER read or display the user's API key content.** Do not use `config --get` on key fields or read `config.json` directly. The `config --list` command is safe — it automatically masks API keys as `***` in output.
+**If the command reports the key is empty**, you MUST stop and guide the user to configure it before proceeding. Do NOT attempt transcription without a valid key — it will fail.
+Use `AskUserQuestion` to ask the user to provide their API key, with the following options and guidance:
+1. Tell the user which provider is currently active (openai or elevenlabs) and that its API key is not configured. Explain that transcription cannot proceed without it.
+2. Provide clear instructions on where to obtain a key:
+   - **OpenAI**: Get an API key at https://platform.openai.com/api-keys — requires an OpenAI account with billing enabled. The Whisper API costs ~$0.006/min.
+   - **ElevenLabs**: Get an API key at https://elevenlabs.io/app/settings/api-keys — requires an ElevenLabs account. Free tier includes limited credits.
+3. Also offer the option to switch to the other provider if they already have a key for it.
+4. Once the user provides the key, configure it using:
+   ```bash
+   cd "{project_root}/{.claude or .codex}/skills/acestep-lyrics-transcription/" && bash ./scripts/acestep-lyrics-transcription.sh config --set <provider>.api_key <KEY>
+   ```
+5. If the user wants to switch providers, also run:
+   ```bash
+   cd "{project_root}/{.claude or .codex}/skills/acestep-lyrics-transcription/" && bash ./scripts/acestep-lyrics-transcription.sh config --set provider <provider_name>
+   ```
+6. After configuring, re-run `config --check-key` to verify the key is set before proceeding.
+**If the API key is already configured**, proceed directly to transcription without asking.
+## Quick Start
+```bash
+# 1. cd to this skill's directory
+cd {project_root}/{.claude or .codex}/skills/acestep-lyrics-transcription/
+# 2. Configure API key (choose one)
+./scripts/acestep-lyrics-transcription.sh config --set openai.api_key sk-...
+# or
+./scripts/acestep-lyrics-transcription.sh config --set elevenlabs.api_key ...
+./scripts/acestep-lyrics-transcription.sh config --set provider elevenlabs
+# 3. Transcribe
+./scripts/acestep-lyrics-transcription.sh transcribe --audio /path/to/song.mp3 --language zh
+# 4. Output saved to: {project_root}/acestep_output/<filename>.lrc
+```
+## Prerequisites
+- curl, jq, python3 (or python)
+- An API key for OpenAI or ElevenLabs
+## Script Usage
+```bash
+./scripts/acestep-lyrics-transcription.sh transcribe --audio <file> [options]
+Options:
+  -a, --audio      Audio file path (required)
+  -l, --language   Language code (zh, en, ja, etc.)
+  -f, --format     Output format: lrc, srt, json (default: lrc)
+  -p, --provider   API provider: openai, elevenlabs (overrides config)
+  -o, --output     Output file path (default: acestep_output/<filename>.lrc)
+```
+## Post-Transcription Lyrics Correction (MANDATORY)
+**CRITICAL**: After transcription, you MUST manually correct the LRC file before using it for MV rendering. Transcription models frequently produce errors on sung lyrics:
+- Proper nouns: "ACE-Step" → "AC step", "Spotify" → "spot a fly"
+- Similar-sounding words: "arrives" → "eyes", "open source" → "open sores"
+- Merged/split words: "lighting up" → "lightin' nup"
+### Correction Workflow
+1. **Read the transcribed LRC file** using the Read tool
+2. **Read the original lyrics** from the ACE-Step output JSON file
+3. **Use original lyrics as a whole reference**: Do NOT attempt line-by-line alignment — transcription often splits, merges, or reorders lines differently from the original. Instead, read the original lyrics in full to understand the correct wording, then scan each LRC line and fix any misrecognized words based on your knowledge of what the original lyrics say.
+4. **Fix transcription errors**: Replace misrecognized words with the correct original words, keeping the timestamps intact
+5. **Write the corrected LRC** back using the Write tool
+### What to Correct
+- Replace misrecognized words with their correct original versions
+- Keep all `[MM:SS.cc]` timestamps exactly as-is (timestamps from transcription are accurate)
+- Do NOT add structure tags like `[Verse]` or `[Chorus]` — the LRC should only have timestamped text lines
+### Example
+**Transcribed (wrong):**
+```
+[00:46.96]AC step alive,
+[00:50.80]one point five eyes.
+```
+**Original lyrics reference:**
+```
+ACE-Step alive
+One point five arrives
+```
+**Corrected (right):**
+```
+[00:46.96]ACE-Step alive,
+[00:50.80]One point five arrives.
+```
+## Configuration
+Config file: `scripts/config.json`
+```bash
+# Switch provider
+./scripts/acestep-lyrics-transcription.sh config --set provider openai
+./scripts/acestep-lyrics-transcription.sh config --set provider elevenlabs
+# Set API keys
+./scripts/acestep-lyrics-transcription.sh config --set openai.api_key sk-...
+./scripts/acestep-lyrics-transcription.sh config --set elevenlabs.api_key ...
+# View config
+./scripts/acestep-lyrics-transcription.sh config --list
+```
+| Option | Default | Description |
+|--------|---------|-------------|
+| `provider` | `openai` | Active provider: `openai` or `elevenlabs` |
+| `output_format` | `lrc` | Default output: `lrc`, `srt`, or `json` |
+| `openai.api_key` | `""` | OpenAI API key |
+| `openai.api_url` | `https://api.openai.com/v1` | OpenAI API base URL |
+| `openai.model` | `whisper-1` | OpenAI model (whisper-1 for word timestamps) |
+| `elevenlabs.api_key` | `""` | ElevenLabs API key |
+| `elevenlabs.api_url` | `https://api.elevenlabs.io/v1` | ElevenLabs API base URL |
+| `elevenlabs.model` | `scribe_v2` | ElevenLabs model |
+## Provider Notes
+| Provider | Model | Word Timestamps | Pricing |
+|----------|-------|-----------------|---------|
+| OpenAI | whisper-1 | Yes (segment + word) | $0.006/min |
+| ElevenLabs | scribe_v2 | Yes (word-level) | Varies by plan |
+- OpenAI `whisper-1` is the only OpenAI model supporting word-level timestamps
+- ElevenLabs `scribe_v2` returns word-level timestamps with type filtering
+- Both support multilingual transcription
+## Examples
+```bash
+# Basic transcription (uses config defaults)
+./scripts/acestep-lyrics-transcription.sh transcribe --audio song.mp3
+# Chinese song to LRC
+./scripts/acestep-lyrics-transcription.sh transcribe --audio song.mp3 --language zh
+# Use ElevenLabs, output SRT
+./scripts/acestep-lyrics-transcription.sh transcribe --audio song.mp3 --provider elevenlabs --format srt
+# Custom output path
+./scripts/acestep-lyrics-transcription.sh transcribe --audio song.mp3 --output ./my_lyrics.lrc
+```

.claude/skills/acestep-lyrics-transcription/scripts/acestep-lyrics-transcription.sh ADDED Viewed

	@@ -0,0 +1,584 @@

+#!/bin/bash
+#
+# acestep-lyrics-transcription.sh - Transcribe audio to timestamped lyrics (LRC/SRT/JSON)
+#
+# Requirements: curl, jq
+#
+# Usage:
+#   ./acestep-lyrics-transcription.sh transcribe --audio <file> [options]
+#   ./acestep-lyrics-transcription.sh config [--get|--set|--reset]
+#
+# Output:
+#   - LRC/SRT/JSON files saved to output directory
+set -e
+export LANG="${LANG:-en_US.UTF-8}"
+export LC_ALL="${LC_ALL:-en_US.UTF-8}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+CONFIG_FILE="${SCRIPT_DIR}/config.json"
+OUTPUT_DIR="$(cd "${SCRIPT_DIR}/../../../.." && pwd)/acestep_output"
+# Colors
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+CYAN='\033[0;36m'
+NC='\033[0m'
+# Convert MSYS2/Cygwin paths to Windows-native paths for Python
+to_python_path() {
+    if command -v cygpath &> /dev/null; then
+        cygpath -m "$1"
+    else
+        echo "$1"
+    fi
+}
+# Detect python executable (python3 or python)
+PYTHON_CMD=""
+find_python() {
+    if [ -n "$PYTHON_CMD" ]; then return; fi
+    # Test actual execution, not just existence (Windows Store python3 shim returns exit 49)
+    if python3 -c "pass" &> /dev/null; then
+        PYTHON_CMD="python3"
+    elif python -c "pass" &> /dev/null; then
+        PYTHON_CMD="python"
+    else
+        echo -e "${RED}Error: python3 or python is required but not found.${NC}"
+        exit 1
+    fi
+}
+# ─── Dependencies ───
+check_deps() {
+    if ! command -v curl &> /dev/null; then
+        echo -e "${RED}Error: curl is required but not installed.${NC}"
+        exit 1
+    fi
+    if ! command -v jq &> /dev/null; then
+        echo -e "${RED}Error: jq is required but not installed.${NC}"
+        echo "Install: apt install jq / brew install jq / choco install jq"
+        exit 1
+    fi
+}
+# ─── Config ───
+DEFAULT_CONFIG='{
+  "provider": "openai",
+  "output_format": "lrc",
+  "openai": {
+    "api_key": "",
+    "api_url": "https://api.openai.com/v1",
+    "model": "whisper-1"
+  },
+  "elevenlabs": {
+    "api_key": "",
+    "api_url": "https://api.elevenlabs.io/v1",
+    "model": "scribe_v2"
+  }
+}'
+ensure_config() {
+    if [ ! -f "$CONFIG_FILE" ]; then
+        local example="${SCRIPT_DIR}/config.example.json"
+        if [ -f "$example" ]; then
+            cp "$example" "$CONFIG_FILE"
+            echo -e "${YELLOW}Config file created from config.example.json. Please configure your settings:${NC}"
+            echo -e "  ${CYAN}./scripts/acestep-lyrics-transcription.sh config --set provider <openai|elevenlabs>${NC}"
+            echo -e "  ${CYAN}./scripts/acestep-lyrics-transcription.sh config --set <provider>.api_key <key>${NC}"
+        else
+            echo "$DEFAULT_CONFIG" > "$CONFIG_FILE"
+        fi
+    fi
+}
+get_config() {
+    local key="$1"
+    ensure_config
+    local jq_path=".${key}"
+    local value
+    value=$(jq -r "$jq_path" "$CONFIG_FILE" 2>/dev/null)
+    if [ "$value" = "null" ]; then
+        echo ""
+    else
+        echo "$value" | tr -d '\r\n'
+    fi
+}
+set_config() {
+    local key="$1"
+    local value="$2"
+    ensure_config
+    local tmp_file="${CONFIG_FILE}.tmp"
+    local jq_path=".${key}"
+    if [ "$value" = "true" ] || [ "$value" = "false" ]; then
+        jq "$jq_path = $value" "$CONFIG_FILE" > "$tmp_file"
+    elif [[ "$value" =~ ^-?[0-9]+$ ]] || [[ "$value" =~ ^-?[0-9]+\.[0-9]+$ ]]; then
+        jq "$jq_path = $value" "$CONFIG_FILE" > "$tmp_file"
+    else
+        jq "$jq_path = \"$value\"" "$CONFIG_FILE" > "$tmp_file"
+    fi
+    mv "$tmp_file" "$CONFIG_FILE"
+    echo "Set $key = $value"
+}
+ensure_output_dir() {
+    mkdir -p "$OUTPUT_DIR"
+}
+# ─── Format Conversion ───
+# Convert word-level timestamps to LRC format
+# Input: JSON array of {word, start, end} on stdin
+# Output: LRC text
+words_to_lrc() {
+    local json_file="$(to_python_path "$1")"
+    local output_file="$(to_python_path "$2")"
+    local line_gap="${3:-1.5}"
+    find_python
+    $PYTHON_CMD -c "
+import json, sys, unicodedata
+def is_cjk(ch):
+    cp = ord(ch)
+    return (0x4E00 <= cp <= 0x9FFF or 0x3400 <= cp <= 0x4DBF or
+            0x20000 <= cp <= 0x2A6DF or 0x2A700 <= cp <= 0x2B73F or
+            0x2B740 <= cp <= 0x2B81F or 0x2B820 <= cp <= 0x2CEAF or
+            0xF900 <= cp <= 0xFAFF or 0x2F800 <= cp <= 0x2FA1F or
+            0x3000 <= cp <= 0x303F or 0x3040 <= cp <= 0x309F or
+            0x30A0 <= cp <= 0x30FF or 0xFF00 <= cp <= 0xFFEF)
+def smart_join(word_list):
+    if not word_list:
+        return ''
+    result = word_list[0]
+    for j in range(1, len(word_list)):
+        prev_w = word_list[j-1]
+        curr_w = word_list[j]
+        prev_last = prev_w[-1] if prev_w else ''
+        curr_first = curr_w[0] if curr_w else ''
+        if is_cjk(prev_last) or is_cjk(curr_first):
+            result += curr_w
+        else:
+            result += ' ' + curr_w
+    return result.strip()
+with open('$json_file', 'r', encoding='utf-8') as f:
+    words = json.load(f)
+if not words:
+    sys.exit(0)
+lines = []
+current_line = []
+current_start = words[0]['start']
+for i, w in enumerate(words):
+    current_line.append(w['word'])
+    is_last = (i == len(words) - 1)
+    has_punct = w['word'].rstrip().endswith(('.', '!', '?', '。', '！', '？', '，', ','))
+    has_gap = (not is_last and words[i+1]['start'] - w['end'] > $line_gap)
+    if is_last or has_punct or has_gap:
+        text = smart_join(current_line)
+        text = text.rstrip('，。,.')
+        if text:
+            mins = int(current_start) // 60
+            secs = current_start - mins * 60
+            lines.append(f'[{mins:02d}:{secs:05.2f}]{text}')
+        current_line = []
+        if not is_last:
+            current_start = words[i+1]['start']
+with open('$output_file', 'w', encoding='utf-8') as f:
+    for line in lines:
+        f.write(line + '\n')
+"
+}
+# Convert word-level timestamps to SRT format
+words_to_srt() {
+    local json_file="$(to_python_path "$1")"
+    local output_file="$(to_python_path "$2")"
+    local line_gap="${3:-1.5}"
+    find_python
+    $PYTHON_CMD -c "
+import json, sys
+def is_cjk(ch):
+    cp = ord(ch)
+    return (0x4E00 <= cp <= 0x9FFF or 0x3400 <= cp <= 0x4DBF or
+            0x20000 <= cp <= 0x2A6DF or 0x2A700 <= cp <= 0x2B73F or
+            0x2B740 <= cp <= 0x2B81F or 0x2B820 <= cp <= 0x2CEAF or
+            0xF900 <= cp <= 0xFAFF or 0x2F800 <= cp <= 0x2FA1F or
+            0x3000 <= cp <= 0x303F or 0x3040 <= cp <= 0x309F or
+            0x30A0 <= cp <= 0x30FF or 0xFF00 <= cp <= 0xFFEF)
+def smart_join(word_list):
+    if not word_list:
+        return ''
+    result = word_list[0]
+    for j in range(1, len(word_list)):
+        prev_w = word_list[j-1]
+        curr_w = word_list[j]
+        prev_last = prev_w[-1] if prev_w else ''
+        curr_first = curr_w[0] if curr_w else ''
+        if is_cjk(prev_last) or is_cjk(curr_first):
+            result += curr_w
+        else:
+            result += ' ' + curr_w
+    return result.strip()
+with open('$json_file', 'r', encoding='utf-8') as f:
+    words = json.load(f)
+if not words:
+    sys.exit(0)
+def fmt(t):
+    h = int(t) // 3600
+    m = (int(t) % 3600) // 60
+    s = t - h*3600 - m*60
+    return f'{h:02d}:{m:02d}:{s:06.3f}'.replace('.', ',')
+lines = []
+current_line = []
+current_start = words[0]['start']
+current_end = words[0]['end']
+for i, w in enumerate(words):
+    current_line.append(w['word'])
+    current_end = w['end']
+    is_last = (i == len(words) - 1)
+    has_punct = w['word'].rstrip().endswith(('.', '!', '?', '。', '！', '？', '，', ','))
+    has_gap = (not is_last and words[i+1]['start'] - w['end'] > $line_gap)
+    if is_last or has_punct or has_gap:
+        text = smart_join(current_line)
+        text = text.rstrip('，。,.')
+        if text:
+            lines.append((current_start, current_end, text))
+        current_line = []
+        if not is_last:
+            current_start = words[i+1]['start']
+with open('$output_file', 'w', encoding='utf-8') as f:
+    for idx, (s, e, text) in enumerate(lines, 1):
+        f.write(f'{idx}\n')
+        f.write(f'{fmt(s)} --> {fmt(e)}\n')
+        f.write(f'{text}\n')
+        f.write('\n')
+"
+}
+# ─── OpenAI Whisper ───
+transcribe_openai() {
+    local audio_file="$1"
+    local language="$2"
+    local words_file="$3"
+    local api_key=$(get_config "openai.api_key")
+    local api_url=$(get_config "openai.api_url")
+    local model=$(get_config "openai.model")
+    [ -z "$api_key" ] && { echo -e "${RED}Error: OpenAI API key not configured.${NC}"; echo "Run: ./acestep-lyrics-transcription.sh config --set openai.api_key YOUR_KEY"; exit 1; }
+    [ -z "$api_url" ] && api_url="https://api.openai.com/v1"
+    [ -z "$model" ] && model="whisper-1"
+    echo -e "  Provider: OpenAI (${model})"
+    local resp_file=$(mktemp)
+    # Build curl command
+    local curl_args=(
+        -s -w "%{http_code}"
+        -o "$resp_file"
+        -X POST "${api_url}/audio/transcriptions"
+        -H "Authorization: Bearer ${api_key}"
+        -F "file=@${audio_file}"
+        -F "model=${model}"
+        -F "response_format=verbose_json"
+        -F "timestamp_granularities[]=word"
+        -F "timestamp_granularities[]=segment"
+    )
+    [ -n "$language" ] && curl_args+=(-F "language=${language}")
+    local http_code
+    http_code=$(curl "${curl_args[@]}")
+    if [ "$http_code" != "200" ]; then
+        local err
+        err=$(jq -r '.error.message // .detail // "Unknown error"' "$resp_file" 2>/dev/null)
+        echo -e "${RED}Error: HTTP $http_code - $err${NC}"
+        rm -f "$resp_file"
+        return 1
+    fi
+    # Extract word-level timestamps into normalized format [{word, start, end}]
+    jq '[.words[] | {word: .word, start: .start, end: .end}]' "$resp_file" > "$words_file" 2>/dev/null
+    # Show transcription text
+    local text
+    text=$(jq -r '.text // empty' "$resp_file" 2>/dev/null)
+    echo -e "  ${GREEN}Transcription complete${NC}"
+    echo ""
+    echo "$text"
+    rm -f "$resp_file"
+}
+# ─── ElevenLabs Scribe ───
+transcribe_elevenlabs() {
+    local audio_file="$1"
+    local language="$2"
+    local words_file="$3"
+    local api_key=$(get_config "elevenlabs.api_key")
+    local api_url=$(get_config "elevenlabs.api_url")
+    local model=$(get_config "elevenlabs.model")
+    [ -z "$api_key" ] && { echo -e "${RED}Error: ElevenLabs API key not configured.${NC}"; echo "Run: ./acestep-lyrics-transcription.sh config --set elevenlabs.api_key YOUR_KEY"; exit 1; }
+    [ -z "$api_url" ] && api_url="https://api.elevenlabs.io/v1"
+    [ -z "$model" ] && model="scribe_v2"
+    echo -e "  Provider: ElevenLabs (${model})"
+    local resp_file=$(mktemp)
+    local curl_args=(
+        -s -w "%{http_code}"
+        -o "$resp_file"
+        -X POST "${api_url}/speech-to-text"
+        -H "xi-api-key: ${api_key}"
+        -F "file=@${audio_file}"
+        -F "model_id=${model}"
+    )
+    [ -n "$language" ] && curl_args+=(-F "language_code=${language}")
+    local http_code
+    http_code=$(curl "${curl_args[@]}")
+    if [ "$http_code" != "200" ]; then
+        local err
+        err=$(jq -r '.detail.message // .detail // "Unknown error"' "$resp_file" 2>/dev/null)
+        echo -e "${RED}Error: HTTP $http_code - $err${NC}"
+        rm -f "$resp_file"
+        return 1
+    fi
+    # ElevenLabs response: { text, words: [{text, start, end, type}...] }
+    # Normalize to [{word, start, end}], timestamps already in seconds, filter only "word" type
+    jq '[.words[] | select(.type == "word") | {word: .text, start: .start, end: .end}]' "$resp_file" > "$words_file" 2>/dev/null
+    local text
+    text=$(jq -r '.text // empty' "$resp_file" 2>/dev/null)
+    echo -e "  ${GREEN}Transcription complete${NC}"
+    echo ""
+    echo "$text"
+    rm -f "$resp_file"
+}
+# ─── Commands ───
+cmd_transcribe() {
+    check_deps
+    ensure_config
+    local audio="" language="" output="" format="" provider=""
+    while [[ $# -gt 0 ]]; do
+        case $1 in
+            --audio|-a)    audio="$2"; shift 2 ;;
+            --language|-l) language="$2"; shift 2 ;;
+            --output|-o)   output="$2"; shift 2 ;;
+            --format|-f)   format="$2"; shift 2 ;;
+            --provider|-p) provider="$2"; shift 2 ;;
+            *) [ -z "$audio" ] && audio="$1"; shift ;;
+        esac
+    done
+    [ -z "$audio" ] && { echo -e "${RED}Error: --audio is required${NC}"; echo "Usage: $0 transcribe --audio <file> [options]"; exit 1; }
+    [ ! -f "$audio" ] && { echo -e "${RED}Error: audio file not found: $audio${NC}"; exit 1; }
+    # Resolve absolute path
+    audio="$(cd "$(dirname "$audio")" && pwd)/$(basename "$audio")"
+    [ -z "$provider" ] && provider=$(get_config "provider")
+    [ -z "$provider" ] && provider="openai"
+    [ -z "$format" ] && format=$(get_config "output_format")
+    [ -z "$format" ] && format="lrc"
+    # Default output path
+    if [ -z "$output" ]; then
+        ensure_output_dir
+        local basename="$(basename "${audio%.*}")"
+        output="${OUTPUT_DIR}/${basename}.${format}"
+    fi
+    echo "Transcribing..."
+    echo "  Audio: $(basename "$audio")"
+    echo "  Format: $format"
+    # Transcribe to normalized word timestamps
+    local words_file=$(mktemp)
+    case "$provider" in
+        openai)   transcribe_openai "$audio" "$language" "$words_file" ;;
+        elevenlabs) transcribe_elevenlabs "$audio" "$language" "$words_file" ;;
+        *) echo -e "${RED}Error: unknown provider: $provider${NC}"; echo "Supported: openai, elevenlabs"; rm -f "$words_file"; exit 1 ;;
+    esac
+    # Check if we got words
+    local word_count
+    word_count=$(jq 'length' "$words_file" 2>/dev/null)
+    if [ -z "$word_count" ] || [ "$word_count" = "0" ]; then
+        echo -e "${YELLOW}Warning: no word-level timestamps returned${NC}"
+        rm -f "$words_file"
+        return 1
+    fi
+    echo ""
+    echo "  Words detected: $word_count"
+    # Convert to output format
+    mkdir -p "$(dirname "$output")"
+    case "$format" in
+        lrc)
+            words_to_lrc "$words_file" "$output"
+            ;;
+        srt)
+            words_to_srt "$words_file" "$output"
+            ;;
+        json)
+            cp "$words_file" "$output"
+            ;;
+        *)
+            echo -e "${RED}Error: unknown format: $format (supported: lrc, srt, json)${NC}"
+            rm -f "$words_file"
+            exit 1
+            ;;
+    esac
+    rm -f "$words_file"
+    echo -e "  ${GREEN}Saved: $output${NC}"
+    echo ""
+    echo -e "${GREEN}Done!${NC}"
+}
+cmd_config() {
+    check_deps
+    ensure_config
+    local action="" key="" value=""
+    while [[ $# -gt 0 ]]; do
+        case $1 in
+            --get)       action="get"; key="$2"; shift 2 ;;
+            --set)       action="set"; key="$2"; value="$3"; shift 3 ;;
+            --reset)     action="reset"; shift ;;
+            --list)      action="list"; shift ;;
+            --check-key) action="check-key"; shift ;;
+            *) shift ;;
+        esac
+    done
+    case "$action" in
+        "check-key")
+            local provider=$(get_config "provider")
+            [ -z "$provider" ] && provider="openai"
+            local api_key=$(get_config "${provider}.api_key")
+            echo "provider: $provider"
+            if [ -n "$api_key" ]; then
+                echo "api_key: configured"
+            else
+                echo "api_key: empty"
+            fi
+            ;;
+        "get")
+            [ -z "$key" ] && { echo -e "${RED}Error: --get requires KEY${NC}"; exit 1; }
+            local result=$(get_config "$key")
+            [ -n "$result" ] && echo "$key = $result" || echo "Key not found: $key"
+            ;;
+        "set")
+            [ -z "$key" ] || [ -z "$value" ] && { echo -e "${RED}Error: --set requires KEY VALUE${NC}"; exit 1; }
+            set_config "$key" "$value"
+            ;;
+        "reset")
+            echo "$DEFAULT_CONFIG" > "$CONFIG_FILE"
+            echo -e "${GREEN}Configuration reset to defaults.${NC}"
+            jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
+            ;;
+        "list")
+            echo "Current configuration:"
+            jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
+            ;;
+        *)
+            echo "Config file: $CONFIG_FILE"
+            echo "----------------------------------------"
+            jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
+            echo ""
+            echo "----------------------------------------"
+            echo ""
+            echo "Usage:"
+            echo "  config --list              Show config"
+            echo "  config --get <key>         Get value"
+            echo "  config --set <key> <val>   Set value"
+            echo "  config --reset             Reset to defaults"
+            echo ""
+            echo "Examples:"
+            echo "  config --set provider elevenlabs"
+            echo "  config --set openai.api_key sk-..."
+            echo "  config --set elevenlabs.api_key ..."
+            echo "  config --set output_format srt"
+            ;;
+    esac
+}
+show_help() {
+    echo "Lyrics Transcription CLI"
+    echo ""
+    echo "Requirements: curl, jq, python3"
+    echo ""
+    echo "Usage: $0 <command> [options]"
+    echo ""
+    echo "Commands:"
+    echo "  transcribe   Transcribe audio to timestamped lyrics"
+    echo "  config       Manage configuration"
+    echo ""
+    echo "Transcribe Options:"
+    echo "  -a, --audio      Audio file path (required)"
+    echo "  -l, --language   Language code (e.g. zh, en, ja)"
+    echo "  -f, --format     Output format: lrc, srt, json (default: lrc)"
+    echo "  -p, --provider   API provider: openai, elevenlabs"
+    echo "  -o, --output     Output file path"
+    echo ""
+    echo "Examples:"
+    echo "  $0 transcribe --audio song.mp3"
+    echo "  $0 transcribe --audio song.mp3 --language zh --format lrc"
+    echo "  $0 config --set provider openai"
+}
+# ─── Main ───
+case "$1" in
+    transcribe) shift; cmd_transcribe "$@" ;;
+    config)     shift; cmd_config "$@" ;;
+    help|--help|-h) show_help ;;
+    *) show_help; exit 1 ;;
+esac

.claude/skills/acestep-lyrics-transcription/scripts/config.example.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "provider": "elevenlabs",
+  "output_format": "lrc",
+  "openai": {
+    "api_key": "",
+    "api_url": "https://api.openai.com/v1",
+    "model": "whisper-1"
+  },
+  "elevenlabs": {
+    "api_key": "",
+    "api_url": "https://api.elevenlabs.io/v1",
+    "model": "scribe_v2"
+  }
+}

.claude/skills/acestep-simplemv/SKILL.md ADDED Viewed

	@@ -0,0 +1,133 @@

+---
+name: acestep-simplemv
+description: Render music videos from audio files and lyrics using Remotion. Accepts audio + LRC/JSON lyrics + title to produce MP4 videos with waveform visualization and synced lyrics display. Use when users mention MV generation, music video rendering, creating video from audio/lyrics, or visualizing songs.
+---
+# MV Render
+Render music videos with waveform visualization and synced lyrics from audio + lyrics input.
+## Prerequisites
+- Remotion project at `scripts/` directory within this skill
+- Node.js + npm dependencies installed
+- ffprobe available (for audio duration detection)
+### First-Time Setup
+Before first use, check and install dependencies:
+```bash
+# 1. Check Node.js
+node --version
+# 2. Install npm dependencies
+cd {project_root}/{.claude or .codex}/skills/acestep-simplemv/scripts && npm install
+# 3. Check ffprobe
+ffprobe -version
+```
+If ffprobe is not available, install ffmpeg (which includes ffprobe):
+- **Windows**: `choco install ffmpeg` or download from https://ffmpeg.org/download.html and add to PATH
+- **macOS**: `brew install ffmpeg`
+- **Linux**: `sudo apt-get install ffmpeg` (Debian/Ubuntu) or `sudo dnf install ffmpeg` (Fedora)
+## Quick Start
+```bash
+cd {project_root}/{.claude or .codex}/skills/acestep-simplemv/
+./scripts/render-mv.sh --audio /path/to/song.mp3 --lyrics /path/to/song.lrc --title "Song Title"
+```
+Output: MP4 file at `out/<audio_basename>.mp4` (or custom `--output` path).
+## Script Usage
+```bash
+./scripts/render-mv.sh --audio <file> --lyrics <lrc_file> --title "Title" [options]
+Options:
+  --audio        Audio file path (absolute paths supported)
+  --lyrics       LRC format lyrics file (timestamped)
+  --lyrics-json  JSON lyrics file [{start, end, text}] (alternative to --lyrics)
+  --title        Video title (default: "Music Video")
+  --subtitle     Subtitle text
+  --credit       Bottom credit text
+  --offset       Lyric timing offset in seconds (default: -0.5)
+  --output       Output file path (default: out/<audio_basename>.mp4)
+  --codec        h264|h265|vp8|vp9 (default: h264)
+  --browser      Custom browser executable path (Chrome/Edge/Chromium)
+Environment variables:
+  BROWSER_EXECUTABLE  Path to browser executable (overrides auto-detection)
+```
+## Browser Detection
+Remotion requires a Chromium-based browser for rendering. The script auto-detects browsers in this priority order:
+1. `BROWSER_EXECUTABLE` environment variable
+2. `--browser` CLI argument
+3. Remotion cache (`chrome-headless-shell`, downloaded by Remotion)
+4. System Chrome (auto-uses `--chrome-mode=chrome-for-testing`)
+5. **System Edge** (pre-installed on Windows 10/11, auto-uses `--chrome-mode=chrome-for-testing`)
+6. System Chromium (auto-uses `--chrome-mode=chrome-for-testing`)
+**Important**: New versions of Chrome/Edge removed the old headless mode. When using regular Chrome/Edge/Chromium, the script automatically sets `--chrome-mode=chrome-for-testing` (which uses `--headless=new`). When using `chrome-headless-shell`, it uses the default `headless-shell` mode (which uses `--headless=old`). This is handled transparently.
+If no browser is found, Remotion will attempt to download `chrome-headless-shell` from Google servers. **This will fail if Google servers are inaccessible from your network.**
+### Workarounds for restricted networks
+Since **Edge is pre-installed on Windows 10/11**, it should be auto-detected without any manual configuration. The script automatically detects Chrome/Edge and uses the correct headless mode. If auto-detection fails:
+```bash
+# Option 1: Set environment variable
+export BROWSER_EXECUTABLE="/path/to/msedge.exe"
+# Option 2: Pass as CLI argument
+./scripts/render-mv.sh --audio song.mp3 --lyrics song.lrc --title "Song" --browser "/path/to/msedge.exe"
+# Option 3: Enable proxy and let Remotion download chrome-headless-shell
+```
+## Examples
+```bash
+# Basic render
+./scripts/render-mv.sh --audio /tmp/abc123_1.mp3 --lyrics /tmp/abc123.lrc --title "夜桜"
+# Custom output path
+./scripts/render-mv.sh --audio song.mp3 --lyrics song.lrc --title "My Song" --output /tmp/my_mv.mp4
+# With subtitle and credit
+./scripts/render-mv.sh --audio song.mp3 --lyrics song.lrc --title "Song" --subtitle "Artist Name" --credit "Generated by ACE-Step"
+```
+## File Naming
+**IMPORTANT**: Use the audio file's job ID as the output filename to avoid overwriting. Do NOT use custom names like `--output my_song.mp4`. Let the default naming handle it (derives from audio filename).
+Default output uses the audio filename as base:
+- Audio: `acestep_output/{job_id}_1.mp3`
+- Lyrics: `acestep_output/{job_id}_1.lrc`
+- Video: Pass `--output acestep_output/{job_id}.mp4` (use the job ID from the audio file)
+Example: if audio is `chatcmpl-abc123_1.mp3`, pass `--output acestep_output/chatcmpl-abc123.mp4`
+## Title Guidelines
+- Keep `--title` short and single-line (max ~50 chars, auto-truncated)
+- Use `--subtitle` for additional info
+- Do NOT put newlines in `--title`
+Good: `--title "Open Source" --subtitle "ACE-Step v1.5"`
+Bad: `--title "Open Source - ACE-Step v1.5\nCelebrating Music AI"`
+## Notes
+- Audio files with absolute paths are auto-copied to `public/` by render.mjs
+- Duration is auto-detected via ffprobe
+- Typical render time: ~1-2 minutes for a 90s song
+- Output resolution: 1920x1080, 30fps

.claude/skills/acestep-simplemv/scripts/package-lock.json ADDED Viewed

The diff for this file is too large to render. See raw diff

.claude/skills/acestep-simplemv/scripts/package.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "name": "acestep-video",
+  "version": "1.0.0",
+  "description": "",
+  "main": "index.js",
+  "scripts": {
+    "start": "remotion preview",
+    "build": "remotion render MusicVideo out/video.mp4",
+    "render": "node render.mjs",
+    "upgrade": "remotion upgrade"
+  },
+  "keywords": [],
+  "author": "",
+  "license": "ISC",
+  "type": "commonjs",
+  "dependencies": {
+    "@remotion/cli": "^4.0.417",
+    "@remotion/media-utils": "^4.0.417",
+    "react": "^18.3.1",
+    "react-dom": "^18.3.1",
+    "remotion": "^4.0.417"
+  },
+  "devDependencies": {
+    "@types/react": "^19.2.13",
+    "typescript": "^5.9.3"
+  }
+}

.claude/skills/acestep-simplemv/scripts/remotion.config.ts ADDED Viewed

	@@ -0,0 +1,4 @@

+import {Config} from '@remotion/cli/config';
+Config.setVideoImageFormat('jpeg');
+Config.setOverwriteOutput(true);

.claude/skills/acestep-simplemv/scripts/render-mv.sh ADDED Viewed

	@@ -0,0 +1,123 @@

+#!/bin/bash
+# render-mv.sh - Render a music video from audio + lyrics
+#
+# Usage:
+#   ./render-mv.sh --audio <file> --lyrics <lrc_file> --title "Title" [options]
+#
+# Options:
+#   --audio     Audio file path (absolute or relative)
+#   --lyrics    LRC format lyrics file
+#   --lyrics-json  JSON lyrics file [{start, end, text}]
+#   --title     Video title (default: "Music Video")
+#   --subtitle  Subtitle text
+#   --credit    Bottom credit text
+#   --offset    Lyric timing offset in seconds (default: -0.5)
+#   --output    Output file path (default: acestep_output/<audio_basename>.mp4)
+#   --codec     h264|h265|vp8|vp9 (default: h264)
+#   --browser   Custom browser executable path (Chrome/Edge/Chromium)
+#
+# Environment variables:
+#   BROWSER_EXECUTABLE  Path to browser executable (overrides auto-detection)
+#
+# Examples:
+#   ./render-mv.sh --audio song.mp3 --lyrics song.lrc --title "My Song"
+#   ./render-mv.sh --audio /path/to/abc123_1.mp3 --lyrics /path/to/abc123.lrc --title "夜桜"
+set -euo pipefail
+RENDER_DIR="$(cd "$(dirname "$0")" && pwd)"
+# Ensure output directory exists
+mkdir -p "${RENDER_DIR}/out"
+# Cross-platform realpath alternative (works on macOS/Linux/Windows MSYS2)
+resolve_path() {
+  local dir base
+  dir="$(cd "$(dirname "$1")" && pwd)"
+  base="$(basename "$1")"
+  echo "${dir}/${base}"
+}
+AUDIO=""
+LYRICS=""
+LYRICS_JSON=""
+TITLE="Music Video"
+SUBTITLE=""
+CREDIT=""
+OFFSET="-0.5"
+OUTPUT=""
+CODEC="h264"
+BROWSER=""
+# Parse args
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --audio)       AUDIO="$2"; shift 2 ;;
+    --lyrics)      LYRICS="$2"; shift 2 ;;
+    --lyrics-json) LYRICS_JSON="$2"; shift 2 ;;
+    --title)       TITLE="$2"; shift 2 ;;
+    --subtitle)    SUBTITLE="$2"; shift 2 ;;
+    --credit)      CREDIT="$2"; shift 2 ;;
+    --offset)      OFFSET="$2"; shift 2 ;;
+    --output)      OUTPUT="$2"; shift 2 ;;
+    --codec)       CODEC="$2"; shift 2 ;;
+    --browser)     BROWSER="$2"; shift 2 ;;
+    -h|--help)
+      head -20 "$0" | tail -18
+      exit 0
+      ;;
+    *)
+      echo "Error: unknown argument: $1" >&2
+      exit 1
+      ;;
+  esac
+done
+if [[ -z "$AUDIO" ]]; then
+  echo "Error: --audio is required" >&2
+  exit 1
+fi
+if [[ ! -f "$AUDIO" ]]; then
+  echo "Error: audio file not found: $AUDIO" >&2
+  exit 1
+fi
+# Resolve absolute path for audio
+AUDIO="$(resolve_path "$AUDIO")"
+# Default output: acestep_output/<audio_basename>.mp4
+if [[ -z "$OUTPUT" ]]; then
+  BASENAME="$(basename "${AUDIO%.*}")"
+  # Strip trailing _1, _2 etc from audio filename for cleaner video name
+  OUTPUT="${RENDER_DIR}/out/${BASENAME}.mp4"
+fi
+# Ensure output directory exists
+mkdir -p "$(dirname "$OUTPUT")"
+# Build node args array (safe quoting, no eval)
+NODE_ARGS=(render.mjs --audio "$AUDIO" --title "$TITLE" --offset "$OFFSET" --output "$OUTPUT" --codec "$CODEC")
+if [[ -n "$LYRICS" ]]; then
+  LYRICS="$(resolve_path "$LYRICS")"
+  NODE_ARGS+=(--lyrics "$LYRICS")
+elif [[ -n "$LYRICS_JSON" ]]; then
+  LYRICS_JSON="$(resolve_path "$LYRICS_JSON")"
+  NODE_ARGS+=(--lyrics-json "$LYRICS_JSON")
+fi
+[[ -n "$SUBTITLE" ]] && NODE_ARGS+=(--subtitle "$SUBTITLE")
+[[ -n "$CREDIT" ]] && NODE_ARGS+=(--credit "$CREDIT")
+[[ -n "$BROWSER" ]] && NODE_ARGS+=(--browser "$BROWSER")
+echo "Rendering MV..."
+echo "  Audio: $(basename "$AUDIO")"
+echo "  Title: $TITLE"
+echo "  Output: $OUTPUT"
+cd "$RENDER_DIR"
+node "${NODE_ARGS[@]}"
+echo ""
+echo "Output: $OUTPUT"

.claude/skills/acestep-simplemv/scripts/render.mjs ADDED Viewed

	@@ -0,0 +1,345 @@

+#!/usr/bin/env node
+/**
+ * CLI entry point for rendering music videos.
+ *
+ * Usage:
+ *   node render.mjs --audio music.mp3 --lyrics lyrics.lrc --title "Song Name" --output out/video.mp4
+ *   node render.mjs --audio music.mp3 --lyrics-json lyrics.json --title "Song Name"
+ *
+ * Options:
+ *   --audio        Audio file path (absolute paths auto-copied to public/) or filename in public/
+ *   --lyrics       Path to LRC format lyrics file
+ *   --lyrics-json  Path to JSON lyrics file [{start, end, text}]
+ *   --title        Main title (default: "Music Video")
+ *   --subtitle     Subtitle (default: "")
+ *   --credit       Bottom credit text (default: "")
+ *   --duration     Audio duration in seconds (auto-detected if omitted)
+ *   --offset       Lyric timing offset in seconds (default: -0.5)
+ *   --output       Output file path (default: out/video.mp4)
+ *   --codec        Video codec: h264, h265, vp8, vp9 (default: h264)
+ */
+import {execSync} from 'child_process';
+import {readFileSync, readdirSync, existsSync, copyFileSync, mkdirSync} from 'fs';
+import {resolve, basename, isAbsolute, join} from 'path';
+import {homedir} from 'os';
+/**
+ * Resolve a file path that may be a MSYS2/Cygwin-style path on Windows.
+ * Converts paths like /e/foo/bar to E:/foo/bar for Node.js compatibility.
+ */
+function resolveFilePath(p) {
+  if (process.platform === 'win32' && /^\/[a-zA-Z]\//.test(p)) {
+    // Convert MSYS2 path /x/... to X:/...
+    return p[1].toUpperCase() + ':' + p.slice(2);
+  }
+  return resolve(p);
+}
+/**
+ * Find a usable browser executable for Remotion rendering.
+ *
+ * Search priority:
+ *   1. Environment variable BROWSER_EXECUTABLE
+ *   2. CLI argument --browser
+ *   3. Remotion cache (chrome-headless-shell)
+ *   4. System Chrome (requires --chrome-mode=chrome-for-testing)
+ *   5. System Edge (requires --chrome-mode=chrome-for-testing)
+ *   6. System Chromium (requires --chrome-mode=chrome-for-testing)
+ *
+ * Returns {path, chromeMode} or {path: null, chromeMode: 'headless-shell'} if not found.
+ *
+ * chromeMode:
+ *   - 'headless-shell': for chrome-headless-shell binary (uses --headless=old)
+ *   - 'chrome-for-testing': for regular Chrome/Edge/Chromium (uses --headless=new)
+ */
+function findBrowserExecutable(cliOverride) {
+  // 1. Environment variable — highest priority
+  const envExe = process.env.BROWSER_EXECUTABLE;
+  if (envExe && existsSync(envExe)) {
+    const mode = isHeadlessShell(envExe) ? 'headless-shell' : 'chrome-for-testing';
+    return {path: envExe, chromeMode: mode};
+  }
+  // 2. CLI argument
+  if (cliOverride && existsSync(cliOverride)) {
+    const mode = isHeadlessShell(cliOverride) ? 'headless-shell' : 'chrome-for-testing';
+    return {path: cliOverride, chromeMode: mode};
+  }
+  const platform = process.platform;
+  const home = homedir();
+  // 3. Local node_modules/.remotion (chrome-headless-shell) — uses --headless=old
+  const localCacheDir = join(process.cwd(), 'node_modules', '.remotion', 'chrome-headless-shell');
+  if (existsSync(localCacheDir)) {
+    try {
+      // Structure: chrome-headless-shell/linux64/chrome-headless-shell-linux64/chrome-headless-shell
+      const platformDir = platform === 'win32' ? 'win64' : platform === 'darwin' ? 'mac-arm64' : 'linux64';
+      const exeName = platform === 'win32' ? 'chrome-headless-shell.exe' : 'chrome-headless-shell';
+      const platformPath = join(localCacheDir, platformDir);
+      if (existsSync(platformPath)) {
+        const subdirs = readdirSync(platformPath);
+        for (const subdir of subdirs) {
+          const exe = join(platformPath, subdir, exeName);
+          if (existsSync(exe)) return {path: exe, chromeMode: 'headless-shell'};
+        }
+      }
+    } catch {}
+  }
+  // 4. User home Remotion cache (chrome-headless-shell) — uses --headless=old
+  let cacheDir;
+  if (platform === 'win32') {
+    cacheDir = join(home, 'AppData', 'Local', 'remotion', 'chrome-headless-shell');
+  } else if (platform === 'darwin') {
+    cacheDir = join(home, 'Library', 'Caches', 'remotion', 'chrome-headless-shell');
+  } else {
+    cacheDir = join(home, '.cache', 'remotion', 'chrome-headless-shell');
+  }
+  if (existsSync(cacheDir)) {
+    try {
+      const versions = readdirSync(cacheDir).sort().reverse();
+      const exeName = platform === 'win32' ? 'chrome-headless-shell.exe' : 'chrome-headless-shell';
+      for (const ver of versions) {
+        const exe = join(cacheDir, ver, exeName);
+        if (existsSync(exe)) return {path: exe, chromeMode: 'headless-shell'};
+      }
+    } catch {}
+  }
+  // 4-6. System browsers: Chrome, Edge, Chromium — require --chrome-mode=chrome-for-testing
+  const browserPaths = platform === 'win32' ? [
+    // Chrome
+    'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe',
+    'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe',
+    // Edge (pre-installed on Windows 10/11)
+    'C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe',
+    'C:\\Program Files\\Microsoft\\Edge\\Application\\msedge.exe',
+  ] : platform === 'darwin' ? [
+    '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome',
+    '/Applications/Microsoft Edge.app/Contents/MacOS/Microsoft Edge',
+    '/Applications/Chromium.app/Contents/MacOS/Chromium',
+  ] : [
+    '/usr/bin/google-chrome',
+    '/usr/bin/google-chrome-stable',
+    '/usr/bin/chromium',
+    '/usr/bin/chromium-browser',
+    '/usr/bin/microsoft-edge',
+    '/usr/bin/microsoft-edge-stable',
+  ];
+  for (const p of browserPaths) {
+    if (existsSync(p)) return {path: p, chromeMode: 'chrome-for-testing'};
+  }
+  return {path: null, chromeMode: 'headless-shell'};
+}
+/**
+ * Check if the given executable path is a chrome-headless-shell binary.
+ */
+function isHeadlessShell(exePath) {
+  const name = exePath.toLowerCase().replace(/\\/g, '/');
+  return name.includes('chrome-headless-shell');
+}
+function parseLrc(content) {
+  const lines = content.split(/\r?\n/).filter(l => l.trim());
+  const parsed = [];
+  for (const line of lines) {
+    const match = line.match(/^\[(\d{2}):(\d{2})(?:\.(\d{2,3}))?\]\s*(.*)$/);
+    if (match) {
+      const minutes = parseInt(match[1], 10);
+      const seconds = parseInt(match[2], 10);
+      const cs = match[3] ? parseInt(match[3].padEnd(3, '0'), 10) / 1000 : 0;
+      const time = minutes * 60 + seconds + cs;
+      const text = match[4].trim();
+      parsed.push({time, text});
+    }
+  }
+  const result = [];
+  for (let i = 0; i < parsed.length; i++) {
+    const start = parsed[i].time;
+    const end = i < parsed.length - 1 ? parsed[i + 1].time : start + 5;
+    if (parsed[i].text) {
+      result.push({start, end, text: parsed[i].text});
+    }
+  }
+  return result;
+}
+function getAudioDuration(filePath) {
+  try {
+    const result = execSync(
+      `ffprobe -v error -show_entries format=duration -of csv=p=0 "${filePath}"`,
+      {encoding: 'utf-8'}
+    ).trim();
+    return parseFloat(result);
+  } catch {
+    return null;
+  }
+}
+function parseArgs(argv) {
+  const args = {};
+  for (let i = 2; i < argv.length; i++) {
+    const key = argv[i];
+    if (key.startsWith('--') && i + 1 < argv.length) {
+      const name = key.slice(2);
+      args[name] = argv[i + 1];
+      i++;
+    }
+  }
+  return args;
+}
+const args = parseArgs(process.argv);
+// Validate required args
+if (!args.audio) {
+  console.error('Error: --audio is required');
+  console.error('Usage: node render.mjs --audio music.mp3 --lyrics lyrics.lrc --title "Song"');
+  process.exit(1);
+}
+// If audio is an absolute path, copy it into public/ and use the filename
+let audioFileName = args.audio;
+const resolvedAudio = resolveFilePath(args.audio);
+if (isAbsolute(resolvedAudio)) {
+  if (!existsSync(resolvedAudio)) {
+    console.error(`Error: Audio file not found: ${resolvedAudio}`);
+    process.exit(1);
+  }
+  const pubDir = resolve('public');
+  mkdirSync(pubDir, {recursive: true});
+  const fname = basename(resolvedAudio);
+  const dest = resolve(pubDir, fname);
+  if (resolve(resolvedAudio) !== dest) {
+    copyFileSync(resolvedAudio, dest);
+    console.log(`Copied audio to public/${fname}`);
+  }
+  audioFileName = fname;
+} else {
+  // Relative name — must exist in public/
+  const audioPath = resolve('public', args.audio);
+  if (!existsSync(audioPath)) {
+    console.error(`Error: Audio file not found in public/: ${args.audio}`);
+    process.exit(1);
+  }
+}
+// Parse lyrics
+let lyrics = [];
+if (args.lyrics) {
+  const lrcPath = resolveFilePath(args.lyrics);
+  if (!existsSync(lrcPath)) {
+    console.error(`Error: LRC file not found: ${lrcPath}`);
+    process.exit(1);
+  }
+  lyrics = parseLrc(readFileSync(lrcPath, 'utf-8'));
+  console.log(`Parsed ${lyrics.length} lyric lines from LRC file`);
+} else if (args['lyrics-json']) {
+  const jsonPath = resolveFilePath(args['lyrics-json']);
+  if (!existsSync(jsonPath)) {
+    console.error(`Error: JSON lyrics file not found: ${jsonPath}`);
+    process.exit(1);
+  }
+  lyrics = JSON.parse(readFileSync(jsonPath, 'utf-8'));
+  console.log(`Loaded ${lyrics.length} lyric lines from JSON file`);
+}
+// Determine audio duration
+let duration = args.duration ? parseFloat(args.duration) : null;
+if (!duration) {
+  const audioPath = resolve('public', audioFileName);
+  if (existsSync(audioPath)) {
+    duration = getAudioDuration(audioPath);
+    if (duration) {
+      console.log(`Auto-detected audio duration: ${duration.toFixed(2)}s`);
+    }
+  }
+}
+if (!duration) {
+  console.error('Error: Could not detect audio duration. Please provide --duration');
+  process.exit(1);
+}
+// Build input props
+// Sanitize title: single-line, max 50 chars
+const rawTitle = (args.title || 'Music Video').replace(/[\r\n]+/g, ' ').trim();
+const title = rawTitle.length > 50 ? rawTitle.slice(0, 47) + '...' : rawTitle;
+const inputProps = {
+  audioFileName: audioFileName,
+  lyrics,
+  title,
+  subtitle: (args.subtitle || '').replace(/[\r\n]+/g, ' ').trim(),
+  creditText: args.credit || '',
+  durationInSeconds: duration,
+  lyricOffset: args.offset ? parseFloat(args.offset) : -0.5,
+};
+const output = args.output ? resolveFilePath(args.output) : 'out/video.mp4';
+const codec = args.codec || 'h264';
+// Write props to temp file to avoid shell escaping issues
+const propsFile = resolve('.render-props.json');
+const {writeFileSync} = await import('fs');
+writeFileSync(propsFile, JSON.stringify(inputProps));
+// Find browser executable to avoid re-downloading
+const {path: browserExe, chromeMode} = findBrowserExecutable(args.browser);
+if (!browserExe) {
+  console.warn('⚠️  No browser found. Remotion will attempt to download chrome-headless-shell from Google servers.');
+  console.warn('   If download fails (e.g. Google servers inaccessible), try one of these:');
+  console.warn('   1. Set environment variable: BROWSER_EXECUTABLE=/path/to/chrome-or-edge');
+  console.warn('   2. Pass CLI argument: --browser /path/to/chrome-or-edge');
+  console.warn('   3. Enable proxy and retry');
+  console.warn('');
+}
+const cmd = [
+  'npx remotion render',
+  'MusicVideo',
+  `"${output}"`,
+  `--props="${propsFile}"`,
+  `--codec=${codec}`,
+  '--log=error',
+  browserExe ? `--browser-executable="${browserExe}"` : '',
+  chromeMode !== 'headless-shell' ? `--chrome-mode=${chromeMode}` : '',
+].filter(Boolean).join(' ');
+console.log(`\nRendering video...`);
+console.log(`  Audio: ${args.audio}`);
+console.log(`  Title: ${inputProps.title}`);
+console.log(`  Duration: ${duration.toFixed(1)}s`);
+console.log(`  Lyrics: ${lyrics.length} lines`);
+console.log(`  Output: ${output}`);
+console.log(`  Codec: ${codec}`);
+if (browserExe) console.log(`  Browser: ${browserExe}`);
+if (chromeMode !== 'headless-shell') console.log(`  Chrome mode: ${chromeMode}`);
+console.log('');
+try {
+  const result = execSync(cmd, {encoding: 'utf-8', stdio: ['pipe', 'pipe', 'pipe']});
+  // Only show the final output file line (starts with '+') and size info
+  const outputLines = result.split(/\r?\n/).filter(l => l.includes(output) || /^\+/.test(l.replace(/\x1b\[[0-9;]*m/g, '').trim()));
+  if (outputLines.length) console.log(outputLines.join('\n'));
+  console.log(`\n✅ Video rendered successfully: ${output}`);
+} catch (e) {
+  // Show stderr on failure for debugging
+  if (e.stderr) console.error(e.stderr.toString());
+  console.error('\n❌ Render failed');
+  process.exit(1);
+} finally {
+  // Clean up temp props file
+  try {
+    const {unlinkSync} = await import('fs');
+    unlinkSync(propsFile);
+  } catch {}
+}

.claude/skills/acestep-simplemv/scripts/render.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+#!/bin/bash
+# render.sh - Convenience wrapper for rendering music videos
+#
+# Usage:
+#   ./render.sh --audio music.mp3 --lyrics lyrics.lrc --title "Song Name"
+#   ./render.sh --audio music.mp3 --lyrics-json lyrics.json --title "Song" --output out/mv.mp4
+#
+# All options are passed through to render.mjs. See render.mjs for full options list.
+set -e
+cd "$(dirname "$0")"
+node render.mjs "$@"

.claude/skills/acestep-simplemv/scripts/src/AudioVisualization.tsx ADDED Viewed

	@@ -0,0 +1,314 @@

+import React from 'react';
+import {
+  AbsoluteFill,
+  Audio,
+  useCurrentFrame,
+  useVideoConfig,
+  interpolate,
+  Easing,
+  staticFile,
+} from 'remotion';
+import {useAudioData, visualizeAudio} from '@remotion/media-utils';
+import {MVInputProps} from './types';
+export const AudioVisualization: React.FC<MVInputProps> = ({
+  audioFileName,
+  lyrics,
+  title,
+  subtitle,
+  creditText,
+  lyricOffset,
+}) => {
+  const frame = useCurrentFrame();
+  const {fps, durationInFrames} = useVideoConfig();
+  const audioSrc = audioFileName.startsWith('http')
+    ? audioFileName
+    : staticFile(audioFileName);
+  const audioData = useAudioData(audioSrc);
+  if (!audioData) {
+    return null;
+  }
+  const visualization = visualizeAudio({
+    fps,
+    frame,
+    audioData,
+    numberOfSamples: 128,
+    optimizeFor: 'speed',
+  });
+  const currentTime = frame / fps + lyricOffset;
+  const currentLyric = lyrics.find(
+    (lyric) => currentTime >= lyric.start && currentTime < lyric.end
+  );
+  const lyricProgress = currentLyric
+    ? interpolate(
+        currentTime,
+        [currentLyric.start, currentLyric.start + 0.3],
+        [0, 1],
+        {extrapolateRight: 'clamp'}
+      )
+    : 0;
+  const titleOpacity = interpolate(frame, [0, 30], [0, 1], {
+    extrapolateRight: 'clamp',
+  });
+  const titleY = interpolate(frame, [0, 30], [-50, 0], {
+    extrapolateRight: 'clamp',
+    easing: Easing.out(Easing.ease),
+  });
+  const hue = interpolate(frame, [0, durationInFrames], [200, 320], {
+    extrapolateRight: 'wrap',
+  });
+  const avgAmplitude =
+    visualization.reduce((sum, val) => sum + val, 0) / visualization.length;
+  return (
+    <AbsoluteFill>
+      {/* Animated gradient background */}
+      <AbsoluteFill
+        style={{
+          background: `linear-gradient(135deg, hsl(${hue}, 80%, 12%) 0%, hsl(${hue + 80}, 70%, 8%) 100%)`,
+        }}
+      />
+      {/* Radial glow effect */}
+      <AbsoluteFill
+        style={{
+          background: `radial-gradient(circle at 50% 50%, hsla(${hue}, 100%, 50%, ${avgAmplitude * 0.3}) 0%, transparent 50%)`,
+        }}
+      />
+      {/* Audio source */}
+      <Audio src={audioSrc} />
+      {/* Bottom frequency bars */}
+      <AbsoluteFill
+        style={{
+          justifyContent: 'flex-end',
+          alignItems: 'center',
+        }}
+      >
+        <div
+          style={{
+            display: 'flex',
+            alignItems: 'flex-end',
+            justifyContent: 'center',
+            gap: 4,
+            height: 350,
+            width: '90%',
+            marginBottom: 180,
+          }}
+        >
+          {visualization.map((value, index) => {
+            const scaledValue = Math.pow(value, 0.6);
+            const barHeight = Math.max(scaledValue * 800, 20);
+            const colorIndex = (index / visualization.length) * 360;
+            return (
+              <div
+                key={index}
+                style={{
+                  width: `${100 / visualization.length}%`,
+                  height: barHeight,
+                  background: `linear-gradient(to top,
+                    hsl(${(colorIndex + hue) % 360}, 90%, 60%),
+                    hsl(${(colorIndex + hue + 40) % 360}, 90%, 70%))`,
+                  borderRadius: '4px 4px 0 0',
+                  boxShadow: `0 0 ${10 + scaledValue * 30}px hsla(${(colorIndex + hue) % 360}, 100%, 60%, ${scaledValue})`,
+                  transition: 'height 0.05s ease-out',
+                }}
+              />
+            );
+          })}
+        </div>
+      </AbsoluteFill>
+      {/* Symmetrical side bars */}
+      <AbsoluteFill
+        style={{
+          justifyContent: 'center',
+          alignItems: 'center',
+        }}
+      >
+        {/* Left bars */}
+        <div
+          style={{
+            position: 'absolute',
+            left: 40,
+            display: 'flex',
+            flexDirection: 'column',
+            gap: 8,
+            height: '80%',
+            justifyContent: 'space-around',
+          }}
+        >
+          {visualization.slice(0, 20).map((value, index) => {
+            const scaledValue = Math.pow(value, 0.6);
+            const barWidth = Math.max(scaledValue * 300, 10);
+            const colorIndex = (index / 20) * 360;
+            return (
+              <div
+                key={index}
+                style={{
+                  width: barWidth,
+                  height: 12,
+                  background: `linear-gradient(to right,
+                    hsl(${(colorIndex + hue) % 360}, 90%, 60%),
+                    hsl(${(colorIndex + hue + 40) % 360}, 90%, 70%))`,
+                  borderRadius: '0 6px 6px 0',
+                  boxShadow: `0 0 ${10 + scaledValue * 20}px hsla(${(colorIndex + hue) % 360}, 100%, 60%, ${scaledValue})`,
+                }}
+              />
+            );
+          })}
+        </div>
+        {/* Right bars */}
+        <div
+          style={{
+            position: 'absolute',
+            right: 40,
+            display: 'flex',
+            flexDirection: 'column',
+            gap: 8,
+            height: '80%',
+            justifyContent: 'space-around',
+            alignItems: 'flex-end',
+          }}
+        >
+          {visualization.slice(0, 20).map((value, index) => {
+            const scaledValue = Math.pow(value, 0.6);
+            const barWidth = Math.max(scaledValue * 300, 10);
+            const colorIndex = (index / 20) * 360;
+            return (
+              <div
+                key={index}
+                style={{
+                  width: barWidth,
+                  height: 12,
+                  background: `linear-gradient(to left,
+                    hsl(${(colorIndex + hue + 180) % 360}, 90%, 60%),
+                    hsl(${(colorIndex + hue + 220) % 360}, 90%, 70%))`,
+                  borderRadius: '6px 0 0 6px',
+                  boxShadow: `0 0 ${10 + scaledValue * 20}px hsla(${(colorIndex + hue + 180) % 360}, 100%, 60%, ${scaledValue})`,
+                }}
+              />
+            );
+          })}
+        </div>
+      </AbsoluteFill>
+      {/* Center title area */}
+      <AbsoluteFill
+        style={{
+          justifyContent: 'flex-start',
+          alignItems: 'center',
+          paddingTop: 60,
+        }}
+      >
+        <div
+          style={{
+            textAlign: 'center',
+            transform: `scale(${1 + avgAmplitude * 0.1})`,
+            transition: 'transform 0.1s ease-out',
+          }}
+        >
+          <div
+            style={{
+              fontSize: 96,
+              fontWeight: 'bold',
+              color: 'white',
+              opacity: titleOpacity,
+              transform: `translateY(${titleY}px)`,
+              textShadow: `0 0 40px hsla(${hue}, 100%, 70%, 0.8), 0 4px 20px rgba(0,0,0,0.5)`,
+              fontFamily: '"Noto Sans CJK JP", "Noto Sans CJK SC", Arial, sans-serif',
+              marginBottom: 10,
+            }}
+          >
+            {title}
+          </div>
+          <div
+            style={{
+              fontSize: 56,
+              fontWeight: '600',
+              color: 'rgba(255,255,255,0.95)',
+              opacity: titleOpacity,
+              transform: `translateY(${titleY}px)`,
+              textShadow: `0 0 30px hsla(${hue + 60}, 100%, 70%, 0.6), 0 2px 10px rgba(0,0,0,0.5)`,
+              fontFamily: '"Noto Sans CJK JP", "Noto Sans CJK SC", Arial, sans-serif',
+              letterSpacing: '4px',
+            }}
+          >
+            {subtitle}
+          </div>
+        </div>
+      </AbsoluteFill>
+      {/* Lyrics display */}
+      {currentLyric && currentLyric.text && (
+        <AbsoluteFill
+          style={{
+            justifyContent: 'center',
+            alignItems: 'center',
+            paddingTop: 100,
+          }}
+        >
+          <div
+            style={{
+              fontSize: 48,
+              fontWeight: '600',
+              color: 'white',
+              textAlign: 'center',
+              maxWidth: '85%',
+              opacity: lyricProgress,
+              transform: `translateY(${(1 - lyricProgress) * 30}px)`,
+              textShadow: `0 0 40px hsla(${hue}, 100%, 70%, 0.8), 0 4px 30px rgba(0,0,0,0.9)`,
+              fontFamily: '"Noto Sans CJK JP", "Noto Sans CJK SC", Arial, sans-serif',
+              lineHeight: 1.5,
+              padding: '25px 50px',
+              background: `linear-gradient(135deg, rgba(0,0,0,0.4), rgba(0,0,0,0.2))`,
+              backdropFilter: 'blur(15px)',
+              borderRadius: '20px',
+              border: `2px solid hsla(${hue}, 80%, 60%, 0.3)`,
+              boxShadow: `0 8px 32px rgba(0,0,0,0.5), inset 0 0 40px hsla(${hue}, 100%, 50%, 0.1)`,
+            }}
+          >
+            {currentLyric.text}
+          </div>
+        </AbsoluteFill>
+      )}
+      {/* Bottom credit text */}
+      <AbsoluteFill
+        style={{
+          justifyContent: 'flex-end',
+          alignItems: 'center',
+          padding: 50,
+        }}
+      >
+        <div
+          style={{
+            fontSize: 32,
+            fontWeight: '500',
+            color: 'white',
+            opacity: 0.8,
+            textAlign: 'center',
+            textShadow: `0 0 20px hsla(${hue}, 100%, 70%, 0.6), 0 2px 10px rgba(0,0,0,0.7)`,
+            fontFamily: '"Noto Sans CJK JP", "Noto Sans CJK SC", Arial, sans-serif',
+          }}
+        >
+          {creditText}
+        </div>
+      </AbsoluteFill>
+    </AbsoluteFill>
+  );
+};

.claude/skills/acestep-simplemv/scripts/src/Root.tsx ADDED Viewed

	@@ -0,0 +1,31 @@

+import React from 'react';
+import {Composition, CalculateMetadataFunction} from 'remotion';
+import {AudioVisualization} from './AudioVisualization';
+import {MVInputProps, defaultProps} from './types';
+const calculateMetadata: CalculateMetadataFunction<MVInputProps> = ({props}) => {
+  const fps = 30;
+  const durationInFrames = Math.ceil(props.durationInSeconds * fps);
+  return {
+    durationInFrames,
+    fps,
+    width: 1920,
+    height: 1080,
+  };
+};
+export const RemotionRoot: React.FC = () => {
+  return (
+    <>
+      <Composition
+        id="MusicVideo"
+        component={AudioVisualization}
+        fps={30}
+        width={1920}
+        height={1080}
+        defaultProps={defaultProps}
+        calculateMetadata={calculateMetadata}
+      />
+    </>
+  );
+};

.claude/skills/acestep-simplemv/scripts/src/index.ts ADDED Viewed

	@@ -0,0 +1,4 @@

+import {registerRoot} from 'remotion';
+import {RemotionRoot} from './Root';
+registerRoot(RemotionRoot);

.claude/skills/acestep-simplemv/scripts/src/parseLrc.ts ADDED Viewed

	@@ -0,0 +1,40 @@

+import {LyricLine} from './types';
+/**
+ * Parse LRC format lyrics into LyricLine array.
+ * LRC format: [mm:ss.xx] lyrics text
+ *
+ * Example:
+ *   [00:02.99] Version one point five is here today
+ *   [00:07.00] ACE-Step's rising, leading the way
+ */
+export function parseLrc(lrcContent: string): LyricLine[] {
+  const lines = lrcContent.split('\n').filter((line) => line.trim());
+  const parsed: {time: number; text: string}[] = [];
+  for (const line of lines) {
+    // Match [mm:ss.xx] or [mm:ss] format
+    const match = line.match(/^\[(\d{2}):(\d{2})(?:\.(\d{2,3}))?\]\s*(.*)$/);
+    if (match) {
+      const minutes = parseInt(match[1], 10);
+      const seconds = parseInt(match[2], 10);
+      const centiseconds = match[3] ? parseInt(match[3].padEnd(3, '0'), 10) / 1000 : 0;
+      const time = minutes * 60 + seconds + centiseconds;
+      const text = match[4].trim();
+      parsed.push({time, text});
+    }
+  }
+  // Convert to LyricLine with start/end
+  const result: LyricLine[] = [];
+  for (let i = 0; i < parsed.length; i++) {
+    const start = parsed[i].time;
+    const end = i < parsed.length - 1 ? parsed[i + 1].time : start + 5;
+    const text = parsed[i].text;
+    if (text) {
+      result.push({start, end, text});
+    }
+  }
+  return result;
+}

.claude/skills/acestep-simplemv/scripts/src/types.ts ADDED Viewed

	@@ -0,0 +1,32 @@

+export interface LyricLine {
+  start: number;
+  end: number;
+  text: string;
+}
+export interface MVInputProps extends Record<string, unknown> {
+  /** Path to audio file (relative to public/ or absolute URL) */
+  audioFileName: string;
+  /** Lyrics as JSON array [{start, end, text}] */
+  lyrics: LyricLine[];
+  /** Main title displayed at top */
+  title: string;
+  /** Subtitle displayed below title */
+  subtitle: string;
+  /** Bottom credit text */
+  creditText: string;
+  /** Audio duration in seconds (used to calculate total frames) */
+  durationInSeconds: number;
+  /** Lyric timing offset in seconds (positive = delay, negative = advance) */
+  lyricOffset: number;
+}
+export const defaultProps: MVInputProps = {
+  audioFileName: 'celebration.mp3',
+  lyrics: [],
+  title: 'ACE-Step',
+  subtitle: 'v1.5',
+  creditText: 'Powered by Claude Code + ACE-Step',
+  durationInSeconds: 150,
+  lyricOffset: -0.5,
+};

.claude/skills/acestep-simplemv/scripts/tsconfig.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+  "compilerOptions": {
+    "target": "ES2022",
+    "module": "ES2022",
+    "moduleResolution": "Bundler",
+    "lib": ["DOM", "ES2022"],
+    "jsx": "react-jsx",
+    "skipLibCheck": true,
+    "strict": true,
+    "esModuleInterop": true,
+    "allowSyntheticDefaultImports": true,
+    "forceConsistentCasingInFileNames": true,
+    "resolveJsonModule": true,
+    "isolatedModules": true,
+    "noEmit": true
+  },
+  "include": ["src/**/*"]
+}

.claude/skills/acestep-songwriting/SKILL.md ADDED Viewed

	@@ -0,0 +1,194 @@

+---
+name: acestep-songwriting
+description: Music songwriting guide for ACE-Step. Provides professional knowledge on writing captions, lyrics, choosing BPM/key/duration, and structuring songs. Use this skill when users want to create, write, or plan a song before generating it with ACE-Step.
+allowed-tools: Read
+---
+# ACE-Step Songwriting Guide
+Professional music creation knowledge for writing captions, lyrics, and choosing music parameters for ACE-Step.
+## Output Format
+After using this guide, produce two things for the acestep skill:
+1. **Caption** (`-c`): Style/genre/instruments/emotion description
+2. **Lyrics** (`-l`): Complete structured lyrics with tags
+3. **Parameters**: `--duration`, `--bpm`, `--key`, `--time-signature`, `--language`
+---
+## Caption: The Most Important Input
+**Caption is the most important factor affecting generated music.**
+Supports multiple formats: simple style words, comma-separated tags, complex natural language descriptions.
+### Common Dimensions
+| Dimension | Examples |
+|-----------|----------|
+| **Style/Genre** | pop, rock, jazz, electronic, hip-hop, R&B, folk, classical, lo-fi, synthwave |
+| **Emotion/Atmosphere** | melancholic, uplifting, energetic, dreamy, dark, nostalgic, euphoric, intimate |
+| **Instruments** | acoustic guitar, piano, synth pads, 808 drums, strings, brass, electric bass |
+| **Timbre Texture** | warm, bright, crisp, muddy, airy, punchy, lush, raw, polished |
+| **Era Reference** | 80s synth-pop, 90s grunge, 2010s EDM, vintage soul, modern trap |
+| **Production Style** | lo-fi, high-fidelity, live recording, studio-polished, bedroom pop |
+| **Vocal Characteristics** | female vocal, male vocal, breathy, powerful, falsetto, raspy, choir |
+| **Speed/Rhythm** | slow tempo, mid-tempo, fast-paced, groovy, driving, laid-back |
+| **Structure Hints** | building intro, catchy chorus, dramatic bridge, fade-out ending |
+### Caption Writing Principles
+1. **Specific beats vague** — "sad piano ballad with female breathy vocal" > "a sad song"
+2. **Combine multiple dimensions** — style+emotion+instruments+timbre anchors direction precisely
+3. **Use references well** — "in the style of 80s synthwave" conveys complex aesthetic quickly
+4. **Texture words are useful** — warm, crisp, airy, punchy influence mixing and timbre
+5. **Don't pursue perfection** — Caption is a starting point, iterate based on results
+6. **Granularity determines freedom** — Less detail = more model creativity; more detail = more control
+7. **Avoid conflicting words** — "classical strings" + "hardcore metal" degrades output
+   - **Fix: Repetition reinforcement** — Repeat the elements you want more
+   - **Fix: Conflict to evolution** — "Start with soft strings, middle becomes metal rock, end turns to hip-hop"
+8. **Don't put BPM/key/tempo in Caption** — Use dedicated parameters instead
+---
+## Lyrics: The Temporal Script
+Lyrics controls how music unfolds over time. It carries:
+- Lyric text itself
+- **Structure tags** ([Verse], [Chorus], [Bridge]...)
+- **Vocal style hints** ([raspy vocal], [whispered]...)
+- **Instrumental sections** ([guitar solo], [drum break]...)
+- **Energy changes** ([building energy], [explosive drop]...)
+### Structure Tags
+| Category | Tag | Description |
+|----------|-----|-------------|
+| **Basic Structure** | `[Intro]` | Opening, establish atmosphere |
+| | `[Verse]` / `[Verse 1]` | Verse, narrative progression |
+| | `[Pre-Chorus]` | Pre-chorus, build energy |
+| | `[Chorus]` | Chorus, emotional climax |
+| | `[Bridge]` | Bridge, transition or elevation |
+| | `[Outro]` | Ending, conclusion |
+| **Dynamic Sections** | `[Build]` | Energy gradually rising |
+| | `[Drop]` | Electronic music energy release |
+| | `[Breakdown]` | Reduced instrumentation, space |
+| **Instrumental** | `[Instrumental]` | Pure instrumental, no vocals |
+| | `[Guitar Solo]` | Guitar solo |
+| | `[Piano Interlude]` | Piano interlude |
+| **Special** | `[Fade Out]` | Fade out ending |
+| | `[Silence]` | Silence |
+### Combining Tags
+Use `-` for finer control, but keep it concise:
+```
+✅ [Chorus - anthemic]
+❌ [Chorus - anthemic - stacked harmonies - high energy - powerful - epic]
+```
+Put complex style descriptions in Caption, not in tags.
+### Caption-Lyrics Consistency
+**Models are not good at resolving conflicts.** Checklist:
+- Instruments in Caption ↔ Instrumental section tags in Lyrics
+- Emotion in Caption ↔ Energy tags in Lyrics
+- Vocal description in Caption ↔ Vocal control tags in Lyrics
+### Vocal Control Tags
+| Tag | Effect |
+|-----|--------|
+| `[raspy vocal]` | Raspy, textured vocals |
+| `[whispered]` | Whispered |
+| `[falsetto]` | Falsetto |
+| `[powerful belting]` | Powerful, high-pitched singing |
+| `[spoken word]` | Rap/recitation |
+| `[harmonies]` | Layered harmonies |
+| `[call and response]` | Call and response |
+| `[ad-lib]` | Improvised embellishments |
+### Energy and Emotion Tags
+| Tag | Effect |
+|-----|--------|
+| `[high energy]` | High energy, passionate |
+| `[low energy]` | Low energy, restrained |
+| `[building energy]` | Increasing energy |
+| `[explosive]` | Explosive energy |
+| `[melancholic]` | Melancholic |
+| `[euphoric]` | Euphoric |
+| `[dreamy]` | Dreamy |
+| `[aggressive]` | Aggressive |
+### Lyric Writing Tips
+1. **6-10 syllables per line** — Model aligns syllables to beats; keep similar counts for lines in same position (±1-2)
+2. **Uppercase = stronger intensity** — `WE ARE THE CHAMPIONS!` (shouting) vs `walking through the streets` (normal)
+3. **Parentheses = background vocals** — `We rise together (together)`
+4. **Extend vowels** — `Feeeling so aliiive` (use cautiously, effects unstable)
+5. **Clear section separation** — Blank lines between sections
+### Avoiding "AI-flavored" Lyrics
+| Red Flag | Description |
+|----------|-------------|
+| **Adjective stacking** | "neon skies, electric hearts, endless dreams" — vague imagery filler |
+| **Rhyme chaos** | Inconsistent patterns or forced rhymes breaking meaning |
+| **Blurred boundaries** | Lyric content crosses structure tags |
+| **No breathing room** | Lines too long to sing in one breath |
+| **Mixed metaphors** | Water → fire → flying — listeners can't anchor |
+**Metaphor discipline**: One core metaphor per song, explore its multiple aspects.
+---
+## Music Metadata
+**Most of the time, let LM auto-infer.** Only set manually when you have clear requirements.
+| Parameter | Range | Description |
+|-----------|-------|-------------|
+| `bpm` | 30–300 | Slow 60–80, mid 90–120, fast 130–180 |
+| `keyscale` | Key | e.g. `C Major`, `Am`. Common keys (C, G, D, Am, Em) most stable |
+| `timesignature` | Time sig | `4/4` (most common), `3/4` (waltz), `6/8` (swing) |
+| `vocal_language` | Language | Usually auto-detected from lyrics |
+| `duration` | Seconds | See duration calculation below |
+### When to Set Manually
+| Scenario | Set |
+|----------|-----|
+| Daily generation | Let LM auto-infer |
+| Clear tempo requirement | `bpm` |
+| Specific style (waltz) | `timesignature=3/4` |
+| Match other material | `bpm` + `duration` |
+| Specific key color | `keyscale` |
+---
+## Duration Calculation
+### Estimation Method
+- **Intro/Outro**: 5-10 seconds each
+- **Instrumental sections**: 5-15 seconds each
+- **Typical structures**:
+  - 2 verses + 2 choruses: 120-150s minimum
+  - 2 verses + 2 choruses + bridge: 180-240s minimum
+  - Full song with intro/outro: 210-270s (3.5-4.5 min)
+### BPM and Duration Relationship
+- **Slower BPM (60-80)**: Need MORE duration for same lyrics
+- **Medium BPM (100-130)**: Standard duration
+- **Faster BPM (150-180)**: Can fit more lyrics, but still need breathing room
+**Rule of thumb**: When in doubt, estimate longer. A song too short feels rushed.
+---
+Note: Lyrics tags (piano, powerful, whispered) are consistent with Caption (piano ballad, building to powerful chorus, intimate).

.claude/skills/acestep/SKILL.md ADDED Viewed

	@@ -0,0 +1,253 @@

+---
+name: acestep
+description: Use ACE-Step API to generate music, edit songs, and remix music. Supports text-to-music, lyrics generation, audio continuation, and audio repainting. Use this skill when users mention generating music, creating songs, music production, remix, or audio continuation.
+allowed-tools: Read, Write, Bash, Skill
+---
+# ACE-Step Music Generation Skill
+Use ACE-Step V1.5 API for music generation. **Always use `scripts/acestep.sh` script** — do NOT call API endpoints directly.
+## Quick Start
+```bash
+# 1. cd to this skill's directory
+cd {project_root}/{.claude or .codex}/skills/acestep/
+# 2. Check API service health
+./scripts/acestep.sh health
+# 3. Generate with lyrics (recommended)
+./scripts/acestep.sh generate -c "pop, female vocal, piano" -l "[Verse] Your lyrics here..." --duration 120 --language zh
+# 4. Output saved to: {project_root}/acestep_output/
+```
+## Workflow
+For user requests requiring vocals:
+1. Use the **acestep-songwriting** skill for lyrics writing, caption creation, duration/BPM/key selection
+2. Write complete, well-structured lyrics yourself based on the songwriting guide
+3. Generate using Caption mode with `-c` and `-l` parameters
+Only use Simple/Random mode (`-d` or `random`) for quick inspiration or instrumental exploration.
+If the user needs a simple music video, use the **acestep-simplemv** skill to render one with waveform visualization and synced lyrics.
+**MV Production Requirements**: Making a simple MV requires three additional skills to be installed:
+- **acestep-songwriting** — for writing lyrics and planning song structure
+- **acestep-lyrics-transcription** — for transcribing audio to timestamped lyrics (LRC)
+- **acestep-simplemv** — for rendering the final music video
+## Script Commands
+**CRITICAL - Complete Lyrics Input**: When providing lyrics via the `-l` parameter, you MUST pass ALL lyrics content WITHOUT any omission:
+- If user provides lyrics, pass the ENTIRE text they give you
+- If you generate lyrics yourself, pass the COMPLETE lyrics you created
+- NEVER truncate, shorten, or pass only partial lyrics
+- Missing lyrics will result in incomplete or incoherent songs
+**Music Parameters**: Use the **acestep-songwriting** skill for guidance on duration, BPM, key scale, and time signature.
+```bash
+# need to cd to this skill's directory first
+cd {project_root}/{.claude or .codex}/skills/acestep/
+# Caption mode - RECOMMENDED: Write lyrics first, then generate
+./scripts/acestep.sh generate -c "Electronic pop, energetic synths" -l "[Verse] Your complete lyrics
+[Chorus] Full chorus here..." --duration 120 --bpm 128
+# Instrumental only
+./scripts/acestep.sh generate "Jazz with saxophone"
+# Quick exploration (Simple/Random mode)
+./scripts/acestep.sh generate -d "A cheerful song about spring"
+./scripts/acestep.sh random
+# Options
+./scripts/acestep.sh generate "Rock" --duration 60 --batch 2
+./scripts/acestep.sh generate "EDM" --no-thinking    # Faster
+# Other commands
+./scripts/acestep.sh status <job_id>
+./scripts/acestep.sh health
+./scripts/acestep.sh models
+```
+## Output Files
+After generation, the script automatically saves results to the `acestep_output` folder in the project root (same level as `.claude`):
+```
+project_root/
+├── .claude/
+│   └── skills/acestep/...
+├── acestep_output/          # Output directory
+│   ├── <job_id>.json         # Complete task result (JSON)
+│   ├── <job_id>_1.mp3        # First audio file
+│   ├── <job_id>_2.mp3        # Second audio file (if batch_size > 1)
+│   └── ...
+└── ...
+```
+### JSON Result Structure
+**Important**: When LM enhancement is enabled (`use_format=true`), the final synthesized content may differ from your input. Check the JSON file for actual values:
+| Field | Description |
+|-------|-------------|
+| `prompt` | **Actual caption** used for synthesis (may be LM-enhanced) |
+| `lyrics` | **Actual lyrics** used for synthesis (may be LM-enhanced) |
+| `metas.prompt` | Original input caption |
+| `metas.lyrics` | Original input lyrics |
+| `metas.bpm` | BPM used |
+| `metas.keyscale` | Key scale used |
+| `metas.duration` | Duration in seconds |
+| `generation_info` | Detailed timing and model info |
+| `seed_value` | Seeds used (for reproducibility) |
+| `lm_model` | LM model name |
+| `dit_model` | DiT model name |
+To get the actual synthesized lyrics, parse the JSON and read the top-level `lyrics` field, not `metas.lyrics`.
+## Configuration
+**Important**: Configuration follows this priority (high to low):
+1. **Command line arguments** > **config.json defaults**
+2. User-specified parameters **temporarily override** defaults but **do not modify** config.json
+3. Only `config --set` command **permanently modifies** config.json
+### Default Config File (`scripts/config.json`)
+```json
+{
+  "api_url": "http://127.0.0.1:8001",
+  "api_key": "",
+  "api_mode": "completion",
+  "generation": {
+    "thinking": true,
+    "use_format": false,
+    "use_cot_caption": true,
+    "use_cot_language": false,
+    "batch_size": 1,
+    "audio_format": "mp3",
+    "vocal_language": "en"
+  }
+}
+```
+| Option | Default | Description |
+|--------|---------|-------------|
+| `api_url` | `http://127.0.0.1:8001` | API server address |
+| `api_key` | `""` | API authentication key (optional) |
+| `api_mode` | `completion` | API mode: `completion` (OpenRouter, default) or `native` (polling) |
+| `generation.thinking` | `true` | Enable 5Hz LM (higher quality, slower) |
+| `generation.audio_format` | `mp3` | Output format (mp3/wav/flac) |
+| `generation.vocal_language` | `en` | Vocal language |
+## Prerequisites - ACE-Step API Service
+**IMPORTANT**: This skill requires the ACE-Step API server to be running.
+### Required Dependencies
+The `scripts/acestep.sh` script requires: **curl** and **jq**.
+```bash
+# Check dependencies
+curl --version
+jq --version
+```
+If jq is not installed, the script will attempt to install it automatically. If automatic installation fails:
+- **Windows**: `choco install jq` or download from https://jqlang.github.io/jq/download/
+- **macOS**: `brew install jq`
+- **Linux**: `sudo apt-get install jq` (Debian/Ubuntu) or `sudo dnf install jq` (Fedora)
+### Before First Use
+**You MUST check the API key and URL status before proceeding.** Run:
+```bash
+cd "{project_root}/{.claude or .codex}/skills/acestep/" && bash ./scripts/acestep.sh config --check-key
+cd "{project_root}/{.claude or .codex}/skills/acestep/" && bash ./scripts/acestep.sh config --get api_url
+```
+#### Case 1: Using Official Cloud API (`https://api.acemusic.ai`) without API key
+If `api_url` is `https://api.acemusic.ai` and `api_key` is `empty`, you MUST stop and guide the user to configure their key:
+1. Tell the user: "You're using the ACE-Step official cloud API, but no API key is configured. An API key is required to use this service."
+2. Explain how to get a key: API keys are currently available through the official ACE-Step Discord community (https://discord.gg/bGVxwUyD). Additional distribution methods will be added in the future.
+3. Use `AskUserQuestion` to ask the user to provide their API key.
+4. Once provided, configure it:
+   ```bash
+   cd "{project_root}/{.claude or .codex}/skills/acestep/" && bash ./scripts/acestep.sh config --set api_key <KEY>
+   ```
+5. Additionally, inform the user: "If you also want to render music videos (MV), it's recommended to configure a lyrics transcription API key as well (OpenAI Whisper or ElevenLabs Scribe), so that lyrics can be automatically transcribed with accurate timestamps. You can configure it later via the `acestep-lyrics-transcription` skill."
+#### Case 2: API key is configured
+Verify the API endpoint: `./scripts/acestep.sh health` and proceed with music generation.
+#### Case 3: Using local/custom API without key
+Local services (`http://127.0.0.1:*`) typically don't require a key. Verify with `./scripts/acestep.sh health` and proceed.
+If health check fails:
+- Ask: "Do you have ACE-Step installed?"
+- **If installed but not running**: Use the acestep-docs skill to help them start the service
+- **If not installed**: Use acestep-docs skill to guide through installation
+### Service Configuration
+**Official Cloud API:** ACE-Step provides an official API endpoint at `https://api.acemusic.ai`. To use it:
+```bash
+./scripts/acestep.sh config --set api_url "https://api.acemusic.ai"
+./scripts/acestep.sh config --set api_key "your-key"
+./scripts/acestep.sh config --set api_mode completion
+```
+API keys are currently available through the official ACE-Step Discord community. Additional distribution methods will be added in the future.
+**Local Service (Default):** No configuration needed — connects to `http://127.0.0.1:8001`.
+**Custom Remote Service:** Update `scripts/config.json` or use:
+```bash
+./scripts/acestep.sh config --set api_url "http://remote-server:8001"
+./scripts/acestep.sh config --set api_key "your-key"
+```
+**API Key Handling**: When checking whether an API key is configured, use `config --check-key` which only reports `configured` or `empty` without printing the actual key. **NEVER use `config --get api_key`** or read `config.json` directly — these would expose the user's API key. The `config --list` command is safe — it automatically masks API keys as `***` in output.
+### API Mode
+The skill supports two API modes. Switch via `api_mode` in `scripts/config.json`:
+| Mode | Endpoint | Description |
+|------|----------|-------------|
+| `completion` (default) | `/v1/chat/completions` | OpenRouter-compatible, sync request, audio returned as base64 |
+| `native` | `/release_task` + `/query_result` | Async polling mode, supports all parameters |
+**Switch mode:**
+```bash
+./scripts/acestep.sh config --set api_mode completion
+./scripts/acestep.sh config --set api_mode native
+```
+**Completion mode notes:**
+- No polling needed — single request returns result directly
+- Audio is base64-encoded inline in the response (auto-decoded and saved)
+- `inference_steps`, `infer_method`, `shift` are not configurable (server defaults)
+- `--no-wait` and `status` commands are not applicable in completion mode
+- Requires `model` field — auto-detected from `/v1/models` if not specified
+### Using acestep-docs Skill for Setup Help
+**IMPORTANT**: For installation and startup, always use the acestep-docs skill to get complete and accurate guidance.
+**DO NOT provide simplified startup commands** - each user's environment may be different. Always guide them to use acestep-docs for proper setup.
+---
+For API debugging, see [API Reference](./api-reference.md).

.claude/skills/acestep/api-reference.md ADDED Viewed

	@@ -0,0 +1,149 @@

+# ACE-Step API Reference
+> For debugging and advanced usage only. Normal operations should use `scripts/acestep.sh`.
+## Native Mode Endpoints
+All responses wrapped: `{"data": <payload>, "code": 200, "error": null, "timestamp": ...}`
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/health` | GET | Health check |
+| `/release_task` | POST | Create generation task |
+| `/query_result` | POST | Query task status, body: `{"task_id_list": ["id"]}` |
+| `/v1/models` | GET | List available models |
+| `/v1/audio?path={path}` | GET | Download audio file |
+## Completion Mode Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/v1/chat/completions` | POST | Generate music (OpenRouter-compatible) |
+| `/v1/models` | GET | List available models (OpenRouter format) |
+## Query Result Response
+```json
+{
+  "data": [{
+    "task_id": "xxx",
+    "status": 1,
+    "result": "[{\"file\":\"/v1/audio?path=...\",\"metas\":{\"bpm\":120,\"duration\":60,\"keyscale\":\"C Major\"}}]"
+  }]
+}
+```
+Status codes: `0` = processing, `1` = success, `2` = failed
+## Completion Mode Request (`/v1/chat/completions`)
+**Caption mode** — prompt and lyrics wrapped in XML tags inside message content:
+```json
+{
+  "model": "acestep/ACE-Step-v1.5",
+  "messages": [{"role": "user", "content": "<prompt>Jazz with saxophone</prompt><lyrics>[Verse] Hello...</lyrics>"}],
+  "stream": false,
+  "thinking": true,
+  "use_format": false,
+  "audio_config": {"duration": 90, "bpm": 110, "format": "mp3", "vocal_language": "en"}
+}
+```
+**Simple mode** — plain text message, set `sample_mode: true`:
+```json
+{
+  "model": "acestep/ACE-Step-v1.5",
+  "messages": [{"role": "user", "content": "A cheerful pop song about spring"}],
+  "stream": false,
+  "sample_mode": true,
+  "thinking": true
+}
+```
+## Completion Mode Response
+```json
+{
+  "id": "chatcmpl-abc123",
+  "choices": [{
+    "message": {
+      "role": "assistant",
+      "content": "## Metadata\n**Caption:** ...\n**BPM:** 128\n\n## Lyrics\n...",
+      "audio": [{"type": "audio_url", "audio_url": {"url": "data:audio/mpeg;base64,..."}}]
+    },
+    "finish_reason": "stop"
+  }]
+}
+```
+Audio is base64-encoded inline — the script auto-decodes and saves to `acestep_output/`.
+## Request Parameters (`/release_task`)
+Parameters can be placed in `param_obj` object.
+### Generation Modes
+| Mode | Usage | When to Use |
+|------|-------|-------------|
+| **Caption** (Recommended) | `generate -c "style" -l "lyrics"` | For vocal songs - write lyrics yourself first |
+| **Simple** | `generate -d "description"` | Quick exploration, LM generates everything |
+| **Random** | `random` | Random generation for inspiration |
+### Core Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `prompt` | string | "" | Music style description (Caption mode) |
+| `lyrics` | string | "" | **Full lyrics content** - Pass ALL lyrics without omission. Use `[inst]` for instrumental. Partial/truncated lyrics = incomplete songs |
+| `sample_mode` | bool | false | Enable Simple/Random mode |
+| `sample_query` | string | "" | Description for Simple mode |
+| `thinking` | bool | false | Enable 5Hz LM for audio code generation |
+| `use_format` | bool | false | Use LM to enhance caption/lyrics |
+| `model` | string | - | DiT model name |
+| `batch_size` | int | 1 | Number of audio files to generate |
+### Music Attributes
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `audio_duration` | float | - | Duration in seconds |
+| `bpm` | int | - | Tempo (beats per minute) |
+| `key_scale` | string | "" | Key (e.g. "C Major") |
+| `time_signature` | string | "" | Time signature (e.g. "4/4") |
+| `vocal_language` | string | "en" | Language code (en, zh, ja, etc.) |
+| `audio_format` | string | "mp3" | Output format (mp3/wav/flac) |
+### Generation Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `inference_steps` | int | 8 | Diffusion steps |
+| `guidance_scale` | float | 7.0 | CFG scale |
+| `seed` | int | -1 | Random seed (-1 for random) |
+| `infer_method` | string | "ode" | Diffusion method (ode/sde) |
+### Audio Task Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `task_type` | string | "text2music" | text2music / continuation / repainting |
+| `src_audio_path` | string | - | Source audio for continuation |
+| `repainting_start` | float | 0.0 | Repainting start position (seconds) |
+| `repainting_end` | float | - | Repainting end position (seconds) |
+### Example Request (Simple Mode)
+```json
+{
+  "sample_mode": true,
+  "sample_query": "A cheerful pop song about spring",
+  "thinking": true,
+  "param_obj": {
+    "duration": 60,
+    "bpm": 120,
+    "language": "en"
+  },
+  "batch_size": 2
+}
+```

.claude/skills/acestep/scripts/acestep.sh ADDED Viewed

	@@ -0,0 +1,1093 @@

+#!/bin/bash
+#
+# ACE-Step Music Generation CLI (Bash + Curl + jq)
+#
+# Requirements: curl, jq
+#
+# Usage:
+#   ./acestep.sh generate "Music description" [options]
+#   ./acestep.sh random [--no-thinking]
+#   ./acestep.sh status <job_id>
+#   ./acestep.sh models
+#   ./acestep.sh health
+#   ./acestep.sh config [--get|--set|--reset]
+#
+# Output:
+#   - Results saved to output/<job_id>.json
+#   - Audio files downloaded to output/<job_id>_1.mp3, output/<job_id>_2.mp3, ...
+set -e
+# Ensure UTF-8 encoding for non-ASCII characters (Japanese, Chinese, etc.)
+export LANG="${LANG:-en_US.UTF-8}"
+export LC_ALL="${LC_ALL:-en_US.UTF-8}"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+CONFIG_FILE="${SCRIPT_DIR}/config.json"
+# Output dir at same level as .claude (go up 4 levels from scripts/)
+OUTPUT_DIR="$(cd "${SCRIPT_DIR}/../../../.." && pwd)/acestep_output"
+DEFAULT_API_URL="http://127.0.0.1:8001"
+# Colors
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+CYAN='\033[0;36m'
+NC='\033[0m'
+# Check dependencies
+check_deps() {
+    if ! command -v curl &> /dev/null; then
+        echo -e "${RED}Error: curl is required but not installed.${NC}"
+        exit 1
+    fi
+    if ! command -v jq &> /dev/null; then
+        echo -e "${RED}Error: jq is required but not installed.${NC}"
+        echo "Install: apt install jq / brew install jq / choco install jq"
+        exit 1
+    fi
+}
+# JSON value extractor using jq
+# Usage: json_get "$json" ".key" or json_get "$json" ".nested.key"
+json_get() {
+    local json="$1"
+    local path="$2"
+    echo "$json" | jq -r "$path // empty" 2>/dev/null
+}
+# Extract array values using jq
+json_get_array() {
+    local json="$1"
+    local path="$2"
+    echo "$json" | jq -r "$path[]? // empty" 2>/dev/null
+}
+# Ensure output directory exists
+ensure_output_dir() {
+    mkdir -p "$OUTPUT_DIR"
+}
+# Default config
+DEFAULT_CONFIG='{
+  "api_url": "http://127.0.0.1:8001",
+  "api_key": "",
+  "api_mode": "native",
+  "generation": {
+    "thinking": true,
+    "use_format": true,
+    "use_cot_caption": true,
+    "use_cot_language": true,
+    "audio_format": "mp3",
+    "vocal_language": "en"
+  }
+}'
+# Ensure config file exists
+ensure_config() {
+    if [ ! -f "$CONFIG_FILE" ]; then
+        local example="${SCRIPT_DIR}/config.example.json"
+        if [ -f "$example" ]; then
+            cp "$example" "$CONFIG_FILE"
+            echo -e "${YELLOW}Config file created from config.example.json. Please configure your settings:${NC}"
+            echo -e "  ${CYAN}./scripts/acestep.sh config --set api_url <url>${NC}"
+            echo -e "  ${CYAN}./scripts/acestep.sh config --set api_key <key>${NC}"
+        else
+            echo "$DEFAULT_CONFIG" > "$CONFIG_FILE"
+        fi
+    fi
+}
+# Get config value using jq
+get_config() {
+    local key="$1"
+    ensure_config
+    # Convert dot notation to jq path: "generation.thinking" -> ".generation.thinking"
+    local jq_path=".${key}"
+    local value
+    # Don't use // operator as it treats boolean false as falsy
+    value=$(jq -r "$jq_path" "$CONFIG_FILE" 2>/dev/null)
+    # Remove any trailing whitespace/newlines (Windows compatibility)
+    # Return empty string if value is "null" (key doesn't exist)
+    if [ "$value" = "null" ]; then
+        echo ""
+    else
+        echo "$value" | tr -d '\r\n'
+    fi
+}
+# Normalize boolean value for jq --argjson
+normalize_bool() {
+    local val="$1"
+    local default="${2:-false}"
+    case "$val" in
+        true|True|TRUE|1) echo "true" ;;
+        false|False|FALSE|0) echo "false" ;;
+        *) echo "$default" ;;
+    esac
+}
+# Set config value using jq
+set_config() {
+    local key="$1"
+    local value="$2"
+    ensure_config
+    local tmp_file="${CONFIG_FILE}.tmp"
+    local jq_path=".${key}"
+    # Determine value type and set accordingly
+    if [ "$value" = "true" ] || [ "$value" = "false" ]; then
+        jq "$jq_path = $value" "$CONFIG_FILE" > "$tmp_file"
+    elif [[ "$value" =~ ^-?[0-9]+$ ]] || [[ "$value" =~ ^-?[0-9]+\.[0-9]+$ ]]; then
+        jq "$jq_path = $value" "$CONFIG_FILE" > "$tmp_file"
+    else
+        jq "$jq_path = \"$value\"" "$CONFIG_FILE" > "$tmp_file"
+    fi
+    mv "$tmp_file" "$CONFIG_FILE"
+    echo "Set $key = $value"
+}
+# Load API URL
+load_api_url() {
+    local url=$(get_config "api_url")
+    echo "${url:-$DEFAULT_API_URL}"
+}
+# Load API Key
+load_api_key() {
+    local key=$(get_config "api_key")
+    echo "${key:-}"
+}
+# Check API health
+check_health() {
+    local url="$1"
+    local status
+    status=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 "${url}/health" 2>/dev/null) || true
+    [ "$status" = "200" ]
+}
+# Build auth header
+build_auth_header() {
+    local api_key=$(load_api_key)
+    if [ -n "$api_key" ]; then
+        echo "-H \"Authorization: Bearer ${api_key}\""
+    fi
+}
+# Prompt for URL
+prompt_for_url() {
+    echo ""
+    echo -e "${YELLOW}API server is not responding.${NC}"
+    echo "Please enter the API URL (or press Enter for default):"
+    read -p "API URL [$DEFAULT_API_URL]: " user_input
+    echo "${user_input:-$DEFAULT_API_URL}"
+}
+# Ensure API connection
+ensure_connection() {
+    ensure_config
+    local api_url=$(load_api_url)
+    if check_health "$api_url"; then
+        echo "$api_url"
+        return 0
+    fi
+    echo -e "${YELLOW}Cannot connect to: $api_url${NC}" >&2
+    local new_url=$(prompt_for_url)
+    if check_health "$new_url"; then
+        set_config "api_url" "$new_url" > /dev/null
+        echo -e "${GREEN}Saved API URL: $new_url${NC}" >&2
+        echo "$new_url"
+        return 0
+    fi
+    echo -e "${RED}Error: Cannot connect to $new_url${NC}" >&2
+    exit 1
+}
+# Save result to JSON file
+save_result() {
+    local job_id="$1"
+    local result_json="$2"
+    ensure_output_dir
+    local output_file="${OUTPUT_DIR}/${job_id}.json"
+    echo "$result_json" > "$output_file"
+    echo -e "${GREEN}Result saved: $output_file${NC}"
+}
+# Health command
+cmd_health() {
+    check_deps
+    ensure_config
+    local api_url=$(load_api_url)
+    echo "Checking API at: $api_url"
+    if check_health "$api_url"; then
+        echo -e "${GREEN}Status: OK${NC}"
+        curl -s "${api_url}/health"
+        echo ""
+    else
+        echo -e "${RED}Status: FAILED${NC}"
+        exit 1
+    fi
+}
+# Config command
+cmd_config() {
+    check_deps
+    ensure_config
+    local action=""
+    local key=""
+    local value=""
+    while [[ $# -gt 0 ]]; do
+        case $1 in
+            --get) action="get"; key="$2"; shift 2 ;;
+            --set) action="set"; key="$2"; value="$3"; shift 3 ;;
+            --reset) action="reset"; shift ;;
+            --list) action="list"; shift ;;
+            --check-key) action="check-key"; shift ;;
+            *) shift ;;
+        esac
+    done
+    case "$action" in
+        "check-key")
+            local api_key=$(get_config "api_key")
+            if [ -n "$api_key" ]; then
+                echo "api_key: configured"
+            else
+                echo "api_key: empty"
+            fi
+            ;;
+        "get")
+            [ -z "$key" ] && { echo -e "${RED}Error: --get requires KEY${NC}"; exit 1; }
+            local result=$(get_config "$key")
+            [ -n "$result" ] && echo "$key = $result" || echo "Key not found: $key"
+            ;;
+        "set")
+            [ -z "$key" ] || [ -z "$value" ] && { echo -e "${RED}Error: --set requires KEY VALUE${NC}"; exit 1; }
+            set_config "$key" "$value"
+            ;;
+        "reset")
+            echo "$DEFAULT_CONFIG" > "$CONFIG_FILE"
+            echo -e "${GREEN}Configuration reset to defaults.${NC}"
+            jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
+            ;;
+        "list")
+            echo "Current configuration:"
+            jq 'walk(if type == "object" and has("api_key") and (.api_key | length) > 0 then .api_key = "***" else . end)' "$CONFIG_FILE"
+            ;;
+        *)
+            echo "Config file: $CONFIG_FILE"
+            echo "Output dir: $OUTPUT_DIR"
+            echo "----------------------------------------"
+            cat "$CONFIG_FILE"
+            echo "----------------------------------------"
+            echo ""
+            echo "Usage:"
+            echo "  config --list              Show config"
+            echo "  config --get <key>         Get value"
+            echo "  config --set <key> <val>   Set value"
+            echo "  config --reset             Reset to defaults"
+            ;;
+    esac
+}
+# Models command
+cmd_models() {
+    check_deps
+    local api_url=$(ensure_connection)
+    local api_key=$(load_api_key)
+    echo "Available Models:"
+    echo "----------------------------------------"
+    if [ -n "$api_key" ]; then
+        curl -s -H "Authorization: Bearer ${api_key}" "${api_url}/v1/models"
+    else
+        curl -s "${api_url}/v1/models"
+    fi
+    echo ""
+}
+# Query job result via /query_result endpoint
+query_job_result() {
+    local api_url="$1"
+    local job_id="$2"
+    local api_key=$(load_api_key)
+    local payload=$(jq -n --arg id "$job_id" '{"task_id_list": [$id]}')
+    if [ -n "$api_key" ]; then
+        curl -s -X POST "${api_url}/query_result" \
+            -H "Content-Type: application/json; charset=utf-8" \
+            -H "Authorization: Bearer ${api_key}" \
+            -d "$payload"
+    else
+        curl -s -X POST "${api_url}/query_result" \
+            -H "Content-Type: application/json; charset=utf-8" \
+            -d "$payload"
+    fi
+}
+# Parse query_result response to extract status (0=processing, 1=success, 2=failed)
+# Response is wrapped: {"data": [...], "code": 200, ...}
+# Uses temp file to avoid jq pipe issues with special characters on Windows
+parse_query_status() {
+    local response="$1"
+    local tmp_file=$(mktemp)
+    printf '%s' "$response" > "$tmp_file"
+    jq -r '.data[0].status // .[0].status // 0' "$tmp_file"
+    rm -f "$tmp_file"
+}
+# Parse result JSON string from query_result response
+# The result field is a JSON string that needs to be parsed
+# Uses temp file to avoid jq pipe issues with special characters on Windows
+parse_query_result() {
+    local response="$1"
+    local tmp_file=$(mktemp)
+    printf '%s' "$response" > "$tmp_file"
+    jq -r '.data[0].result // .[0].result // "[]"' "$tmp_file"
+    rm -f "$tmp_file"
+}
+# Extract audio file paths from result (returns newline-separated paths)
+# Uses temp file to avoid jq pipe issues with special characters on Windows
+parse_audio_files() {
+    local result="$1"
+    local tmp_file=$(mktemp)
+    printf '%s' "$result" > "$tmp_file"
+    jq -r '.[].file // empty' "$tmp_file" 2>/dev/null
+    rm -f "$tmp_file"
+}
+# Extract metas value from result
+# Uses temp file to avoid jq pipe issues with special characters on Windows
+parse_metas_value() {
+    local result="$1"
+    local key="$2"
+    local tmp_file=$(mktemp)
+    printf '%s' "$result" > "$tmp_file"
+    jq -r ".[0].metas.$key // .[0].$key // empty" "$tmp_file" 2>/dev/null
+    rm -f "$tmp_file"
+}
+# Status command
+cmd_status() {
+    check_deps
+    local job_id="$1"
+    [ -z "$job_id" ] && { echo -e "${RED}Error: job_id required${NC}"; echo "Usage: $0 status <job_id>"; exit 1; }
+    local api_url=$(ensure_connection)
+    local response=$(query_job_result "$api_url" "$job_id")
+    local status=$(parse_query_status "$response")
+    echo "Job ID: $job_id"
+    case "$status" in
+        0)
+            echo "Status: processing"
+            ;;
+        1)
+            echo "Status: succeeded"
+            echo ""
+            local result_file=$(mktemp)
+            parse_query_result "$response" > "$result_file"
+            local bpm=$(jq -r '.[0].metas.bpm // .[0].bpm // empty' "$result_file" 2>/dev/null)
+            local keyscale=$(jq -r '.[0].metas.keyscale // .[0].keyscale // empty' "$result_file" 2>/dev/null)
+            local duration=$(jq -r '.[0].metas.duration // .[0].duration // empty' "$result_file" 2>/dev/null)
+            echo "Result:"
+            [ -n "$bpm" ] && echo "  BPM: $bpm"
+            [ -n "$keyscale" ] && echo "  Key: $keyscale"
+            [ -n "$duration" ] && echo "  Duration: ${duration}s"
+            # Save and download
+            save_result "$job_id" "$response"
+            download_audios "$api_url" "$job_id" "$result_file"
+            rm -f "$result_file"
+            ;;
+        2)
+            echo "Status: failed"
+            echo ""
+            echo -e "${RED}Task failed${NC}"
+            ;;
+        *)
+            echo "Status: unknown ($status)"
+            ;;
+    esac
+}
+# Download audio files from result file
+# Usage: download_audios <api_url> <job_id> <result_file>
+download_audios() {
+    local api_url="$1"
+    local job_id="$2"
+    local result_file="$3"
+    local api_key=$(load_api_key)
+    ensure_output_dir
+    local audio_format=$(get_config "generation.audio_format")
+    [ -z "$audio_format" ] && audio_format="mp3"
+    # Read result file content and extract audio paths using pipe (avoid temp file path issues on Windows)
+    local result_content
+    result_content=$(cat "$result_file" 2>/dev/null)
+    if [ -z "$result_content" ]; then
+        echo -e "  ${RED}Error: Result file is empty or cannot be read${NC}"
+        return 1
+    fi
+    # Extract audio paths using pipe instead of file (better Windows compatibility)
+    local audio_paths
+    audio_paths=$(echo "$result_content" | jq -r '.[].file // empty' 2>&1)
+    local jq_exit_code=$?
+    if [ $jq_exit_code -ne 0 ]; then
+        echo -e "  ${RED}Error: Failed to parse result JSON${NC}"
+        echo -e "  ${RED}jq error: $audio_paths${NC}"
+        return 1
+    fi
+    if [ -z "$audio_paths" ]; then
+        echo -e "  ${YELLOW}No audio files found in result${NC}"
+        return 0
+    fi
+    local count=1
+    while IFS= read -r audio_path; do
+        # Skip empty lines and remove potential Windows carriage return
+        audio_path=$(echo "$audio_path" | tr -d '\r')
+        if [ -n "$audio_path" ]; then
+            local output_file="${OUTPUT_DIR}/${job_id}_${count}.${audio_format}"
+            local download_url="${api_url}${audio_path}"
+            echo -e "  ${CYAN}Downloading audio $count...${NC}"
+            local curl_output
+            local curl_exit_code
+            if [ -n "$api_key" ]; then
+                curl_output=$(curl -s --connect-timeout 10 --max-time 300 \
+                    -w "%{http_code}" \
+                    -o "$output_file" \
+                    -H "Authorization: Bearer ${api_key}" \
+                    "$download_url" 2>&1)
+                curl_exit_code=$?
+            else
+                curl_output=$(curl -s --connect-timeout 10 --max-time 300 \
+                    -w "%{http_code}" \
+                    -o "$output_file" \
+                    "$download_url" 2>&1)
+                curl_exit_code=$?
+            fi
+            if [ $curl_exit_code -ne 0 ]; then
+                echo -e "  ${RED}Failed to download (curl error $curl_exit_code): $download_url${NC}"
+                rm -f "$output_file" 2>/dev/null
+            elif [ -f "$output_file" ] && [ -s "$output_file" ]; then
+                echo -e "  ${GREEN}Saved: $output_file${NC}"
+            else
+                echo -e "  ${RED}Failed to download (HTTP $curl_output): $download_url${NC}"
+                rm -f "$output_file" 2>/dev/null
+            fi
+            count=$((count + 1))
+        fi
+    done <<< "$audio_paths"
+}
+# =============================================================================
+# Completion Mode (OpenRouter /v1/chat/completions)
+# =============================================================================
+# Load api_mode from config (default: native)
+load_api_mode() {
+    local mode=$(get_config "api_mode")
+    echo "${mode:-native}"
+}
+# Get model ID from /v1/models endpoint for completion mode
+get_completion_model() {
+    local api_url="$1"
+    local user_model="$2"
+    local api_key=$(load_api_key)
+    # If user specified a model, prefix with acemusic/ if needed
+    if [ -n "$user_model" ]; then
+        if [[ "$user_model" == */* ]]; then
+            echo "$user_model"
+        else
+            echo "acemusic/${user_model}"
+        fi
+        return
+    fi
+    # Query /v1/models for the first available model
+    local response
+    if [ -n "$api_key" ]; then
+        response=$(curl -s -H "Authorization: Bearer ${api_key}" "${api_url}/v1/models" 2>/dev/null)
+    else
+        response=$(curl -s "${api_url}/v1/models" 2>/dev/null)
+    fi
+    local model_id
+    model_id=$(echo "$response" | jq -r '.data[0].id // empty' 2>/dev/null)
+    echo "${model_id:-acemusic/acestep-v15-turbo}"
+}
+# Decode base64 audio data URL and save to file
+# Handles cross-platform compatibility (Linux/macOS/Windows MSYS)
+decode_base64_audio() {
+    local data_url="$1"
+    local output_file="$2"
+    # Strip data URL prefix: data:audio/mpeg;base64,...
+    local b64_data="${data_url#data:*;base64,}"
+    local tmp_b64=$(mktemp)
+    printf '%s' "$b64_data" > "$tmp_b64"
+    if command -v base64 &> /dev/null; then
+        # Linux / macOS / MSYS2
+        base64 -d < "$tmp_b64" > "$output_file" 2>/dev/null || \
+        base64 -D < "$tmp_b64" > "$output_file" 2>/dev/null || \
+        python3 -c "import base64,sys; sys.stdout.buffer.write(base64.b64decode(sys.stdin.read()))" < "$tmp_b64" > "$output_file" 2>/dev/null || \
+        python -c "import base64,sys; sys.stdout.buffer.write(base64.b64decode(sys.stdin.read()))" < "$tmp_b64" > "$output_file" 2>/dev/null
+    else
+        # Fallback to python
+        python3 -c "import base64,sys; sys.stdout.buffer.write(base64.b64decode(sys.stdin.read()))" < "$tmp_b64" > "$output_file" 2>/dev/null || \
+        python -c "import base64,sys; sys.stdout.buffer.write(base64.b64decode(sys.stdin.read()))" < "$tmp_b64" > "$output_file" 2>/dev/null
+    fi
+    local decode_ok=$?
+    rm -f "$tmp_b64"
+    return $decode_ok
+}
+# Parse completion response: extract metadata, save audio files
+# Usage: parse_completion_response <response_file> <job_id>
+parse_completion_response() {
+    local resp_file="$1"
+    local job_id="$2"
+    ensure_output_dir
+    local audio_format=$(get_config "generation.audio_format")
+    [ -z "$audio_format" ] && audio_format="mp3"
+    # Check for error
+    local finish_reason
+    finish_reason=$(jq -r '.choices[0].finish_reason // "stop"' "$resp_file" 2>/dev/null)
+    if [ "$finish_reason" = "error" ]; then
+        local err_content
+        err_content=$(jq -r '.choices[0].message.content // "Unknown error"' "$resp_file" 2>/dev/null)
+        echo -e "${RED}Generation failed: $err_content${NC}"
+        return 1
+    fi
+    # Extract and display text content (metadata + lyrics)
+    local content
+    content=$(jq -r '.choices[0].message.content // empty' "$resp_file" 2>/dev/null)
+    if [ -n "$content" ]; then
+        echo "$content"
+        echo ""
+    fi
+    # Extract and save audio files
+    local audio_count
+    audio_count=$(jq -r '.choices[0].message.audio | length // 0' "$resp_file" 2>/dev/null)
+    if [ "$audio_count" -gt 0 ] 2>/dev/null; then
+        local i=0
+        while [ "$i" -lt "$audio_count" ]; do
+            local audio_url
+            audio_url=$(jq -r ".choices[0].message.audio[$i].audio_url.url // empty" "$resp_file" 2>/dev/null)
+            if [ -n "$audio_url" ]; then
+                local output_file="${OUTPUT_DIR}/${job_id}_$((i+1)).${audio_format}"
+                echo -e "  ${CYAN}Decoding audio $((i+1))...${NC}"
+                if decode_base64_audio "$audio_url" "$output_file"; then
+                    if [ -f "$output_file" ] && [ -s "$output_file" ]; then
+                        echo -e "  ${GREEN}Saved: $output_file${NC}"
+                    else
+                        echo -e "  ${RED}Failed to decode audio $((i+1))${NC}"
+                        rm -f "$output_file" 2>/dev/null
+                    fi
+                else
+                    echo -e "  ${RED}Failed to decode audio $((i+1))${NC}"
+                    rm -f "$output_file" 2>/dev/null
+                fi
+            fi
+            i=$((i+1))
+        done
+    else
+        echo -e "  ${YELLOW}No audio files in response${NC}"
+    fi
+    # Save full response JSON (strip base64 audio to keep file small)
+    local clean_resp
+    clean_resp=$(jq 'del(.choices[].message.audio[].audio_url.url)' "$resp_file" 2>/dev/null)
+    if [ -n "$clean_resp" ]; then
+        save_result "$job_id" "$clean_resp"
+    else
+        save_result "$job_id" "$(cat "$resp_file")"
+    fi
+}
+# Send request to /v1/chat/completions and handle response
+# Usage: send_completion_request <api_url> <payload_file> <job_id_var>
+send_completion_request() {
+    local api_url="$1"
+    local payload_file="$2"
+    local api_key=$(load_api_key)
+    local resp_file=$(mktemp)
+    local http_code
+    if [ -n "$api_key" ]; then
+        http_code=$(curl -s -w "%{http_code}" --connect-timeout 10 --max-time 660 \
+            -o "$resp_file" \
+            -X POST "${api_url}/v1/chat/completions" \
+            -H "Content-Type: application/json; charset=utf-8" \
+            -H "Authorization: Bearer ${api_key}" \
+            --data-binary "@${payload_file}")
+    else
+        http_code=$(curl -s -w "%{http_code}" --connect-timeout 10 --max-time 660 \
+            -o "$resp_file" \
+            -X POST "${api_url}/v1/chat/completions" \
+            -H "Content-Type: application/json; charset=utf-8" \
+            --data-binary "@${payload_file}")
+    fi
+    rm -f "$payload_file"
+    if [ "$http_code" != "200" ]; then
+        local err_detail
+        err_detail=$(jq -r '.detail // .error.message // empty' "$resp_file" 2>/dev/null)
+        echo -e "${RED}Error: HTTP $http_code${NC}"
+        [ -n "$err_detail" ] && echo -e "${RED}$err_detail${NC}"
+        rm -f "$resp_file"
+        return 1
+    fi
+    # Generate a job_id from the completion id
+    local job_id
+    job_id=$(jq -r '.id // empty' "$resp_file" 2>/dev/null)
+    [ -z "$job_id" ] && job_id="completion-$(date +%s)"
+    echo ""
+    echo -e "${GREEN}Generation completed!${NC}"
+    echo ""
+    parse_completion_response "$resp_file" "$job_id"
+    rm -f "$resp_file"
+    echo ""
+    echo -e "${GREEN}Done! Files saved to: $OUTPUT_DIR${NC}"
+}
+# Wait for job and download results
+wait_for_job() {
+    local api_url="$1"
+    local job_id="$2"
+    echo "Job created: $job_id"
+    echo "Output: $OUTPUT_DIR"
+    echo ""
+    while true; do
+        local response=$(query_job_result "$api_url" "$job_id")
+        local status=$(parse_query_status "$response")
+        case "$status" in
+            1)
+                echo ""
+                echo -e "${GREEN}Generation completed!${NC}"
+                echo ""
+                local result_file=$(mktemp)
+                parse_query_result "$response" > "$result_file"
+                local bpm=$(jq -r '.[0].metas.bpm // .[0].bpm // empty' "$result_file" 2>/dev/null)
+                local keyscale=$(jq -r '.[0].metas.keyscale // .[0].keyscale // empty' "$result_file" 2>/dev/null)
+                local duration=$(jq -r '.[0].metas.duration // .[0].duration // empty' "$result_file" 2>/dev/null)
+                echo "Metadata:"
+                [ -n "$bpm" ] && echo "  BPM: $bpm"
+                [ -n "$keyscale" ] && echo "  Key: $keyscale"
+                [ -n "$duration" ] && echo "  Duration: ${duration}s"
+                echo ""
+                # Save result JSON
+                save_result "$job_id" "$response"
+                # Download audio files
+                echo "Downloading audio files..."
+                download_audios "$api_url" "$job_id" "$result_file"
+                rm -f "$result_file"
+                echo ""
+                echo -e "${GREEN}Done! Files saved to: $OUTPUT_DIR${NC}"
+                return 0
+                ;;
+            2)
+                echo ""
+                echo -e "${RED}Generation failed!${NC}"
+                # Save error result
+                save_result "$job_id" "$response"
+                return 1
+                ;;
+            0)
+                printf "\rProcessing...              "
+                ;;
+            *)
+                printf "\rWaiting...                 "
+                ;;
+        esac
+        sleep 5
+    done
+}
+# Generate command
+cmd_generate() {
+    check_deps
+    ensure_config
+    local caption="" lyrics="" description="" thinking="" use_format=""
+    local no_thinking=false no_format=false no_wait=false
+    local model="" language="" steps="" guidance="" seed="" duration="" bpm="" batch=""
+    while [[ $# -gt 0 ]]; do
+        case $1 in
+            --caption|-c) caption="$2"; shift 2 ;;
+            --lyrics|-l) lyrics="$2"; shift 2 ;;
+            --description|-d) description="$2"; shift 2 ;;
+            --thinking|-t) thinking="true"; shift ;;
+            --no-thinking) no_thinking=true; shift ;;
+            --use-format) use_format="true"; shift ;;
+            --no-format) no_format=true; shift ;;
+            --model|-m) model="$2"; shift 2 ;;
+            --language|--vocal-language) language="$2"; shift 2 ;;
+            --steps) steps="$2"; shift 2 ;;
+            --guidance) guidance="$2"; shift 2 ;;
+            --seed) seed="$2"; shift 2 ;;
+            --duration) duration="$2"; shift 2 ;;
+            --bpm) bpm="$2"; shift 2 ;;
+            --batch) batch="$2"; shift 2 ;;
+            --no-wait) no_wait=true; shift ;;
+            *) [ -z "$caption" ] && caption="$1"; shift ;;
+        esac
+    done
+    # If no caption but has description, use simple mode
+    if [ -z "$caption" ] && [ -z "$description" ]; then
+        echo -e "${RED}Error: caption or description required${NC}"
+        echo "Usage: $0 generate \"Music description\" [options]"
+        echo "       $0 generate -d \"Simple description\" [options]"
+        exit 1
+    fi
+    local api_url=$(ensure_connection)
+    # Get defaults
+    local def_thinking=$(get_config "generation.thinking")
+    local def_format=$(get_config "generation.use_format")
+    local def_cot_caption=$(get_config "generation.use_cot_caption")
+    local def_cot_language=$(get_config "generation.use_cot_language")
+    local def_language=$(get_config "generation.vocal_language")
+    local def_audio_format=$(get_config "generation.audio_format")
+    [ -z "$thinking" ] && thinking="${def_thinking:-true}"
+    [ -z "$use_format" ] && use_format="${def_format:-true}"
+    [ -z "$language" ] && language="${def_language:-en}"
+    [ "$no_thinking" = true ] && thinking="false"
+    [ "$no_format" = true ] && use_format="false"
+    # Normalize boolean values for jq --argjson
+    thinking=$(normalize_bool "$thinking" "true")
+    use_format=$(normalize_bool "$use_format" "true")
+    local cot_caption=$(normalize_bool "$def_cot_caption" "true")
+    local cot_language=$(normalize_bool "$def_cot_language" "true")
+    # Build payload using jq for proper escaping
+    local payload=$(jq -n \
+        --arg prompt "$caption" \
+        --arg lyrics "${lyrics:-}" \
+        --arg sample_query "${description:-}" \
+        --argjson thinking "$thinking" \
+        --argjson use_format "$use_format" \
+        --argjson use_cot_caption "$cot_caption" \
+        --argjson use_cot_language "$cot_language" \
+        --arg vocal_language "$language" \
+        --arg audio_format "${def_audio_format:-mp3}" \
+        '{
+            prompt: $prompt,
+            lyrics: $lyrics,
+            sample_query: $sample_query,
+            thinking: $thinking,
+            use_format: $use_format,
+            use_cot_caption: $use_cot_caption,
+            use_cot_language: $use_cot_language,
+            vocal_language: $vocal_language,
+            audio_format: $audio_format,
+            use_random_seed: true
+        }')
+    # Add optional parameters
+    [ -n "$model" ] && payload=$(echo "$payload" | jq --arg v "$model" '. + {model: $v}')
+    [ -n "$steps" ] && payload=$(echo "$payload" | jq --argjson v "$steps" '. + {inference_steps: $v}')
+    [ -n "$guidance" ] && payload=$(echo "$payload" | jq --argjson v "$guidance" '. + {guidance_scale: $v}')
+    [ -n "$seed" ] && payload=$(echo "$payload" | jq --argjson v "$seed" '. + {seed: $v, use_random_seed: false}')
+    [ -n "$duration" ] && payload=$(echo "$payload" | jq --argjson v "$duration" '. + {audio_duration: $v}')
+    [ -n "$bpm" ] && payload=$(echo "$payload" | jq --argjson v "$bpm" '. + {bpm: $v}')
+    [ -n "$batch" ] && payload=$(echo "$payload" | jq --argjson v "$batch" '. + {batch_size: $v}')
+    local api_mode=$(load_api_mode)
+    echo "Generating music..."
+    if [ -n "$description" ]; then
+        echo "  Mode: Simple (description)"
+        echo "  Description: ${description:0:50}..."
+    else
+        echo "  Mode: Caption"
+        echo "  Caption: ${caption:0:50}..."
+    fi
+    echo "  Thinking: $thinking, Format: $use_format"
+    echo "  API: $api_mode"
+    echo "  Output: $OUTPUT_DIR"
+    echo ""
+    if [ "$api_mode" = "completion" ]; then
+        # --- Completion mode: /v1/chat/completions ---
+        local model_id=$(get_completion_model "$api_url" "$model")
+        # Build message content
+        local message_content=""
+        local sample_mode=false
+        if [ -n "$description" ]; then
+            message_content="$description"
+            sample_mode=true
+        else
+            message_content="<prompt>${caption}</prompt>"
+            [ -n "$lyrics" ] && message_content="${message_content}<lyrics>${lyrics}</lyrics>"
+        fi
+        # Build completion payload
+        local payload_c=$(jq -n \
+            --arg model "$model_id" \
+            --arg content "$message_content" \
+            --argjson thinking "$thinking" \
+            --argjson use_format "$use_format" \
+            --argjson sample_mode "$sample_mode" \
+            --argjson use_cot_caption "$cot_caption" \
+            --argjson use_cot_language "$cot_language" \
+            --arg vocal_language "$language" \
+            --arg format "${def_audio_format:-mp3}" \
+            '{
+                model: $model,
+                messages: [{"role": "user", "content": $content}],
+                stream: false,
+                thinking: $thinking,
+                use_format: $use_format,
+                sample_mode: $sample_mode,
+                use_cot_caption: $use_cot_caption,
+                use_cot_language: $use_cot_language,
+                audio_config: {
+                    format: $format,
+                    vocal_language: $vocal_language
+                }
+            }')
+        # Add optional parameters to completion payload
+        [ -n "$guidance" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$guidance" '. + {guidance_scale: $v}')
+        [ -n "$seed" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$seed" '. + {seed: $v}')
+        [ -n "$batch" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$batch" '. + {batch_size: $v}')
+        [ -n "$duration" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$duration" '.audio_config.duration = $v')
+        [ -n "$bpm" ] && payload_c=$(echo "$payload_c" | jq --argjson v "$bpm" '.audio_config.bpm = $v')
+        local temp_payload=$(mktemp)
+        printf '%s' "$payload_c" > "$temp_payload"
+        send_completion_request "$api_url" "$temp_payload"
+    else
+        # --- Native mode: /release_task + polling ---
+        local temp_payload=$(mktemp)
+        printf '%s' "$payload" > "$temp_payload"
+        local api_key=$(load_api_key)
+        local response
+        if [ -n "$api_key" ]; then
+            response=$(curl -s -X POST "${api_url}/release_task" \
+                -H "Content-Type: application/json; charset=utf-8" \
+                -H "Authorization: Bearer ${api_key}" \
+                --data-binary "@${temp_payload}")
+        else
+            response=$(curl -s -X POST "${api_url}/release_task" \
+                -H "Content-Type: application/json; charset=utf-8" \
+                --data-binary "@${temp_payload}")
+        fi
+        rm -f "$temp_payload"
+        local job_id=$(echo "$response" | jq -r '.data.task_id // .task_id // empty')
+        [ -z "$job_id" ] && { echo -e "${RED}Error: Failed to create job${NC}"; echo "$response"; exit 1; }
+        if [ "$no_wait" = true ]; then
+            echo "Job ID: $job_id"
+            echo "Use '$0 status $job_id' to check progress and download"
+        else
+            wait_for_job "$api_url" "$job_id"
+        fi
+    fi
+}
+# Random command
+cmd_random() {
+    check_deps
+    ensure_config
+    local thinking="" no_thinking=false no_wait=false
+    while [[ $# -gt 0 ]]; do
+        case $1 in
+            --thinking|-t) thinking="true"; shift ;;
+            --no-thinking) no_thinking=true; shift ;;
+            --no-wait) no_wait=true; shift ;;
+            *) shift ;;
+        esac
+    done
+    local api_url=$(ensure_connection)
+    local def_thinking=$(get_config "generation.thinking")
+    [ -z "$thinking" ] && thinking="${def_thinking:-true}"
+    [ "$no_thinking" = true ] && thinking="false"
+    # Normalize boolean for jq --argjson
+    thinking=$(normalize_bool "$thinking" "true")
+    local api_mode=$(load_api_mode)
+    echo "Generating random music..."
+    echo "  Thinking: $thinking"
+    echo "  API: $api_mode"
+    echo "  Output: $OUTPUT_DIR"
+    echo ""
+    if [ "$api_mode" = "completion" ]; then
+        # --- Completion mode ---
+        local model_id=$(get_completion_model "$api_url" "")
+        local def_audio_format=$(get_config "generation.audio_format")
+        local payload_c=$(jq -n \
+            --arg model "$model_id" \
+            --argjson thinking "$thinking" \
+            --arg format "${def_audio_format:-mp3}" \
+            '{
+                model: $model,
+                messages: [{"role": "user", "content": "Generate a random song"}],
+                stream: false,
+                sample_mode: true,
+                thinking: $thinking,
+                audio_config: { format: $format }
+            }')
+        local temp_payload=$(mktemp)
+        printf '%s' "$payload_c" > "$temp_payload"
+        send_completion_request "$api_url" "$temp_payload"
+    else
+        # --- Native mode ---
+        local payload=$(jq -n --argjson thinking "$thinking" '{sample_mode: true, thinking: $thinking}')
+        local temp_payload=$(mktemp)
+        printf '%s' "$payload" > "$temp_payload"
+        local api_key=$(load_api_key)
+        local response
+        if [ -n "$api_key" ]; then
+            response=$(curl -s -X POST "${api_url}/release_task" \
+                -H "Content-Type: application/json; charset=utf-8" \
+                -H "Authorization: Bearer ${api_key}" \
+                --data-binary "@${temp_payload}")
+        else
+            response=$(curl -s -X POST "${api_url}/release_task" \
+                -H "Content-Type: application/json; charset=utf-8" \
+                --data-binary "@${temp_payload}")
+        fi
+        rm -f "$temp_payload"
+        local job_id=$(echo "$response" | jq -r '.data.task_id // .task_id // empty')
+        [ -z "$job_id" ] && { echo -e "${RED}Error: Failed to create job${NC}"; echo "$response"; exit 1; }
+        if [ "$no_wait" = true ]; then
+            echo "Job ID: $job_id"
+            echo "Use '$0 status $job_id' to check progress and download"
+        else
+            wait_for_job "$api_url" "$job_id"
+        fi
+    fi
+}
+# Help
+show_help() {
+    echo "ACE-Step Music Generation CLI"
+    echo ""
+    echo "Requirements: curl, jq"
+    echo ""
+    echo "Usage: $0 <command> [options]"
+    echo ""
+    echo "Commands:"
+    echo "  generate    Generate music from text"
+    echo "  random      Generate random music"
+    echo "  status      Check job status and download results"
+    echo "  models      List available models"
+    echo "  health      Check API health"
+    echo "  config      Manage configuration"
+    echo ""
+    echo "Output:"
+    echo "  Results saved to: $OUTPUT_DIR/<job_id>.json"
+    echo "  Audio files: $OUTPUT_DIR/<job_id>_1.mp3, ..."
+    echo ""
+    echo "Generate Options:"
+    echo "  -c, --caption     Music style/genre description (caption mode)"
+    echo "  -d, --description Simple description, LM auto-generates caption/lyrics"
+    echo "  -l, --lyrics      Lyrics text"
+    echo "  -t, --thinking    Enable thinking mode (default: true)"
+    echo "  --no-thinking     Disable thinking mode"
+    echo "  --no-format       Disable format enhancement"
+    echo ""
+    echo "Examples:"
+    echo "  $0 generate \"Pop music with guitar\"           # Caption mode"
+    echo "  $0 generate -d \"A February love song\"         # Simple mode (LM generates)"
+    echo "  $0 generate -c \"Jazz\" -l \"[Verse] Hello\"      # With lyrics"
+    echo "  $0 random"
+    echo "  $0 status <job_id>"
+    echo "  $0 config --set generation.thinking false"
+}
+# Main
+case "$1" in
+    generate) shift; cmd_generate "$@" ;;
+    random) shift; cmd_random "$@" ;;
+    status) shift; cmd_status "$@" ;;
+    models) cmd_models ;;
+    health) cmd_health ;;
+    config) shift; cmd_config "$@" ;;
+    help|--help|-h) show_help ;;
+    *) show_help; exit 1 ;;
+esac

.claude/skills/acestep/scripts/config.example.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "api_url": "https://api.acemusic.ai",
+  "api_key": "",
+  "api_mode": "completion",
+  "generation": {
+    "thinking": true,
+    "use_format": false,
+    "use_cot_caption": true,
+    "use_cot_language": false,
+    "audio_format": "mp3",
+    "batch_size": 1,
+    "vocal_language": "en"
+  }
+}

.dockerignore ADDED Viewed

	@@ -0,0 +1,42 @@

+# Reduce build context; models are downloaded at runtime from HuggingFace
+.git
+.gitignore
+.dockerignore
+*.md
+!README.md
+__pycache__
+*.py[cod]
+*$py.class
+.venv
+venv
+.env
+.env.*
+!.env.example
+checkpoints/
+gradio_outputs/
+datasets/
+lora_output/
+lokr_output/
+*.log
+.cache
+.pytest_cache
+.ruff_cache
+.mypy_cache
+torchinductor_root/
+PortableGit/
+proxy_config.txt
+*.7z
+.history/
+discord_bot/
+feishu_bot/
+test_*.py
+**/tests/
+**/test_*.py
+playground.ipynb
+issues/
+checkpoints_legacy/
+checkpoints_pack/
+python_embeded/
+acestep/third_parts/vllm/

.editorconfig ADDED Viewed

	@@ -0,0 +1,16 @@

+root = true
+[*]
+charset = utf-8
+end_of_line = lf
+insert_final_newline = true
+trim_trailing_whitespace = true
+[*.{bat,cmd,ps1}]
+end_of_line = crlf
+[*.png]
+charset = unset
+end_of_line = unset
+insert_final_newline = false
+trim_trailing_whitespace = false

.env.example ADDED Viewed

	@@ -0,0 +1,78 @@

+# ACE-Step Environment Configuration
+# Copy this file to .env and modify as needed
+#
+# This file is used by:
+# - Python scripts (acestep_v15_pipeline.py, api_server.py, etc.)
+# - Windows launcher (start_gradio_ui.bat)
+# - Linux/macOS launchers (start_gradio_ui.sh, start_gradio_ui_macos.sh)
+#
+# Settings in .env will survive repository updates, unlike hardcoded values
+# in launcher scripts which get overwritten on each update.
+# ==================== Model Settings ====================
+# DiT model path
+ACESTEP_CONFIG_PATH=acestep-v15-turbo
+# LM model path (used when LLM is enabled)
+# Available: acestep-5Hz-lm-0.6B, acestep-5Hz-lm-1.7B, acestep-5Hz-lm-4B
+ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-1.7B
+# Device selection: auto, cuda, cpu, xpu
+ACESTEP_DEVICE=auto
+# LM backend: vllm (faster) or pt (PyTorch native)
+ACESTEP_LM_BACKEND=vllm
+# ==================== LLM Initialization ====================
+# Controls whether to initialize the Language Model (LLM/5Hz LM)
+#
+# Flow: GPU detection (full) → ACESTEP_INIT_LLM override → Model loading
+# GPU optimizations (offload, quantization, batch limits) are ALWAYS applied.
+# ACESTEP_INIT_LLM only overrides the "should we try to load LLM" decision.
+#
+# Values:
+#   auto (or empty)    = Use GPU auto-detection result (recommended)
+#   true/1/yes         = Force enable LLM after GPU detection (may cause OOM)
+#   false/0/no         = Force disable LLM (pure DiT mode, faster)
+#
+# Examples:
+#   ACESTEP_INIT_LLM=auto       # Let GPU detection decide (recommended)
+#   ACESTEP_INIT_LLM=           # Same as auto
+#   ACESTEP_INIT_LLM=true       # Force enable even on low VRAM GPU
+#   ACESTEP_INIT_LLM=false      # Force disable for pure DiT mode
+#
+# When LLM is disabled, these features are unavailable:
+#   - Thinking mode (thinking=true)
+#   - Chain-of-Thought caption/language detection
+#   - Sample mode (generate from description)
+#   - Format mode (LLM-enhanced input)
+#
+# Default: auto (based on GPU VRAM detection)
+ACESTEP_INIT_LLM=auto
+# ==================== Download Settings ====================
+# Preferred download source: auto, huggingface, modelscope
+# ACESTEP_DOWNLOAD_SOURCE=auto
+# ==================== API Server Settings ====================
+# API key for authentication (optional)
+# ACESTEP_API_KEY=sk-your-secret-key
+# ==================== Gradio UI Settings ====================
+# Server port (default: 7860)
+# PORT=7860
+# Server name/host (default: 127.0.0.1 for local only, 0.0.0.0 for network access)
+# SERVER_NAME=127.0.0.1
+# UI language: en, zh, he, ja (default: en)
+# LANGUAGE=en
+# Default batch size for generation (1 to GPU-dependent max)
+# When not specified, defaults to min(2, GPU_max)
+# ACESTEP_BATCH_SIZE=2
+# ==================== Startup Settings ====================
+# Skip model loading at startup (models will be lazy-loaded on first request)
+# Set to true to start server quickly without loading models
+# ACESTEP_NO_INIT=false

.github/ISSUE_TEMPLATE/bug_report.md ADDED Viewed

	@@ -0,0 +1,38 @@

+---
+name: Bug report
+about: Create a report to help us improve
+title: ''
+labels: ''
+assignees: ''
+---
+**Describe the bug**
+A clear and concise description of what the bug is.
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+**Screenshots**
+If applicable, add screenshots to help explain your problem.
+**Desktop (please complete the following information):**
+ - OS: [e.g. iOS]
+ - Browser [e.g. chrome, safari]
+ - Version [e.g. 22]
+**Smartphone (please complete the following information):**
+ - Device: [e.g. iPhone6]
+ - OS: [e.g. iOS8.1]
+ - Browser [e.g. stock browser, safari]
+ - Version [e.g. 22]
+**Additional context**
+Add any other context about the problem here.

.github/ISSUE_TEMPLATE/feature_request.md ADDED Viewed

	@@ -0,0 +1,20 @@

+---
+name: Feature request
+about: Suggest an idea for this project
+title: ''
+labels: ''
+assignees: ''
+---
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+**Additional context**
+Add any other context or screenshots about the feature request here.

.github/copilot-instructions.md ADDED Viewed

	@@ -0,0 +1,67 @@

+# ACE-Step 1.5 - GitHub Copilot Instructions
+## Project Overview
+ACE-Step 1.5 is an open-source music foundation model combining a Language Model (LM) as a planner with a Diffusion Transformer (DiT) for audio synthesis. It generates commercial-grade music on consumer hardware (< 4GB VRAM).
+## Tech Stack
+- **Python 3.11-3.12** (ROCm on Windows requires 3.12; other platforms use 3.11)
+- **PyTorch 2.7+** with CUDA 12.8 (Windows/Linux), MPS (macOS ARM64)
+- **Transformers 4.51.0-4.57.x** for LLM inference
+- **Diffusers** for diffusion models
+- **Gradio 6.2.0** for web UI
+- **FastAPI + Uvicorn** for REST API server
+- **uv** for dependency management
+- **MLX** (Apple Silicon native acceleration, macOS ARM64)
+- **nano-vllm** (optimized LLM inference, non-macOS ARM64)
+## Multi-Platform Support
+**CRITICAL**: Supports CUDA, ROCm, Intel XPU, MPS, MLX, and CPU. When fixing bugs or adding features:
+- **DO NOT alter non-target platform paths** unless explicitly required
+- Changes to CUDA code should not affect MPS/XPU/CPU paths
+- Use `gpu_config.py` for hardware detection and configuration
+## Code Organization
+### Main Entry Points
+- `acestep/acestep_v15_pipeline.py` - Gradio UI pipeline
+- `acestep/api_server.py` - REST API server
+- `cli.py` - Command-line interface
+- `acestep/model_downloader.py` - Model downloader
+### Core Modules
+- `acestep/handler.py` - Audio generation handler (AceStepHandler)
+- `acestep/llm_inference.py` - LLM handler for text processing
+- `acestep/inference.py` - Generation logic and parameters
+- `acestep/gpu_config.py` - Hardware detection and GPU configuration
+- `acestep/audio_utils.py` - Audio processing utilities
+- `acestep/constants.py` - Global constants
+### UI & Internationalization
+- `acestep/gradio_ui/` - Gradio interface components
+- `acestep/gradio_ui/i18n.py` - i18n system (50+ languages)
+- All user-facing strings must use i18n translation keys
+### Training
+- `acestep/training/` - LoRA training pipeline
+- `acestep/dataset/` - Dataset handling
+## Key Conventions
+- **Python style**: PEP 8, 4 spaces, double quotes for strings
+- **Naming**: `snake_case` functions/variables, `PascalCase` classes, `UPPER_SNAKE_CASE` constants
+- **Logging**: Use `loguru` logger (not `print()` except CLI output)
+- **Dependencies**: Use `uv add <package>` to add to `pyproject.toml`
+## Performance
+- Target: 4GB VRAM - minimize memory allocations
+- Lazy load models when needed
+- Batch operations supported (up to 8 songs)
+## Additional Resources
+- **AGENTS.md**: Detailed guidance for AI coding agents
+- **CONTRIBUTING.md**: Contribution workflow and guidelines

.github/workflows/codeql.yml ADDED Viewed

	@@ -0,0 +1,99 @@

+# For most projects, this workflow file will not need changing; you simply need
+# to commit it to your repository.
+#
+# You may wish to alter this file to override the set of languages analyzed,
+# or to provide custom queries or build logic.
+#
+# ******** NOTE ********
+# We have attempted to detect the languages in your repository. Please check
+# the `language` matrix defined below to confirm you have the correct set of
+# supported CodeQL languages.
+#
+name: "CodeQL Advanced"
+on:
+  push:
+    branches: [ "main" ]
+  pull_request:
+    branches: [ "main" ]
+  schedule:
+    - cron: '26 2 * * 5'
+jobs:
+  analyze:
+    name: Analyze (${{ matrix.language }})
+    # Runner size impacts CodeQL analysis time. To learn more, please see:
+    #   - https://gh.io/recommended-hardware-resources-for-running-codeql
+    #   - https://gh.io/supported-runners-and-hardware-resources
+    #   - https://gh.io/using-larger-runners (GitHub.com only)
+    # Consider using larger runners or machines with greater resources for possible analysis time improvements.
+    runs-on: ${{ (matrix.language == 'swift' && 'macos-latest') || 'ubuntu-latest' }}
+    permissions:
+      # required for all workflows
+      security-events: write
+      # required to fetch internal or private CodeQL packs
+      packages: read
+      # only required for workflows in private repositories
+      actions: read
+      contents: read
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - language: python
+            build-mode: none
+        # CodeQL supports the following values keywords for 'language': 'actions', 'c-cpp', 'csharp', 'go', 'java-kotlin', 'javascript-typescript', 'python', 'ruby', 'rust', 'swift'
+        # Use `c-cpp` to analyze code written in C, C++ or both
+        # Use 'java-kotlin' to analyze code written in Java, Kotlin or both
+        # Use 'javascript-typescript' to analyze code written in JavaScript, TypeScript or both
+        # To learn more about changing the languages that are analyzed or customizing the build mode for your analysis,
+        # see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/customizing-your-advanced-setup-for-code-scanning.
+        # If you are analyzing a compiled language, you can modify the 'build-mode' for that language to customize how
+        # your codebase is analyzed, see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/codeql-code-scanning-for-compiled-languages
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+    # Add any setup steps before running the `github/codeql-action/init` action.
+    # This includes steps like installing compilers or runtimes (`actions/setup-node`
+    # or others). This is typically only required for manual builds.
+    # - name: Setup runtime (example)
+    #   uses: actions/setup-example@v1
+    # Initializes the CodeQL tools for scanning.
+      - name: Initialize CodeQL
+        uses: github/codeql-action/init@v4
+        with:
+          languages: ${{ matrix.language }}
+          build-mode: ${{ matrix.build-mode }}
+        # If you wish to specify custom queries, you can do so here or in a config file.
+        # By default, queries listed here will override any specified in a config file.
+        # Prefix the list here with "+" to use these queries and those in the config file.
+        # For more details on CodeQL's query packs, refer to: https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
+        # queries: security-extended,security-and-quality
+    # If the analyze step fails for one of the languages you are analyzing with
+    # "We were unable to automatically build your code", modify the matrix above
+    # to set the build mode to "manual" for that language. Then modify this step
+    # to build your code.
+    # ℹ️ Command-line programs to run using the OS shell.
+    # 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun
+      - name: Run manual build steps
+        if: matrix.build-mode == 'manual'
+        shell: bash
+        run: |
+          echo 'If you are using a "manual" build mode for one or more of the' \
+            'languages you are analyzing, replace this with the commands to build' \
+            'your code, for example:'
+          echo '  make bootstrap'
+          echo '  make release'
+          exit 1
+      - name: Perform CodeQL Analysis
+        uses: github/codeql-action/analyze@v4
+        with:
+          category: "/language:${{matrix.language}}"

.gitignore ADDED Viewed

	@@ -0,0 +1,250 @@

+# HF Spaces reject binaries in git
+assets/*.png
+assets/*.gif
+acestep/third_parts/nano-vllm/assets/
+#Exclude potential (c) training data
+*_lyrics.txt
+*.mp3
+AlbumArt*.jpg
+Folder.jpg
+data/
+*.mp3
+*.wav
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[codz]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py.cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# UV
+#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+uv.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+#poetry.toml
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
+#   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
+#pdm.lock
+#pdm.toml
+.pdm-python
+.pdm-build/
+# pixi
+#   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
+#pixi.lock
+#   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
+#   in the .venv directory. It is recommended not to include this directory in version control.
+.pixi
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.envrc
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# Abstra
+# Abstra is an AI-powered process automation framework.
+# Ignore directories containing user credentials, local state, and settings.
+# Learn more at https://abstra.io/docs
+.abstra/
+# Visual Studio Code
+#  Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
+#  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
+#  and can be added to the global gitignore or merged into this file. However, if you prefer,
+#  you could uncomment the following to ignore the entire vscode folder
+# .vscode/
+# Ruff stuff:
+.ruff_cache/
+# PyPI configuration file
+.pypirc
+# Cursor
+#  Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
+#  exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
+#  refer to https://docs.cursor.com/context/ignore-files
+.cursorignore
+.cursorindexingignore
+# Marimo
+marimo/_static/
+marimo/_lsp/
+__marimo__/
+tests/
+checkpoints/
+playground.ipynb
+.history/
+upload_checkpoints.sh
+checkpoints.7z
+README_old.md
+discord_bot/
+feishu_bot/
+tmp*
+torchinductor_root/
+scripts/*.pyc
+scripts/__pycache__/
+!scripts/check_gpu.py
+!scripts/prepare_vae_calibration_data.py
+checkpoints_legacy/
+lora_output/
+datasets/
+python_embeded/
+checkpoints_pack/
+issues/
+PortableGit/
+proxy_config.txt
+gradio_outputs/
+acestep/third_parts/vllm/
+test_lora_scale_fix.py
+lokr_output/

AGENTS.md ADDED Viewed

	@@ -0,0 +1,96 @@

+# AGENTS.md
+Guidance for AI coding agents working in `ace-step/ACE-Step-1.5`.
+This document is aligned with the intent from:
+- Discussion #408: functional decomposition to reduce risk from large mixed-responsibility files.
+- Discussion #365: low-risk contribution workflow, minimal scope, and review rigor.
+## Primary Objectives
+1. Keep changes safe and reviewable.
+2. Prefer small, maintainable, decomposed modules.
+3. Preserve behavior outside the target fix.
+4. Validate with focused Python unit tests.
+## Scope and Change Control (Required)
+- Solve one problem per task/PR.
+- Keep edits minimal: touch only files/functions required for the requested change.
+- Do not make drive-by refactors, formatting sweeps, or opportunistic cleanups.
+- Do not alter non-target hardware/runtime paths (CPU/CUDA/MPS/XPU) unless required by the task.
+- If any cross-path change is necessary, isolate it and justify it in the PR notes.
+- Preserve existing public interfaces unless the task explicitly requires an interface change.
+## Decomposition and Module Size Policy
+- Prefer single-responsibility modules with clear boundaries.
+- Target module size:
+  - Optimal: `<= 150` LOC @coderabbit: Exceeding this limit is NIGGLE, not hard cap or major issue.
+  - Hard cap: `200` LOC
+- Function decomposition rules:
+  - Do one thing at a time; if a function description naturally contains "and", split it.
+  - Split by responsibility, not by convenience.
+  - Keep data flow explicit (`data in, data out`); side effects must be obvious and deliberate.
+  - Push decisions up and push work down (orchestration at higher layers, execution details in lower layers).
+  - The call graph should read clearly from top-level orchestration to leaf operations.
+- If a module would exceed `200` LOC:
+  - Split by responsibility before merging, or
+  - Add a short justification in PR notes and include a concrete follow-up split plan.
+- Keep orchestrator/facade modules thin. Move logic into focused helpers/services.
+- Preserve stable facade imports when splitting large files so external callers are not broken.
+## Python Unit Testing Expectations
+- Add or update tests for every behavior change and bug fix.
+- Match repository conventions:
+  - Use `unittest`-style tests.
+  - Name test files as `*_test.py` or `test_*.py`.
+- Keep tests deterministic, fast, and scoped to changed behavior.
+- Use mocks/fakes for GPU, filesystem, network, and external services where possible.
+- If a change requires mocking a large portion of the system to test one unit, treat that as a decomposition smell and refactor boundaries.
+- Include at least:
+  - One success-path test.
+  - One regression/edge-case test for the bug being fixed.
+  - One non-target behavior check when relevant.
+- Run targeted tests locally before submitting.
+## Feature Gating and WIP Safety
+- Do not expose unfinished or non-functional user-facing flows by default.
+- Gate WIP or unstable UI/API paths behind explicit feature/release flags.
+- Keep default behavior stable; "coming soon" paths must not appear as usable functionality unless they are operational and tested.
+## Python Coding Best Practices
+- Use explicit, readable code over clever shortcuts.
+- Docstrings are mandatory for all new or modified Python modules, classes, and functions.
+- Docstrings must be concise and include purpose plus key inputs/outputs (and raised exceptions when relevant).
+- Add type hints for new/modified functions when practical.
+- Keep functions focused and short; extract helpers instead of nesting complexity.
+- Use clear names that describe behavior, not implementation trivia.
+- Prefer pure functions for logic-heavy paths where possible.
+- Avoid duplicated logic, but do not introduce broad abstractions too early; prefer simple local duplication over unstable premature abstraction.
+- Handle errors explicitly; avoid bare `except`.
+- Keep logging actionable; avoid noisy logs and `print` debugging in committed code.
+- Avoid hidden state and unintended side effects.
+- Write comments only where intent is non-obvious; keep comments concise and technical.
+## AI-Agent Workflow (Recommended)
+1. Understand the task and define explicit in-scope/out-of-scope boundaries.
+2. Propose a minimal patch plan before editing.
+3. Implement the smallest viable change.
+4. Add/update focused tests.
+5. Self-review only changed hunks for regressions and scope creep.
+6. Summarize risk, validation, and non-target impact in PR notes.
+## PR Readiness Checklist
+- [ ] Change is tightly scoped to one problem.
+- [ ] Non-target paths are unchanged, or changes are explicitly justified.
+- [ ] New/updated tests cover changed behavior and edge cases.
+- [ ] No unrelated refactor/formatting churn.
+- [ ] Required docstrings are present for all new/modified modules, classes, and functions.
+- [ ] WIP/unstable functionality is feature-flagged and not exposed as default-ready behavior.
+- [ ] Module LOC policy is met (`<=150` target, `<=200` hard cap or justified exception).

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,175 @@

+Hopefully this will provide a simple, easy to understand guide to making safe contributions to the project, happy coding!
+---
+## Why This Matters
+This project supports **many hardware and runtime combinations**.
+A change that works perfectly on one setup can unintentionally break another if scope is not tightly controlled.
+The project has kind of gone viral, and has thousands of users, amature, semi professional and professional, technical and none technical, it is important that Ace-Step has reliable builds to maintain user trust and engagement.
+Recent PR patterns have shown avoidable regressions, for example:
+- Fixes that changed behaviour outside the intended target path
+- Hardware-specific assumptions leaking into general code paths
+- String / status handling changes that broke downstream logic
+- Missing or weak review before merge
+The goal here is **not blame**.
+The goal is **predictable, low-risk contributions** that maintainers can trust and merge with confidence.
+---
+## Core Principles for Contributors
+### Solve One Problem at a Time
+- Keep each PR focused on **a single bug or feature**.
+- Do **not** mix refactors, formatting, and behaviour changes unless absolutely required.
+### Minimize Blast Radius
+- Touch **only** the files and functions required for the fix.
+- Avoid “drive-by improvements” in unrelated code.
+### Preserve Non-Target Platforms
+- If fixing **CUDA behaviour**, do not change **CPU / MPS / XPU** paths unless needed.
+- Explicitly state **“non-target platforms unchanged”** in the PR notes — and verify it.
+### Prove the Change
+- Add or run **targeted checks** for the affected path.
+- Include a short **regression checklist** in the PR description.
+### Be Explicit About Risk
+- Call out edge cases and trade-offs up front.
+- If uncertain, say so and ask maintainers or experienced contributors for preferred direction.
+Clarity beats confidence.
+---
+## AI Prompt Guardrails for Multi-Platform Projects
+Tell your coding agent explicitly:
+- Ask for a proposal and plan before making code changes.
+- Make only the **minimum required changes** for the target issue.
+- Do **not** refactor unrelated code.
+- Do **not** alter non-target hardware/runtime paths unless required.
+- If a cross-platform change is necessary, **isolate and justify it explicitly**.
+- Preserve existing behaviour and interfaces unless the bug fix requires change.
+These guardrails dramatically reduce accidental regressions from broad AI edits.
+---
+## Recommended AI-Assisted Workflow (Copilot / CodePilot / Codex)
+### Step 1: Commit-Scoped Review (First Pass)
+Once you feel work is complete, and whatever manual or automated testing passes, commit your work to your local project. Note the commit number, or ask your agent to provide the number for your latest commit.
+**Use a different agent to review your work than was used to produce the work**
+If you use Claud or OpenAI codex, use your free Copilot tokens in VScode to get a Copilot review. If in doubt, ask your main agent to formulate a prompt for the review agent. It will 'know' what it has worked on and can suggest appropriate focus areas for the review agent.
+Ask the agent to review **only your commit diff**, not the whole repo.
+Prompt example:
+Review commit <sha> only.
+Focus on regressions, behaviour changes, and missing tests.
+Ignore pre-existing issues outside changed hunks.
+Output findings by severity with file/line references.
+Fix the issues raised by the review, rerun the review process until only non-breaking trivial issues exist. This may need to be repeated a number of times until the commit is clean, but watch that this does not incorrectly blow scope out beyond what is required for the primary fix.
+---
+### Step 2: Validate Findings
+Classify each finding as:
+- **Accept** — real issue introduced or exposed by your change
+- **Rebut** — incorrect or out-of-scope concern
+- **Pre-existing** — not introduced by this PR (note separately)
+---
+### Step 3: Apply Minimal Fixes
+- Fix **accepted** issues with the **smallest possible patch**.
+- Do **not** broaden scope or refactor opportunistically.
+---
+### Step 4: PR-Scoped Review (Second Pass)
+Run review on the **entire PR diff**, but only what changed.
+Prompt example:
+Review PR diff only (base <base>, head <head>),
+(or alternatively, "treat commit a/b/c... as a whole.")
+Prioritize regression risk across hardware paths.
+Verify unchanged behaviour on non-target platforms.
+Flag only issues in changed code.
+---
+### Step 5: Write Reviewer Responses
+For each reviewer comment:
+- Quote the concern
+- Respond in one line
+- Mark disposition clearly
+- Link to fix if applicable
+---
+## Review / Accept / Rebut / Fix Cycle (Practical Template)
+Use this structure in your notes:
+Comment: <reviewer concern>
+Disposition: Accepted | Rebutted | Pre-existing
+Response: <one-line rationale>
+Action: <commit / file / line> or No code change
+This keeps discussion **objective, fast, and easy to follow**.
+---
+## PR Description Template (Recommended)
+### Summary
+- What bug or feature is addressed
+- Why this change is needed
+### Scope
+- Files changed
+- What is explicitly **out of scope**
+### Risk and Compatibility
+- Target platform / path
+- Confirmation that **non-target paths are unchanged**
+  (or describe exactly what changed and why)
+### Regression Checks
+- Checks run (manual and/or automated)
+- Key scenarios validated
+### Reviewer Notes
+- Known pre-existing issues not addressed
+- Follow-up items (if any)
+Your PR description should look something like [this](https://github.com/ace-step/ACE-Step-1.5/pull/309), demonstrating care and rigor applied by the author before hitting the PR button. If you have multiple Coderabbit/copilot responses to your PR, its probably a good idea to revoke the PR, fix the issues raised by the review bot, and resubmit.
+---
+Maintainers are balancing **correctness, stability, and review bandwidth**.
+PRs that are:
+- tightly scoped
+- clearly explained
+- minimally risky
+- easy to reason about
+are **much more likely to be reviewed and merged quickly**.
+Thanks for helping keep the project stable and enjoyable to work on.

Dockerfile ADDED Viewed

	@@ -0,0 +1,28 @@

+# ACE-Step 1.5 - Hugging Face Docker Space (GPU)
+# Uses CUDA base; no GPU at build time. Port 7860 for Gradio.
+# See https://huggingface.co/docs/hub/spaces-sdks-docker
+FROM nvidia/cuda:12.4.0-cudnn8-runtime-ubuntu22.04
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    python3 python3-pip python3-venv python3-dev \
+    git build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# HF Spaces run as user 1000
+RUN useradd -m -u 1000 user
+USER user
+ENV HOME=/home/user PATH=/home/user/.local/bin:$PATH
+WORKDIR /home/user/app
+# Install Python deps (no GPU ops at build time)
+COPY --chown=user requirements.txt .
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r requirements.txt
+# App code (acestep/, configs/, app.py, etc.) - copied from Space repo root
+COPY --chown=user . .
+EXPOSE 7860
+CMD ["python3", "app.py"]

README.md CHANGED Viewed

@@ -1,285 +1,16 @@
 ---
-title: Ace Step Munk
 emoji: 🎵
-colorFrom: indigo
-colorTo: blue
-sdk: gradio
-sdk_version: 6.2.0
-app_file: app.py
-hf_oauth: true
 pinned: false
 ---
-<h1 align="center">ACE-Step 1.5</h1>
-<h1 align="center">Pushing the Boundaries of Open-Source Music Generation</h1>
-<p align="center">
-    <a href="https://ace-step.github.io/ace-step-v1.5.github.io/">Project</a> |
-    <a href="https://huggingface.co/ACE-Step/Ace-Step1.5">Hugging Face</a> |
-    <a href="https://modelscope.cn/models/ACE-Step/Ace-Step1.5">ModelScope</a> |
-    <a href="https://huggingface.co/spaces/ACE-Step/Ace-Step-v1.5">Space Demo</a> |
-    <a href="https://discord.gg/PeWDxrkdj7">Discord</a> |
-    <a href="https://arxiv.org/abs/2602.00744">Technical Report</a>
-</p>
-<p align="center">
-    <img src="./assets/orgnization_logos.png" width="100%" alt="StepFun Logo">
-</p>
-## Table of Contents
-- [✨ Features](#-features)
-- [⚡ Quick Start](#-quick-start)
-- [🚀 Launch Scripts](#-launch-scripts)
-- [📚 Documentation](#-documentation)
-- [📖 Tutorial](#-tutorial)
-- [🏗️ Architecture](#️-architecture)
-- [🦁 Model Zoo](#-model-zoo)
-- [🔬 Benchmark](#-benchmark)
-## 📝 Abstract
-🚀 We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast—under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style.
-🌉 At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints—scaling from short loops to 10-minute compositions—while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). ⚡ Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. 🎚️
-🔮 Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities—such as cover generation, repainting, and vocal-to-BGM conversion—while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. 🎸
-## ✨ Features
-<p align="center">
-    <img src="./assets/application_map.png" width="100%" alt="ACE-Step Framework">
-</p>
-### ⚡ Performance
-- ✅ **Ultra-Fast Generation** — Under 2s per full song on A100, under 10s on RTX 3090 (0.5s to 10s on A100 depending on think mode & diffusion steps)
-- ✅ **Flexible Duration** — Supports 10 seconds to 10 minutes (600s) audio generation
-- ✅ **Batch Generation** — Generate up to 8 songs simultaneously
-### 🎵 Generation Quality
-- ✅ **Commercial-Grade Output** — Quality beyond most commercial music models (between Suno v4.5 and Suno v5)
-- ✅ **Rich Style Support** — 1000+ instruments and styles with fine-grained timbre description
-- ✅ **Multi-Language Lyrics** — Supports 50+ languages with lyrics prompt for structure & style control
-### 🎛️ Versatility & Control
-| Feature | Description |
-|---------|-------------|
-| ✅ Reference Audio Input | Use reference audio to guide generation style |
-| ✅ Cover Generation | Create covers from existing audio |
-| ✅ Repaint & Edit | Selective local audio editing and regeneration |
-| ✅ Track Separation | Separate audio into individual stems |
-| ✅ Multi-Track Generation | Add layers like Suno Studio's "Add Layer" feature |
-| ✅ Vocal2BGM | Auto-generate accompaniment for vocal tracks |
-| ✅ Metadata Control | Control duration, BPM, key/scale, time signature |
-| ✅ Simple Mode | Generate full songs from simple descriptions |
-| ✅ Query Rewriting | Auto LM expansion of tags and lyrics |
-| ✅ Audio Understanding | Extract BPM, key/scale, time signature & caption from audio |
-| ✅ LRC Generation | Auto-generate lyric timestamps for generated music |
-| ✅ LoRA Training | One-click annotation & training in Gradio. 8 songs, 1 hour on 3090 (12GB VRAM) |
-| ✅ Quality Scoring | Automatic quality assessment for generated audio |
-## Staying ahead
------------------
-Star ACE-Step on GitHub and be instantly notified of new releases
-![](assets/star.gif)
-## ⚡ Quick Start
-> **Requirements:** Python 3.11-3.12, CUDA GPU recommended (also supports MPS / ROCm / Intel XPU / CPU)
->
-> **Note:** ROCm on Windows requires Python 3.12 (AMD officially provides Python 3.12 wheels only)
-```bash
-# 1. Install uv
-curl -LsSf https://astral.sh/uv/install.sh | sh          # macOS / Linux
-# powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"  # Windows
-# 2. Clone & install
-git clone https://github.com/ACE-Step/ACE-Step-1.5.git
-cd ACE-Step-1.5
-uv sync
-# 3. Launch Gradio UI (models auto-download on first run)
-uv run acestep
-# Or launch REST API server
-uv run acestep-api
-```
-Open http://localhost:7860 (Gradio) or http://localhost:8001 (API).
-> 📦 **Windows users:** A [portable package](https://files.acemusic.ai/acemusic/win/ACE-Step-1.5.7z) with pre-installed dependencies is available. See [Installation Guide](./docs/en/INSTALL.md#-windows-portable-package).
-> 📖 **Full installation guide** (AMD/ROCm, Intel GPU, CPU, environment variables, command-line options): [English](./docs/en/INSTALL.md) | [中文](./docs/zh/INSTALL.md) | [日本語](./docs/ja/INSTALL.md)
-### 💡 Which Model Should I Choose?
-| Your GPU VRAM | Recommended LM Model | Backend | Notes |
-|---------------|---------------------|---------|-------|
-| **≤6GB** | None (DiT only) | — | LM disabled by default; INT8 quantization + full CPU offload |
-| **6-8GB** | `acestep-5Hz-lm-0.6B` | `pt` | Lightweight LM with PyTorch backend |
-| **8-16GB** | `acestep-5Hz-lm-0.6B` / `1.7B` | `vllm` | 0.6B for 8-12GB, 1.7B for 12-16GB |
-| **16-24GB** | `acestep-5Hz-lm-1.7B` | `vllm` | 4B available on 20GB+; no offload needed on 20GB+ |
-| **≥24GB** | `acestep-5Hz-lm-4B` | `vllm` | Best quality, all models fit without offload |
-The UI automatically selects the best configuration for your GPU. All settings (LM model, backend, offloading, quantization) are tier-aware and pre-configured.
-> 📖 GPU compatibility details: [English](./docs/en/GPU_COMPATIBILITY.md) | [中文](./docs/zh/GPU_COMPATIBILITY.md) | [日本語](./docs/ja/GPU_COMPATIBILITY.md) | [한국어](./docs/ko/GPU_COMPATIBILITY.md)
-## 🚀 Launch Scripts
-Ready-to-use launch scripts for all platforms with auto environment detection, update checking, and dependency installation.
-| Platform | Scripts | Backend |
-|----------|---------|---------|
-| **Windows** | `start_gradio_ui.bat`, `start_api_server.bat` | CUDA |
-| **Windows (ROCm)** | `start_gradio_ui_rocm.bat`, `start_api_server_rocm.bat` | AMD ROCm |
-| **Linux** | `start_gradio_ui.sh`, `start_api_server.sh` | CUDA |
-| **macOS** | `start_gradio_ui_macos.sh`, `start_api_server_macos.sh` | MLX (Apple Silicon) |
-```bash
-# Windows
-start_gradio_ui.bat
-# Linux
-chmod +x start_gradio_ui.sh && ./start_gradio_ui.sh
-# macOS (Apple Silicon)
-chmod +x start_gradio_ui_macos.sh && ./start_gradio_ui_macos.sh
-```
-### ⚙️ Customizing Launch Settings
-**Recommended:** Create a `.env` file to customize models, ports, and other settings. Your `.env` configuration will survive repository updates.
-```bash
-# Copy the example file
-cp .env.example .env
-# Edit with your preferred settings
-# Examples in .env:
-ACESTEP_CONFIG_PATH=acestep-v15-turbo
-ACESTEP_LM_MODEL_PATH=acestep-5Hz-lm-1.7B
-PORT=7860
-LANGUAGE=en
-```
-> 📖 **Script configuration & customization:** [English](./docs/en/INSTALL.md#-launch-scripts) | [中文](./docs/zh/INSTALL.md#-启动脚本) | [日本語](./docs/ja/INSTALL.md#-起動スクリプト)
-## 📚 Documentation
-### Usage Guides
-| Method | Description | Documentation |
-|--------|-------------|---------------|
-| 🖥️ **Gradio Web UI** | Interactive web interface for music generation | [Guide](./docs/en/GRADIO_GUIDE.md) |
-| 🎚️ **Studio UI** | Optional HTML frontend (DAW-like) | [Guide](./docs/en/studio.md) |
-| 🐍 **Python API** | Programmatic access for integration | [Guide](./docs/en/INFERENCE.md) |
-| 🌐 **REST API** | HTTP-based async API for services | [Guide](./docs/en/API.md) |
-| ⌨️ **CLI** | Interactive wizard and configuration | [Guide](./docs/en/CLI.md) |
-### Setup & Configuration
-| Topic | Documentation |
-|-------|---------------|
-| 📦 Installation (all platforms) | [English](./docs/en/INSTALL.md) \| [中文](./docs/zh/INSTALL.md) \| [日本語](./docs/ja/INSTALL.md) |
-| 🎮 GPU Compatibility | [English](./docs/en/GPU_COMPATIBILITY.md) \| [中文](./docs/zh/GPU_COMPATIBILITY.md) \| [日本語](./docs/ja/GPU_COMPATIBILITY.md) |
-| 🔧 GPU Troubleshooting | [English](./docs/en/GPU_TROUBLESHOOTING.md) |
-| 🔬 Benchmark & Profiling | [English](./docs/en/BENCHMARK.md) \| [中文](./docs/zh/BENCHMARK.md) |
-### Multi-Language Docs
-| Language | API | Gradio | Inference | Tutorial | Install | Benchmark |
-|----------|-----|--------|-----------|----------|---------|-----------|
-| 🇺🇸 English | [Link](./docs/en/API.md) | [Link](./docs/en/GRADIO_GUIDE.md) | [Link](./docs/en/INFERENCE.md) | [Link](./docs/en/Tutorial.md) | [Link](./docs/en/INSTALL.md) | [Link](./docs/en/BENCHMARK.md) |
-| 🇨🇳 中文 | [Link](./docs/zh/API.md) | [Link](./docs/zh/GRADIO_GUIDE.md) | [Link](./docs/zh/INFERENCE.md) | [Link](./docs/zh/Tutorial.md) | [Link](./docs/zh/INSTALL.md) | [Link](./docs/zh/BENCHMARK.md) |
-| 🇯🇵 日本語 | [Link](./docs/ja/API.md) | [Link](./docs/ja/GRADIO_GUIDE.md) | [Link](./docs/ja/INFERENCE.md) | [Link](./docs/ja/Tutorial.md) | [Link](./docs/ja/INSTALL.md) | — |
-| 🇰🇷 한국어 | [Link](./docs/ko/API.md) | [Link](./docs/ko/GRADIO_GUIDE.md) | [Link](./docs/ko/INFERENCE.md) | [Link](./docs/ko/Tutorial.md) | — | — |
-## 📖 Tutorial
-**🎯 Must Read:** Comprehensive guide to ACE-Step 1.5's design philosophy and usage methods.
-| Language | Link |
-|----------|------|
-| 🇺🇸 English | [English Tutorial](./docs/en/Tutorial.md) |
-| 🇨🇳 中文 | [中文教程](./docs/zh/Tutorial.md) |
-| 🇯🇵 日本語 | [日本語チュートリアル](./docs/ja/Tutorial.md) |
-This tutorial covers: mental models and design philosophy, model architecture and selection, input control (text and audio), inference hyperparameters, random factors and optimization strategies.
-## 🔨 Train
-See the **LoRA Training** tab in Gradio UI for one-click training, or check [Gradio Guide - LoRA Training](./docs/en/GRADIO_GUIDE.md#lora-training) for details.
-## 🏗️ Architecture
-<p align="center">
-    <img src="./assets/ACE-Step_framework.png" width="100%" alt="ACE-Step Framework">
-</p>
-## 🦁 Model Zoo
-<p align="center">
-    <img src="./assets/model_zoo.png" width="100%" alt="Model Zoo">
-</p>
-### DiT Models
-| DiT Model | Pre-Training | SFT | RL | CFG | Step | Refer audio | Text2Music | Cover | Repaint | Extract | Lego | Complete | Quality | Diversity | Fine-Tunability | Hugging Face |
-|-----------|:------------:|:---:|:--:|:---:|:----:|:-----------:|:----------:|:-----:|:-------:|:-------:|:----:|:--------:|:-------:|:---------:|:---------------:|--------------|
-| `acestep-v15-base` | ✅ | ❌ | ❌ | ✅ | 50 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | Medium | High | Easy | [Link](https://huggingface.co/ACE-Step/acestep-v15-base) |
-| `acestep-v15-sft` | ✅ | ✅ | ❌ | ✅ | 50 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | High | Medium | Easy | [Link](https://huggingface.co/ACE-Step/acestep-v15-sft) |
-| `acestep-v15-turbo` | ✅ | ✅ | ❌ | ❌ | 8 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Medium | [Link](https://huggingface.co/ACE-Step/Ace-Step1.5) |
-| `acestep-v15-turbo-rl` | ✅ | ✅ | ✅ | ❌ | 8 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Medium | To be released |
-### LM Models
-| LM Model | Pretrain from | Pre-Training | SFT | RL | CoT metas | Query rewrite | Audio Understanding | Composition Capability | Copy Melody | Hugging Face |
-|----------|---------------|:------------:|:---:|:--:|:---------:|:-------------:|:-------------------:|:----------------------:|:-----------:|--------------|
-| `acestep-5Hz-lm-0.6B` | Qwen3-0.6B | ✅ | ✅ | ✅ | ✅ | ✅ | Medium | Medium | Weak | ✅ |
-| `acestep-5Hz-lm-1.7B` | Qwen3-1.7B | ✅ | ✅ | ✅ | ✅ | ✅ | Medium | Medium | Medium | ✅ |
-| `acestep-5Hz-lm-4B` | Qwen3-4B | ✅ | ✅ | ✅ | ✅ | ✅ | Strong | Strong | Strong | ✅ |
-## 🔬 Benchmark
-ACE-Step 1.5 includes `profile_inference.py`, a profiling & benchmarking tool that measures LLM, DiT, and VAE timing across devices and configurations.
-```bash
-python profile_inference.py                        # Single-run profile
-python profile_inference.py --mode benchmark       # Configuration matrix
-```
-> 📖 **Full guide** (all modes, CLI options, output interpretation): [English](./docs/en/BENCHMARK.md) | [中文](./docs/zh/BENCHMARK.md)
-## 📜 License & Disclaimer
-This project is licensed under [MIT](./LICENSE)
-ACE-Step enables original music generation across diverse genres, with applications in creative production, education, and entertainment. While designed to support positive and artistic use cases, we acknowledge potential risks such as unintentional copyright infringement due to stylistic similarity, inappropriate blending of cultural elements, and misuse for generating harmful content. To ensure responsible use, we encourage users to verify the originality of generated works, clearly disclose AI involvement, and obtain appropriate permissions when adapting protected styles or materials. By using ACE-Step, you agree to uphold these principles and respect artistic integrity, cultural diversity, and legal compliance. The authors are not responsible for any misuse of the model, including but not limited to copyright violations, cultural insensitivity, or the generation of harmful content.
-🔔 Important Notice
-The only official website for the ACE-Step project is our GitHub Pages site.
- We do not operate any other websites.
-🚫 Fake domains include but are not limited to:
-ac\*\*p.com, a\*\*p.org, a\*\*\*c.org
-⚠️ Please be cautious. Do not visit, trust, or make payments on any of those sites.
-## 🙏 Acknowledgements
-This project is co-led by ACE Studio and StepFun.
-## 📖 Citation
-If you find this project useful for your research, please consider citing:
-```BibTeX
-@misc{gong2026acestep,
-	title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
-	author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
-	howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
-	year={2026},
-	note={GitHub repository}
-}
-```

 ---
+title: ACE-Step 1.5 Music Gen
 emoji: 🎵
+colorFrom: blue
+colorTo: purple
+sdk: docker
+app_port: 7860
 pinned: false
+license: mit
 ---
+# ACE-Step 1.5 (Docker)
+Lyric-controllable, open-source text-to-music. Runs as a Docker Space with GPU.
+Models are downloaded from the Hub on first run (ACE-Step/Ace-Step1.5). Select **GPU** (e.g. T4 or A10G) in Space Settings.

SECURITY.md ADDED Viewed

	@@ -0,0 +1,27 @@

+# Security Policy
+## Reporting a Vulnerability
+We take security issues seriously and appreciate responsible disclosure.
+If you believe you have found a security vulnerability, **please do not report it in a public GitHub issue**.
+Instead, use one of the following private channels:
+- Open a **GitHub Security Advisory** for this repository (preferred)
+- Or contact the maintainers directly if a private email channel is listed
+Please include:
+- A clear description of the issue
+- Steps to reproduce (if applicable)
+- Potential impact
+- Any relevant proof-of-concept or logs
+We will acknowledge receipt and work to assess the issue as quickly as possible.
+## Bug Bounties
+At this time, this project does **not** operate a formal bug bounty program.
+However, valid and responsibly disclosed security issues may be acknowledged in release notes or documentation at the maintainers’ discretion.
+Thank you for helping keep the project and its users safe.

app.py CHANGED Viewed

@@ -1,21 +1,26 @@
 import os
 import sys
-# Ensure current directory is in sys.path
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
 from acestep.acestep_v15_pipeline import main
 if __name__ == "__main__":
-    # ZeroGPU specific settings if needed
-    # Usually ZeroGPU works out of the box with @spaces.GPU
-    # Run the main function from the pipeline
-    # We pass arguments as if they were from command line
-    import sys
-    sys.argv = [
-        "app.py",
-        "--server-name", "0.0.0.0",
-        "--port", "7860",
-    ]
     main()

+"""
+Hugging Face Space entry point for ACE-Step 1.5.
+Run the Gradio app bound to 0.0.0.0:7860 and with init_service so models load on startup.
+"""
 import os
 import sys
+# Ensure this repo root is on path (Space repo contains app.py + acestep/ at same level)
+_REPO_ROOT = os.path.dirname(os.path.abspath(__file__))
+if _REPO_ROOT not in sys.path:
+    sys.path.insert(0, _REPO_ROOT)
+# Override argv for Space: bind all interfaces, port 7860, init service
+os.chdir(_REPO_ROOT)
+sys.argv = [
+    sys.argv[0],
+    "--server_name", "0.0.0.0",
+    "--port", "7860",
+    "--init_service", "true",
+    "--download-source", "huggingface",
+]
 from acestep.acestep_v15_pipeline import main
 if __name__ == "__main__":
     main()

check_update.bat ADDED Viewed

	@@ -0,0 +1,609 @@

+@echo off
+REM Git Update Check Utility
+REM This script checks for updates from GitHub and optionally updates the repository
+setlocal enabledelayedexpansion
+REM Configuration
+set TIMEOUT_SECONDS=10
+set GIT_PORTABLE_PATH=%~dp0PortableGit\bin\git.exe
+set GIT_PATH=
+set REPO_PATH=%~dp0
+set PROXY_CONFIG_FILE=%~dp0proxy_config.txt
+echo ========================================
+echo ACE-Step Update Check
+echo ========================================
+echo.
+REM Check for Git: first try PortableGit, then system Git
+if exist "%GIT_PORTABLE_PATH%" (
+    set "GIT_PATH=%GIT_PORTABLE_PATH%"
+    echo [Git] Using PortableGit
+) else (
+    REM Try to find git in system PATH
+    where git >nul 2>&1
+    if !ERRORLEVEL! EQU 0 (
+        for /f "tokens=*" %%i in ('where git 2^>nul') do (
+            if not defined GIT_PATH set "GIT_PATH=%%i"
+        )
+        echo [Git] Using system Git: !GIT_PATH!
+    ) else (
+        echo [Error] Git not found.
+        echo   - PortableGit not found at: %GIT_PORTABLE_PATH%
+        echo   - System Git not found in PATH
+        echo.
+        echo Please either:
+        echo   1. Install PortableGit in the PortableGit folder, or
+        echo   2. Install Git and add it to your system PATH
+        echo.
+        echo ========================================
+        echo Press any key to close...
+        echo ========================================
+        pause >nul
+        exit /b 1
+    )
+)
+echo.
+REM Check if this is a git repository
+cd /d "%REPO_PATH%"
+"!GIT_PATH!" rev-parse --git-dir >nul 2>&1
+if %ERRORLEVEL% NEQ 0 (
+    echo [Error] Not a git repository.
+    echo This folder does not appear to be a git repository.
+    echo.
+    echo ========================================
+    echo Press any key to close...
+    echo ========================================
+    pause >nul
+    exit /b 1
+)
+REM Load proxy configuration if exists
+set PROXY_ENABLED=0
+set PROXY_URL=
+if exist "%PROXY_CONFIG_FILE%" (
+    for /f "usebackq tokens=1,* delims==" %%a in ("%PROXY_CONFIG_FILE%") do (
+        if /i "%%a"=="PROXY_ENABLED" set PROXY_ENABLED=%%b
+        if /i "%%a"=="PROXY_URL" set PROXY_URL=%%b
+    )
+    if "!PROXY_ENABLED!"=="1" (
+        if not "!PROXY_URL!"=="" (
+            echo [Proxy] Using proxy server: !PROXY_URL!
+            "!GIT_PATH!" config --local http.proxy "!PROXY_URL!"
+            "!GIT_PATH!" config --local https.proxy "!PROXY_URL!"
+            echo.
+        )
+    )
+)
+echo [1/4] Checking current version...
+REM Get current branch
+for /f "tokens=*" %%i in ('"!GIT_PATH!" rev-parse --abbrev-ref HEAD 2^>nul') do set CURRENT_BRANCH=%%i
+if "%CURRENT_BRANCH%"=="" set CURRENT_BRANCH=main
+REM Get current commit
+for /f "tokens=*" %%i in ('"!GIT_PATH!" rev-parse --short HEAD 2^>nul') do set CURRENT_COMMIT=%%i
+echo   Branch: %CURRENT_BRANCH%
+echo   Commit: %CURRENT_COMMIT%
+echo.
+echo [2/4] Checking for updates (timeout: %TIMEOUT_SECONDS%s)...
+echo   Connecting to GitHub...
+:FetchRetry
+REM Fetch remote with timeout (stderr visible so "Bad credentials" etc. are shown)
+set FETCH_SUCCESS=0
+"!GIT_PATH!" fetch origin --quiet
+if %ERRORLEVEL% EQU 0 (
+    set FETCH_SUCCESS=1
+)
+if !FETCH_SUCCESS! EQU 1 goto :FetchDone
+REM Try with timeout using a temp marker file
+set TEMP_MARKER=%TEMP%\acestep_git_fetch_%RANDOM%.tmp
+REM Start fetch in background
+set "FETCH_CMD=!GIT_PATH! fetch origin --quiet"
+start /b "" cmd /c "!FETCH_CMD! >nul 2>&1 && echo SUCCESS > "!TEMP_MARKER!""
+REM Wait with timeout
+set /a COUNTER=0
+:WaitLoop
+if exist "!TEMP_MARKER!" (
+    set FETCH_SUCCESS=1
+    del "!TEMP_MARKER!" >nul 2>&1
+    goto :FetchDone
+)
+timeout /t 1 /nobreak >nul
+set /a COUNTER+=1
+if !COUNTER! LSS %TIMEOUT_SECONDS% goto :WaitLoop
+REM Timeout reached
+echo   [Timeout] Could not connect to GitHub within %TIMEOUT_SECONDS% seconds.
+:FetchDone
+if %FETCH_SUCCESS% EQU 0 (
+    echo   [Failed] Could not fetch from GitHub.
+    echo   If the error above is "Bad credentials", update or clear stored Git credentials.
+    echo   This repo is public and does not require login: https://docs.github.com/en/get-started/getting-started-with-git/caching-your-github-credentials-in-git
+    echo   Otherwise check your internet connection or proxy.
+    echo.
+    REM Ask if user wants to configure proxy
+    set /p PROXY_CHOICE="Do you want to configure a proxy server to retry? (Y/N): "
+    if /i "!PROXY_CHOICE!"=="Y" (
+        call :ConfigureProxy
+        if !ERRORLEVEL! EQU 0 (
+            echo.
+            echo [Proxy] Retrying with proxy configuration...
+            echo.
+            goto :FetchRetry
+        )
+    )
+    echo.
+    echo ========================================
+    echo Press any key to close...
+    echo ========================================
+    pause >nul
+    exit /b 2
+)
+echo   [Success] Fetched latest information from GitHub.
+echo.
+echo [3/4] Comparing versions...
+REM Get remote commit
+for /f "tokens=*" %%i in ('"!GIT_PATH!" rev-parse --short origin/%CURRENT_BRANCH% 2^>nul') do set REMOTE_COMMIT=%%i
+if "%REMOTE_COMMIT%"=="" (
+    echo   [Warning] Remote branch 'origin/%CURRENT_BRANCH%' not found.
+    echo.
+    echo   Your current branch '%CURRENT_BRANCH%' does not exist on the remote repository.
+    echo   This might be a local development branch.
+    echo.
+    REM Try to get main branch instead
+    set FALLBACK_BRANCH=main
+    echo   Checking main branch instead...
+    for /f "tokens=*" %%i in ('"!GIT_PATH!" rev-parse --short origin/!FALLBACK_BRANCH! 2^>nul') do set REMOTE_COMMIT=%%i
+    if "!REMOTE_COMMIT!"=="" (
+        echo   [Error] Could not find remote main branch either.
+        echo   Please ensure you are connected to the correct repository.
+        echo.
+        echo ========================================
+        echo Press any key to close...
+        echo ========================================
+        pause >nul
+        exit /b 1
+    )
+    echo   Found main branch: !REMOTE_COMMIT!
+    echo.
+    echo   Recommendation: Switch to main branch to check for official updates.
+    echo   Command: git checkout main
+    echo.
+    set /p SWITCH_BRANCH="Do you want to switch to main branch now? (Y/N): "
+    if /i "!SWITCH_BRANCH!"=="Y" (
+        echo.
+        echo   Switching to main branch...
+        "!GIT_PATH!" checkout main
+        if !ERRORLEVEL! EQU 0 (
+            echo   [Success] Switched to main branch.
+            echo.
+            echo   Please run this script again to check for updates.
+            echo.
+            echo ========================================
+            echo Press any key to close...
+            echo ========================================
+            pause >nul
+            exit /b 0
+        ) else (
+            echo   [Error] Failed to switch branch.
+            echo.
+            echo ========================================
+            echo Press any key to close...
+            echo ========================================
+            pause >nul
+            exit /b 1
+        )
+    ) else (
+        echo.
+        echo   Staying on branch '%CURRENT_BRANCH%'. No update performed.
+        echo.
+        echo ========================================
+        echo Press any key to close...
+        echo ========================================
+        pause >nul
+        exit /b 0
+    )
+)
+echo   Local:  %CURRENT_COMMIT%
+echo   Remote: %REMOTE_COMMIT%
+echo.
+REM Compare commits
+if "%CURRENT_COMMIT%"=="%REMOTE_COMMIT%" (
+    echo [4/4] Result: Already up to date!
+    echo   You have the latest version.
+    echo.
+    echo ========================================
+    echo Press any key to close...
+    echo ========================================
+    pause >nul
+    exit /b 0
+) else (
+    echo [4/4] Result: Update available!
+    REM Check if local is behind remote
+    "!GIT_PATH!" merge-base --is-ancestor HEAD origin/%CURRENT_BRANCH% 2>nul
+    if !ERRORLEVEL! EQU 0 (
+        echo   A new version is available on GitHub.
+        echo.
+        REM Show commits behind (do not suppress stderr so ref/encoding errors are visible)
+        echo   New commits:
+        "!GIT_PATH!" --no-pager log --oneline --graph --decorate "HEAD..origin/!CURRENT_BRANCH!"
+        if !ERRORLEVEL! NEQ 0 (
+            echo   [Could not show commit log. Check branch name and network.]
+        )
+        echo.
+        REM Ask if user wants to update
+        set /p UPDATE_CHOICE="Do you want to update now? (Y/N): "
+        if /i "!UPDATE_CHOICE!"=="Y" (
+            echo.
+            echo Updating...
+            REM First, refresh the index to avoid false positives from line ending changes
+            "!GIT_PATH!" update-index --refresh >nul 2>&1
+            REM Check for uncommitted changes
+            "!GIT_PATH!" diff-index --quiet HEAD -- 2>nul
+            if !ERRORLEVEL! NEQ 0 (
+                echo.
+                echo [Info] Checking for potential conflicts...
+                REM Get list of locally modified files
+                set TEMP_LOCAL_CHANGES=%TEMP%\acestep_local_changes_%RANDOM%.txt
+                "!GIT_PATH!" diff --name-only HEAD 2>nul > "!TEMP_LOCAL_CHANGES!"
+                REM Get list of files changed in remote
+                set TEMP_REMOTE_CHANGES=%TEMP%\acestep_remote_changes_%RANDOM%.txt
+                "!GIT_PATH!" diff --name-only HEAD..origin/%CURRENT_BRANCH% 2>nul > "!TEMP_REMOTE_CHANGES!"
+                REM Check for conflicts
+                set HAS_CONFLICTS=0
+                REM Use wmic to get locale-independent date/time format (YYYYMMDDHHMMSS)
+                for /f "tokens=2 delims==" %%a in ('wmic os get localdatetime /value 2^>nul') do set "DATETIME=%%a"
+                set "BACKUP_DIR=%~dp0.update_backup_!DATETIME:~0,8!_!DATETIME:~8,6!"
+                REM Find conflicting files
+                for /f "usebackq delims=" %%a in ("!TEMP_LOCAL_CHANGES!") do (
+                    findstr /x /c:"%%a" "!TEMP_REMOTE_CHANGES!" >nul 2>&1
+                    if !ERRORLEVEL! EQU 0 (
+                        REM Found a conflict
+                        set HAS_CONFLICTS=1
+                        REM Create backup directory if not exists
+                        if not exist "!BACKUP_DIR!" (
+                            mkdir "!BACKUP_DIR!"
+                            echo.
+                            echo [Backup] Creating backup directory: !BACKUP_DIR!
+                        )
+                        REM Backup the file
+                        echo [Backup] Backing up: %%a
+                        set FILE_PATH=%%a
+                        set FILE_DIR=
+                        for %%i in ("!FILE_PATH!") do set FILE_DIR=%%~dpi
+                        REM Create subdirectories in backup if needed
+                        if not "!FILE_DIR!"=="" (
+                            if not "!FILE_DIR!"=="." (
+                                if not exist "!BACKUP_DIR!\!FILE_DIR!" (
+                                    mkdir "!BACKUP_DIR!\!FILE_DIR!" 2>nul
+                                )
+                            )
+                        )
+                        REM Copy file to backup
+                        copy "%%a" "!BACKUP_DIR!\%%a" >nul 2>&1
+                    )
+                )
+                REM Clean up temp files
+                del "!TEMP_LOCAL_CHANGES!" >nul 2>&1
+                del "!TEMP_REMOTE_CHANGES!" >nul 2>&1
+                if !HAS_CONFLICTS! EQU 1 (
+                    echo.
+                    echo ========================================
+                    echo [Warning] Potential conflicts detected!
+                    echo ========================================
+                    echo.
+                    echo Your modified files may conflict with remote updates.
+                    echo Your changes have been backed up to:
+                    echo   !BACKUP_DIR!
+                    echo.
+                    echo Update will restore these files to the remote version.
+                    echo You can manually merge your changes later.
+                    echo.
+                    set /p CONFLICT_CHOICE="Continue with update? (Y/N): "
+                    if /i "!CONFLICT_CHOICE!"=="Y" (
+                        echo.
+                        echo [Restore] Proceeding with update...
+                        echo [Restore] Files will be updated to remote version.
+                    ) else (
+                        echo.
+                        echo Update cancelled.
+                        echo Your backup remains at: !BACKUP_DIR!
+                        echo.
+                        echo ========================================
+                        echo Press any key to close...
+                        echo ========================================
+                        pause >nul
+                        exit /b 0
+                    )
+                ) else (
+                    echo.
+                    echo [Info] No conflicts detected. Safe to stash and update.
+                    echo.
+                    set /p STASH_CHOICE="Stash your changes and continue? (Y/N): "
+                    if /i "!STASH_CHOICE!"=="Y" (
+                        echo Stashing changes...
+                        "!GIT_PATH!" stash push -m "Auto-stash before update - %date% %time%"
+                    ) else (
+                        echo.
+                        echo Update cancelled.
+                        echo.
+                        echo ========================================
+                        echo Press any key to close...
+                        echo ========================================
+                        pause >nul
+                        exit /b 0
+                    )
+                )
+            )
+            REM Check for untracked files that could be overwritten
+            set STASHED_UNTRACKED=0
+            set TEMP_UNTRACKED=%TEMP%\acestep_untracked_%RANDOM%.txt
+            "!GIT_PATH!" ls-files --others --exclude-standard 2>nul > "!TEMP_UNTRACKED!"
+            REM Check if there are any untracked files
+            set HAS_UNTRACKED=0
+            for /f "usebackq delims=" %%u in ("!TEMP_UNTRACKED!") do set HAS_UNTRACKED=1
+            if !HAS_UNTRACKED! EQU 1 (
+                REM Get files added in remote
+                set TEMP_REMOTE_ADDED=%TEMP%\acestep_remote_added_%RANDOM%.txt
+                "!GIT_PATH!" diff --name-only --diff-filter=A HEAD..origin/%CURRENT_BRANCH% 2>nul > "!TEMP_REMOTE_ADDED!"
+                set HAS_UNTRACKED_CONFLICTS=0
+                for /f "usebackq delims=" %%u in ("!TEMP_UNTRACKED!") do (
+                    findstr /x /c:"%%u" "!TEMP_REMOTE_ADDED!" >nul 2>&1
+                    if !ERRORLEVEL! EQU 0 (
+                        if !HAS_UNTRACKED_CONFLICTS! EQU 0 (
+                            echo.
+                            echo ========================================
+                            echo [Warning] Untracked files conflict with update!
+                            echo ========================================
+                            echo.
+                            echo The following untracked files would be overwritten:
+                        )
+                        set HAS_UNTRACKED_CONFLICTS=1
+                        echo   %%u
+                    )
+                )
+                del "!TEMP_REMOTE_ADDED!" >nul 2>&1
+                if !HAS_UNTRACKED_CONFLICTS! EQU 1 (
+                    echo.
+                    set /p STASH_UNTRACKED_CHOICE="Stash untracked files before updating? (Y/N): "
+                    if /i "!STASH_UNTRACKED_CHOICE!"=="Y" (
+                        echo Stashing all changes including untracked files...
+                        "!GIT_PATH!" stash push --include-untracked -m "pre-update-%RANDOM%" >nul 2>&1
+                        if !ERRORLEVEL! EQU 0 (
+                            set STASHED_UNTRACKED=1
+                            echo [Stash] Changes stashed successfully.
+                        ) else (
+                            echo [Error] Failed to stash changes. Update aborted.
+                            del "!TEMP_UNTRACKED!" >nul 2>&1
+                            echo.
+                            echo ========================================
+                            echo Press any key to close...
+                            echo ========================================
+                            pause >nul
+                            exit /b 1
+                        )
+                    ) else (
+                        echo.
+                        echo Update cancelled. Please move or remove the conflicting files manually.
+                        del "!TEMP_UNTRACKED!" >nul 2>&1
+                        echo.
+                        echo ========================================
+                        echo Press any key to close...
+                        echo ========================================
+                        pause >nul
+                        exit /b 1
+                    )
+                    echo.
+                )
+            )
+            del "!TEMP_UNTRACKED!" >nul 2>&1
+            REM Pull changes
+            echo Pulling latest changes...
+            REM Force update by resetting to remote branch (discards any remaining local changes)
+            "!GIT_PATH!" reset --hard origin/%CURRENT_BRANCH% >nul 2>&1
+            if !ERRORLEVEL! EQU 0 (
+                echo.
+                echo ========================================
+                echo Update completed successfully!
+                echo ========================================
+                echo.
+                REM Check if backup was created
+                if defined BACKUP_DIR (
+                    if exist "!BACKUP_DIR!" (
+                        echo [Important] Your modified files were backed up to:
+                        echo   !BACKUP_DIR!
+                        echo.
+                        echo To restore your changes:
+                        echo   1. Run merge_config.bat to compare and merge files
+                        echo   2. Or manually compare backup with new version
+                        echo.
+                        echo Backed up files:
+                        set "BACKUP_DIR_DISPLAY=!BACKUP_DIR!"
+                        for /f "delims=" %%f in ('dir /b /s "!BACKUP_DIR!\*.*" 2^>nul') do (
+                            set "FILEPATH=%%f"
+                            REM Use call to safely handle the string replacement
+                            call set "FILEPATH=%%FILEPATH:!BACKUP_DIR_DISPLAY!\=%%"
+                            echo   - !FILEPATH!
+                        )
+                        echo.
+                    )
+                )
+                if !STASHED_UNTRACKED! EQU 1 (
+                    echo [Stash] Untracked files were stashed before the update.
+                    echo   To restore them:  git stash pop
+                    echo   To discard them:  git stash drop
+                    echo.
+                    echo   Note: 'git stash pop' may produce merge conflicts if
+                    echo   the update modified the same files. Resolve manually.
+                    echo.
+                )
+                echo Please restart the application to use the new version.
+                echo.
+                echo ========================================
+                echo Press any key to close...
+                echo ========================================
+                pause >nul
+                exit /b 0
+            ) else (
+                echo.
+                echo [Error] Update failed.
+                echo Please check the error messages above.
+                if !STASHED_UNTRACKED! EQU 1 (
+                    echo.
+                    echo [Stash] Restoring stashed changes...
+                    "!GIT_PATH!" stash pop >nul 2>&1
+                    if !ERRORLEVEL! EQU 0 (
+                        echo [Stash] Changes restored successfully.
+                    ) else (
+                        echo [Stash] Could not auto-restore. Run 'git stash pop' manually.
+                    )
+                )
+                REM If backup exists, mention it
+                if defined BACKUP_DIR (
+                    if exist "!BACKUP_DIR!" (
+                        echo.
+                        echo Your backup is still available at: !BACKUP_DIR!
+                    )
+                )
+                echo.
+                echo ========================================
+                echo Press any key to close...
+                echo ========================================
+                pause >nul
+                exit /b 1
+            )
+        ) else (
+            echo.
+            echo Update skipped.
+            echo.
+            echo ========================================
+            echo Press any key to close...
+            echo ========================================
+            pause >nul
+            exit /b 0
+        )
+    ) else (
+        echo   [Warning] Local version has diverged from remote.
+        echo   This might be because you have local commits.
+        echo   Please update manually or consult the documentation.
+        echo.
+        echo ========================================
+        echo Press any key to close...
+        echo ========================================
+        pause >nul
+        exit /b 0
+    )
+)
+REM ========================================
+REM Function: ConfigureProxy
+REM Configure proxy server for git
+REM ========================================
+:ConfigureProxy
+echo.
+echo ========================================
+echo Proxy Server Configuration
+echo ========================================
+echo.
+echo Please enter your proxy server URL.
+echo.
+echo Examples:
+echo   - HTTP proxy:  http://127.0.0.1:7890
+echo   - HTTPS proxy: https://proxy.example.com:8080
+echo   - SOCKS5:      socks5://127.0.0.1:1080
+echo.
+echo Leave empty to disable proxy.
+echo.
+set /p NEW_PROXY_URL="Proxy URL: "
+if "!NEW_PROXY_URL!"=="" (
+    echo.
+    echo [Proxy] Disabling proxy...
+    REM Remove proxy configuration
+    "!GIT_PATH!" config --local --unset http.proxy 2>nul
+    "!GIT_PATH!" config --local --unset https.proxy 2>nul
+    REM Update config file
+    (
+        echo PROXY_ENABLED=0
+        echo PROXY_URL=
+    ) > "%PROXY_CONFIG_FILE%"
+    echo [Proxy] Proxy disabled.
+    exit /b 0
+) else (
+    echo.
+    echo [Proxy] Configuring proxy: !NEW_PROXY_URL!
+    REM Apply proxy to git
+    "!GIT_PATH!" config --local http.proxy "!NEW_PROXY_URL!"
+    "!GIT_PATH!" config --local https.proxy "!NEW_PROXY_URL!"
+    REM Save to config file
+    (
+        echo PROXY_ENABLED=1
+        echo PROXY_URL=!NEW_PROXY_URL!
+    ) > "%PROXY_CONFIG_FILE%"
+    echo [Proxy] Proxy configured successfully.
+    echo [Proxy] Configuration saved to: %PROXY_CONFIG_FILE%
+    exit /b 0
+)
+endlocal

check_update.sh ADDED Viewed

	@@ -0,0 +1,330 @@

+#!/usr/bin/env bash
+# Git Update Check Utility - Linux/macOS
+# This script checks for updates from GitHub and optionally updates the repository
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# Configuration
+TIMEOUT_SECONDS=10
+GIT_PATH=""
+REPO_PATH="$SCRIPT_DIR"
+echo "========================================"
+echo "ACE-Step Update Check"
+echo "========================================"
+echo
+# Find git
+if command -v git &>/dev/null; then
+    GIT_PATH="$(command -v git)"
+    echo "[Git] Using system Git: $GIT_PATH"
+else
+    echo "[Error] Git not found."
+    echo
+    if [[ "$(uname)" == "Darwin" ]]; then
+        echo "Please install Git:"
+        echo "  xcode-select --install"
+        echo "  or: brew install git"
+    else
+        echo "Please install Git:"
+        echo "  Ubuntu/Debian: sudo apt install git"
+        echo "  CentOS/RHEL:   sudo yum install git"
+        echo "  Arch:          sudo pacman -S git"
+    fi
+    echo
+    exit 1
+fi
+echo
+# Check if this is a git repository
+cd "$REPO_PATH"
+if ! "$GIT_PATH" rev-parse --git-dir &>/dev/null; then
+    echo "[Error] Not a git repository."
+    echo "This folder does not appear to be a git repository."
+    echo
+    exit 1
+fi
+echo "[1/4] Checking current version..."
+CURRENT_BRANCH="$("$GIT_PATH" rev-parse --abbrev-ref HEAD 2>/dev/null || echo "main")"
+CURRENT_COMMIT="$("$GIT_PATH" rev-parse --short HEAD 2>/dev/null || echo "unknown")"
+echo "  Branch: $CURRENT_BRANCH"
+echo "  Commit: $CURRENT_COMMIT"
+echo
+echo "[2/4] Checking for updates (timeout: ${TIMEOUT_SECONDS}s)..."
+echo "  Connecting to GitHub..."
+# Fetch remote with timeout (stderr visible so "Bad credentials" etc. are shown)
+FETCH_SUCCESS=0
+if timeout "$TIMEOUT_SECONDS" "$GIT_PATH" fetch origin --quiet; then
+    FETCH_SUCCESS=1
+elif command -v gtimeout &>/dev/null; then
+    if gtimeout "$TIMEOUT_SECONDS" "$GIT_PATH" fetch origin --quiet; then
+        FETCH_SUCCESS=1
+    fi
+else
+    if "$GIT_PATH" fetch origin --quiet; then
+        FETCH_SUCCESS=1
+    fi
+fi
+if [[ $FETCH_SUCCESS -eq 0 ]]; then
+    echo "  [Failed] Could not fetch from GitHub."
+    echo "  If the error above is 'Bad credentials', update or clear stored Git credentials."
+    echo "  This repo is public and does not require login: https://docs.github.com/en/get-started/getting-started-with-git/caching-your-github-credentials-in-git"
+    echo "  Otherwise check your internet connection or proxy."
+    echo
+    exit 2
+fi
+echo "  [Success] Fetched latest information from GitHub."
+echo
+echo "[3/4] Comparing versions..."
+REMOTE_COMMIT="$("$GIT_PATH" rev-parse --short "origin/$CURRENT_BRANCH" 2>/dev/null || echo "")"
+if [[ -z "$REMOTE_COMMIT" ]]; then
+    echo "  [Warning] Remote branch 'origin/$CURRENT_BRANCH' not found."
+    echo
+    echo "  Checking main branch instead..."
+    FALLBACK_BRANCH="main"
+    REMOTE_COMMIT="$("$GIT_PATH" rev-parse --short "origin/$FALLBACK_BRANCH" 2>/dev/null || echo "")"
+    if [[ -z "$REMOTE_COMMIT" ]]; then
+        echo "  [Error] Could not find remote main branch either."
+        exit 1
+    fi
+    echo "  Found main branch: $REMOTE_COMMIT"
+    echo
+    read -rp "  Switch to main branch? (Y/N): " SWITCH_BRANCH
+    if [[ "${SWITCH_BRANCH^^}" == "Y" ]]; then
+        echo
+        echo "  Switching to main branch..."
+        if "$GIT_PATH" checkout main; then
+            echo "  [Success] Switched to main branch."
+            echo "  Please run this script again to check for updates."
+            exit 0
+        else
+            echo "  [Error] Failed to switch branch."
+            exit 1
+        fi
+    else
+        echo
+        echo "  Staying on branch '$CURRENT_BRANCH'. No update performed."
+        exit 0
+    fi
+fi
+echo "  Local:  $CURRENT_COMMIT"
+echo "  Remote: $REMOTE_COMMIT"
+echo
+# Compare commits
+if [[ "$CURRENT_COMMIT" == "$REMOTE_COMMIT" ]]; then
+    echo "[4/4] Result: Already up to date!"
+    echo "  You have the latest version."
+    echo
+    exit 0
+fi
+echo "[4/4] Result: Update available!"
+# Check if local is behind remote
+if "$GIT_PATH" merge-base --is-ancestor HEAD "origin/$CURRENT_BRANCH" 2>/dev/null; then
+    echo "  A new version is available on GitHub."
+    echo
+    # Show new commits (do not suppress stderr so ref/encoding errors are visible)
+    echo "  New commits:"
+    if ! "$GIT_PATH" --no-pager log --oneline --graph --decorate "HEAD..origin/$CURRENT_BRANCH"; then
+        echo "  [Could not show commit log. Check branch name and network.]"
+    fi
+    echo
+    read -rp "Do you want to update now? (Y/N): " UPDATE_CHOICE
+    if [[ "${UPDATE_CHOICE^^}" != "Y" ]]; then
+        echo
+        echo "Update skipped."
+        exit 0
+    fi
+    echo
+    echo "Updating..."
+    # Refresh index
+    "$GIT_PATH" update-index --refresh &>/dev/null || true
+    # Check for uncommitted changes
+    if ! "$GIT_PATH" diff-index --quiet HEAD -- 2>/dev/null; then
+        echo
+        echo "[Info] Checking for potential conflicts..."
+        # Get locally modified files
+        LOCAL_CHANGES="$("$GIT_PATH" diff --name-only HEAD 2>/dev/null || echo "")"
+        REMOTE_CHANGES="$("$GIT_PATH" diff --name-only "HEAD..origin/$CURRENT_BRANCH" 2>/dev/null || echo "")"
+        # Check for conflicting files
+        HAS_CONFLICTS=0
+        BACKUP_DIR="$SCRIPT_DIR/.update_backup_$(date +%Y%m%d_%H%M%S)"
+        while IFS= read -r local_file; do
+            [[ -z "$local_file" ]] && continue
+            if echo "$REMOTE_CHANGES" | grep -qxF "$local_file"; then
+                HAS_CONFLICTS=1
+                # Create backup directory if not exists
+                if [[ ! -d "$BACKUP_DIR" ]]; then
+                    mkdir -p "$BACKUP_DIR"
+                    echo
+                    echo "[Backup] Creating backup directory: $BACKUP_DIR"
+                fi
+                # Backup the file
+                echo "[Backup] Backing up: $local_file"
+                FILE_DIR="$(dirname "$local_file")"
+                if [[ "$FILE_DIR" != "." ]]; then
+                    mkdir -p "$BACKUP_DIR/$FILE_DIR"
+                fi
+                cp "$local_file" "$BACKUP_DIR/$local_file" 2>/dev/null || true
+            fi
+        done <<< "$LOCAL_CHANGES"
+        if [[ $HAS_CONFLICTS -eq 1 ]]; then
+            echo
+            echo "========================================"
+            echo "[Warning] Potential conflicts detected!"
+            echo "========================================"
+            echo
+            echo "Your modified files may conflict with remote updates."
+            echo "Your changes have been backed up to:"
+            echo "  $BACKUP_DIR"
+            echo
+            read -rp "Continue with update? (Y/N): " CONFLICT_CHOICE
+            if [[ "${CONFLICT_CHOICE^^}" != "Y" ]]; then
+                echo
+                echo "Update cancelled. Your backup remains at: $BACKUP_DIR"
+                exit 0
+            fi
+            echo
+            echo "[Restore] Proceeding with update..."
+        else
+            echo
+            echo "[Info] No conflicts detected. Safe to stash and update."
+            echo
+            read -rp "Stash your changes and continue? (Y/N): " STASH_CHOICE
+            if [[ "${STASH_CHOICE^^}" == "Y" ]]; then
+                echo "Stashing changes..."
+                "$GIT_PATH" stash push -m "Auto-stash before update - $(date)"
+            else
+                echo
+                echo "Update cancelled."
+                exit 0
+            fi
+        fi
+    fi
+    # Check for untracked files that could be overwritten
+    UNTRACKED_FILES="$("$GIT_PATH" ls-files --others --exclude-standard 2>/dev/null || echo "")"
+    STASHED_UNTRACKED=0
+    if [[ -n "$UNTRACKED_FILES" ]]; then
+        # Check if any untracked files conflict with incoming changes
+        REMOTE_ALL_FILES="$("$GIT_PATH" diff --name-only --diff-filter=A "HEAD..origin/$CURRENT_BRANCH" 2>/dev/null || echo "")"
+        CONFLICTING_UNTRACKED=""
+        while IFS= read -r ufile; do
+            [[ -z "$ufile" ]] && continue
+            if echo "$REMOTE_ALL_FILES" | grep -qxF "$ufile"; then
+                CONFLICTING_UNTRACKED="${CONFLICTING_UNTRACKED}${ufile}"$'\n'
+            fi
+        done <<< "$UNTRACKED_FILES"
+        if [[ -n "$CONFLICTING_UNTRACKED" ]]; then
+            echo
+            echo "========================================"
+            echo "[Warning] Untracked files conflict with update!"
+            echo "========================================"
+            echo
+            echo "The following untracked files would be overwritten:"
+            echo "$CONFLICTING_UNTRACKED" | sed '/^$/d; s/^/  /'
+            echo
+            read -rp "Stash untracked files before updating? (Y/N): " STASH_UNTRACKED_CHOICE
+            if [[ "${STASH_UNTRACKED_CHOICE^^}" != "Y" ]]; then
+                echo
+                echo "Update cancelled. Please move or remove the conflicting files manually."
+                exit 1
+            fi
+            echo "Stashing all changes including untracked files..."
+            if "$GIT_PATH" stash push --include-untracked -m "pre-update-$(date +%s)"; then
+                STASHED_UNTRACKED=1
+                echo "[Stash] Changes stashed successfully."
+            else
+                echo "[Error] Failed to stash changes. Update aborted."
+                exit 1
+            fi
+            echo
+        fi
+    fi
+    # Pull changes
+    echo "Pulling latest changes..."
+    if "$GIT_PATH" reset --hard "origin/$CURRENT_BRANCH" &>/dev/null; then
+        echo
+        echo "========================================"
+        echo "Update completed successfully!"
+        echo "========================================"
+        echo
+        if [[ -d "${BACKUP_DIR:-}" ]]; then
+            echo "[Important] Your modified files were backed up to:"
+            echo "  $BACKUP_DIR"
+            echo
+            echo "To restore your changes:"
+            echo "  1. Run ./merge_config.sh to compare and merge files"
+            echo "  2. Or manually compare backup with new version"
+            echo
+        fi
+        if [[ $STASHED_UNTRACKED -eq 1 ]]; then
+            echo "[Stash] Untracked files were stashed before the update."
+            echo "  To restore them:  git stash pop"
+            echo "  To discard them:  git stash drop"
+            echo
+            echo "  Note: 'git stash pop' may produce merge conflicts if"
+            echo "  the update modified the same files. Resolve manually."
+            echo
+        fi
+        echo "Please restart the application to use the new version."
+        exit 0
+    else
+        echo
+        echo "[Error] Update failed."
+        if [[ $STASHED_UNTRACKED -eq 1 ]]; then
+            echo "[Stash] Restoring stashed changes..."
+            if "$GIT_PATH" stash pop &>/dev/null; then
+                echo "[Stash] Changes restored successfully."
+            else
+                echo "[Stash] Could not auto-restore. Run 'git stash pop' manually."
+            fi
+        fi
+        if [[ -d "${BACKUP_DIR:-}" ]]; then
+            echo "Your backup is still available at: $BACKUP_DIR"
+        fi
+        exit 1
+    fi
+else
+    echo "  [Warning] Local version has diverged from remote."
+    echo "  This might be because you have local commits."
+    echo "  Please update manually or consult the documentation."
+    exit 0
+fi

cli.py ADDED Viewed

	@@ -0,0 +1,1998 @@

+import argparse
+import re
+import ast
+import os
+import sys
+import toml
+from pathlib import Path
+from typing import List, Optional, Tuple
+# Load environment variables from .env or .env.example (if available)
+try:
+    from dotenv import load_dotenv
+    _current_file = os.path.abspath(__file__)
+    _project_root = os.path.dirname(_current_file)
+    _env_path = os.path.join(_project_root, '.env')
+    _env_example_path = os.path.join(_project_root, '.env.example')
+    if os.path.exists(_env_path):
+        load_dotenv(_env_path)
+        print(f"Loaded configuration from {_env_path}")
+    elif os.path.exists(_env_example_path):
+        load_dotenv(_env_example_path)
+        print(f"Loaded configuration from {_env_example_path} (fallback)")
+except ImportError:
+    pass
+# Clear proxy settings that may affect network behavior
+for _proxy_var in ['http_proxy', 'https_proxy', 'HTTP_PROXY', 'HTTPS_PROXY', 'ALL_PROXY']:
+    os.environ.pop(_proxy_var, None)
+def _configure_logging(
+    level: Optional[str] = None,
+    suppress_audio_tokens: Optional[bool] = None,
+) -> None:
+    try:
+        from loguru import logger
+    except Exception:
+        return
+    if suppress_audio_tokens is None:
+        suppress_audio_tokens = os.environ.get("ACE_STEP_SUPPRESS_AUDIO_TOKENS", "1") not in {"0", "false", "False"}
+    if level is None:
+        level = "INFO"
+    level = str(level).upper()
+    def _log_filter(record) -> bool:
+        message = record.get("message", "")
+        # Suppress duplicate DiT prompt logs (we print a single final prompt in cli.py)
+        if (
+            "DiT TEXT ENCODER INPUT" in message
+            or "text_prompt:" in message
+            or (message.strip() and set(message.strip()) == {"="})
+        ):
+            return False
+        if not suppress_audio_tokens:
+            return True
+        return "<|audio_code_" not in message
+    logger.remove()
+    logger.add(sys.stderr, level=level, filter=_log_filter)
+_configure_logging()
+from acestep.handler import AceStepHandler
+from acestep.llm_inference import LLMHandler
+from acestep.inference import GenerationParams, GenerationConfig, generate_music, create_sample, format_sample
+from acestep.constants import DEFAULT_DIT_INSTRUCTION, TASK_INSTRUCTIONS
+from acestep.gpu_config import get_gpu_config, set_global_gpu_config, is_mps_platform
+import torch
+TRACK_CHOICES = [
+    "vocals",
+    "backing_vocals",
+    "drums",
+    "bass",
+    "guitar",
+    "keyboard",
+    "percussion",
+    "strings",
+    "synth",
+    "fx",
+    "brass",
+    "woodwinds",
+]
+def _get_project_root() -> str:
+    return os.path.dirname(os.path.abspath(__file__))
+def _parse_description_hints(description: str) -> tuple[Optional[str], bool]:
+    import re
+    if not description:
+        return None, False
+    description_lower = description.lower().strip()
+    language_mapping = {
+        'english': 'en', 'en': 'en',
+        'chinese': 'zh', '中文': 'zh', 'zh': 'zh', 'mandarin': 'zh',
+        'japanese': 'ja', '日本語': 'ja', 'ja': 'ja',
+        'korean': 'ko', '한국어': 'ko', 'ko': 'ko',
+        'spanish': 'es', 'español': 'es', 'es': 'es',
+        'french': 'fr', 'français': 'fr', 'fr': 'fr',
+        'german': 'de', 'deutsch': 'de', 'de': 'de',
+        'italian': 'it', 'italiano': 'it', 'it': 'it',
+        'portuguese': 'pt', 'português': 'pt', 'pt': 'pt',
+        'russian': 'ru', 'русский': 'ru', 'ru': 'ru',
+        'bengali': 'bn', 'bn': 'bn',
+        'hindi': 'hi', 'hi': 'hi',
+        'arabic': 'ar', 'ar': 'ar',
+        'thai': 'th', 'th': 'th',
+        'vietnamese': 'vi', 'vi': 'vi',
+        'indonesian': 'id', 'id': 'id',
+        'turkish': 'tr', 'tr': 'tr',
+        'dutch': 'nl', 'nl': 'nl',
+        'polish': 'pl', 'pl': 'pl',
+    }
+    detected_language = None
+    for lang_name, lang_code in language_mapping.items():
+        if len(lang_name) <= 2:
+            pattern = r'(?:^|\s|[.,;:!?])' + re.escape(lang_name) + r'(?:$|\s|[.,;:!?])'
+        else:
+            pattern = r'\b' + re.escape(lang_name) + r'\b'
+        if re.search(pattern, description_lower):
+            detected_language = lang_code
+            break
+    is_instrumental = False
+    if 'instrumental' in description_lower:
+        is_instrumental = True
+    elif 'pure music' in description_lower or 'pure instrument' in description_lower:
+        is_instrumental = True
+    elif description_lower.endswith(' solo') or description_lower == 'solo':
+        is_instrumental = True
+    return detected_language, is_instrumental
+def _prompt_non_empty(prompt: str) -> str:
+    value = input(prompt).strip()
+    while not value:
+        value = input(prompt).strip()
+    return value
+def _prompt_with_default(prompt: str, default: Optional[str] = None, required: bool = False) -> str:
+    while True:
+        suffix = f" [{default}]" if default not in (None, "") else ""
+        value = input(f"{prompt}{suffix}: ").strip()
+        if value:
+            return value
+        if default not in (None, ""):
+            return str(default)
+        if not required:
+            return ""
+        print("This value is required. Please try again.")
+def _prompt_bool(prompt: str, default: bool) -> bool:
+    default_str = "y" if default else "n"
+    while True:
+        value = input(f"{prompt} (y/n) [default: {default_str}]: ").strip().lower()
+        if not value:
+            return default
+        if value in {"y", "yes", "1", "true"}:
+            return True
+        if value in {"n", "no", "0", "false"}:
+            return False
+        print("Please enter 'y' or 'n'.")
+def _prompt_choice_from_list(
+    prompt: str,
+    options: List[str],
+    default: Optional[str] = None,
+    allow_custom: bool = True,
+    custom_validator=None,
+    custom_error: Optional[str] = None,
+) -> Optional[str]:
+    if not options:
+        return default
+    print("\n" + prompt)
+    for idx, option in enumerate(options, start=1):
+        print(f"{idx}. {option}")
+    default_display = default if default not in (None, "") else "auto"
+    while True:
+        choice = input(f"Choose a model (number or name) [default: {default_display}]: ").strip()
+        if not choice:
+            return None if default_display == "auto" else default
+        if choice.lower() == "auto":
+            return None
+        if choice.isdigit():
+            idx = int(choice)
+            if 1 <= idx <= len(options):
+                return options[idx - 1]
+            print("Invalid selection. Please choose a valid number.")
+            continue
+        if allow_custom:
+            if custom_validator and not custom_validator(choice):
+                print(custom_error or "Invalid selection. Please try again.")
+                continue
+            if choice not in options:
+                print("Unknown model. Using as-is.")
+            return choice
+        print("Please choose a valid option.")
+def _edit_formatted_prompt_via_file(formatted_prompt: str, instruction_path: str) -> str:
+    """Write formatted prompt to file, wait for user edits, then read back."""
+    try:
+        with open(instruction_path, "w", encoding="utf-8") as f:
+            f.write(formatted_prompt)
+    except Exception as e:
+        print(f"WARNING: Failed to write {instruction_path}: {e}")
+        return formatted_prompt
+    print("\n--- Final Draft Saved ---")
+    print(f"Saved to {instruction_path}")
+    print("Edit the file now. Press Enter when ready to continue.")
+    input()
+    try:
+        with open(instruction_path, "r", encoding="utf-8") as f:
+            return f.read()
+    except Exception as e:
+        print(f"WARNING: Failed to read {instruction_path}: {e}")
+        return formatted_prompt
+def _extract_caption_lyrics_from_formatted_prompt(formatted_prompt: str) -> Tuple[Optional[str], Optional[str]]:
+    """Best-effort extraction of caption/lyrics from a formatted prompt string."""
+    matches = list(re.finditer(r"# Caption\n(.*?)\n+# Lyric\n(.*)", formatted_prompt, re.DOTALL))
+    if not matches:
+        return None, None
+    caption = matches[-1].group(1).strip()
+    lyrics = matches[-1].group(2)
+    # Trim lyrics if chat-template markers appear after the user message.
+    cut_markers = ["<|eot_id|>", "<|start_header_id|>", "<|assistant|>", "<|user|>", "<|system|>", "<|im_end|>", "<|im_start|>"]
+    cut_at = len(lyrics)
+    for marker in cut_markers:
+        pos = lyrics.find(marker)
+        if pos != -1:
+            cut_at = min(cut_at, pos)
+    lyrics = lyrics[:cut_at].rstrip()
+    return caption or None, lyrics or None
+def _extract_instruction_from_formatted_prompt(formatted_prompt: str) -> Optional[str]:
+    """Best-effort extraction of instruction text from a formatted prompt string."""
+    match = re.search(r"# Instruction\n(.*?)\n\n", formatted_prompt, re.DOTALL)
+    if not match:
+        return None
+    instruction = match.group(1).strip()
+    return instruction or None
+def _extract_cot_metadata_from_formatted_prompt(formatted_prompt: str) -> dict:
+    """Best-effort extraction of COT metadata from a formatted prompt string,
+    supporting multi-line values.
+    """
+    matches = list(re.finditer(r"<think>\n(.*?)\n</think>", formatted_prompt, re.DOTALL))
+    if not matches:
+        return {}
+    block = matches[-1].group(1)
+    metadata = {}
+    current_key = None
+    current_value_lines = []
+    for line in block.splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        key_match = re.match(r"^(\w+):\s*(.*)", line)
+        if key_match:
+            if current_key:
+                metadata[current_key] = " ".join(current_value_lines).strip()
+            current_key = key_match.group(1).strip().lower()
+            current_value_lines = [key_match.group(2).strip()]
+        else:
+            if current_key:
+                current_value_lines.append(line)
+    if current_key and current_value_lines:
+        metadata[current_key] = " ".join(current_value_lines).strip()
+    return metadata
+def _parse_number(value: str) -> Optional[float]:
+    try:
+        match = re.search(r"[-+]?\d*\.?\d+", value)
+        if not match:
+            return None
+        return float(match.group(0))
+    except Exception:
+        return None
+def _parse_timesteps_input(value) -> Optional[List[float]]:
+    if value is None:
+        return None
+    if isinstance(value, list):
+        if all(isinstance(t, (int, float)) for t in value):
+            return [float(t) for t in value]
+        return None
+    if not isinstance(value, str):
+        return None
+    raw = value.strip()
+    if not raw:
+        return None
+    if raw.startswith("[") or raw.startswith("("):
+        try:
+            parsed = ast.literal_eval(raw)
+        except Exception:
+            return None
+        if isinstance(parsed, list) and all(isinstance(t, (int, float)) for t in parsed):
+            return [float(t) for t in parsed]
+        return None
+    try:
+        return [float(t.strip()) for t in raw.split(",") if t.strip()]
+    except Exception:
+        return None
+def _install_prompt_edit_hook(
+    llm_handler: LLMHandler,
+    instruction_path: str,
+    preloaded_prompt: Optional[str] = None,
+) -> None:
+    """Intercept formatted prompt generation to allow user editing before audio tokens."""
+    original = llm_handler.build_formatted_prompt_with_cot
+    cache = {}
+    def wrapped(caption, lyrics, cot_text, is_negative_prompt=False, negative_prompt="NO USER INPUT"):
+        prompt = original(
+            caption,
+            lyrics,
+            cot_text,
+            is_negative_prompt=is_negative_prompt,
+            negative_prompt=negative_prompt,
+        )
+        if is_negative_prompt:
+            conditional_prompt = original(
+                caption,
+                lyrics,
+                cot_text,
+                is_negative_prompt=False,
+                negative_prompt=negative_prompt,
+            )
+            cached = cache.get(conditional_prompt)
+            if cached and (cached.get("edited_caption") or cached.get("edited_lyrics")):
+                edited_caption = cached.get("edited_caption") or caption
+                edited_lyrics = cached.get("edited_lyrics") or lyrics
+                return original(
+                    edited_caption,
+                    edited_lyrics,
+                    cot_text,
+                    is_negative_prompt=True,
+                    negative_prompt=negative_prompt,
+                )
+            return prompt
+        cached = cache.get(prompt)
+        if cached:
+            return cached["edited_prompt"]
+        if getattr(llm_handler, "_skip_prompt_edit", False):
+            cache[prompt] = {
+                "edited_prompt": prompt,
+                "edited_caption": None,
+                "edited_lyrics": None,
+            }
+            return prompt
+        if preloaded_prompt is not None:
+            edited = preloaded_prompt
+        else:
+            edited = _edit_formatted_prompt_via_file(prompt, instruction_path)
+        edited_caption, edited_lyrics = _extract_caption_lyrics_from_formatted_prompt(edited)
+        if edited != prompt:
+            print("INFO: Using edited draft for audio-token prompt.")
+            if edited_caption or edited_lyrics:
+                llm_handler._edited_caption = edited_caption
+                llm_handler._edited_lyrics = edited_lyrics
+            edited_instruction = _extract_instruction_from_formatted_prompt(edited)
+            if edited_instruction:
+                llm_handler._edited_instruction = edited_instruction
+            edited_metas = _extract_cot_metadata_from_formatted_prompt(edited)
+            if edited_metas:
+                llm_handler._edited_metas = edited_metas
+        cache[prompt] = {
+            "edited_prompt": edited,
+            "edited_caption": edited_caption,
+            "edited_lyrics": edited_lyrics,
+        }
+        return edited
+    llm_handler.build_formatted_prompt_with_cot = wrapped
+def _prompt_int(prompt: str, default: Optional[int] = None, min_value: Optional[int] = None,
+                max_value: Optional[int] = None) -> Optional[int]:
+    default_display = "auto" if default is None else default
+    while True:
+        value = input(f"{prompt} [{default_display}]: ").strip()
+        if not value:
+            return default
+        try:
+            parsed = int(value)
+        except ValueError:
+            print("Invalid input. Please enter an integer.")
+            continue
+        if min_value is not None and parsed < min_value:
+            print(f"Please enter a value >= {min_value}.")
+            continue
+        if max_value is not None and parsed > max_value:
+            print(f"Please enter a value <= {max_value}.")
+            continue
+        return parsed
+def _prompt_float(prompt: str, default: Optional[float] = None, min_value: Optional[float] = None,
+                  max_value: Optional[float] = None) -> Optional[float]:
+    default_display = "auto" if default is None else default
+    while True:
+        value = input(f"{prompt} [{default_display}]: ").strip()
+        if not value:
+            return default
+        try:
+            parsed = float(value)
+        except ValueError:
+            print("Invalid input. Please enter a number.")
+            continue
+        if min_value is not None and parsed < min_value:
+            print(f"Please enter a value >= {min_value}.")
+            continue
+        if max_value is not None and parsed > max_value:
+            print(f"Please enter a value <= {max_value}.")
+            continue
+        return parsed
+def _prompt_existing_file(prompt: str, default: Optional[str] = None) -> str:
+    while True:
+        suffix = f" [{default}]" if default else ""
+        path = input(f"{prompt}{suffix}: ").strip()
+        if not path and default:
+            path = default
+        if os.path.isfile(path):
+            return _expand_audio_path(path)
+        print("Invalid file path. Please try again.")
+def _expand_audio_path(path_str: Optional[str]) -> Optional[str]:
+    if not path_str or not isinstance(path_str, str):
+        return path_str
+    try:
+        return Path(path_str).expanduser().resolve(strict=False).as_posix()
+    except Exception:
+        return Path(path_str).expanduser().absolute().as_posix()
+def _parse_bool(value: str) -> bool:
+    return str(value).lower() in {"true", "1", "yes", "y"}
+def _resolve_device(device: str) -> str:
+    if device == "auto":
+        if hasattr(torch, 'xpu') and torch.xpu.is_available():
+            return "xpu"
+        if torch.cuda.is_available():
+            return "cuda"
+        if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
+            return "mps"
+        return "cpu"
+    return device
+def _default_instruction_for_task(task_type: str, tracks: Optional[List[str]] = None) -> str:
+    if task_type == "lego":
+        track = tracks[0] if tracks else "guitar"
+        return TASK_INSTRUCTIONS["lego"].format(TRACK_NAME=track.upper())
+    if task_type == "extract":
+        track = tracks[0] if tracks else "vocals"
+        return TASK_INSTRUCTIONS["extract"].format(TRACK_NAME=track.upper())
+    if task_type == "complete":
+        tracks_list = ", ".join(tracks) if tracks else "drums, bass, guitar"
+        return TASK_INSTRUCTIONS["complete"].format(TRACK_CLASSES=tracks_list)
+    return DEFAULT_DIT_INSTRUCTION
+def _apply_optional_defaults(args, params_defaults: GenerationParams, config_defaults: GenerationConfig) -> None:
+    optional_defaults = {
+        "duration": params_defaults.duration,
+        "bpm": params_defaults.bpm,
+        "keyscale": params_defaults.keyscale,
+        "timesignature": params_defaults.timesignature,
+        "vocal_language": params_defaults.vocal_language,
+        "inference_steps": params_defaults.inference_steps,
+        "seed": params_defaults.seed,
+        "guidance_scale": params_defaults.guidance_scale,
+        "use_adg": params_defaults.use_adg,
+        "cfg_interval_start": params_defaults.cfg_interval_start,
+        "cfg_interval_end": params_defaults.cfg_interval_end,
+        "shift": 3.0,
+        "infer_method": params_defaults.infer_method,
+        "timesteps": None,
+        "repainting_start": params_defaults.repainting_start,
+        "repainting_end": params_defaults.repainting_end,
+        "audio_cover_strength": params_defaults.audio_cover_strength,
+        "thinking": params_defaults.thinking,
+        "lm_temperature": params_defaults.lm_temperature,
+        "lm_cfg_scale": params_defaults.lm_cfg_scale,
+        "lm_top_k": params_defaults.lm_top_k,
+        "lm_top_p": params_defaults.lm_top_p,
+        "lm_negative_prompt": params_defaults.lm_negative_prompt,
+        "use_cot_metas": params_defaults.use_cot_metas,
+        "use_cot_caption": params_defaults.use_cot_caption,
+        "use_cot_lyrics": params_defaults.use_cot_lyrics,
+        "use_cot_language": params_defaults.use_cot_language,
+        "use_constrained_decoding": params_defaults.use_constrained_decoding,
+        "batch_size": config_defaults.batch_size,
+        "allow_lm_batch": config_defaults.allow_lm_batch,
+        "use_random_seed": config_defaults.use_random_seed,
+        "seeds": config_defaults.seeds,
+        "lm_batch_chunk_size": config_defaults.lm_batch_chunk_size,
+        "constrained_decoding_debug": config_defaults.constrained_decoding_debug,
+        "audio_format": config_defaults.audio_format,
+        "sample_mode": False,
+        "sample_query": "",
+        "use_format": False,
+    }
+    for key, default_value in optional_defaults.items():
+        if getattr(args, key, None) is None:
+            setattr(args, key, default_value)
+def _summarize_lyrics(lyrics: Optional[str]) -> str:
+    if not lyrics:
+        return "none"
+    if isinstance(lyrics, str):
+        stripped = lyrics.strip()
+        if not stripped:
+            return "none"
+        if os.path.isfile(stripped):
+            return f"file: {os.path.basename(stripped)}"
+        if len(stripped) <= 60:
+            return stripped.replace("\n", " ")
+        return f"text ({len(stripped)} chars)"
+    return "provided"
+def _print_final_parameters(
+    args,
+    params: GenerationParams,
+    config: GenerationConfig,
+    params_defaults: GenerationParams,
+    config_defaults: GenerationConfig,
+    compact: bool,
+    resolved_device: Optional[str] = None,
+) -> None:
+    if not compact:
+        print("\n--- Final Parameters (Args) ---")
+        for k in sorted(vars(args).keys()):
+            print(f"{k}: {getattr(args, k)}")
+        print("------------------------------")
+        print("\n--- Final Parameters (GenerationParams) ---")
+        for k in sorted(vars(params).keys()):
+            print(f"{k}: {getattr(params, k)}")
+        print("-------------------------------------------")
+        print("\n--- Final Parameters (GenerationConfig) ---")
+        for k in sorted(vars(config).keys()):
+            print(f"{k}: {getattr(config, k)}")
+        print("-------------------------------------------\n")
+        return
+    device_display = args.device
+    if resolved_device and resolved_device != args.device:
+        device_display = f"{args.device} -> {resolved_device}"
+    print("\n--- Final Parameters (Summary) ---")
+    print(f"task_type: {params.task_type}")
+    print(f"caption: {params.caption or 'none'}")
+    print(f"lyrics: {_summarize_lyrics(params.lyrics)}")
+    print(f"duration: {params.duration}s")
+    print(f"outputs: {config.batch_size}")
+    if params.bpm not in (None, params_defaults.bpm):
+        print(f"bpm: {params.bpm}")
+    if params.keyscale not in (None, params_defaults.keyscale):
+        print(f"keyscale: {params.keyscale}")
+    if params.timesignature not in (None, params_defaults.timesignature):
+        print(f"timesignature: {params.timesignature}")
+    print(f"instrumental: {params.instrumental}")
+    print(f"thinking: {params.thinking}")
+    print(f"lm_model: {args.lm_model_path or 'auto'}")
+    print(f"dit_model: {args.config_path or 'auto'}")
+    print(f"backend: {args.backend}")
+    print(f"device: {device_display}")
+    print(f"audio_format: {config.audio_format}")
+    print(f"save_dir: {args.save_dir}")
+    if config.seeds:
+        print(f"seeds: {config.seeds}")
+    else:
+        print(f"seed: {params.seed} (random={config.use_random_seed})")
+    print("-------------------------------\n")
+def _build_meta_dict(params: GenerationParams) -> Optional[dict]:
+    meta = {}
+    if params.bpm is not None:
+        meta["bpm"] = params.bpm
+    if params.timesignature:
+        meta["timesignature"] = params.timesignature
+    if params.keyscale:
+        meta["keyscale"] = params.keyscale
+    if params.duration is not None:
+        meta["duration"] = params.duration
+    return meta or None
+def _print_dit_prompt(dit_handler: "AceStepHandler", params: GenerationParams) -> None:
+    meta = _build_meta_dict(params)
+    caption_input, lyrics_input = dit_handler.build_dit_inputs(
+        task=params.task_type,
+        instruction=params.instruction,
+        caption=params.caption or "",
+        lyrics=params.lyrics or "",
+        metas=meta,
+        vocal_language=params.vocal_language or "unknown",
+    )
+    print("\n--- Final DiT Prompt (Caption Branch) ---")
+    print(caption_input)
+    print("\n--- Final DiT Prompt (Lyrics Branch) ---")
+    print(lyrics_input)
+    print("----------------------------------------\n")
+def run_wizard(args, configure_only: bool = False, default_config_path: Optional[str] = None,
+               params_defaults: Optional[GenerationParams] = None,
+               config_defaults: Optional[GenerationConfig] = None):
+    """
+    Runs an interactive wizard to set generation parameters.
+    """
+    print("Welcome to the ACE-Step Music Generation Wizard!")
+    print("This will guide you through creating your music.")
+    print("Press Ctrl+C at any time to exit.")
+    print("Note: Required models will be auto-downloaded if missing.")
+    print("-" * 30)
+    try:
+        # Task selection
+        print("\n--- Task Type ---")
+        print("1. text2music - generate music from text/lyrics.")
+        print("2. cover     - transform existing audio into a new style.")
+        print("3. repaint   - regenerate a specific time segment of audio.")
+        print("4. lego      - generate a specific instrument track in context.")
+        print("5. extract   - isolate a specific instrument track from a mix.")
+        print("6. complete  - complete/extend partial tracks with new instruments.")
+        task_map = {
+            "1": "text2music",
+            "2": "cover",
+            "3": "repaint",
+            "4": "lego",
+            "5": "extract",
+            "6": "complete",
+        }
+        current_task = args.task_type or "text2music"
+        task_default = next((k for k, v in task_map.items() if v == current_task), "1")
+        task_choice = input(f"Choose a task (1-6) [default: {task_default}]: ").strip()
+        if not task_choice:
+            task_choice = task_default
+        args.task_type = task_map.get(task_choice, "text2music")
+        if args.task_type in {"lego", "extract", "complete"}:
+            print("Note: This task requires a base DiT model (acestep-v15-base). It will be auto-downloaded if missing.")
+        # Model selection (DiT)
+        dit_handler = AceStepHandler()
+        available_dit_models = dit_handler.get_available_acestep_v15_models()
+        base_only = args.task_type in {"lego", "extract", "complete"}
+        if base_only and available_dit_models:
+            available_dit_models = [m for m in available_dit_models if "base" in m.lower()]
+        if base_only and args.config_path and "base" not in str(args.config_path).lower():
+            args.config_path = None
+        if base_only:
+            if available_dit_models:
+                if args.config_path in available_dit_models:
+                    selected = args.config_path
+                else:
+                    selected = available_dit_models[0]
+                args.config_path = selected
+                print(f"\nNote: This task requires a base model. Using: {selected}")
+            else:
+                print("\nNote: This task requires a base model (e.g., 'acestep-v15-base'). It will be auto-downloaded if missing.")
+        elif available_dit_models:
+            selected = _prompt_choice_from_list(
+                "--- Available DiT Models ---",
+                available_dit_models,
+                default=args.config_path,
+                allow_custom=True,
+            )
+            if selected is not None:
+                args.config_path = selected
+        else:
+            print("\nNote: No local DiT models found. The main model will be auto-downloaded during initialization.")
+        # Model selection (LM)
+        llm_handler = LLMHandler()
+        available_lm_models = llm_handler.get_available_5hz_lm_models()
+        if available_lm_models:
+            selected_lm = _prompt_choice_from_list(
+                "--- Available LM Models ---",
+                available_lm_models,
+                default=args.lm_model_path,
+                allow_custom=True,
+            )
+            if selected_lm is not None:
+                args.lm_model_path = selected_lm
+        else:
+            print("\nNote: No local LM models found. If LM features are enabled, a default LM will be auto-downloaded.")
+        # Task-specific inputs
+        if args.task_type in {"cover", "repaint", "lego", "extract", "complete"}:
+            args.src_audio = _prompt_existing_file("Enter path to source audio file", default=args.src_audio)
+        if args.task_type == "repaint":
+            args.repainting_start = _prompt_float(
+                "Repaint start time in seconds", args.repainting_start
+            )
+            args.repainting_end = _prompt_float(
+                "Repaint end time in seconds", args.repainting_end
+            )
+        if args.task_type in {"lego", "extract"}:
+            print("\nAvailable tracks:")
+            print(", ".join(TRACK_CHOICES))
+            track_default = args.lego_track if args.task_type == "lego" else args.extract_track
+            track = _prompt_with_default("Choose a track", track_default, required=True)
+            if track not in TRACK_CHOICES:
+                print("Unknown track. Using as-is.")
+            if args.task_type == "lego":
+                args.lego_track = track
+            else:
+                args.extract_track = track
+            if not args.instruction or args.instruction == DEFAULT_DIT_INSTRUCTION:
+                args.instruction = _default_instruction_for_task(args.task_type, [track])
+            args.instruction = _prompt_with_default("Instruction", args.instruction, required=True)
+        if args.task_type == "complete":
+            print("\nAvailable tracks:")
+            print(", ".join(TRACK_CHOICES))
+            tracks_raw = _prompt_with_default("Choose tracks (comma-separated)", args.complete_tracks, required=True)
+            tracks = [t.strip() for t in tracks_raw.split(",") if t.strip()]
+            args.complete_tracks = ",".join(tracks)
+            if not args.instruction or args.instruction == DEFAULT_DIT_INSTRUCTION:
+                args.instruction = _default_instruction_for_task(args.task_type, tracks)
+            args.instruction = _prompt_with_default("Instruction", args.instruction, required=True)
+        if args.task_type in {"cover", "repaint", "lego", "complete"}:
+            args.caption = _prompt_with_default(
+                "Enter a music description (e.g., 'upbeat electronic dance music')",
+                args.caption,
+                required=True,
+            )
+        elif args.task_type == "text2music":
+            args.sample_mode = _prompt_bool("Use Simple Mode (auto-generate caption/lyrics via LM)", args.sample_mode)
+            if args.sample_mode:
+                args.sample_query = _prompt_with_default(
+                    "Describe the music you want (for auto-generation)",
+                    args.sample_query,
+                    required=False,
+                )
+            if not args.sample_mode:
+                caption = _prompt_with_default(
+                    "Enter a music description (optional if you provide lyrics)",
+                    args.caption,
+                    required=False,
+                )
+                if caption:
+                    args.caption = caption
+        # Lyrics
+        if args.task_type in {"text2music", "cover", "repaint", "lego", "complete"} and not args.sample_mode:
+            print("\n--- Lyrics Options ---")
+            print("1. Instrumental (no lyrics).")
+            print("2. Generate lyrics automatically.")
+            print("3. Provide path to a .txt file.")
+            print("4. Paste lyrics directly.")
+            if args.instrumental or args.lyrics == "[Instrumental]":
+                default_choice = "1"
+            elif args.use_cot_lyrics:
+                default_choice = "2"
+            elif args.lyrics and isinstance(args.lyrics, str) and os.path.isfile(args.lyrics):
+                default_choice = "3"
+            elif args.lyrics:
+                default_choice = "4"
+            else:
+                default_choice = "1"
+            choice = input(f"Your choice (1-4) [default: {default_choice}]: ").strip()
+            if not choice:
+                choice = default_choice
+            if choice == "1":  # Instrumental
+                args.instrumental = True
+                args.lyrics = "[Instrumental]"
+                args.use_cot_lyrics = False
+                print("Instrumental music will be generated.")
+            elif choice == "2":  # Generate lyrics automatically
+                args.use_cot_lyrics = True
+                args.lyrics = ""
+                args.instrumental = False
+                print("Lyrics will be generated automatically.")
+            elif choice == "3":
+                args.instrumental = False
+                args.use_cot_lyrics = False
+                default_lyrics_path = args.lyrics if isinstance(args.lyrics, str) and os.path.isfile(args.lyrics) else None
+                while True:
+                    lyrics_path = _prompt_existing_file("Please enter the path to your .txt lyrics file", default_lyrics_path)
+                    if lyrics_path.endswith('.txt'):
+                        args.lyrics = lyrics_path
+                        print(f"Lyrics will be loaded from: {lyrics_path}")
+                        break
+                    print("Invalid file path or not a .txt file. Please try again.")
+            elif choice == "4":
+                args.instrumental = False
+                args.use_cot_lyrics = False
+                default_lyrics = args.lyrics if isinstance(args.lyrics, str) and args.lyrics and not os.path.isfile(args.lyrics) else None
+                args.lyrics = _prompt_with_default("Paste lyrics (single line or use \\n)", default_lyrics, required=True)
+            if not args.instrumental:
+                lang = _prompt_with_default(
+                    "Vocal language (e.g., 'en', 'zh', 'unknown')",
+                    args.vocal_language,
+                    required=False
+                ).lower()
+                if lang:
+                    args.vocal_language = lang
+            if args.use_cot_lyrics:
+                if not args.caption:
+                    args.caption = _prompt_non_empty("Enter a music description for lyric generation: ")
+                if not args.thinking:
+                    print("INFO: Automatic lyric generation requires the LM handler. Enabling LM 'thinking'.")
+                    args.thinking = True
+        args.batch_size = _prompt_int(
+            "Number of outputs (audio clips) to generate",
+            args.batch_size if args.batch_size is not None else 2,
+            min_value=1,
+        )
+        advanced = input("\nConfigure advanced parameters? (y/n) [default: n]: ").lower()
+        if advanced == 'y':
+            if args.task_type == "text2music" and not args.sample_mode:
+                args.use_format = _prompt_bool("Use format_sample to enhance caption/lyrics", args.use_format)
+            print("\n--- Optional Metadata ---")
+            args.duration = _prompt_float("Duration in seconds (10-600)", args.duration, min_value=10, max_value=600)
+            args.bpm = _prompt_int("BPM (30-300, empty for auto)", args.bpm, min_value=30, max_value=300)
+            args.keyscale = _prompt_with_default("Keyscale (e.g., 'C Major', empty for auto)", args.keyscale)
+            args.timesignature = _prompt_with_default("Time signature (e.g., '4/4', empty for auto)", args.timesignature)
+            args.vocal_language = _prompt_with_default("Vocal language (e.g., 'en', 'zh', 'unknown')", args.vocal_language)
+            print("\n--- Advanced DiT Settings ---")
+            args.seed = _prompt_int("Random seed (-1 for random)", args.seed)
+            args.inference_steps = _prompt_int("Inference steps", args.inference_steps, min_value=1)
+            if args.config_path and 'base' in args.config_path:
+                args.guidance_scale = _prompt_float("Guidance scale (for base models)", args.guidance_scale)
+                args.use_adg = _prompt_bool("Enable Adaptive Dual Guidance (ADG)", args.use_adg)
+                args.cfg_interval_start = _prompt_float("CFG interval start (0.0-1.0)", args.cfg_interval_start, 0.0, 1.0)
+                args.cfg_interval_end = _prompt_float("CFG interval end (0.0-1.0)", args.cfg_interval_end, 0.0, 1.0)
+            args.shift = _prompt_float("Timestep shift (1.0-5.0)", args.shift, 1.0, 5.0)
+            args.infer_method = _prompt_with_default("Inference method (ode/sde)", args.infer_method)
+            timesteps_input = _prompt_with_default(
+                "Custom timesteps list (e.g., [0.97, 0.5, 0])",
+                args.timesteps,
+                required=False,
+            )
+            if timesteps_input:
+                args.timesteps = timesteps_input
+            if args.task_type == "cover":
+                args.audio_cover_strength = _prompt_float(
+                    "Audio cover strength (0.0-1.0)", args.audio_cover_strength, 0.0, 1.0
+                )
+            print("\n--- Advanced LM Settings ---")
+            args.thinking = _prompt_bool("Enable LM 'thinking'", args.thinking)
+            args.lm_temperature = _prompt_float("LM temperature (0.0-2.0)", args.lm_temperature, 0.0, 2.0)
+            args.lm_cfg_scale = _prompt_float("LM CFG scale", args.lm_cfg_scale)
+            args.lm_top_k = _prompt_int("LM top-k (0 disables)", args.lm_top_k, min_value=0)
+            args.lm_top_p = _prompt_float("LM top-p (0.0-1.0)", args.lm_top_p, 0.0, 1.0)
+            args.lm_negative_prompt = _prompt_with_default("LM negative prompt", args.lm_negative_prompt)
+            args.use_cot_metas = _prompt_bool("Use CoT for metadata", args.use_cot_metas)
+            args.use_cot_caption = _prompt_bool("Use CoT for caption refinement", args.use_cot_caption)
+            args.use_cot_lyrics = _prompt_bool("Use CoT for lyrics generation", args.use_cot_lyrics)
+            args.use_cot_language = _prompt_bool("Use CoT for language detection", args.use_cot_language)
+            args.use_constrained_decoding = _prompt_bool("Use constrained decoding", args.use_constrained_decoding)
+            print("\n--- Output Settings ---")
+            args.save_dir = _prompt_with_default("Save directory", args.save_dir)
+            args.audio_format = _prompt_with_default("Audio format (mp3/wav/flac)", args.audio_format)
+            # Batch size already captured above.
+            args.use_random_seed = _prompt_bool("Use random seed per batch", args.use_random_seed)
+            seeds_input = _prompt_with_default(
+                "Custom seeds (comma/space separated, leave empty for random)",
+                "",
+                required=False,
+            )
+            if seeds_input:
+                seeds = [s for s in seeds_input.replace(",", " ").split() if s.strip()]
+                try:
+                    args.seeds = [int(s) for s in seeds]
+                except ValueError:
+                    print("Invalid seeds input. Ignoring custom seeds.")
+            args.allow_lm_batch = _prompt_bool("Allow LM batch processing", args.allow_lm_batch)
+            args.lm_batch_chunk_size = _prompt_int("LM batch chunk size", args.lm_batch_chunk_size, min_value=1)
+            args.constrained_decoding_debug = _prompt_bool("Constrained decoding debug", args.constrained_decoding_debug)
+        else:
+            if params_defaults and config_defaults:
+                _apply_optional_defaults(args, params_defaults, config_defaults)
+        # Ensure LM thinking is enabled when lyric generation is requested.
+        if args.use_cot_lyrics and not args.thinking:
+            print("INFO: Automatic lyric generation requires the LM handler. Enabling LM 'thinking'.")
+            args.thinking = True
+        print("\n--- Summary ---")
+        print(f"Task: {args.task_type}")
+        if args.caption:
+            print(f"Description: {args.caption}")
+        if args.task_type in {"lego", "extract", "complete"}:
+            print(f"Instruction: {args.instruction}")
+        if args.src_audio:
+            print(f"Source audio: {args.src_audio}")
+        print(f"Duration: {args.duration}s")
+        print(f"Outputs: {args.batch_size}")
+        if args.instrumental:
+            print("Lyrics: Instrumental")
+        elif args.use_cot_lyrics:
+            print(f"Lyrics: Auto-generated ({args.vocal_language})")
+        elif args.lyrics and os.path.isfile(args.lyrics):
+             print(f"Lyrics: Provided from file ({args.lyrics})")
+        elif args.lyrics:
+             print(f"Lyrics: Provided as text")
+        print("-" * 30)
+        if not configure_only:
+            confirm = input("Start generation with these settings? (y/n) [default: y]: ").lower()
+            if confirm == 'n':
+                print("Generation cancelled.")
+                sys.exit(0)
+        default_filename = default_config_path or "config.toml"
+        config_filename = input(f"\nEnter filename to save configuration [{default_filename}]: ")
+        if not config_filename:
+            config_filename = default_filename
+        if not config_filename.endswith(".toml"):
+            config_filename += ".toml"
+        try:
+            config_to_save = {
+                k: v for k, v in vars(args).items()
+                if k not in ['config'] and not k.startswith('_')
+            }
+            with open(config_filename, 'w') as f:
+                toml.dump(config_to_save, f)
+            print(f"Configuration saved to {config_filename}")
+            print(f"You can reuse it next time with: python cli.py -c {config_filename}")
+        except Exception as e:
+            print(f"Error saving configuration: {e}. Please try again.")
+    except (KeyboardInterrupt, EOFError):
+        print("\nWizard cancelled. Exiting.")
+        sys.exit(0)
+    return args, not configure_only
+def main():
+    """
+    Main function to run ACE-Step music generation from the command line.
+    """
+    gpu_config = get_gpu_config()
+    set_global_gpu_config(gpu_config)
+    mps_available = is_mps_platform()
+    # Mac (Apple Silicon) uses unified memory — offloading provides no benefit
+    auto_offload = (not mps_available) and gpu_config.gpu_memory_gb > 0 and gpu_config.gpu_memory_gb < 16
+    print(f"\n{'='*60}")
+    print("GPU Configuration Detected:")
+    print(f"{'='*60}")
+    print(f"  GPU Memory: {gpu_config.gpu_memory_gb:.2f} GiB")
+    print(f"  Configuration Tier: {gpu_config.tier}")
+    print(f"  Max Duration (with LM): {gpu_config.max_duration_with_lm}s ({gpu_config.max_duration_with_lm // 60} min)")
+    print(f"  Max Duration (without LM): {gpu_config.max_duration_without_lm}s ({gpu_config.max_duration_without_lm // 60} min)")
+    print(f"  Max Batch Size (with LM): {gpu_config.max_batch_size_with_lm}")
+    print(f"  Max Batch Size (without LM): {gpu_config.max_batch_size_without_lm}")
+    print(f"  Default LM Init: {gpu_config.init_lm_default}")
+    print(f"  Available LM Models: {gpu_config.available_lm_models or 'None'}")
+    print(f"{'='*60}\n")
+    if auto_offload:
+        print("Auto-enabling CPU offload (GPU < 16GB)")
+    elif gpu_config.gpu_memory_gb > 0:
+        print("CPU offload disabled by default (GPU >= 16GB)")
+    elif mps_available:
+        print("MPS detected, running on Apple GPU")
+    else:
+        print("No GPU detected, running on CPU")
+    params_defaults = GenerationParams()
+    config_defaults = GenerationConfig()
+    parser = argparse.ArgumentParser(
+        description="ACE-Step 1.5: Music generation (wizard/config only).",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("-c", "--config", type=str, help="Path to a TOML configuration file to load.")
+    parser.add_argument("--configure", action="store_true", help="Run wizard to save configuration without generating.")
+    parser.add_argument(
+        "--backend",
+        type=str,
+        default=None,
+        choices=["vllm", "pt", "mlx"],
+        help="5Hz LM backend. Auto-detected if not specified: 'mlx' on Apple Silicon, 'vllm' on CUDA, 'pt' otherwise.",
+    )
+    parser.add_argument(
+        "--log-level",
+        type=str,
+        default="INFO",
+        help="Logging level for internal modules (TRACE/DEBUG/INFO/WARNING/ERROR/CRITICAL).",
+    )
+    cli_args = parser.parse_args()
+    _configure_logging(level=cli_args.log_level)
+    default_batch_size = 1 if not cli_args.config else config_defaults.batch_size
+    # Auto-detect MLX on Apple Silicon, fall back to vllm
+    if mps_available:
+        try:
+            import mlx.core  # noqa: F401
+            default_backend = "mlx"
+            print("Apple Silicon detected with MLX available. Using MLX backend.")
+        except ImportError:
+            default_backend = "vllm"
+    else:
+        default_backend = "vllm"
+    defaults = {
+        "project_root": _get_project_root(),
+        "config_path": None,
+        "checkpoint_dir": os.path.join(_get_project_root(), "checkpoints"),
+        "lm_model_path": None,
+        "backend": default_backend,
+        "device": "auto",
+        "use_flash_attention": None,
+        "offload_to_cpu": auto_offload,
+        "offload_dit_to_cpu": False,
+        "save_dir": "output",
+        "audio_format": config_defaults.audio_format,
+        "caption": "",
+        "prompt": "",
+        "lyrics": None,
+        "duration": params_defaults.duration,
+        "instrumental": False,
+        "bpm": params_defaults.bpm,
+        "keyscale": params_defaults.keyscale,
+        "timesignature": params_defaults.timesignature,
+        "vocal_language": params_defaults.vocal_language,
+        "task_type": params_defaults.task_type,
+        "instruction": params_defaults.instruction,
+        "reference_audio": params_defaults.reference_audio,
+        "src_audio": params_defaults.src_audio,
+        "repainting_start": params_defaults.repainting_start,
+        "repainting_end": params_defaults.repainting_end,
+        "audio_cover_strength": params_defaults.audio_cover_strength,
+        "lego_track": "",
+        "extract_track": "",
+        "complete_tracks": "",
+        "sample_mode": False,
+        "sample_query": "",
+        "use_format": False,
+        "inference_steps": params_defaults.inference_steps,
+        "seed": params_defaults.seed,
+        "guidance_scale": params_defaults.guidance_scale,
+        "use_adg": params_defaults.use_adg,
+        "shift": 3.0,
+        "infer_method": params_defaults.infer_method,
+        "timesteps": None,
+        "thinking": gpu_config.init_lm_default,
+        "lm_temperature": params_defaults.lm_temperature,
+        "lm_cfg_scale": params_defaults.lm_cfg_scale,
+        "lm_top_k": params_defaults.lm_top_k,
+        "lm_top_p": params_defaults.lm_top_p,
+        "use_cot_metas": params_defaults.use_cot_metas,
+        "use_cot_caption": params_defaults.use_cot_caption,
+        "use_cot_lyrics": params_defaults.use_cot_lyrics,
+        "use_cot_language": params_defaults.use_cot_language,
+        "use_constrained_decoding": params_defaults.use_constrained_decoding,
+        "batch_size": default_batch_size,
+        "seeds": None,
+        "use_random_seed": config_defaults.use_random_seed,
+        "allow_lm_batch": config_defaults.allow_lm_batch,
+        "lm_batch_chunk_size": config_defaults.lm_batch_chunk_size,
+        "constrained_decoding_debug": config_defaults.constrained_decoding_debug,
+        "audio_codes": "",
+        "cfg_interval_start": params_defaults.cfg_interval_start,
+        "cfg_interval_end": params_defaults.cfg_interval_end,
+        "lm_negative_prompt": params_defaults.lm_negative_prompt,
+        "log_level": cli_args.log_level,
+    }
+    args = argparse.Namespace(**defaults)
+    args.config = None
+    if cli_args.config:
+        if not os.path.exists(cli_args.config):
+            parser.error(f"Config file not found: {cli_args.config}")
+        try:
+            with open(cli_args.config, 'r') as f:
+                config_from_file = toml.load(f)
+            print(f"Configuration loaded from {cli_args.config}")
+        except Exception as e:
+            parser.error(f"Error loading TOML config file {cli_args.config}: {e}")
+        for key, value in config_from_file.items():
+            setattr(args, key, value)
+        args.config = cli_args.config
+    # CLI --backend overrides config file and auto-detection
+    if cli_args.backend is not None:
+        args.backend = cli_args.backend
+    if cli_args.configure:
+        args, _ = run_wizard(
+            args,
+            configure_only=True,
+            default_config_path=cli_args.config,
+            params_defaults=params_defaults,
+            config_defaults=config_defaults,
+        )
+        print("Configuration complete. Exiting without generation.")
+        sys.exit(0)
+    if not cli_args.config:
+        args, should_generate = run_wizard(
+            args,
+            configure_only=False,
+            default_config_path=None,
+            params_defaults=params_defaults,
+            config_defaults=config_defaults,
+        )
+        if not should_generate:
+            print("Configuration complete. Exiting without generation.")
+            sys.exit(0)
+    # --- Post-parsing Setup ---
+    if args.use_cot_lyrics and not args.thinking:
+        print("INFO: Automatic lyric generation requires the LM handler. Forcing --thinking=True.")
+        args.thinking = True
+    if not args.project_root:
+        args.project_root = _get_project_root()
+    else:
+        args.project_root = os.path.abspath(os.path.expanduser(str(args.project_root)))
+    if args.checkpoint_dir:
+        args.checkpoint_dir = os.path.expanduser(str(args.checkpoint_dir))
+        if not os.path.isabs(args.checkpoint_dir):
+            args.checkpoint_dir = os.path.join(args.project_root, args.checkpoint_dir)
+    if args.src_audio:
+        args.src_audio = _expand_audio_path(args.src_audio)
+    if args.reference_audio:
+        args.reference_audio = _expand_audio_path(args.reference_audio)
+    device = _resolve_device(args.device)
+    # --- Argument Post-processing ---
+    try:
+        timesteps = _parse_timesteps_input(args.timesteps)
+        if args.timesteps and timesteps is None:
+            raise ValueError("Timesteps must be a list of numbers or a comma-separated string.")
+    except ValueError as e:
+        parser.error(f"Invalid format for timesteps. Expected a list of numbers (e.g., '[1.0, 0.5, 0.0]' or '0.97,0.5,0'). Error: {e}")
+    if args.seeds:
+        args.batch_size = len(args.seeds)
+        args.use_random_seed = False
+        args.seed = -1
+    if args.instrumental and not args.lyrics:
+        args.lyrics = "[Instrumental]"
+    elif isinstance(args.lyrics, str) and args.lyrics.strip().lower() in {"[inst]", "[instrumental]"}:
+        args.instrumental = True
+    # --- Task-specific validation and instruction helpers ---
+    if args.task_type in {"cover", "repaint", "lego", "extract", "complete"}:
+        if not args.src_audio:
+            parser.error(f"--src_audio is required for task_type '{args.task_type}'.")
+    if args.task_type in {"cover", "repaint", "lego", "complete"}:
+        if not args.caption:
+            parser.error(f"--caption is required for task_type '{args.task_type}'.")
+    if args.task_type == "text2music":
+        if not args.caption and not args.lyrics:
+            if not args.sample_mode and not args.sample_query:
+                parser.error("--caption or --lyrics is required for text2music.")
+        if args.use_cot_lyrics and not args.caption:
+            parser.error("--use_cot_lyrics requires --caption for lyric generation.")
+        if args.sample_mode or args.sample_query:
+            args.sample_mode = True
+    else:
+        if args.sample_mode or args.sample_query:
+            parser.error("--sample_mode/sample_query are only supported for task_type 'text2music'.")
+    if args.sample_mode and args.use_cot_lyrics:
+        print("INFO: sample_mode enabled. Disabling --use_cot_lyrics.")
+        args.use_cot_lyrics = False
+    # Auto-select instruction based on task_type if user didn't provide a custom instruction.
+    # Align with api_server behavior and TASK_INSTRUCTIONS defaults.
+    if args.instruction == DEFAULT_DIT_INSTRUCTION and args.task_type in TASK_INSTRUCTIONS:
+        if args.task_type in {"text2music", "cover", "repaint"}:
+            args.instruction = TASK_INSTRUCTIONS[args.task_type]
+    # Base-model-only task enforcement
+    base_only_tasks = {"lego", "extract", "complete"}
+    if args.task_type in base_only_tasks and args.config_path:
+        if "base" not in str(args.config_path).lower():
+            parser.error(f"task_type '{args.task_type}' requires a base model config (e.g., 'acestep-v15-base').")
+    if args.task_type == "repaint":
+        if args.repainting_end != -1 and args.repainting_end <= args.repainting_start:
+            parser.error("--repainting_end must be greater than --repainting_start (or -1).")
+    if args.task_type in {"lego", "extract", "complete"}:
+        has_custom_instruction = bool(args.instruction and args.instruction.strip() and args.instruction.strip() != params_defaults.instruction)
+        if not has_custom_instruction:
+            if args.task_type == "lego":
+                if not args.lego_track:
+                    parser.error("--instruction or --lego_track is required for lego task.")
+                args.instruction = _default_instruction_for_task("lego", [args.lego_track.strip()])
+            elif args.task_type == "extract":
+                if not args.extract_track:
+                    parser.error("--instruction or --extract_track is required for extract task.")
+                args.instruction = _default_instruction_for_task("extract", [args.extract_track.strip()])
+            elif args.task_type == "complete":
+                if not args.complete_tracks:
+                    parser.error("--instruction or --complete_tracks is required for complete task.")
+                tracks = [t.strip() for t in args.complete_tracks.split(",") if t.strip()]
+                if not tracks:
+                    parser.error("--complete_tracks must contain at least one track.")
+                args.instruction = _default_instruction_for_task("complete", tracks)
+    # Handle lyrics argument
+    lyrics_arg = args.lyrics
+    if isinstance(lyrics_arg, str) and lyrics_arg:
+        lyrics_arg = os.path.expanduser(lyrics_arg)
+        if not os.path.isabs(lyrics_arg):
+            # Resolve relative lyrics path against config file location first, then project_root.
+            resolved = None
+            if args.config:
+                config_dir = os.path.dirname(os.path.abspath(args.config))
+                candidate = os.path.join(config_dir, lyrics_arg)
+                if os.path.isfile(candidate):
+                    resolved = candidate
+            if resolved is None and args.project_root:
+                candidate = os.path.join(os.path.abspath(args.project_root), lyrics_arg)
+                if os.path.isfile(candidate):
+                    resolved = candidate
+            if resolved is not None:
+                lyrics_arg = resolved
+    if lyrics_arg is not None:
+        if lyrics_arg == "generate":
+            args.use_cot_lyrics = True
+            args.lyrics = ""
+            print("Lyrics generation enabled.")
+        elif os.path.isfile(lyrics_arg):
+            print(f"INFO: Attempting to load lyrics from file: {lyrics_arg}")
+            try:
+                with open(lyrics_arg, 'r', encoding='utf-8') as f:
+                    args.lyrics = f.read()
+                print(f"Lyrics loaded from file: {lyrics_arg}")
+            except Exception as e:
+                parser.error(f"Could not read lyrics file {lyrics_arg}. Error: {e}")
+        # else: lyrics is a string, use as is.
+    # --- Handler Initialization ---
+    if args.backend == "pyTorch":
+        args.backend = "pt"
+    if args.backend not in {"vllm", "pt", "mlx"}:
+        args.backend = "vllm"
+    print("Initializing ACE-Step handlers...")
+    dit_handler = AceStepHandler()
+    llm_handler = LLMHandler()
+    base_only_tasks = {"lego", "extract", "complete"}
+    skip_lm_tasks = {"cover", "repaint"}
+    requires_lm = (
+        args.task_type not in skip_lm_tasks and (
+            args.thinking
+            or args.sample_mode
+            or bool(args.sample_query and str(args.sample_query).strip())
+            or args.use_format
+            or args.use_cot_metas
+            or args.use_cot_caption
+            or args.use_cot_lyrics
+            or args.use_cot_language
+        )
+    )
+    if args.config_path is None:
+        available_models = dit_handler.get_available_acestep_v15_models()
+        if args.task_type in base_only_tasks and available_models:
+            available_models = [m for m in available_models if "base" in m.lower()]
+        if not available_models:
+            print("No DiT models found. Downloading main model (acestep-v15-turbo + core components)...")
+            from acestep.model_downloader import ensure_main_model, get_checkpoints_dir
+            checkpoints_dir = get_checkpoints_dir()
+            success, msg = ensure_main_model(checkpoints_dir)
+            print(msg)
+            if not success:
+                parser.error(f"Failed to download main model: {msg}")
+            available_models = dit_handler.get_available_acestep_v15_models()
+            if args.task_type in base_only_tasks and available_models:
+                available_models = [m for m in available_models if "base" in m.lower()]
+        if args.task_type in base_only_tasks and not available_models:
+            print("Base-only task selected. Downloading base DiT model (acestep-v15-base)...")
+            from acestep.model_downloader import ensure_dit_model, get_checkpoints_dir
+            checkpoints_dir = get_checkpoints_dir()
+            success, msg = ensure_dit_model("acestep-v15-base", checkpoints_dir)
+            print(msg)
+            if not success:
+                parser.error(f"Failed to download base DiT model: {msg}")
+            available_models = dit_handler.get_available_acestep_v15_models()
+            if available_models:
+                available_models = [m for m in available_models if "base" in m.lower()]
+        if available_models:
+            if args.task_type in {"lego", "extract", "complete"}:
+                preferred = "acestep-v15-base"
+            else:
+                preferred = "acestep-v15-turbo"
+            args.config_path = preferred if preferred in available_models else available_models[0]
+            print(f"Auto-selected config_path: {args.config_path}")
+        else:
+            parser.error("No available DiT models found. Please specify --config_path.")
+    if args.task_type in {"lego", "extract", "complete"} and "base" not in str(args.config_path).lower():
+        parser.error(f"task_type '{args.task_type}' requires a base model config (e.g., 'acestep-v15-base').")
+    # Ensure required DiT/main models are present for the selected task/model.
+    from acestep.model_downloader import (
+        ensure_main_model,
+        ensure_dit_model,
+        get_checkpoints_dir,
+        check_main_model_exists,
+        check_model_exists,
+        SUBMODEL_REGISTRY,
+    )
+    checkpoints_dir = get_checkpoints_dir()
+    if not check_main_model_exists(checkpoints_dir):
+        print("Main model components not found. Downloading main model...")
+        success, msg = ensure_main_model(checkpoints_dir)
+        print(msg)
+        if not success:
+            parser.error(f"Failed to download main model: {msg}")
+    if args.config_path:
+        config_name = str(args.config_path)
+        known_models = {"acestep-v15-turbo"} | set(SUBMODEL_REGISTRY.keys())
+        if check_model_exists(config_name, checkpoints_dir):
+            pass
+        elif config_name in known_models:
+            success, msg = ensure_dit_model(config_name, checkpoints_dir)
+            if not success:
+                parser.error(f"Failed to download DiT model '{config_name}': {msg}")
+        else:
+            print(f"Warning: DiT model '{config_name}' not found locally and not in registry. Skipping auto-download.")
+    use_flash_attention = args.use_flash_attention
+    if use_flash_attention is None:
+        use_flash_attention = dit_handler.is_flash_attention_available(device)
+    compile_model = os.environ.get("ACESTEP_COMPILE_MODEL", "").strip().lower() in {
+        "1", "true", "yes", "y", "on",
+    }
+    print(f"Initializing DiT handler with model: {args.config_path}")
+    dit_handler.initialize_service(
+        project_root=args.project_root,
+        config_path=args.config_path,
+        device=device,
+        use_flash_attention=use_flash_attention,
+        compile_model=compile_model,
+        offload_to_cpu=args.offload_to_cpu,
+        offload_dit_to_cpu=args.offload_dit_to_cpu,
+    )
+    if requires_lm:
+        from acestep.model_downloader import ensure_lm_model
+        if args.lm_model_path is None:
+            available_lm_models = llm_handler.get_available_5hz_lm_models()
+            if available_lm_models:
+                args.lm_model_path = available_lm_models[0]
+                print(f"Using default LM model: {args.lm_model_path}")
+            else:
+                success, msg = ensure_lm_model(checkpoints_dir=checkpoints_dir)
+                print(msg)
+                if not success:
+                    parser.error("No LM models available. Please specify --lm_model_path or disable --thinking.")
+                available_lm_models = llm_handler.get_available_5hz_lm_models()
+                if not available_lm_models:
+                    parser.error("No LM models available after download. Please specify --lm_model_path or disable --thinking.")
+                args.lm_model_path = available_lm_models[0]
+                print(f"Using default LM model: {args.lm_model_path}")
+        else:
+            lm_model_path = str(args.lm_model_path)
+            if os.path.isabs(lm_model_path) and os.path.exists(lm_model_path):
+                pass
+            elif check_model_exists(lm_model_path, checkpoints_dir):
+                pass
+            elif lm_model_path in SUBMODEL_REGISTRY:
+                success, msg = ensure_lm_model(lm_model_path, checkpoints_dir=checkpoints_dir)
+                print(msg)
+                if not success:
+                    parser.error(f"Failed to download LM model '{lm_model_path}': {msg}")
+            else:
+                parser.error(f"LM model '{lm_model_path}' not found locally and not in registry. Please provide a valid --lm_model_path.")
+        print(f"Initializing LM handler with model: {args.lm_model_path}")
+        llm_handler.initialize(
+            checkpoint_dir=args.checkpoint_dir,
+            lm_model_path=args.lm_model_path,
+            backend=args.backend,
+            device=device,
+            offload_to_cpu=args.offload_to_cpu,
+            dtype=None,
+        )
+    else:
+        if args.task_type in skip_lm_tasks:
+            print(f"LM is not required for task_type '{args.task_type}'. Skipping LM handler initialization.")
+        else:
+            print("LM 'thinking' is disabled. Skipping LM handler initialization.")
+    print("Handlers initialized.")
+    format_has_duration = False
+    # --- Sample Mode / Description-based Auto-Generation ---
+    if args.sample_mode or (args.sample_query and str(args.sample_query).strip()):
+        if not llm_handler.llm_initialized:
+            parser.error("--sample_mode/sample_query requires the LM handler, but it's not initialized.")
+        sample_query = args.sample_query if args.sample_query and str(args.sample_query).strip() else "NO USER INPUT"
+        parsed_language, parsed_instrumental = _parse_description_hints(sample_query)
+        if args.vocal_language and args.vocal_language not in ("en", "unknown", ""):
+            sample_language = args.vocal_language
+        else:
+            sample_language = parsed_language
+        print("\nINFO: Creating sample via 'create_sample'...")
+        sample_result = create_sample(
+            llm_handler=llm_handler,
+            query=sample_query,
+            instrumental=parsed_instrumental,
+            vocal_language=sample_language,
+            temperature=args.lm_temperature,
+            top_k=args.lm_top_k,
+            top_p=args.lm_top_p,
+        )
+        if sample_result.success:
+            args.caption = sample_result.caption
+            args.lyrics = sample_result.lyrics
+            args.instrumental = bool(sample_result.instrumental)
+            if args.bpm is None:
+                args.bpm = sample_result.bpm
+            if not args.keyscale:
+                args.keyscale = sample_result.keyscale
+            if not args.timesignature:
+                args.timesignature = sample_result.timesignature
+            if args.duration <= 0:
+                args.duration = sample_result.duration
+            if args.vocal_language in ("unknown", "", None):
+                args.vocal_language = sample_result.language
+            args.sample_mode = True
+            print("✓ Sample created. Using generated parameters.")
+        else:
+            parser.error(f"create_sample failed: {sample_result.error or sample_result.status_message}")
+    # --- Format caption/lyrics if requested ---
+    if args.use_format and (args.caption or args.lyrics):
+        if not llm_handler.llm_initialized:
+            parser.error("--use_format requires the LM handler, but it's not initialized.")
+        user_metadata_for_format = {}
+        if args.bpm is not None:
+            user_metadata_for_format["bpm"] = args.bpm
+        if args.duration is not None and float(args.duration) > 0:
+            user_metadata_for_format["duration"] = float(args.duration)
+        if args.keyscale:
+            user_metadata_for_format["keyscale"] = args.keyscale
+        if args.timesignature:
+            user_metadata_for_format["timesignature"] = args.timesignature
+        if args.vocal_language and args.vocal_language != "unknown":
+            user_metadata_for_format["language"] = args.vocal_language
+        print("\nINFO: Formatting caption/lyrics via 'format_sample'...")
+        format_result = format_sample(
+            llm_handler=llm_handler,
+            caption=args.caption or "",
+            lyrics=args.lyrics or "",
+            user_metadata=user_metadata_for_format if user_metadata_for_format else None,
+            temperature=args.lm_temperature,
+            top_k=args.lm_top_k,
+            top_p=args.lm_top_p,
+        )
+        if format_result.success:
+            args.caption = format_result.caption or args.caption
+            args.lyrics = format_result.lyrics or args.lyrics
+            if format_result.duration:
+                args.duration = format_result.duration
+                format_has_duration = True
+            if format_result.bpm:
+                args.bpm = format_result.bpm
+            if format_result.keyscale:
+                args.keyscale = format_result.keyscale
+            if format_result.timesignature:
+                args.timesignature = format_result.timesignature
+            print("✓ Format complete.")
+        else:
+            parser.error(f"format_sample failed: {format_result.error or format_result.status_message}")
+    # --- Auto-generate Lyrics if Requested ---
+    if args.use_cot_lyrics:
+        if not llm_handler.llm_initialized:
+             parser.error("--use_cot_lyrics requires the LM handler, but it's not initialized. Ensure --thinking is enabled.")
+        print("\nINFO: Generating lyrics and metadata via 'create_sample'...")
+        sample_result = create_sample(
+            llm_handler=llm_handler,
+            query=args.caption,
+            instrumental=False,
+            vocal_language=args.vocal_language if args.vocal_language != 'unknown' else None,
+            temperature=args.lm_temperature,
+            top_k=args.lm_top_k,
+            top_p=args.lm_top_p,
+        )
+        if sample_result.success:
+            print("✓ Automatic sample creation successful. Using generated parameters:")
+            # Update args with values from create_sample, respecting user-provided values
+            args.caption = sample_result.caption
+            args.lyrics = sample_result.lyrics
+            if args.bpm is None: args.bpm = sample_result.bpm
+            if not args.keyscale: args.keyscale = sample_result.keyscale
+            if not args.timesignature: args.timesignature = sample_result.timesignature
+            if args.duration <= 0: args.duration = sample_result.duration
+            if args.vocal_language == 'unknown': args.vocal_language = sample_result.language
+            print(f"  - Caption: {args.caption}")
+            lyrics_preview = args.lyrics[:150].strip().replace("\n", " ")
+            print(f"  - Lyrics: '{lyrics_preview}...'")
+            print(f"  - Metadata: BPM={args.bpm}, Key='{args.keyscale}', Lang='{args.vocal_language}'")
+            # Disable subsequent CoT steps to avoid redundancy and save time
+            args.use_cot_metas = False
+            args.use_cot_caption = False
+        else:
+            print(f"⚠️ WARNING: Automatic lyric generation via 'create_sample' failed: {sample_result.error}")
+            print("         Proceeding with an instrumental track instead.")
+            args.lyrics = "[Instrumental]"
+            args.instrumental = True
+        # Flag has served its purpose, disable it to avoid issues with GenerationParams
+        args.use_cot_lyrics = False
+    if args.sample_mode or format_has_duration:
+        args.use_cot_metas = False
+    # --- Prompt Editing Hook for LLM Audio Tokens ---
+    if args.thinking and args.task_type not in skip_lm_tasks:
+        instruction_path = os.path.join(
+            os.path.abspath(args.project_root) if args.project_root else os.getcwd(),
+            "instruction.txt",
+        )
+        preloaded_prompt = None
+        use_instruction_file = False
+        if args.config and os.path.exists(instruction_path):
+            use_instruction_file = True
+            try:
+                with open(instruction_path, "r", encoding="utf-8") as f:
+                    preloaded_prompt = f.read()
+            except Exception as e:
+                print(f"WARNING: Failed to read {instruction_path}: {e}")
+                preloaded_prompt = None
+                use_instruction_file = False
+        if use_instruction_file:
+            print(f"INFO: Found {instruction_path}. Using it without editing.")
+        if preloaded_prompt is not None and not preloaded_prompt.strip():
+            preloaded_prompt = None
+        _install_prompt_edit_hook(llm_handler, instruction_path, preloaded_prompt=preloaded_prompt)
+    # --- Configure Generation ---
+    params = GenerationParams(
+        task_type=args.task_type,
+        instruction=args.instruction,
+        reference_audio=args.reference_audio,
+        src_audio=args.src_audio,
+        audio_codes=args.audio_codes,
+        caption=args.caption,
+        lyrics=args.lyrics,
+        instrumental=args.instrumental,
+        vocal_language=args.vocal_language,
+        bpm=args.bpm,
+        keyscale=args.keyscale,
+        timesignature=args.timesignature,
+        duration=args.duration,
+        inference_steps=args.inference_steps,
+        seed=args.seed,
+        guidance_scale=args.guidance_scale,
+        use_adg=args.use_adg,
+        cfg_interval_start=args.cfg_interval_start,
+        cfg_interval_end=args.cfg_interval_end,
+        shift=args.shift,
+        infer_method=args.infer_method,
+        timesteps=timesteps,
+        repainting_start=args.repainting_start,
+        repainting_end=args.repainting_end,
+        audio_cover_strength=args.audio_cover_strength,
+        thinking=args.thinking,
+        lm_temperature=args.lm_temperature,
+        lm_cfg_scale=args.lm_cfg_scale,
+        lm_top_k=args.lm_top_k,
+        lm_top_p=args.lm_top_p,
+        lm_negative_prompt=args.lm_negative_prompt,
+        use_cot_metas=args.use_cot_metas,
+        use_cot_caption=args.use_cot_caption,
+        use_cot_lyrics=args.use_cot_lyrics,
+        use_cot_language=args.use_cot_language,
+        use_constrained_decoding=args.use_constrained_decoding
+    )
+    config = GenerationConfig(
+        batch_size=args.batch_size,
+        allow_lm_batch=args.allow_lm_batch,
+        use_random_seed=args.use_random_seed,
+        seeds=args.seeds,
+        lm_batch_chunk_size=args.lm_batch_chunk_size,
+        constrained_decoding_debug=args.constrained_decoding_debug,
+        audio_format=args.audio_format
+    )
+    # --- Generate Music ---
+    log_level = getattr(args, "log_level", "INFO")
+    log_level_upper = str(log_level).upper()
+    compact_logs = log_level_upper != "DEBUG"
+    _print_final_parameters(
+        args,
+        params,
+        config,
+        params_defaults,
+        config_defaults,
+        compact=compact_logs,
+        resolved_device=device,
+    )
+    print("\n--- Starting Generation ---")
+    print(f"Caption: \"{params.caption}\"")
+    print(f"Duration: {params.duration}s | Outputs: {config.batch_size}")
+    if config.seeds:
+        print(f"Custom Seeds: {config.seeds}")
+    print("---------------------------\n")
+    manual_edit_pipeline = (
+        args.thinking
+        and args.task_type not in skip_lm_tasks
+        and not (params.audio_codes and str(params.audio_codes).strip())
+    )
+    lm_time_costs = None
+    if manual_edit_pipeline:
+        top_k_value = None if not params.lm_top_k or params.lm_top_k == 0 else int(params.lm_top_k)
+        top_p_value = None if not params.lm_top_p or params.lm_top_p >= 1.0 else params.lm_top_p
+        actual_batch_size = config.batch_size if config.batch_size is not None else 1
+        seed_for_generation = ""
+        if config.seeds is not None:
+            if isinstance(config.seeds, list) and len(config.seeds) > 0:
+                seed_for_generation = ",".join(str(s) for s in config.seeds)
+            elif isinstance(config.seeds, int):
+                seed_for_generation = str(config.seeds)
+        actual_seed_list, _ = dit_handler.prepare_seeds(actual_batch_size, seed_for_generation, config.use_random_seed)
+        original_target_duration = params.duration
+        original_bpm = params.bpm
+        original_keyscale = params.keyscale
+        original_timesignature = params.timesignature
+        original_vocal_language = params.vocal_language
+        lm_result = None
+        lm_metadata = {}
+        edited_caption = None
+        edited_lyrics = None
+        edited_instruction = None
+        edited_metas = {}
+        lm_time_costs = {
+            "phase1_time": 0.0,
+            "phase2_time": 0.0,
+            "total_time": 0.0,
+        }
+        for attempt in range(2):
+            user_metadata = {}
+            if params.bpm is not None:
+                try:
+                    bpm_value = float(params.bpm)
+                    if bpm_value > 0:
+                        user_metadata["bpm"] = int(bpm_value)
+                except (ValueError, TypeError):
+                    pass
+            if params.keyscale and params.keyscale.strip() and params.keyscale.strip().lower() not in ["n/a", ""]:
+                user_metadata["keyscale"] = params.keyscale.strip()
+            if params.timesignature and params.timesignature.strip() and params.timesignature.strip().lower() not in ["n/a", ""]:
+                user_metadata["timesignature"] = params.timesignature.strip()
+            if params.duration is not None:
+                try:
+                    duration_value = float(params.duration)
+                    if duration_value > 0:
+                        user_metadata["duration"] = int(duration_value)
+                except (ValueError, TypeError):
+                    pass
+            # Only include caption and language in user_metadata on
+            # regeneration attempts.  On the first attempt the LM should
+            # generate/expand these via CoT (matching inference.py behaviour).
+            if attempt > 0:
+                if params.caption and params.caption.strip():
+                    user_metadata["caption"] = params.caption.strip()
+                if params.vocal_language and params.vocal_language not in ("", "unknown"):
+                    user_metadata["language"] = params.vocal_language
+            user_metadata_to_pass = user_metadata if user_metadata else None
+            lm_result = llm_handler.generate_with_stop_condition(
+                caption=params.caption or "",
+                lyrics=params.lyrics or "",
+                infer_type="llm_dit",
+                temperature=params.lm_temperature,
+                cfg_scale=params.lm_cfg_scale,
+                negative_prompt=params.lm_negative_prompt,
+                top_k=top_k_value,
+                top_p=top_p_value,
+                target_duration=params.duration,
+                user_metadata=user_metadata_to_pass,
+                use_cot_caption=params.use_cot_caption,
+                use_cot_language=params.use_cot_language,
+                use_cot_metas=params.use_cot_metas,
+                use_constrained_decoding=params.use_constrained_decoding,
+                constrained_decoding_debug=config.constrained_decoding_debug,
+                batch_size=actual_batch_size,
+                seeds=actual_seed_list,
+            )
+            lm_extra_time = (lm_result.get("extra_outputs") or {}).get("time_costs", {})
+            if lm_extra_time:
+                lm_time_costs["phase1_time"] += float(lm_extra_time.get("phase1_time", 0.0) or 0.0)
+                lm_time_costs["phase2_time"] += float(lm_extra_time.get("phase2_time", 0.0) or 0.0)
+                lm_time_costs["total_time"] += float(
+                    lm_extra_time.get(
+                        "total_time",
+                        (lm_extra_time.get("phase1_time", 0.0) or 0.0)
+                        + (lm_extra_time.get("phase2_time", 0.0) or 0.0),
+                    )
+                    or 0.0
+                )
+            if not lm_result.get("success", False):
+                error_msg = lm_result.get("error", "Unknown LM error")
+                print(f"\n❌ Generation failed: {error_msg}")
+                print(f"   Status: {lm_result.get('error', '')}")
+                return
+            if actual_batch_size > 1:
+                lm_metadata = (lm_result.get("metadata") or [{}])[0]
+                audio_codes = lm_result.get("audio_codes", [])
+            else:
+                lm_metadata = lm_result.get("metadata", {}) or {}
+                audio_codes = lm_result.get("audio_codes", "")
+            if audio_codes:
+                params.audio_codes = audio_codes
+            else:
+                print("WARNING: LM did not return audio codes; proceeding without codes.")
+            edited_caption = getattr(llm_handler, "_edited_caption", None)
+            edited_lyrics = getattr(llm_handler, "_edited_lyrics", None)
+            edited_instruction = getattr(llm_handler, "_edited_instruction", None)
+            edited_metas = getattr(llm_handler, "_edited_metas", {})
+            parsed_duration = None
+            parsed_bpm = None
+            parsed_keyscale = None
+            parsed_timesignature = None
+            parsed_language = None
+            if edited_metas:
+                bpm_value = edited_metas.get("bpm")
+                if bpm_value:
+                    parsed = _parse_number(bpm_value)
+                    if parsed is not None and parsed > 0:
+                        parsed_bpm = int(parsed)
+                duration_value = edited_metas.get("duration")
+                if duration_value:
+                    parsed = _parse_number(duration_value)
+                    if parsed is not None and parsed > 0:
+                        parsed_duration = float(parsed)
+                keyscale_value = edited_metas.get("keyscale")
+                if keyscale_value:
+                    parsed_keyscale = keyscale_value
+                timesignature_value = edited_metas.get("timesignature")
+                if timesignature_value:
+                    parsed_timesignature = timesignature_value
+                language_value = edited_metas.get("language") or edited_metas.get("vocal_language")
+                if language_value:
+                    parsed_language = language_value
+            if attempt == 0:
+                duration_changed = parsed_duration is not None and (
+                    original_target_duration is None
+                    or float(original_target_duration) <= 0
+                    or abs(float(original_target_duration) - parsed_duration) > 1e-6
+                )
+                bpm_changed = parsed_bpm is not None and parsed_bpm != original_bpm
+                keyscale_changed = parsed_keyscale is not None and parsed_keyscale != original_keyscale
+                timesignature_changed = parsed_timesignature is not None and parsed_timesignature != original_timesignature
+                language_changed = parsed_language is not None and parsed_language != original_vocal_language
+                if duration_changed or bpm_changed or keyscale_changed or timesignature_changed or language_changed:
+                    if duration_changed:
+                        params.duration = parsed_duration
+                    if bpm_changed:
+                        params.bpm = parsed_bpm
+                    if keyscale_changed:
+                        params.keyscale = parsed_keyscale
+                    if timesignature_changed:
+                        params.timesignature = parsed_timesignature
+                    if language_changed:
+                        params.vocal_language = parsed_language
+                    # Carry forward the expanded caption so the second
+                    # attempt's <think> block (and user_metadata) use it
+                    # instead of the short original caption.
+                    edited_caption_for_regen = edited_metas.get("caption") if edited_metas else None
+                    if edited_caption_for_regen and edited_caption_for_regen.strip():
+                        params.caption = edited_caption_for_regen
+                    print("INFO: Edited metadata detected. Regenerating audio codes with updated values.")
+                    llm_handler._skip_prompt_edit = True
+                    continue
+            break
+        edited_meta_caption = edited_metas.get("caption") if edited_metas else None
+        if edited_meta_caption and edited_meta_caption.strip():
+            params.caption = edited_meta_caption
+        elif edited_caption:
+            params.caption = edited_caption
+        elif params.use_cot_caption and lm_metadata.get("caption"):
+            params.caption = lm_metadata.get("caption")
+        if edited_lyrics:
+            params.lyrics = edited_lyrics
+        elif not params.lyrics and lm_metadata.get("lyrics"):
+            params.lyrics = lm_metadata.get("lyrics")
+        if edited_instruction:
+            params.instruction = edited_instruction
+        if edited_metas:
+            bpm_value = edited_metas.get("bpm")
+            if bpm_value:
+                parsed = _parse_number(bpm_value)
+                if parsed is not None:
+                    params.bpm = int(parsed)
+            duration_value = edited_metas.get("duration")
+            if duration_value:
+                parsed = _parse_number(duration_value)
+                if parsed is not None:
+                    params.duration = float(parsed)
+            keyscale_value = edited_metas.get("keyscale")
+            if keyscale_value:
+                params.keyscale = keyscale_value
+            timesignature_value = edited_metas.get("timesignature")
+            if timesignature_value:
+                params.timesignature = timesignature_value
+            language_value = edited_metas.get("language") or edited_metas.get("vocal_language")
+            if language_value:
+                params.vocal_language = language_value
+        else:
+            if params.bpm is None and lm_metadata.get("bpm") not in (None, "N/A", ""):
+                parsed = _parse_number(str(lm_metadata.get("bpm")))
+                if parsed is not None:
+                    params.bpm = int(parsed)
+            if not params.keyscale and lm_metadata.get("keyscale"):
+                params.keyscale = lm_metadata.get("keyscale")
+            if not params.timesignature and lm_metadata.get("timesignature"):
+                params.timesignature = lm_metadata.get("timesignature")
+            if params.duration is None and lm_metadata.get("duration") not in (None, "N/A", ""):
+                parsed = _parse_number(str(lm_metadata.get("duration")))
+                if parsed is not None:
+                    params.duration = float(parsed)
+            if params.vocal_language in (None, "", "unknown"):
+                language_value = lm_metadata.get("vocal_language") or lm_metadata.get("language")
+                if language_value:
+                    params.vocal_language = language_value
+        # use_cot_language: override vocal_language with LM detection unless
+        # the user explicitly edited the language in the think block.
+        if params.use_cot_language:
+            edited_lang = (edited_metas.get("language") or edited_metas.get("vocal_language")) if edited_metas else None
+            if not edited_lang:
+                lm_lang = lm_metadata.get("vocal_language") or lm_metadata.get("language")
+                if lm_lang:
+                    params.vocal_language = lm_lang
+        # Populate cot_* fields for downstream reporting (mirrors inference.py)
+        if lm_metadata:
+            if original_bpm is None:
+                params.cot_bpm = params.bpm
+            if not original_keyscale:
+                params.cot_keyscale = params.keyscale
+            if not original_timesignature:
+                params.cot_timesignature = params.timesignature
+            if original_target_duration is None or float(original_target_duration) <= 0:
+                params.cot_duration = params.duration
+            if original_vocal_language in (None, "", "unknown"):
+                params.cot_vocal_language = params.vocal_language
+            if not params.caption:
+                params.cot_caption = lm_metadata.get("caption", "")
+            if not params.lyrics:
+                params.cot_lyrics = lm_metadata.get("lyrics", "")
+        params.thinking = False
+        params.use_cot_caption = False
+        params.use_cot_language = False
+        params.use_cot_metas = False
+        if hasattr(llm_handler, "_skip_prompt_edit"):
+            llm_handler._skip_prompt_edit = False
+        if log_level_upper in {"INFO", "DEBUG"}:
+            _print_dit_prompt(dit_handler, params)
+        print("Running DiT generation with edited prompt and cached audio codes...")
+        result = generate_music(dit_handler, llm_handler, params, config, save_dir=args.save_dir)
+    else:
+        if log_level_upper in {"INFO", "DEBUG"}:
+            _print_dit_prompt(dit_handler, params)
+        result = generate_music(dit_handler, llm_handler, params, config, save_dir=args.save_dir)
+    # --- Process Results ---
+    if result.success:
+        print(f"\n✅ Generation successful! {len(result.audios)} audio(s) saved in '{args.save_dir}/'")
+        for i, audio in enumerate(result.audios):
+            print(f"  [{i+1}] Path: {audio['path']} | Seed: {audio['params']['seed']}")
+        time_costs = result.extra_outputs.get("time_costs", {})
+        if manual_edit_pipeline and lm_time_costs and time_costs is not None:
+            if not isinstance(time_costs, dict):
+                time_costs = {}
+                result.extra_outputs["time_costs"] = time_costs
+            if lm_time_costs["total_time"] > 0.0:
+                time_costs["lm_phase1_time"] = lm_time_costs["phase1_time"]
+                time_costs["lm_phase2_time"] = lm_time_costs["phase2_time"]
+                time_costs["lm_total_time"] = lm_time_costs["total_time"]
+                dit_total = float(time_costs.get("dit_total_time_cost", 0.0) or 0.0)
+                time_costs["pipeline_total_time"] = time_costs["lm_total_time"] + dit_total
+        if time_costs:
+            print("\n--- Performance ---")
+            total_time = time_costs.get('pipeline_total_time', 0)
+            print(f"Total time: {total_time:.2f}s")
+            if args.thinking:
+                lm1_time = time_costs.get('lm_phase1_time', 0)
+                lm2_time = time_costs.get('lm_phase2_time', 0)
+                print(f"  - LM time: {lm1_time + lm2_time:.2f}s")
+            dit_time = time_costs.get('dit_total_time_cost', 0)
+            print(f"  - DiT time: {dit_time:.2f}s")
+            print("-------------------\n")
+    else:
+        print(f"\n❌ Generation failed: {result.error}")
+        print(f"   Status: {result.status_message}")
+if __name__ == "__main__":
+    main()