Spaces:

OnyxMunk
/

Ace-Step-Munk

Running

File size: 19,252 Bytes

bc9c638

# ACE-Step OpenRouter API Documentation

> OpenAI Chat Completions-compatible API for AI music generation

**Base URL:** `http://{host}:{port}` (default `http://127.0.0.1:8002`)

---

## Table of Contents

- [Authentication](#authentication)
- [Endpoints](#endpoints)
  - [POST /v1/chat/completions - Generate Music](#1-generate-music)
  - [GET /v1/models - List Models](#2-list-models)
  - [GET /health - Health Check](#3-health-check)
- [Input Modes](#input-modes)
- [Audio Input](#audio-input)
- [Streaming Responses](#streaming-responses)
- [Examples](#examples)
- [Error Codes](#error-codes)

---

## Authentication

If the server is configured with an API key (via the `OPENROUTER_API_KEY` environment variable or `--api-key` CLI flag), all requests must include the following header:

```
Authorization: Bearer <your-api-key>
```

No authentication is required when no API key is configured.

---

## Endpoints

### 1. Generate Music

**POST** `/v1/chat/completions`

Generates music from chat messages and returns audio data along with LM-generated metadata.

#### Request Parameters

| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| `model` | string | No | auto | Model ID (obtain from `/v1/models`) |
| `messages` | array | **Yes** | - | Chat message list. See [Input Modes](#input-modes) |
| `stream` | boolean | No | `false` | Enable streaming response. See [Streaming Responses](#streaming-responses) |
| `audio_config` | object | No | `null` | Audio generation configuration. See below |
| `temperature` | float | No | `0.85` | LM sampling temperature |
| `top_p` | float | No | `0.9` | LM nucleus sampling parameter |
| `seed` | int \| string | No | `null` | Random seed. When `batch_size > 1`, use comma-separated values, e.g. `"42,123,456"` |
| `lyrics` | string | No | `""` | Lyrics passed directly (takes priority over lyrics parsed from messages). When set, messages text becomes the prompt |
| `sample_mode` | boolean | No | `false` | Enable LLM sample mode. Messages text becomes sample_query for LLM to auto-generate prompt/lyrics |
| `thinking` | boolean | No | `false` | Enable LLM thinking mode for deeper reasoning |
| `use_format` | boolean | No | `false` | When user provides prompt/lyrics, enhance them via LLM formatting |
| `use_cot_caption` | boolean | No | `true` | Rewrite/enhance the music description via Chain-of-Thought |
| `use_cot_language` | boolean | No | `true` | Auto-detect vocal language via Chain-of-Thought |
| `guidance_scale` | float | No | `7.0` | Classifier-free guidance scale |
| `batch_size` | int | No | `1` | Number of audio samples to generate |
| `task_type` | string | No | `"text2music"` | Task type. See [Audio Input](#audio-input) |
| `repainting_start` | float | No | `0.0` | Repaint region start position (seconds) |
| `repainting_end` | float | No | `null` | Repaint region end position (seconds) |
| `audio_cover_strength` | float | No | `1.0` | Cover strength (0.0~1.0) |

#### audio_config Object

| Field | Type | Default | Description |
|---|---|---|---|
| `duration` | float | `null` | Audio duration in seconds. If omitted, determined automatically by the LM |
| `bpm` | integer | `null` | Beats per minute. If omitted, determined automatically by the LM |
| `vocal_language` | string | `"en"` | Vocal language code (e.g. `"zh"`, `"en"`, `"ja"`) |
| `instrumental` | boolean | `null` | Whether to generate instrumental-only (no vocals). If omitted, auto-determined from lyrics |
| `format` | string | `"mp3"` | Output audio format |
| `key_scale` | string | `null` | Musical key (e.g. `"C major"`) |
| `time_signature` | string | `null` | Time signature (e.g. `"4/4"`) |

> **Messages text meaning depends on the mode:**
> - If `lyrics` is set → messages text = prompt (music description)
> - If `sample_mode: true` is set → messages text = sample_query (let LLM generate everything)
> - Neither set → auto-detect: tags → tag mode, lyrics-like → lyrics mode, otherwise → sample mode

#### messages Format

Supports both plain text and multimodal (text + audio) formats:

**Plain text:**

```json
{
  "messages": [
    {"role": "user", "content": "Your input content"}
  ]
}
```

**Multimodal (with audio input):**

```json
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Cover this song"},
        {
          "type": "input_audio",
          "input_audio": {
            "data": "<base64 audio data>",
            "format": "mp3"
          }
        }
      ]
    }
  ]
}
```

---

#### Non-Streaming Response (`stream: false`)

```json
{
  "id": "chatcmpl-a1b2c3d4e5f6g7h8",
  "object": "chat.completion",
  "created": 1706688000,
  "model": "acemusic/acestep-v15-turbo",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "## Metadata\n**Caption:** Upbeat pop song...\n**BPM:** 120\n**Duration:** 30s\n**Key:** C major\n\n## Lyrics\n[Verse 1]\nHello world...",
        "audio": [
          {
            "type": "audio_url",
            "audio_url": {
              "url": "data:audio/mpeg;base64,SUQzBAAAAAAAI1RTU0UAAAA..."
            }
          }
        ]
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 100,
    "total_tokens": 110
  }
}
```

**Response Fields:**

| Field | Description |
|---|---|
| `choices[0].message.content` | Text information generated by the LM, including Metadata (Caption/BPM/Duration/Key/Time Signature/Language) and Lyrics. Returns `"Music generated successfully."` if LM was not involved |
| `choices[0].message.audio` | Audio data array. Each item contains `type` (`"audio_url"`) and `audio_url.url` (Base64 Data URL in format `data:audio/mpeg;base64,...`) |
| `choices[0].finish_reason` | `"stop"` indicates normal completion |

**Decoding Audio:**

The `audio_url.url` value is a Data URL: `data:audio/mpeg;base64,<base64_data>`

Extract the base64 portion after the comma and decode it to get the MP3 file:

```python
import base64

url = response["choices"][0]["message"]["audio"][0]["audio_url"]["url"]
# Strip the "data:audio/mpeg;base64," prefix
b64_data = url.split(",", 1)[1]
audio_bytes = base64.b64decode(b64_data)

with open("output.mp3", "wb") as f:
    f.write(audio_bytes)
```

```javascript
const url = response.choices[0].message.audio[0].audio_url.url;
const b64Data = url.split(",")[1];
const audioBytes = atob(b64Data);
// Or use the Data URL directly in an <audio> element
const audio = new Audio(url);
audio.play();
```

---

### 2. List Models

**GET** `/v1/models`

Returns available model information.

#### Response

```json
{
  "data": [
    {
      "id": "acemusic/acestep-v15-turbo",
      "name": "ACE-Step",
      "created": 1706688000,
      "description": "High-performance text-to-music generation model. Supports multiple styles, lyrics input, and various audio durations.",
      "input_modalities": ["text", "audio"],
      "output_modalities": ["audio", "text"],
      "context_length": 4096,
      "pricing": {"prompt": "0", "completion": "0", "request": "0"},
      "supported_sampling_parameters": ["temperature", "top_p"]
    }
  ]
}
```

---

### 3. Health Check

**GET** `/health`

#### Response

```json
{
  "status": "ok",
  "service": "ACE-Step OpenRouter API",
  "version": "1.0"
}
```

---

## Input Modes

The system automatically selects the input mode based on the content of the last `user` message. You can also explicitly specify via the `lyrics` or `sample_mode` fields.

### Mode 1: Tagged Mode (Recommended)

Use `<prompt>` and `<lyrics>` tags to explicitly specify the music description and lyrics:

```json
{
  "messages": [
    {
      "role": "user",
      "content": "<prompt>A gentle acoustic ballad in C major, female vocal</prompt>\n<lyrics>[Verse 1]\nSunlight through the window\nA brand new day begins\n\n[Chorus]\nWe are the dreamers\nWe are the light</lyrics>"
    }
  ],
  "audio_config": {
    "duration": 30,
    "vocal_language": "en"
  }
}
```

- `<prompt>...</prompt>` — Music style/scene description (caption)
- `<lyrics>...</lyrics>` — Lyrics content
- Either tag can be used alone
- When `use_format: true`, the LLM automatically enhances both prompt and lyrics

### Mode 2: Natural Language Mode (Sample Mode)

Describe the desired music in natural language. The system uses LLM to generate the prompt and lyrics automatically:

```json
{
  "messages": [
    {"role": "user", "content": "Generate an upbeat pop song about summer and travel"}
  ],
  "sample_mode": true,
  "audio_config": {
    "vocal_language": "en"
  }
}
```

Trigger condition: `sample_mode: true`, or message content contains no tags and does not resemble lyrics.

### Mode 3: Lyrics-Only Mode

Pass in lyrics with structural markers directly. The system identifies them automatically:

```json
{
  "messages": [
    {
      "role": "user",
      "content": "[Verse 1]\nWalking down the street\nFeeling the beat\n\n[Chorus]\nDance with me tonight\nUnder the moonlight"
    }
  ],
  "audio_config": {"duration": 30}
}
```

Trigger condition: Message content contains `[Verse]`, `[Chorus]`, or similar markers, or has a multi-line short-text structure.

### Mode 4: Lyrics + Prompt Separation

Use the `lyrics` field to pass lyrics directly, and messages text automatically becomes the prompt:

```json
{
  "messages": [
    {"role": "user", "content": "Energetic EDM with heavy bass drops"}
  ],
  "lyrics": "[Verse 1]\nFeel the rhythm in your soul\nLet the music take control\n\n[Drop]\n(instrumental break)",
  "audio_config": {
    "bpm": 128,
    "duration": 60
  }
}
```

### Instrumental Mode

Set `audio_config.instrumental: true`:

```json
{
  "messages": [
    {"role": "user", "content": "<prompt>Epic orchestral cinematic score, dramatic and powerful</prompt>"}
  ],
  "audio_config": {
    "instrumental": true,
    "duration": 30
  }
}
```

---

## Audio Input

Audio files can be passed via multimodal messages (base64 encoded) for cover, repaint, and other tasks.

### task_type Types

| task_type | Description | Audio Input Required |
|---|---|---|
| `text2music` | Text to music (default) | Optional (as reference) |
| `cover` | Cover/style transfer | Requires src_audio |
| `repaint` | Partial repaint | Requires src_audio |
| `lego` | Audio splicing | Requires src_audio |
| `extract` | Audio extraction | Requires src_audio |
| `complete` | Audio continuation | Requires src_audio |

### Audio Routing Rules

Multiple `input_audio` blocks are routed to different parameters in order (similar to multi-image upload):

| task_type | audio[0] | audio[1] |
|---|---|---|
| `text2music` | reference_audio (style reference) | - |
| `cover/repaint/lego/extract/complete` | src_audio (audio to edit) | reference_audio (optional style reference) |

### Audio Input Examples

**Cover Task:**

```json
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "<prompt>Jazz style cover with saxophone</prompt>"},
        {
          "type": "input_audio",
          "input_audio": {"data": "<base64 source audio>", "format": "mp3"}
        }
      ]
    }
  ],
  "task_type": "cover",
  "audio_cover_strength": 0.8,
  "audio_config": {"duration": 30}
}
```

**Repaint Task:**

```json
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "<prompt>Replace with guitar solo</prompt>"},
        {
          "type": "input_audio",
          "input_audio": {"data": "<base64 source audio>", "format": "mp3"}
        }
      ]
    }
  ],
  "task_type": "repaint",
  "repainting_start": 10.0,
  "repainting_end": 20.0,
  "audio_config": {"duration": 30}
}
```

---

## Streaming Responses

Set `"stream": true` to enable SSE (Server-Sent Events) streaming.

### Event Format

Each event starts with `data: `, followed by JSON, ending with a double newline `\n\n`:

```
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v15-turbo","choices":[{"index":0,"delta":{...},"finish_reason":null}]}

```

### Streaming Event Sequence

| Phase | Delta Content | Description |
|---|---|---|
| 1. Initialization | `{"role":"assistant","content":""}` | Establishes the connection |
| 2. LM Content | `{"content":"\n\n## Metadata\n..."}` | Metadata and lyrics pushed after LM generation (if LM was used) |
| 3. Heartbeat | `{"content":"."}` | Sent every 2 seconds during audio generation to keep the connection alive |
| 4. Audio Data | `{"audio":[{"type":"audio_url","audio_url":{"url":"data:..."}}]}` | Audio base64 data |
| 5. Finish | `finish_reason: "stop"` | Generation complete |
| 6. Termination | `data: [DONE]` | End-of-stream marker |

### Streaming Response Example

```
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v15-turbo","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v15-turbo","choices":[{"index":0,"delta":{"content":"\n\n## Metadata\n**Caption:** Upbeat pop\n**BPM:** 120"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v15-turbo","choices":[{"index":0,"delta":{"content":"."},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v15-turbo","choices":[{"index":0,"delta":{"audio":[{"type":"audio_url","audio_url":{"url":"data:audio/mpeg;base64,..."}}]},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706688000,"model":"acemusic/acestep-v15-turbo","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

```

### Client-Side Streaming Handling

```python
import json
import httpx

with httpx.stream("POST", "http://127.0.0.1:8002/v1/chat/completions", json={
    "messages": [{"role": "user", "content": "Generate a cheerful guitar piece"}],
    "sample_mode": True,
    "stream": True,
    "audio_config": {"instrumental": True}
}) as response:
    content_parts = []
    audio_url = None

    for line in response.iter_lines():
        if not line or not line.startswith("data: "):
            continue
        if line == "data: [DONE]":
            break

        chunk = json.loads(line[6:])
        delta = chunk["choices"][0]["delta"]

        if "content" in delta and delta["content"]:
            content_parts.append(delta["content"])

        if "audio" in delta and delta["audio"]:
            audio_url = delta["audio"][0]["audio_url"]["url"]

        if chunk["choices"][0].get("finish_reason") == "stop":
            print("Generation complete!")

    print("Content:", "".join(content_parts))
    if audio_url:
        import base64
        b64_data = audio_url.split(",", 1)[1]
        with open("output.mp3", "wb") as f:
            f.write(base64.b64decode(b64_data))
```

```javascript
const response = await fetch("http://127.0.0.1:8002/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    messages: [{ role: "user", content: "Generate a cheerful guitar piece" }],
    sample_mode: true,
    stream: true,
    audio_config: { instrumental: true }
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let audioUrl = null;
let content = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const text = decoder.decode(value);
  for (const line of text.split("\n")) {
    if (!line.startsWith("data: ") || line === "data: [DONE]") continue;

    const chunk = JSON.parse(line.slice(6));
    const delta = chunk.choices[0].delta;

    if (delta.content) content += delta.content;
    if (delta.audio) audioUrl = delta.audio[0].audio_url.url;
  }
}

// audioUrl can be used directly as <audio src="...">
```

---

## Examples

### Example 1: Natural Language Generation (Simplest Usage)

```bash
curl -X POST http://127.0.0.1:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "A soft folk song about hometown and memories"}
    ],
    "sample_mode": true,
    "audio_config": {"vocal_language": "en"}
  }'
```

### Example 2: Tagged Mode with Specific Parameters

```bash
curl -X POST http://127.0.0.1:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "<prompt>Energetic EDM track with heavy bass drops and synth leads</prompt><lyrics>[Verse 1]\nFeel the rhythm in your soul\nLet the music take control\n\n[Drop]\n(instrumental break)</lyrics>"
      }
    ],
    "audio_config": {
      "bpm": 128,
      "duration": 60,
      "vocal_language": "en"
    }
  }'
```

### Example 3: Instrumental with LM Enhancement Disabled

```bash
curl -X POST http://127.0.0.1:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "<prompt>Peaceful piano solo, slow tempo, jazz harmony</prompt>"
      }
    ],
    "use_cot_caption": false,
    "audio_config": {
      "instrumental": true,
      "duration": 45
    }
  }'
```

### Example 4: Streaming Request

```bash
curl -X POST http://127.0.0.1:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "messages": [
      {"role": "user", "content": "Generate a happy birthday song"}
    ],
    "sample_mode": true,
    "stream": true
  }'
```

### Example 5: Multi-Seed Batch Generation

```bash
curl -X POST http://127.0.0.1:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "<prompt>Lo-fi hip hop beat</prompt>"}
    ],
    "batch_size": 3,
    "seed": "42,123,456",
    "audio_config": {
      "instrumental": true,
      "duration": 30
    }
  }'
```

---

## Error Codes

| HTTP Status | Description |
|---|---|
| 400 | Invalid request format or missing valid input |
| 401 | Missing or invalid API key |
| 429 | Service busy, queue full |
| 500 | Internal error during music generation |
| 503 | Model not yet initialized |
| 504 | Generation timeout |

Error response format:

```json
{
  "detail": "Error description message"
}
```

---

## Server Configuration (Environment Variables)

The following environment variables can be used to configure the server (for operations reference):

| Variable | Default | Description |
|---|---|---|
| `OPENROUTER_API_KEY` | None | API authentication key |
| `OPENROUTER_HOST` | `127.0.0.1` | Listen address |
| `OPENROUTER_PORT` | `8002` | Listen port |
| `ACESTEP_CONFIG_PATH` | `acestep-v15-turbo` | DiT model configuration path |
| `ACESTEP_DEVICE` | `auto` | Inference device |
| `ACESTEP_LM_MODEL_PATH` | `acestep-5Hz-lm-0.6B` | LLM model path |
| `ACESTEP_LM_BACKEND` | `vllm` | LLM inference backend |
| `ACESTEP_QUEUE_MAXSIZE` | `200` | Task queue max capacity |
| `ACESTEP_GENERATION_TIMEOUT` | `600` | Non-streaming request timeout (seconds) |