File size: 37,031 Bytes

9f5c8f7

# ACE-Step Inference API Documentation

**Language / 语言 / 言語:** [English](INFERENCE.md) | [中文](../zh/INFERENCE.md) | [日本語](../ja/INFERENCE.md)

---

This document provides comprehensive documentation for the ACE-Step inference API, including parameter specifications for all supported task types.

## Table of Contents

- [Quick Start](#quick-start)
- [API Overview](#api-overview)
- [GenerationParams Parameters](#generationparams-parameters)
- [GenerationConfig Parameters](#generationconfig-parameters)
- [Task Types](#task-types)
- [Helper Functions](#helper-functions)
- [Complete Examples](#complete-examples)
- [Best Practices](#best-practices)

---

## Quick Start

### Basic Usage

```python
from acestep.handler import AceStepHandler
from acestep.llm_inference import LLMHandler
from acestep.inference import GenerationParams, GenerationConfig, generate_music

# Initialize handlers
dit_handler = AceStepHandler()
llm_handler = LLMHandler()

# Initialize services
dit_handler.initialize_service(
    project_root="/path/to/project",
    config_path="acestep-v15-turbo",
    device="cuda"
)

llm_handler.initialize(
    checkpoint_dir="/path/to/checkpoints",
    lm_model_path="acestep-5Hz-lm-0.6B",
    backend="vllm",
    device="cuda"
)

# Configure generation parameters
params = GenerationParams(
    caption="upbeat electronic dance music with heavy bass",
    bpm=128,
    duration=30,
)

# Configure generation settings
config = GenerationConfig(
    batch_size=2,
    audio_format="flac",
)

# Generate music
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/path/to/output")

# Access results
if result.success:
    for audio in result.audios:
        print(f"Generated: {audio['path']}")
        print(f"Key: {audio['key']}")
        print(f"Seed: {audio['params']['seed']}")
else:
    print(f"Error: {result.error}")
```

---

## API Overview

### Main Functions

#### generate_music

```python
def generate_music(
    dit_handler,
    llm_handler,
    params: GenerationParams,
    config: GenerationConfig,
    save_dir: Optional[str] = None,
    progress=None,
) -> GenerationResult
```

Main function for generating music using the ACE-Step model.

#### understand_music

```python
def understand_music(
    llm_handler,
    audio_codes: str,
    temperature: float = 0.85,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: float = 1.0,
    use_constrained_decoding: bool = True,
    constrained_decoding_debug: bool = False,
) -> UnderstandResult
```

Analyze audio semantic codes and extract metadata (caption, lyrics, BPM, key, etc.).

#### create_sample

```python
def create_sample(
    llm_handler,
    query: str,
    instrumental: bool = False,
    vocal_language: Optional[str] = None,
    temperature: float = 0.85,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: float = 1.0,
    use_constrained_decoding: bool = True,
    constrained_decoding_debug: bool = False,
) -> CreateSampleResult
```

Generate a complete music sample (caption, lyrics, metadata) from a natural language description.

#### format_sample

```python
def format_sample(
    llm_handler,
    caption: str,
    lyrics: str,
    user_metadata: Optional[Dict[str, Any]] = None,
    temperature: float = 0.85,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: float = 1.0,
    use_constrained_decoding: bool = True,
    constrained_decoding_debug: bool = False,
) -> FormatSampleResult
```

Format and enhance user-provided caption and lyrics, generating structured metadata.

### Configuration Objects

The API uses two configuration dataclasses:

**GenerationParams** - Contains all music generation parameters:

```python
@dataclass
class GenerationParams:
    # Task & Instruction
    task_type: str = "text2music"
    instruction: str = "Fill the audio semantic mask based on the given conditions:"
    
    # Audio Uploads
    reference_audio: Optional[str] = None
    src_audio: Optional[str] = None
    
    # LM Codes Hints
    audio_codes: str = ""
    
    # Text Inputs
    caption: str = ""
    lyrics: str = ""
    instrumental: bool = False
    
    # Metadata
    vocal_language: str = "unknown"
    bpm: Optional[int] = None
    keyscale: str = ""
    timesignature: str = ""
    duration: float = -1.0
    
    # Advanced Settings
    inference_steps: int = 8
    seed: int = -1
    guidance_scale: float = 7.0
    use_adg: bool = False
    cfg_interval_start: float = 0.0
    cfg_interval_end: float = 1.0
    shift: float = 1.0                    # NEW: Timestep shift factor
    infer_method: str = "ode"             # NEW: Diffusion inference method
    timesteps: Optional[List[float]] = None  # NEW: Custom timesteps
    
    repainting_start: float = 0.0
    repainting_end: float = -1
    audio_cover_strength: float = 1.0
    
    # 5Hz Language Model Parameters
    thinking: bool = True
    lm_temperature: float = 0.85
    lm_cfg_scale: float = 2.0
    lm_top_k: int = 0
    lm_top_p: float = 0.9
    lm_negative_prompt: str = "NO USER INPUT"
    use_cot_metas: bool = True
    use_cot_caption: bool = True
    use_cot_lyrics: bool = False
    use_cot_language: bool = True
    use_constrained_decoding: bool = True
    
    # CoT Generated Values (auto-filled by LM)
    cot_bpm: Optional[int] = None
    cot_keyscale: str = ""
    cot_timesignature: str = ""
    cot_duration: Optional[float] = None
    cot_vocal_language: str = "unknown"
    cot_caption: str = ""
    cot_lyrics: str = ""
```

**GenerationConfig** - Contains batch and output configuration:

```python
@dataclass
class GenerationConfig:
    batch_size: int = 2
    allow_lm_batch: bool = False
    use_random_seed: bool = True
    seeds: Optional[List[int]] = None
    lm_batch_chunk_size: int = 8
    constrained_decoding_debug: bool = False
    audio_format: str = "flac"
```

### Result Objects

**GenerationResult** - Result of music generation:

```python
@dataclass
class GenerationResult:
    # Audio Outputs
    audios: List[Dict[str, Any]]  # List of audio dictionaries
    
    # Generation Information
    status_message: str           # Status message from generation
    extra_outputs: Dict[str, Any] # Extra outputs (latents, masks, lm_metadata, time_costs)
    
    # Success Status
    success: bool                 # Whether generation succeeded
    error: Optional[str]          # Error message if failed
```

**Audio Dictionary Structure:**

Each item in `audios` list contains:

```python
{
    "path": str,           # File path to saved audio
    "tensor": Tensor,      # Audio tensor [channels, samples], CPU, float32
    "key": str,            # Unique audio key (UUID based on params)
    "sample_rate": int,    # Sample rate (default: 48000)
    "params": Dict,        # Generation params for this audio (includes seed, audio_codes, etc.)
}
```

**UnderstandResult** - Result of music understanding:

```python
@dataclass
class UnderstandResult:
    # Metadata Fields
    caption: str = ""
    lyrics: str = ""
    bpm: Optional[int] = None
    duration: Optional[float] = None
    keyscale: str = ""
    language: str = ""
    timesignature: str = ""
    
    # Status
    status_message: str = ""
    success: bool = True
    error: Optional[str] = None
```

**CreateSampleResult** - Result of sample creation:

```python
@dataclass
class CreateSampleResult:
    # Metadata Fields
    caption: str = ""
    lyrics: str = ""
    bpm: Optional[int] = None
    duration: Optional[float] = None
    keyscale: str = ""
    language: str = ""
    timesignature: str = ""
    instrumental: bool = False
    
    # Status
    status_message: str = ""
    success: bool = True
    error: Optional[str] = None
```

**FormatSampleResult** - Result of sample formatting:

```python
@dataclass
class FormatSampleResult:
    # Metadata Fields
    caption: str = ""
    lyrics: str = ""
    bpm: Optional[int] = None
    duration: Optional[float] = None
    keyscale: str = ""
    language: str = ""
    timesignature: str = ""
    
    # Status
    status_message: str = ""
    success: bool = True
    error: Optional[str] = None
```

---

## GenerationParams Parameters

### Text Inputs

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `caption` | `str` | `""` | Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. Max 512 characters. |
| `lyrics` | `str` | `""` | Lyrics text for vocal music. Use `"[Instrumental]"` for instrumental tracks. Supports multiple languages. Max 4096 characters. |
| `instrumental` | `bool` | `False` | If True, generate instrumental music regardless of lyrics. |

### Music Metadata

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `bpm` | `Optional[int]` | `None` | Beats per minute (30-300). `None` enables auto-detection via LM. |
| `keyscale` | `str` | `""` | Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection. |
| `timesignature` | `str` | `""` | Time signature (2 for '2/4', 3 for '3/4', 4 for '4/4', 6 for '6/8'). Empty string enables auto-detection. |
| `vocal_language` | `str` | `"unknown"` | Language code for vocals (ISO 639-1). Supported: `"en"`, `"zh"`, `"ja"`, `"es"`, `"fr"`, etc. Use `"unknown"` for auto-detection. |
| `duration` | `float` | `-1.0` | Target audio length in seconds (10-600). If <= 0 or None, model chooses automatically based on lyrics length. |

### Generation Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `inference_steps` | `int` | `8` | Number of denoising steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). Higher = better quality but slower. |
| `guidance_scale` | `float` | `7.0` | Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Only supported for non-turbo model. Typical range: 5.0-9.0. |
| `seed` | `int` | `-1` | Random seed for reproducibility. Use `-1` for random seed, or any positive integer for fixed seed. |

### Advanced DiT Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `use_adg` | `bool` | `False` | Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed. |
| `cfg_interval_start` | `float` | `0.0` | CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance. |
| `cfg_interval_end` | `float` | `1.0` | CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance. |
| `shift` | `float` | `1.0` | Timestep shift factor (range 1.0-5.0, default 1.0). When != 1.0, applies `t = shift * t / (1 + (shift - 1) * t)` to timesteps. Recommended 3.0 for turbo models. |
| `infer_method` | `str` | `"ode"` | Diffusion inference method. `"ode"` (Euler) is faster and deterministic. `"sde"` (stochastic) may produce different results with variance. |
| `timesteps` | `Optional[List[float]]` | `None` | Custom timesteps as a list of floats from 1.0 to 0.0 (e.g., `[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0]`). If provided, overrides `inference_steps` and `shift`. |

### Task-Specific Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `task_type` | `str` | `"text2music"` | Generation task type. See [Task Types](#task-types) section for details. |
| `instruction` | `str` | `"Fill the audio semantic mask based on the given conditions:"` | Task-specific instruction prompt. |
| `reference_audio` | `Optional[str]` | `None` | Path to reference audio file for style transfer or continuation tasks. |
| `src_audio` | `Optional[str]` | `None` | Path to source audio file for audio-to-audio tasks (cover, repaint, etc.). |
| `audio_codes` | `str` | `""` | Pre-extracted 5Hz audio semantic codes as a string. Advanced use only. |
| `repainting_start` | `float` | `0.0` | Repainting start time in seconds (for repaint/lego tasks). |
| `repainting_end` | `float` | `-1` | Repainting end time in seconds. Use `-1` for end of audio. |
| `audio_cover_strength` | `float` | `1.0` | Strength of audio cover/codes influence (0.0-1.0). Set smaller (0.2) for style transfer tasks. |

### 5Hz Language Model Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `thinking` | `bool` | `True` | Enable 5Hz Language Model "Chain-of-Thought" reasoning for semantic/music metadata and codes. |
| `lm_temperature` | `float` | `0.85` | LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative. |
| `lm_cfg_scale` | `float` | `2.0` | LM classifier-free guidance scale. Higher = stronger adherence to prompt. |
| `lm_top_k` | `int` | `0` | LM top-k sampling. `0` disables top-k filtering. Typical values: 40-100. |
| `lm_top_p` | `float` | `0.9` | LM nucleus sampling (0.0-1.0). `1.0` disables nucleus sampling. Typical values: 0.9-0.95. |
| `lm_negative_prompt` | `str` | `"NO USER INPUT"` | Negative prompt for LM guidance. Helps avoid unwanted characteristics. |
| `use_cot_metas` | `bool` | `True` | Generate metadata using LM CoT reasoning (BPM, key, duration, etc.). |
| `use_cot_caption` | `bool` | `True` | Refine user caption using LM CoT reasoning. |
| `use_cot_language` | `bool` | `True` | Detect vocal language using LM CoT reasoning. |
| `use_cot_lyrics` | `bool` | `False` | (Reserved for future use) Generate/refine lyrics using LM CoT. |
| `use_constrained_decoding` | `bool` | `True` | Enable constrained decoding for structured LM output. |

### CoT Generated Values

These fields are automatically populated by the LM when CoT reasoning is enabled:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `cot_bpm` | `Optional[int]` | `None` | LM-generated BPM value. |
| `cot_keyscale` | `str` | `""` | LM-generated key/scale. |
| `cot_timesignature` | `str` | `""` | LM-generated time signature. |
| `cot_duration` | `Optional[float]` | `None` | LM-generated duration. |
| `cot_vocal_language` | `str` | `"unknown"` | LM-detected vocal language. |
| `cot_caption` | `str` | `""` | LM-refined caption. |
| `cot_lyrics` | `str` | `""` | LM-generated/refined lyrics. |

---

## GenerationConfig Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `batch_size` | `int` | `2` | Number of samples to generate in parallel (1-8). Higher values require more GPU memory. |
| `allow_lm_batch` | `bool` | `False` | Allow batch processing in LM. Faster when `batch_size >= 2` and `thinking=True`. |
| `use_random_seed` | `bool` | `True` | Whether to use random seed. `True` for different results each time, `False` for reproducible results. |
| `seeds` | `Optional[List[int]]` | `None` | List of seeds for batch generation. If provided, will be padded with random seeds if fewer than batch_size. Can also be single int. |
| `lm_batch_chunk_size` | `int` | `8` | Maximum batch size per LM inference chunk (GPU memory constraint). |
| `constrained_decoding_debug` | `bool` | `False` | Enable debug logging for constrained decoding. |
| `audio_format` | `str` | `"flac"` | Output audio format. Options: `"mp3"`, `"wav"`, `"flac"`. Default is FLAC for fast saving. |

---

## Task Types

ACE-Step supports 6 different generation task types, each optimized for specific use cases.

### 1. Text2Music (Default)

**Purpose**: Generate music from text descriptions and optional metadata.

**Key Parameters**:
```python
params = GenerationParams(
    task_type="text2music",
    caption="energetic rock music with electric guitar",
    lyrics="[Instrumental]",  # or actual lyrics
    bpm=140,
    duration=30,
)
```

**Required**:
- `caption` or `lyrics` (at least one)

**Optional but Recommended**:
- `bpm`: Controls tempo
- `keyscale`: Controls musical key
- `timesignature`: Controls rhythm structure
- `duration`: Controls length
- `vocal_language`: Controls vocal characteristics

**Use Cases**:
- Generate music from text descriptions
- Create backing tracks from prompts
- Generate songs with lyrics

---

### 2. Cover

**Purpose**: Transform existing audio while maintaining structure but changing style/timbre.

**Key Parameters**:
```python
params = GenerationParams(
    task_type="cover",
    src_audio="original_song.mp3",
    caption="jazz piano version",
    audio_cover_strength=0.8,  # 0.0-1.0
)
```

**Required**:
- `src_audio`: Path to source audio file
- `caption`: Description of desired style/transformation

**Optional**:
- `audio_cover_strength`: Controls influence of original audio
  - `1.0`: Strong adherence to original structure
  - `0.5`: Balanced transformation
  - `0.1`: Loose interpretation
- `lyrics`: New lyrics (if changing vocals)

**Use Cases**:
- Create covers in different styles
- Change instrumentation while keeping melody
- Genre transformation

---

### 3. Repaint

**Purpose**: Regenerate a specific time segment of audio while keeping the rest unchanged.

**Key Parameters**:
```python
params = GenerationParams(
    task_type="repaint",
    src_audio="original.mp3",
    repainting_start=10.0,  # seconds
    repainting_end=20.0,    # seconds
    caption="smooth transition with piano solo",
)
```

**Required**:
- `src_audio`: Path to source audio file
- `repainting_start`: Start time in seconds
- `repainting_end`: End time in seconds (use `-1` for end of file)
- `caption`: Description of desired content for repainted section

**Use Cases**:
- Fix specific sections of generated music
- Add variations to parts of a song
- Create smooth transitions
- Replace problematic segments

---

### 4. Lego (Base Model Only)

**Purpose**: Generate a specific instrument track in context of existing audio.

**Key Parameters**:
```python
params = GenerationParams(
    task_type="lego",
    src_audio="backing_track.mp3",
    instruction="Generate the guitar track based on the audio context:",
    caption="lead guitar melody with bluesy feel",
    repainting_start=0.0,
    repainting_end=-1,
)
```

**Required**:
- `src_audio`: Path to source/backing audio
- `instruction`: Must specify the track type (e.g., "Generate the {TRACK_NAME} track...")
- `caption`: Description of desired track characteristics

**Available Tracks**:
- `"vocals"`, `"backing_vocals"`, `"drums"`, `"bass"`, `"guitar"`, `"keyboard"`, 
- `"percussion"`, `"strings"`, `"synth"`, `"fx"`, `"brass"`, `"woodwinds"`

**Use Cases**:
- Add specific instrument tracks
- Layer additional instruments over backing tracks
- Create multi-track compositions iteratively

---

### 5. Extract (Base Model Only)

**Purpose**: Extract/isolate a specific instrument track from mixed audio.

**Key Parameters**:
```python
params = GenerationParams(
    task_type="extract",
    src_audio="full_mix.mp3",
    instruction="Extract the vocals track from the audio:",
)
```

**Required**:
- `src_audio`: Path to mixed audio file
- `instruction`: Must specify track to extract

**Available Tracks**: Same as Lego task

**Use Cases**:
- Stem separation
- Isolate specific instruments
- Create remixes
- Analyze individual tracks

---

### 6. Complete (Base Model Only)

**Purpose**: Complete/extend partial tracks with specified instruments.

**Key Parameters**:
```python
params = GenerationParams(
    task_type="complete",
    src_audio="incomplete_track.mp3",
    instruction="Complete the input track with drums, bass, guitar:",
    caption="rock style completion",
)
```

**Required**:
- `src_audio`: Path to incomplete/partial track
- `instruction`: Must specify which tracks to add
- `caption`: Description of desired style

**Use Cases**:
- Arrange incomplete compositions
- Add backing tracks
- Auto-complete musical ideas

---

## Helper Functions

### understand_music

Analyze audio codes to extract metadata about the music.

```python
from acestep.inference import understand_music

result = understand_music(
    llm_handler=llm_handler,
    audio_codes="<|audio_code_123|><|audio_code_456|>...",
    temperature=0.85,
    use_constrained_decoding=True,
)

if result.success:
    print(f"Caption: {result.caption}")
    print(f"Lyrics: {result.lyrics}")
    print(f"BPM: {result.bpm}")
    print(f"Key: {result.keyscale}")
    print(f"Duration: {result.duration}s")
    print(f"Language: {result.language}")
else:
    print(f"Error: {result.error}")
```

**Use Cases**:
- Analyze existing music
- Extract metadata from audio codes
- Reverse-engineer generation parameters

---

### create_sample

Generate a complete music sample from a natural language description. This is the "Simple Mode" / "Inspiration Mode" feature.

```python
from acestep.inference import create_sample

result = create_sample(
    llm_handler=llm_handler,
    query="a soft Bengali love song for a quiet evening",
    instrumental=False,
    vocal_language="bn",  # Optional: constrain to Bengali
    temperature=0.85,
)

if result.success:
    print(f"Caption: {result.caption}")
    print(f"Lyrics: {result.lyrics}")
    print(f"BPM: {result.bpm}")
    print(f"Duration: {result.duration}s")
    print(f"Key: {result.keyscale}")
    print(f"Is Instrumental: {result.instrumental}")
    
    # Use with generate_music
    params = GenerationParams(
        caption=result.caption,
        lyrics=result.lyrics,
        bpm=result.bpm,
        duration=result.duration,
        keyscale=result.keyscale,
        vocal_language=result.language,
    )
else:
    print(f"Error: {result.error}")
```

**Parameters**:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `query` | `str` | required | Natural language description of desired music |
| `instrumental` | `bool` | `False` | Whether to generate instrumental music |
| `vocal_language` | `Optional[str]` | `None` | Constrain lyrics to specific language (e.g., "en", "zh", "bn") |
| `temperature` | `float` | `0.85` | Sampling temperature |
| `top_k` | `Optional[int]` | `None` | Top-k sampling (None disables) |
| `top_p` | `Optional[float]` | `None` | Top-p sampling (None disables) |
| `repetition_penalty` | `float` | `1.0` | Repetition penalty |
| `use_constrained_decoding` | `bool` | `True` | Use FSM-based constrained decoding |

---

### format_sample

Format and enhance user-provided caption and lyrics, generating structured metadata.

```python
from acestep.inference import format_sample

result = format_sample(
    llm_handler=llm_handler,
    caption="Latin pop, reggaeton",
    lyrics="[Verse 1]\nBailando en la noche...",
    user_metadata={"bpm": 95},  # Optional: constrain specific values
    temperature=0.85,
)

if result.success:
    print(f"Enhanced Caption: {result.caption}")
    print(f"Formatted Lyrics: {result.lyrics}")
    print(f"BPM: {result.bpm}")
    print(f"Duration: {result.duration}s")
    print(f"Key: {result.keyscale}")
    print(f"Detected Language: {result.language}")
else:
    print(f"Error: {result.error}")
```

**Parameters**:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `caption` | `str` | required | User's caption/description |
| `lyrics` | `str` | required | User's lyrics with structure tags |
| `user_metadata` | `Optional[Dict]` | `None` | Constrain specific metadata values (bpm, duration, keyscale, timesignature, language) |
| `temperature` | `float` | `0.85` | Sampling temperature |
| `top_k` | `Optional[int]` | `None` | Top-k sampling (None disables) |
| `top_p` | `Optional[float]` | `None` | Top-p sampling (None disables) |
| `repetition_penalty` | `float` | `1.0` | Repetition penalty |
| `use_constrained_decoding` | `bool` | `True` | Use FSM-based constrained decoding |

---

## Complete Examples

### Example 1: Simple Text-to-Music Generation

```python
from acestep.inference import GenerationParams, GenerationConfig, generate_music

params = GenerationParams(
    task_type="text2music",
    caption="calm ambient music with soft piano and strings",
    duration=60,
    bpm=80,
    keyscale="C Major",
)

config = GenerationConfig(
    batch_size=2,  # Generate 2 variations
    audio_format="flac",
)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if result.success:
    for i, audio in enumerate(result.audios, 1):
        print(f"Variation {i}: {audio['path']}")
```

### Example 2: Song Generation with Lyrics

```python
params = GenerationParams(
    task_type="text2music",
    caption="pop ballad with emotional vocals",
    lyrics="""Verse 1:
Walking down the street today
Thinking of the words you used to say
Everything feels different now
But I'll find my way somehow

Chorus:
I'm moving on, I'm staying strong
This is where I belong
""",
    vocal_language="en",
    bpm=72,
    duration=45,
)

config = GenerationConfig(batch_size=1)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
```

### Example 3: Using Custom Timesteps

```python
params = GenerationParams(
    task_type="text2music",
    caption="jazz fusion with complex harmonies",
    # Custom 9-step schedule
    timesteps=[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0],
    thinking=True,
)

config = GenerationConfig(batch_size=1)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
```

### Example 4: Using Shift Parameter (Turbo Model)

```python
params = GenerationParams(
    task_type="text2music",
    caption="upbeat electronic dance music",
    inference_steps=8,
    shift=3.0,  # Recommended for turbo models
    infer_method="ode",
)

config = GenerationConfig(batch_size=2)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
```

### Example 5: Simple Mode with create_sample

```python
from acestep.inference import create_sample, GenerationParams, GenerationConfig, generate_music

# Step 1: Create sample from description
sample = create_sample(
    llm_handler=llm_handler,
    query="energetic K-pop dance track with catchy hooks",
    vocal_language="ko",
)

if sample.success:
    # Step 2: Generate music using the sample
    params = GenerationParams(
        caption=sample.caption,
        lyrics=sample.lyrics,
        bpm=sample.bpm,
        duration=sample.duration,
        keyscale=sample.keyscale,
        vocal_language=sample.language,
        thinking=True,
    )
    
    config = GenerationConfig(batch_size=2)
    result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
```

### Example 6: Format and Enhance User Input

```python
from acestep.inference import format_sample, GenerationParams, GenerationConfig, generate_music

# Step 1: Format user input
formatted = format_sample(
    llm_handler=llm_handler,
    caption="rock ballad",
    lyrics="[Verse]\nIn the darkness I find my way...",
)

if formatted.success:
    # Step 2: Generate with enhanced input
    params = GenerationParams(
        caption=formatted.caption,
        lyrics=formatted.lyrics,
        bpm=formatted.bpm,
        duration=formatted.duration,
        keyscale=formatted.keyscale,
        thinking=True,
        use_cot_metas=False,  # Already formatted, skip metas CoT
    )
    
    config = GenerationConfig(batch_size=2)
    result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
```

### Example 7: Style Cover with LM Reasoning

```python
params = GenerationParams(
    task_type="cover",
    src_audio="original_pop_song.mp3",
    caption="orchestral symphonic arrangement",
    audio_cover_strength=0.7,
    thinking=True,  # Enable LM for metadata
    use_cot_metas=True,
)

config = GenerationConfig(batch_size=1)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

# Access LM-generated metadata
if result.extra_outputs.get("lm_metadata"):
    lm_meta = result.extra_outputs["lm_metadata"]
    print(f"LM detected BPM: {lm_meta.get('bpm')}")
    print(f"LM detected Key: {lm_meta.get('keyscale')}")
```

### Example 8: Batch Generation with Specific Seeds

```python
params = GenerationParams(
    task_type="text2music",
    caption="epic cinematic trailer music",
)

config = GenerationConfig(
    batch_size=4,           # Generate 4 variations
    seeds=[42, 123, 456],   # Specify 3 seeds, 4th will be random
    use_random_seed=False,  # Use provided seeds
    lm_batch_chunk_size=2,  # Process 2 at a time (GPU memory)
)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if result.success:
    print(f"Generated {len(result.audios)} variations")
    for audio in result.audios:
        print(f"  Seed {audio['params']['seed']}: {audio['path']}")
```

### Example 9: High-Quality Generation (Base Model)

```python
params = GenerationParams(
    task_type="text2music",
    caption="intricate jazz fusion with complex harmonies",
    inference_steps=64,     # High quality
    guidance_scale=8.0,
    use_adg=True,           # Adaptive Dual Guidance
    cfg_interval_start=0.0,
    cfg_interval_end=1.0,
    shift=3.0,              # Timestep shift
    seed=42,                # Reproducible results
)

config = GenerationConfig(
    batch_size=1,
    use_random_seed=False,
    audio_format="wav",     # Lossless format
)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
```

### Example 10: Understand Audio from Codes

```python
from acestep.inference import understand_music

# Analyze audio codes (e.g., from a previous generation)
result = understand_music(
    llm_handler=llm_handler,
    audio_codes="<|audio_code_10695|><|audio_code_54246|>...",
    temperature=0.85,
)

if result.success:
    print(f"Detected Caption: {result.caption}")
    print(f"Detected Lyrics: {result.lyrics}")
    print(f"Detected BPM: {result.bpm}")
    print(f"Detected Key: {result.keyscale}")
    print(f"Detected Duration: {result.duration}s")
    print(f"Detected Language: {result.language}")
```

---

## Best Practices

### 1. Caption Writing

**Good Captions**:
```python
# Specific and descriptive
caption="upbeat electronic dance music with heavy bass and synthesizer leads"

# Include mood and genre
caption="melancholic indie folk with acoustic guitar and soft vocals"

# Specify instruments
caption="jazz trio with piano, upright bass, and brush drums"
```

**Avoid**:
```python
# Too vague
caption="good music"

# Contradictory
caption="fast slow music"  # Conflicting tempos
```

### 2. Parameter Tuning

**For Best Quality**:
- Use base model with `inference_steps=64` or higher
- Enable `use_adg=True`
- Set `guidance_scale=7.0-9.0`
- Set `shift=3.0` for better timestep distribution
- Use lossless audio format (`audio_format="wav"`)

**For Speed**:
- Use turbo model with `inference_steps=8`
- Disable ADG (`use_adg=False`)
- Use `infer_method="ode"` (default)
- Use compressed format (`audio_format="mp3"`) or default FLAC

**For Consistency**:
- Set `use_random_seed=False` in config
- Use fixed `seeds` list or single `seed` in params
- Keep `lm_temperature` lower (0.7-0.85)

**For Diversity**:
- Set `use_random_seed=True` in config
- Increase `lm_temperature` (0.9-1.1)
- Use `batch_size > 1` for variations

### 3. Duration Guidelines

- **Instrumental**: 30-180 seconds works well
- **With Lyrics**: Auto-detection recommended (set `duration=-1` or leave default)
- **Short clips**: 10-20 seconds minimum
- **Long form**: Up to 600 seconds (10 minutes) maximum

### 4. LM Usage

**When to Enable LM (`thinking=True`)**:
- Need automatic metadata detection
- Want caption refinement
- Generating from minimal input
- Need diverse outputs

**When to Disable LM (`thinking=False`)**:
- Have precise metadata already
- Need faster generation
- Want full control over parameters

### 5. Batch Processing

```python
# Efficient batch generation
config = GenerationConfig(
    batch_size=8,           # Max supported
    allow_lm_batch=True,    # Enable for speed (when thinking=True)
    lm_batch_chunk_size=4,  # Adjust based on GPU memory
)
```

### 6. Error Handling

```python
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if not result.success:
    print(f"Generation failed: {result.error}")
    print(f"Status: {result.status_message}")
else:
    # Process successful result
    for audio in result.audios:
        path = audio['path']
        key = audio['key']
        seed = audio['params']['seed']
        # ... process audio files
```

### 7. Memory Management

For large batch sizes or long durations:
- Monitor GPU memory usage
- Reduce `batch_size` if OOM errors occur
- Reduce `lm_batch_chunk_size` for LM operations
- Consider using `offload_to_cpu=True` during initialization

### 8. Accessing Time Costs

```python
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if result.success:
    time_costs = result.extra_outputs.get("time_costs", {})
    print(f"LM Phase 1 Time: {time_costs.get('lm_phase1_time', 0):.2f}s")
    print(f"LM Phase 2 Time: {time_costs.get('lm_phase2_time', 0):.2f}s")
    print(f"DiT Total Time: {time_costs.get('dit_total_time_cost', 0):.2f}s")
    print(f"Pipeline Total: {time_costs.get('pipeline_total_time', 0):.2f}s")
```

---

## Troubleshooting

### Common Issues

**Issue**: Out of memory errors
- **Solution**: Reduce `batch_size`, `inference_steps`, or enable CPU offloading

**Issue**: Poor quality results
- **Solution**: Increase `inference_steps`, adjust `guidance_scale`, use base model

**Issue**: Results don't match prompt
- **Solution**: Make caption more specific, increase `guidance_scale`, enable LM refinement (`thinking=True`)

**Issue**: Slow generation
- **Solution**: Use turbo model, reduce `inference_steps`, disable ADG

**Issue**: LM not generating codes
- **Solution**: Verify `llm_handler` is initialized, check `thinking=True` and `use_cot_metas=True`

**Issue**: Seeds not being respected
- **Solution**: Set `use_random_seed=False` in config and provide `seeds` list or `seed` in params

**Issue**: Custom timesteps not working
- **Solution**: Ensure timesteps are a list of floats from 1.0 to 0.0, properly ordered

---

## API Reference Summary

### GenerationParams Fields

See [GenerationParams Parameters](#generationparams-parameters) for complete documentation.

### GenerationConfig Fields

See [GenerationConfig Parameters](#generationconfig-parameters) for complete documentation.

### GenerationResult Fields

```python
@dataclass
class GenerationResult:
    # Audio Outputs
    audios: List[Dict[str, Any]]
    # Each audio dict contains:
    #   - "path": str (file path)
    #   - "tensor": Tensor (audio data)
    #   - "key": str (unique identifier)
    #   - "sample_rate": int (48000)
    #   - "params": Dict (generation params with seed, audio_codes, etc.)
    
    # Generation Information
    status_message: str
    extra_outputs: Dict[str, Any]
    # extra_outputs contains:
    #   - "lm_metadata": Dict (LM-generated metadata)
    #   - "time_costs": Dict (timing information)
    #   - "latents": Tensor (intermediate latents, if available)
    #   - "masks": Tensor (attention masks, if available)
    
    # Success Status
    success: bool
    error: Optional[str]
```

---

## Version History

- **v1.5.2**: Current version
  - Added `shift` parameter for timestep shifting
  - Added `infer_method` parameter for ODE/SDE selection
  - Added `timesteps` parameter for custom timestep schedules
  - Added `understand_music()` function for audio analysis
  - Added `create_sample()` function for simple mode generation
  - Added `format_sample()` function for input enhancement
  - Added `UnderstandResult`, `CreateSampleResult`, `FormatSampleResult` dataclasses

- **v1.5.1**: Previous version
  - Split `GenerationConfig` into `GenerationParams` and `GenerationConfig`
  - Renamed parameters for consistency (`key_scale` → `keyscale`, `time_signature` → `timesignature`, `audio_duration` → `duration`, `use_llm_thinking` → `thinking`, `audio_code_string` → `audio_codes`)
  - Added `instrumental` parameter
  - Added `use_constrained_decoding` parameter
  - Added CoT auto-filled fields (`cot_*`)
  - Changed default `audio_format` to "flac"
  - Changed default `batch_size` to 2
  - Changed default `thinking` to True
  - Simplified `GenerationResult` structure with unified `audios` list
  - Added unified `time_costs` in `extra_outputs`

- **v1.5**: Initial version
  - Introduced `GenerationConfig` and `GenerationResult` dataclasses
  - Simplified parameter passing
  - Added comprehensive documentation

---

For more information, see:
- Main README: [`../../README.md`](../../README.md)
- REST API Documentation: [`API.md`](API.md)
- Gradio Demo Guide: [`GRADIO_GUIDE.md`](GRADIO_GUIDE.md)
- Project repository: [ACE-Step-1.5](https://github.com/yourusername/ACE-Step-1.5)