MOSS-TTS-Nano

Running

File size: 6,657 Bytes

# MOSS-TTS-Nano API Documentation

This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at `http://127.0.0.1:18083/docs` when the server is running.

---

## 1. Synchronous Synthesis Endpoint

### `POST /api/generate`
Generates a complete synthesized audio file for a given text and reference prompt.

- **Request Type**: `multipart/form-data`
- **Parameters**:

| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| `text` | `string` | *Required* | The target text to be synthesized. |
| `demo_id` | `string` | `""` | The identifier of a preset voice (e.g. `zh_1`, `en_3`). Required if `prompt_audio` is not provided. |
| `prompt_audio` | `file` | `None` | Reference audio file for zero-shot voice cloning. Overrides `demo_id` if uploaded. |
| `max_new_frames` | `integer` | `375` | Maximum number of new frames generated by the model. |
| `voice_clone_max_text_tokens` | `integer` | `75` | Maximum context length for text-based voice cloning. |
| `tts_max_batch_size` | `integer` | `0` | Batch size constraint for TTS generation (0 = auto). |
| `codec_max_batch_size` | `integer` | `0` | Batch size constraint for Codec decoding (0 = auto). |
| `enable_text_normalization` | `string` | `"1"` | Enables WeTextProcessing frontend (`"1"` or `"0"`). |
| `enable_normalize_tts_text` | `string` | `"1"` | Enables secondary robust text normalization (`"1"` or `"0"`). |
| `cpu_threads` | `integer` | `0` | Target CPU threads. 0 leverages system settings. |
| `attn_implementation` | `string` | `"model_default"` | Attention implementation (e.g. `"model_default"`, `"sdpa"`, `"eager"`). |
| `do_sample` | `string` | `"1"` | Whether to use sampling during decoding (`"1"` or `"0"`). |
| `text_temperature` | `float` | `1.0` | Sampling temperature for text tokens. |
| `text_top_p` | `float` | `1.0` | Top-p filtering threshold for text. |
| `text_top_k` | `integer` | `50` | Top-k filtering threshold for text. |
| `audio_temperature` | `float` | `0.8` | Sampling temperature for audio tokens. |
| `audio_top_p` | `float` | `0.95` | Top-p filtering threshold for audio. |
| `audio_top_k` | `integer` | `25` | Top-k filtering threshold for audio. |
| `audio_repetition_penalty` | `float` | `1.2` | Repetition penalty scaling factor for audio generation. |
| `seed` | `string` | `"0"` | Random seed (set to `"0"` or empty string for random/non-deterministic). |

- **Response Example (`200 OK`)**:
```json
{
  "audio_base64": "UklGRi...",
  "sample_rate": 48000,
  "run_status": "Time: 1.2s | Speed: 25.4 frames/s",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "warmup_status_text": "Model Ready",
  "text_chunks": ["Welcome to MOSS TTS."],
  "normalized_text": "welcome to moss tts",
  "normalization_method": "wetext",
  "text_normalization_language": "en"
}
```

---

## 2. Asynchronous Streaming API

Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing.

### `POST /api/generate-stream/start`
Starts a background streaming generation job and registers a `stream_id`.

- **Request Type**: `multipart/form-data`
- **Parameters**: Same parameters as `/api/generate`.
- **Response Example (`200 OK`)**:
```json
{
  "stream_id": "job_a1b2c3d4",
  "audio_url": "/api/generate-stream/job_a1b2c3d4/audio",
  "status_url": "/api/generate-stream/job_a1b2c3d4/status",
  "result_url": "/api/generate-stream/job_a1b2c3d4/result",
  "sample_rate": 48000,
  "channels": 1,
  "run_status": "Streaming realtime audio...",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "text_chunks": ["Welcome to the streaming engine."]
}
```

### `GET /api/generate-stream/{stream_id}/status`
Polls the active streaming status, progress metrics, and elapsed times.

- **Response Example (`200 OK`)**:
```json
{
  "stream_id": "job_a1b2c3d4",
  "state": "running",
  "ready": false,
  "failed": false,
  "emitted_audio_seconds": 2.45,
  "lead_seconds": 1.28,
  "first_audio_latency_seconds": 0.45,
  "status_text": "Streaming...",
  "stream_metrics": "state=running | emitted=2.45s | lead=1.28s | first_audio=0.45s"
}
```

### `GET /api/generate-stream/{stream_id}/audio`
Streams raw audio packets from the generation queue.

- **Response Type**: Chunked binary stream (`application/octet-stream`)
- **Format**: Raw PCM 16-bit Little-Endian (`pcm_s16le`)
- **Headers**:
  - `X-Audio-Codec`: `pcm_s16le`
  - `X-Audio-Sample-Rate`: `48000`
  - `X-Audio-Channels`: `1`

### `GET /api/generate-stream/{stream_id}/result`
Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes.

- **Response Example (`200 OK`)**:
```json
{
  "stream_id": "job_a1b2c3d4",
  "ready": true,
  "state": "done",
  "prompt_audio_path": "assets/audio/zh_1.wav",
  "run_status": "Finished successfully.",
  "stream_metrics": "state=done | emitted=4.50s",
  "audio_chunk_ranges": [
    [0.0, 4.5, 0]
  ],
  "audio_base64": "UklGRi..."
}
```

### `POST /api/generate-stream/{stream_id}/close`
Closes the active queue session, terminates background threads, and purges intermediate disk caches.

---

## 3. Metadata & Health Endpoints

### `GET /api/metadata`
Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules.

- **Response Example (`200 OK`)**:
```json
{
  "is_onnx": false,
  "is_cpu_only": true,
  "execution_provider": "cpu",
  "device_type": "cpu",
  "checkpoint_default_attn_implementation": "eager",
  "checkpoint_path": "models/MOSS-TTS-Nano-100M",
  "audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano",
  "text_normalization_status": "WeTextProcessing ready."
}
```

### `GET /api/warmup-status`
Returns model graph pre-warming status.

- **Response Example (`200 OK`)**:
```json
{
  "ready": true,
  "progress": 1.0,
  "message": "Warmup completed successfully.",
  "failed": false,
  "status_text": "Warmup Ready"
}
```

### `GET /api/text-normalization-status`
Returns Chinese/English front-end WeTextProcessing engine diagnostics.

---

## 4. Preset Preserves

### `GET /api/demo-prompt-audio/{demo_id}`
Streams a reference preset voice file directly.

- **Parameters**: `demo_id` (e.g. `zh_1`, `en_3`).
- **Response**: Binary audio file stream (`audio/wav`).

---

## 5. Live Deployments

You can interact with a live deployment of this refactored, minimalist modular version here:
- **Hugging Face Space (Refactored)**: [Tom199328/MOSS-TTS-Nano](https://huggingface.co/spaces/Tom199328/MOSS-TTS-Nano)