MOSS-TTS-Nano / API_DOCUMENTATION.md
Tom199328's picture
Upload folder using huggingface_hub
813fb1c verified
# MOSS-TTS-Nano API Documentation
This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at `http://127.0.0.1:18083/docs` when the server is running.
---
## 1. Synchronous Synthesis Endpoint
### `POST /api/generate`
Generates a complete synthesized audio file for a given text and reference prompt.
- **Request Type**: `multipart/form-data`
- **Parameters**:
| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| `text` | `string` | *Required* | The target text to be synthesized. |
| `demo_id` | `string` | `""` | The identifier of a preset voice (e.g. `zh_1`, `en_3`). Required if `prompt_audio` is not provided. |
| `prompt_audio` | `file` | `None` | Reference audio file for zero-shot voice cloning. Overrides `demo_id` if uploaded. |
| `max_new_frames` | `integer` | `375` | Maximum number of new frames generated by the model. |
| `voice_clone_max_text_tokens` | `integer` | `75` | Maximum context length for text-based voice cloning. |
| `tts_max_batch_size` | `integer` | `0` | Batch size constraint for TTS generation (0 = auto). |
| `codec_max_batch_size` | `integer` | `0` | Batch size constraint for Codec decoding (0 = auto). |
| `enable_text_normalization` | `string` | `"1"` | Enables WeTextProcessing frontend (`"1"` or `"0"`). |
| `enable_normalize_tts_text` | `string` | `"1"` | Enables secondary robust text normalization (`"1"` or `"0"`). |
| `cpu_threads` | `integer` | `0` | Target CPU threads. 0 leverages system settings. |
| `attn_implementation` | `string` | `"model_default"` | Attention implementation (e.g. `"model_default"`, `"sdpa"`, `"eager"`). |
| `do_sample` | `string` | `"1"` | Whether to use sampling during decoding (`"1"` or `"0"`). |
| `text_temperature` | `float` | `1.0` | Sampling temperature for text tokens. |
| `text_top_p` | `float` | `1.0` | Top-p filtering threshold for text. |
| `text_top_k` | `integer` | `50` | Top-k filtering threshold for text. |
| `audio_temperature` | `float` | `0.8` | Sampling temperature for audio tokens. |
| `audio_top_p` | `float` | `0.95` | Top-p filtering threshold for audio. |
| `audio_top_k` | `integer` | `25` | Top-k filtering threshold for audio. |
| `audio_repetition_penalty` | `float` | `1.2` | Repetition penalty scaling factor for audio generation. |
| `seed` | `string` | `"0"` | Random seed (set to `"0"` or empty string for random/non-deterministic). |
- **Response Example (`200 OK`)**:
```json
{
"audio_base64": "UklGRi...",
"sample_rate": 48000,
"run_status": "Time: 1.2s | Speed: 25.4 frames/s",
"prompt_audio_path": "assets/audio/zh_1.wav",
"warmup_status_text": "Model Ready",
"text_chunks": ["Welcome to MOSS TTS."],
"normalized_text": "welcome to moss tts",
"normalization_method": "wetext",
"text_normalization_language": "en"
}
```
---
## 2. Asynchronous Streaming API
Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing.
### `POST /api/generate-stream/start`
Starts a background streaming generation job and registers a `stream_id`.
- **Request Type**: `multipart/form-data`
- **Parameters**: Same parameters as `/api/generate`.
- **Response Example (`200 OK`)**:
```json
{
"stream_id": "job_a1b2c3d4",
"audio_url": "/api/generate-stream/job_a1b2c3d4/audio",
"status_url": "/api/generate-stream/job_a1b2c3d4/status",
"result_url": "/api/generate-stream/job_a1b2c3d4/result",
"sample_rate": 48000,
"channels": 1,
"run_status": "Streaming realtime audio...",
"prompt_audio_path": "assets/audio/zh_1.wav",
"text_chunks": ["Welcome to the streaming engine."]
}
```
### `GET /api/generate-stream/{stream_id}/status`
Polls the active streaming status, progress metrics, and elapsed times.
- **Response Example (`200 OK`)**:
```json
{
"stream_id": "job_a1b2c3d4",
"state": "running",
"ready": false,
"failed": false,
"emitted_audio_seconds": 2.45,
"lead_seconds": 1.28,
"first_audio_latency_seconds": 0.45,
"status_text": "Streaming...",
"stream_metrics": "state=running | emitted=2.45s | lead=1.28s | first_audio=0.45s"
}
```
### `GET /api/generate-stream/{stream_id}/audio`
Streams raw audio packets from the generation queue.
- **Response Type**: Chunked binary stream (`application/octet-stream`)
- **Format**: Raw PCM 16-bit Little-Endian (`pcm_s16le`)
- **Headers**:
- `X-Audio-Codec`: `pcm_s16le`
- `X-Audio-Sample-Rate`: `48000`
- `X-Audio-Channels`: `1`
### `GET /api/generate-stream/{stream_id}/result`
Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes.
- **Response Example (`200 OK`)**:
```json
{
"stream_id": "job_a1b2c3d4",
"ready": true,
"state": "done",
"prompt_audio_path": "assets/audio/zh_1.wav",
"run_status": "Finished successfully.",
"stream_metrics": "state=done | emitted=4.50s",
"audio_chunk_ranges": [
[0.0, 4.5, 0]
],
"audio_base64": "UklGRi..."
}
```
### `POST /api/generate-stream/{stream_id}/close`
Closes the active queue session, terminates background threads, and purges intermediate disk caches.
---
## 3. Metadata & Health Endpoints
### `GET /api/metadata`
Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules.
- **Response Example (`200 OK`)**:
```json
{
"is_onnx": false,
"is_cpu_only": true,
"execution_provider": "cpu",
"device_type": "cpu",
"checkpoint_default_attn_implementation": "eager",
"checkpoint_path": "models/MOSS-TTS-Nano-100M",
"audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano",
"text_normalization_status": "WeTextProcessing ready."
}
```
### `GET /api/warmup-status`
Returns model graph pre-warming status.
- **Response Example (`200 OK`)**:
```json
{
"ready": true,
"progress": 1.0,
"message": "Warmup completed successfully.",
"failed": false,
"status_text": "Warmup Ready"
}
```
### `GET /api/text-normalization-status`
Returns Chinese/English front-end WeTextProcessing engine diagnostics.
---
## 4. Preset Preserves
### `GET /api/demo-prompt-audio/{demo_id}`
Streams a reference preset voice file directly.
- **Parameters**: `demo_id` (e.g. `zh_1`, `en_3`).
- **Response**: Binary audio file stream (`audio/wav`).
---
## 5. Live Deployments
You can interact with a live deployment of this refactored, minimalist modular version here:
- **Hugging Face Space (Refactored)**: [Tom199328/MOSS-TTS-Nano](https://huggingface.co/spaces/Tom199328/MOSS-TTS-Nano)