Spaces:
Running
Running
File size: 6,657 Bytes
2510974 813fb1c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | # MOSS-TTS-Nano API Documentation
This document describes the fully modularized REST API endpoints exposed by MOSS-TTS-Nano. The API is powered by FastAPI, and interactive OpenAPI swagger docs can also be accessed directly at `http://127.0.0.1:18083/docs` when the server is running.
---
## 1. Synchronous Synthesis Endpoint
### `POST /api/generate`
Generates a complete synthesized audio file for a given text and reference prompt.
- **Request Type**: `multipart/form-data`
- **Parameters**:
| Parameter | Type | Default | Description |
| :--- | :--- | :--- | :--- |
| `text` | `string` | *Required* | The target text to be synthesized. |
| `demo_id` | `string` | `""` | The identifier of a preset voice (e.g. `zh_1`, `en_3`). Required if `prompt_audio` is not provided. |
| `prompt_audio` | `file` | `None` | Reference audio file for zero-shot voice cloning. Overrides `demo_id` if uploaded. |
| `max_new_frames` | `integer` | `375` | Maximum number of new frames generated by the model. |
| `voice_clone_max_text_tokens` | `integer` | `75` | Maximum context length for text-based voice cloning. |
| `tts_max_batch_size` | `integer` | `0` | Batch size constraint for TTS generation (0 = auto). |
| `codec_max_batch_size` | `integer` | `0` | Batch size constraint for Codec decoding (0 = auto). |
| `enable_text_normalization` | `string` | `"1"` | Enables WeTextProcessing frontend (`"1"` or `"0"`). |
| `enable_normalize_tts_text` | `string` | `"1"` | Enables secondary robust text normalization (`"1"` or `"0"`). |
| `cpu_threads` | `integer` | `0` | Target CPU threads. 0 leverages system settings. |
| `attn_implementation` | `string` | `"model_default"` | Attention implementation (e.g. `"model_default"`, `"sdpa"`, `"eager"`). |
| `do_sample` | `string` | `"1"` | Whether to use sampling during decoding (`"1"` or `"0"`). |
| `text_temperature` | `float` | `1.0` | Sampling temperature for text tokens. |
| `text_top_p` | `float` | `1.0` | Top-p filtering threshold for text. |
| `text_top_k` | `integer` | `50` | Top-k filtering threshold for text. |
| `audio_temperature` | `float` | `0.8` | Sampling temperature for audio tokens. |
| `audio_top_p` | `float` | `0.95` | Top-p filtering threshold for audio. |
| `audio_top_k` | `integer` | `25` | Top-k filtering threshold for audio. |
| `audio_repetition_penalty` | `float` | `1.2` | Repetition penalty scaling factor for audio generation. |
| `seed` | `string` | `"0"` | Random seed (set to `"0"` or empty string for random/non-deterministic). |
- **Response Example (`200 OK`)**:
```json
{
"audio_base64": "UklGRi...",
"sample_rate": 48000,
"run_status": "Time: 1.2s | Speed: 25.4 frames/s",
"prompt_audio_path": "assets/audio/zh_1.wav",
"warmup_status_text": "Model Ready",
"text_chunks": ["Welcome to MOSS TTS."],
"normalized_text": "welcome to moss tts",
"normalization_method": "wetext",
"text_normalization_language": "en"
}
```
---
## 2. Asynchronous Streaming API
Streaming endpoints are designed for ultra-low latency playback, delivering chunked PCM audio packets while the model is still executing.
### `POST /api/generate-stream/start`
Starts a background streaming generation job and registers a `stream_id`.
- **Request Type**: `multipart/form-data`
- **Parameters**: Same parameters as `/api/generate`.
- **Response Example (`200 OK`)**:
```json
{
"stream_id": "job_a1b2c3d4",
"audio_url": "/api/generate-stream/job_a1b2c3d4/audio",
"status_url": "/api/generate-stream/job_a1b2c3d4/status",
"result_url": "/api/generate-stream/job_a1b2c3d4/result",
"sample_rate": 48000,
"channels": 1,
"run_status": "Streaming realtime audio...",
"prompt_audio_path": "assets/audio/zh_1.wav",
"text_chunks": ["Welcome to the streaming engine."]
}
```
### `GET /api/generate-stream/{stream_id}/status`
Polls the active streaming status, progress metrics, and elapsed times.
- **Response Example (`200 OK`)**:
```json
{
"stream_id": "job_a1b2c3d4",
"state": "running",
"ready": false,
"failed": false,
"emitted_audio_seconds": 2.45,
"lead_seconds": 1.28,
"first_audio_latency_seconds": 0.45,
"status_text": "Streaming...",
"stream_metrics": "state=running | emitted=2.45s | lead=1.28s | first_audio=0.45s"
}
```
### `GET /api/generate-stream/{stream_id}/audio`
Streams raw audio packets from the generation queue.
- **Response Type**: Chunked binary stream (`application/octet-stream`)
- **Format**: Raw PCM 16-bit Little-Endian (`pcm_s16le`)
- **Headers**:
- `X-Audio-Codec`: `pcm_s16le`
- `X-Audio-Sample-Rate`: `48000`
- `X-Audio-Channels`: `1`
### `GET /api/generate-stream/{stream_id}/result`
Retrieves the finished synthesis payload (including fully-stitched base64-encoded audio) and chunk-to-timestamp range indexes once the job completes.
- **Response Example (`200 OK`)**:
```json
{
"stream_id": "job_a1b2c3d4",
"ready": true,
"state": "done",
"prompt_audio_path": "assets/audio/zh_1.wav",
"run_status": "Finished successfully.",
"stream_metrics": "state=done | emitted=4.50s",
"audio_chunk_ranges": [
[0.0, 4.5, 0]
],
"audio_base64": "UklGRi..."
}
```
### `POST /api/generate-stream/{stream_id}/close`
Closes the active queue session, terminates background threads, and purges intermediate disk caches.
---
## 3. Metadata & Health Endpoints
### `GET /api/metadata`
Returns information regarding loaded tokenizer weights, hardware attention defaults, and runtime modules.
- **Response Example (`200 OK`)**:
```json
{
"is_onnx": false,
"is_cpu_only": true,
"execution_provider": "cpu",
"device_type": "cpu",
"checkpoint_default_attn_implementation": "eager",
"checkpoint_path": "models/MOSS-TTS-Nano-100M",
"audio_tokenizer_path": "models/MOSS-Audio-Tokenizer-Nano",
"text_normalization_status": "WeTextProcessing ready."
}
```
### `GET /api/warmup-status`
Returns model graph pre-warming status.
- **Response Example (`200 OK`)**:
```json
{
"ready": true,
"progress": 1.0,
"message": "Warmup completed successfully.",
"failed": false,
"status_text": "Warmup Ready"
}
```
### `GET /api/text-normalization-status`
Returns Chinese/English front-end WeTextProcessing engine diagnostics.
---
## 4. Preset Preserves
### `GET /api/demo-prompt-audio/{demo_id}`
Streams a reference preset voice file directly.
- **Parameters**: `demo_id` (e.g. `zh_1`, `en_3`).
- **Response**: Binary audio file stream (`audio/wav`).
---
## 5. Live Deployments
You can interact with a live deployment of this refactored, minimalist modular version here:
- **Hugging Face Space (Refactored)**: [Tom199328/MOSS-TTS-Nano](https://huggingface.co/spaces/Tom199328/MOSS-TTS-Nano)
|